New individual (int-area) draft minimizing diversity of timestamp formats

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

New individual (int-area) draft minimizing diversity of timestamp formats

Bob Briscoe-4
Wei, and authors of draft-wang-tcpm-low-latency-opt ,

See
draft-mizrahi-intarea-packet-timestamps-00
(just discussed in IETF int-area)

I've made the author aware of timestamp resolution requirements in tcpm



Bob

--
________________________________________________________________
Bob Briscoe                               http://bobbriscoe.net/

_______________________________________________
tcpm mailing list
[hidden email]
https://www.ietf.org/mailman/listinfo/tcpm
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Review of draft-wang-tcpm-low-latency-opt-00

Bob Briscoe-4
Wei, Yuchung, Neal and Eric, as authors of draft-wang-tcpm-low-latency-opt-00,

I promised a review. It questions the technical logic behind the draft, so I haven't bothered to give a detailed review of the wording of the draft, because that might be irrelevant if you agree with my arguments.

1/ MAD by configuration?
   o  If the user does not specify a MAD value, then the implementation
      SHOULD NOT specify a MAD value in the Low Latency option.
That sentence triggered my "anti-human-intervention" reflex. My train of thought went as follows:

* Let's consider what advice we would give on what MAD value ought to be configured.
* You say that MAD can be smaller in DCs. So I assume your advice would be that MAD should depend on RTT {Note 1} and clock granularity {Note 2}.
* So why configure one value of MAD for all RTTs? That only makes sense in DC environments where the range of RTTs is small.
* However, for the range of RTTs on the public Internet, why not calculate MAD from RTT and granularity, then standardize the calculation so that both ends arrive at the same result when starting from the same RTT and granularity parameters? (The sender and receiver might measure different smoothed (SRTT) values, but they will converge as the flow progresses.)

Then the receiver only needs to communicate its clock granularity to the sender, and the fact that it is driving MAD off its SRTT. Then the sender can use a formula for RTO derived from the value of MAD that it calculates the receiver will be using. Then its RTO will be completely tailored to the RTT of the flow.

Note: There are two different uses for the min RTO that need to be separated:
    a) Before an initial RTT value has been measured, to determine the RTO during the 3WHS.
    b) Once either end has measured the RTT for a connection.
(a) needs to cope with the whole range of possible RTTs, whereas (b) is the subject of this email, because it can be tailored for the measured RTT.

2/ The problem, and its prevalence

With gradual removal of bufferbloat and more prevalent usage of CDNs, typical base RTTs on the public Internet now make the value of minRTO and of MAD look silly.

As can be seen above, the problem is indeed that each end only has partial knowledge of the config of the other end.
However, the problem is not just that MAD needs to be communicated to the other end so it can be hard-coded to a lower value.
The problem is that MAD is hard-coded in the first place.

The draft needs to say how prevalent the problem is (on the public Internet) where the sender has to wait for the receiver's delayed ACK timer at the end of a flow or between the end of a volley of packets and the start of the next.

The draft also needs to say what tradeoff is considered acceptable between a residual level of spurious retransmissions and lower timeout delay. Eliminating all spurious retransmissions is not the goal.

The draft also needs to say that introducing a new TCP Option is itself a problem (on the public Internet), because of middleboxes particularly proxies. Therefore a solution that does not need a new TCP Option would be preferable....

Perhaps the solution for communicating timestamp resolution in draft-scheffenegger-tcpm-timestamp-negotiation-05 (which cites draft-trammell-tcpm-timestamp-interval-01) could be modified to also communicate:
* TCP's clock granularity (closely related to TCP timestamp resolution),
*  and the fact that the host is calculating MAD as a function of RTT and granularity.
Then the existing timestamp option could be repurposed, which should drastically reduce deployment problems.

3/ Only DC?

All the related work references are solely in the context of a DC. Pls include refs about this problem in a public Internet context. You will find there is a pretty good search engine at www.google.com.

The only non-DC ref I can find about minRTO is [Psaras07], which is mainly about a proposal to apply minRTO if the sender expects the next ACK to be delayed. Nonetheless, the simulation experiment in Section 5.1 provides good evidence for how RTO latency is dependent on uncertainty about the MAD that the other end is using.

[Psaras07] Psaras, I. & Tsaoussidis, V., "The TCP Minimum RTO Revisited," In: Proc. 6th Int'l IFIP-TC6 Conference on Ad Hoc and Sensor Networks, Wireless Networks, Next Generation Internet NETWORKING'07 pp.981-991 Springer-Verlag (2007)
https://www.researchgate.net/publication/225442912_The_TCP_Minimum_RTO_Revisited

4/ Status

Normally, I wouldn't want to hold up a draft that has been proven over years of practice, such as the technique in low-latency-opt, which has been proven in Google's DCs over the last few years. Whereas, my ideas are just that: ideas, not proven. However, the technique in low-latency-opt has only been proven in DC environments where the range of RTTs is limited. So, now that you are proposing to transplant it onto the public Internet, it also only has the status of an unproven idea.

To be clear, as it stands, I do not think low-latency-opt is applicable to the public Internet.


5/ Nits
These nits depart from my promise not comment on details that could become irrelevant if you agree with my idea. Hey, whatever,...

S.3.5:
	RTO <- SRTT + max(G, K*RTTVAR) + max(G, max_ACK_delay)
My immediate reaction to this was that G should not appear twice. However, perhaps you meant them to be G_s and G_r (sender and receiver) respectively. {Note 2}

S.3.5 & S.5. It seems unnecessary to prohibit values of MAD greater than the default (given some companies are already investing in commercial public space flight programmes, so TCP could need to routinely support RTTs that are longer than typical not just shorter).


Cheers



Bob


{Note 1}: On average, if not app-limited, the time between ACKs will be d_r*R_r/W_s where:
   R is SRTT
   d is the delayed ACK factor, e.g. d=2 for ACKing every other packet
   W is the window in units of segments
   subscripts X_r or X_s denote receiver or sender for the half-connection.

So as long as the receiver can estimate the varying value of W at the sender, the receiver's MAD could be
    MAD_r = max(k*d_r*R_r / W_s, G_r),
The factor k (lower case) allows for some bunching of packets e.g. due to link layer aggregation or the residual effects of slow-start, which leaves some bunching even if SS uses pacing. Let's say k=2, but it would need to be checked empirically.

For example, take R=100us, d=2, W=8 and G = 1us.
Given d*R/W = 25us, MAD could be perhaps 50us (i.e. k=2). k might need to be greater, but there would certainly be no need for MAD to be 5ms, which is perhaps 100 times greater than necessary.

{Note 2}: Why is there no field in the Low Latency option to communicate receiver clock granularity to the sender?


Bob

-- 
________________________________________________________________
Bob Briscoe                               http://bobbriscoe.net/

_______________________________________________
tcpm mailing list
[hidden email]
https://www.ietf.org/mailman/listinfo/tcpm
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Review of draft-wang-tcpm-low-latency-opt-00

Wei Wang
Hi Bob,

Thanks a lot for your review and detailed feedback on the draft.
Please see my comments inline below:

On Wed, Aug 2, 2017 at 8:54 AM, Bob Briscoe <[hidden email]> wrote:
Wei, Yuchung, Neal and Eric, as authors of draft-wang-tcpm-low-latency-opt-00,

I promised a review. It questions the technical logic behind the draft, so I haven't bothered to give a detailed review of the wording of the draft, because that might be irrelevant if you agree with my arguments.

1/ MAD by configuration?
   o  If the user does not specify a MAD value, then the implementation
      SHOULD NOT specify a MAD value in the Low Latency option.
That sentence triggered my "anti-human-intervention" reflex. My train of thought went as follows:

* Let's consider what advice we would give on what MAD value ought to be configured.
* You say that MAD can be smaller in DCs. So I assume your advice would be that MAD should depend on RTT {Note 1} and clock granularity {Note 2}.
* So why configure one value of MAD for all RTTs? That only makes sense in DC environments where the range of RTTs is small.
* However, for the range of RTTs on the public Internet, why not calculate MAD from RTT and granularity, then standardize the calculation so that both ends arrive at the same result when starting from the same RTT and granularity parameters? (The sender and receiver might measure different smoothed (SRTT) values, but they will converge as the flow progresses.)

Then the receiver only needs to communicate its clock granularity to the sender, and the fact that it is driving MAD off its SRTT. Then the sender can use a formula for RTO derived from the value of MAD that it calculates the receiver will be using. Then its RTO will be completely tailored to the RTT of the flow.

First of all, we recommend that operating system should have a per-route MAD configuration API and a per-connection MAD configuration API. So different connections could have different MAD values configured. It is not one value for all.

And in my opinion, what MAD value should be set to is not only depending on RTT and clock granularity. It also depends on how the application wants the delayed ack behavior to be. Some application might only send data say every 1ms, so it will delay its ack up to 2ms so that it can always piggy back the ack to the data.
That is why a per-connection MAD configuration makes sense for the application to fine tune MAD according to its own demand.

And when user tries to set a new MAD value, we do boundary check to make sure it is less than the current default MAD value. This is a safety check to make sure user does not configure something that is worse than current default value.

About your question in {Note 2} that why receiver does not communicate its clock granularity to the sender, I don't really see a reason why receiver side clock granularity is needed. Because the MAD value sent by receiver is already a value that is rounded to the clock granularity. Say if a user wants to set MAD to 1ms, and the clock granularity is 10ms, receiver will send MAD value as 10ms. In the draft, we specify that:

      If specified, then the MAD value in the Low Latency option MUST be
      set, as close as possible, to the implementation's actual delayed
      ACK timeout for the connection.  Note that the actual maximum
      delayed ACK timeout of the connection may be larger than the
      actual user specified value because of implementation constraints 
             (e.g. timer granularity limitations).  



Note: There are two different uses for the min RTO that need to be separated:
    a) Before an initial RTT value has been measured, to determine the RTO during the 3WHS.
    b) Once either end has measured the RTT for a connection.
(a) needs to cope with the whole range of possible RTTs, whereas (b) is the subject of this email, because it can be tailored for the measured RTT.

Again, we don't think MAD value is only a function of RTT and clock granularity.

 

2/ The problem, and its prevalence

With gradual removal of bufferbloat and more prevalent usage of CDNs, typical base RTTs on the public Internet now make the value of minRTO and of MAD look silly.

As can be seen above, the problem is indeed that each end only has partial knowledge of the config of the other end.
However, the problem is not just that MAD needs to be communicated to the other end so it can be hard-coded to a lower value.
The problem is that MAD is hard-coded in the first place.

The draft needs to say how prevalent the problem is (on the public Internet) where the sender has to wait for the receiver's delayed ACK timer at the end of a flow or between the end of a volley of packets and the start of the next.

Noted. We will add more contexts on how delayed ack works and why long delayed ack time is hurting performance. We are also planning on adding some history about why delayed ack was configured as a constant in the first place and why the current constant value was chosen.
 

The draft also needs to say what tradeoff is considered acceptable between a residual level of spurious retransmissions and lower timeout delay. Eliminating all spurious retransmissions is not the goal.

Noted.
 

The draft also needs to say that introducing a new TCP Option is itself a problem (on the public Internet), because of middleboxes particularly proxies. Therefore a solution that does not need a new TCP Option would be preferable....


There is already a section in the draft that states the middle box issue:
        5. Middlebox Considerations 
Is that portion a good enough explanation on this?
 
Perhaps the solution for communicating timestamp resolution in draft-scheffenegger-tcpm-timestamp-negotiation-05 (which cites draft-trammell-tcpm-timestamp-interval-01) could be modified to also communicate:
* TCP's clock granularity (closely related to TCP timestamp resolution),
*  and the fact that the host is calculating MAD as a function of RTT and granularity.
Then the existing timestamp option could be repurposed, which should drastically reduce deployment problems.

I am not sure if this is doable but will look into it.

 

3/ Only DC?

All the related work references are solely in the context of a DC. Pls include refs about this problem in a public Internet context. You will find there is a pretty good search engine at www.google.com.

The only non-DC ref I can find about minRTO is [Psaras07], which is mainly about a proposal to apply minRTO if the sender expects the next ACK to be delayed. Nonetheless, the simulation experiment in Section 5.1 provides good evidence for how RTO latency is dependent on uncertainty about the MAD that the other end is using.

[Psaras07] Psaras, I. & Tsaoussidis, V., "The TCP Minimum RTO Revisited," In: Proc. 6th Int'l IFIP-TC6 Conference on Ad Hoc and Sensor Networks, Wireless Networks, Next Generation Internet NETWORKING'07 pp.981-991 Springer-Verlag (2007)
https://www.researchgate.net/publication/225442912_The_TCP_Minimum_RTO_Revisited

Noted. Thanks a lot for the pointers. Will look into them and add to the draft.
 


4/ Status

Normally, I wouldn't want to hold up a draft that has been proven over years of practice, such as the technique in low-latency-opt, which has been proven in Google's DCs over the last few years. Whereas, my ideas are just that: ideas, not proven. However, the technique in low-latency-opt has only been proven in DC environments where the range of RTTs is limited. So, now that you are proposing to transplant it onto the public Internet, it also only has the status of an unproven idea.

To be clear, as it stands, I do not think low-latency-opt is applicable to the public Internet.


Hmm... I think overall, this approach should not do any harm to the network. It provides an additional feature to let the user configure the MAD if the user cares about it. If not, they can leave it as the default behavior as it is right now.
To your concerns about the RTT variation in the internet, first, as I explained, this MAD value will be set per connection or per route. Secondly, I would think it is doable to do some bound check or error correction on the MAD value set by the user if we find that it is way below RTT and does not make sense. But again, we don't think MAD value is only a function of RTT. User should be able to configure it to a value suitable for his/her need.
We want to make it as a standard so that all operating systems could implement this in the same way so that they could understand each other. One use case is that in a cloud environment where different operating systems are running in the same DC, they should be able to interpret this option with no issue.

 

5/ Nits
These nits depart from my promise not comment on details that could become irrelevant if you agree with my idea. Hey, whatever,...

S.3.5:
	RTO <- SRTT + max(G, K*RTTVAR) + max(G, max_ACK_delay)
My immediate reaction to this was that G should not appear twice. However, perhaps you meant them to be G_s and G_r (sender and receiver) respectively. {Note 2}


As explained earlier, clock granularity of the receiver is already being considered in the MAD value itself. In the above formula, both G are the clock granularity on the sender side.

 
S.3.5 & S.5. It seems unnecessary to prohibit values of MAD greater than the default (given some companies are already investing in commercial public space flight programmes, so TCP could need to routinely support RTTs that are longer than typical not just shorter).
   

Noted. Will take consideration of this.
 
 
Cheers



Bob


{Note 1}: On average, if not app-limited, the time between ACKs will be d_r*R_r/W_s where:
   R is SRTT
   d is the delayed ACK factor, e.g. d=2 for ACKing every other packet
   W is the window in units of segments
   subscripts X_r or X_s denote receiver or sender for the half-connection.

So as long as the receiver can estimate the varying value of W at the sender, the receiver's MAD could be
    MAD_r = max(k*d_r*R_r / W_s, G_r),
The factor k (lower case) allows for some bunching of packets e.g. due to link layer aggregation or the residual effects of slow-start, which leaves some bunching even if SS uses pacing. Let's say k=2, but it would need to be checked empirically.

For example, take R=100us, d=2, W=8 and G = 1us.
Given d*R/W = 25us, MAD could be perhaps 50us (i.e. k=2). k might need to be greater, but there would certainly be no need for MAD to be 5ms, which is perhaps 100 times greater than necessary.

{Note 2}: Why is there no field in the Low Latency option to communicate receiver clock granularity to the sender?


Bob

-- 
________________________________________________________________
Bob Briscoe                               http://bobbriscoe.net/


_______________________________________________
tcpm mailing list
[hidden email]
https://www.ietf.org/mailman/listinfo/tcpm
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Review of draft-wang-tcpm-low-latency-opt-00

Neal Cardwell
In reply to this post by Bob Briscoe-4
Thanks, Bob, for your detailed and thoughtful review! This is very insightful and useful.

Sorry I'm coming to this discussion a little late. I wanted to add a few points, beyond what Wei has already noted.

On Wed, Aug 2, 2017 at 11:54 AM, Bob Briscoe <[hidden email]> wrote:
Wei, Yuchung, Neal and Eric, as authors of draft-wang-tcpm-low-latency-opt-00,

I promised a review. It questions the technical logic behind the draft, so I haven't bothered to give a detailed review of the wording of the draft, because that might be irrelevant if you agree with my arguments.

1/ MAD by configuration?
   o  If the user does not specify a MAD value, then the implementation
      SHOULD NOT specify a MAD value in the Low Latency option.
That sentence triggered my "anti-human-intervention" reflex. My train of thought went as follows:

Bob's remark about his "anti-human-intervention" reflex being
triggered got me thinking.

I, too, would like to minimize the amount of human (application)
intervention this proposal involves (to avoid errors, maintenance,
etc).

It occurs to me that actually at Google our experience has shown that
indeed apps have repeatedly made mistakes with this value, and we have
found it convenient to progressively narrow their freedom in tuning
this knob. To the point where actually in our deployment there is very
little freedom left. Because in reality the OS and TCP stack
developers know the timer granularity considerations, and the apps
don't (and tend to use values 5 years out of date). So we've found it
useful to have the OS tightly clamp the app's request for a MAD value.

So in the interests of simplicity and avoiding human intervention,
what if we do not have the MAD value as part of the API, but rather
just allow the API to express a single "please use MAD" bit? And then
the transport implementation uses the smallest value that it can
support on this end host.

Can we go further, and make MAD an automatic feature of the TCP
implementation (so the transport implementation hard-wires MAD to "on"
or "off")? My sense is that we don't want to go that far, and that
instead we want to still allow apps to decide whether to use the
"please use MAD" bit. Why? There may be middlebox or remote host
compatibility issues with MAD. So we want apps (like browsers) to be
able to do A/B experiments to validate that sending the MAD option on
SYNs does not cause problems. We don't want to turn on MAD in Linux
and then find compatibility issues, and have to wait for a client OS
upgrade to everyone's cell phone to turn off MAD; instead we want to
only have to wait for an app update.

So... suppose an app decides it is latency-sensitive and wants to
reduce ACK delays and negotiate a MAD value. And furthermore, the app
is either (a) doing A/B experiments, or (b) has already convinced
itself that MAD will work on this path.

Then the app could enable MAD with a simple API like:
   int mad = 1; // enable
   err = setsockopt(fd, SOL_TCP, TCP_MAD, &mad, sizeof(mad));

For better or for worse, that makes the TCP_MAD option much like the
TCP_NODELAY option. Both in the sense that latency sensitive apps
should remember to set this bit if they want low-latency behavior. And
in the sense that the APIs would look very similar. And TCP_NODELAY
and TCP_MAD would be sort of complimentary: TCP_NODELAY is the app
saying "I want low latency for my sends" and TCP_MAD is the app
saying "I want low latency for my ACKs". My guess is that most
low-latency apps will want both.

For the MAD API, I think this might be the "as simple as possible, but
no simpler point".

That said, that's an API issue. And I think for TCPM we should focus
more on the wire protocol issues.
 
* Let's consider what advice we would give on what MAD value ought to be configured.

I would suggest that the advice be that when an app requests TCP_MAD,
then transport implementors would have the transport implementation
use the lowest feasible value based on the end host hardware/OS/app
capabilities and workloads. Our sense from our deployment at Google
is that for many current technologies and workloads this is probably
currently in the range of 5ms - 10ms.

But I don't think we should get bogged down in a discussion of what this
configured value ought to be. I think we should focus on the simplest
protocol mechanism that can convey to the remote host the minimum
info needed for the remote transport endpoint to achieve excellent
performance.

Here I think of the MSS option as a good analogy (and that's why we
suggested the name "MAD").

For MSS, the point is not to spend time discussing what MSS should be
used, or to come up with complicated formulas to derive MSS. The point
is to have a simple but general mechanism so that, no matter what the
MSS value is (or the underlying hardware constraints are), there is a
simple option that can convey a hint to the remote host. Then the
remote host can use that hint to tune its sending behavior to achieve
good performance.

Now substitute "MAD" in the place of "MSS" in the preceding paragraph. :-)
 
* You say that MAD can be smaller in DCs. So I assume your advice would be that MAD should depend on RTT {Note 1} and clock granularity {Note 2}.

Personally I do not think that MAD should depend on RTT. And I don't think the draft says that it should (though let me know if there is some spot I didn't notice).

I'd vote for keeping MAD as simple as possible, which means keeping RTT out of it. :-)

* So why configure one value of MAD for all RTTs? That only makes sense in DC environments where the range of RTTs is small.

I'd recommend one value of MAD for all RTTs for the sake of simplicity. If we keep MAD as simple as possible, then it stays just about the practical delay limitations of the end host (OS timers, CPU power, CPU load, app behavior, end host queuing delays, etc). That is what we have found makes sense in our deployment. And note that our deployment of a MAD-like option covers RTTs that span quite a range, from <1 ms up to hundreds of ms.

Most OSes I know already have a constant that defines the maximum interval over which they can delay their ACKs. We are basically just suggesting a simple wire format for transport endpoints to advertise this existing value as a hint.
 
* However, for the range of RTTs on the public Internet, why not calculate MAD from RTT and granularity, then standardize the calculation so that both ends arrive at the same result when starting from the same RTT and granularity parameters? (The sender and receiver might measure different smoothed (SRTT) values, but they will converge as the flow progresses.)

Then the receiver only needs to communicate its clock granularity to the sender, and the fact that it is driving MAD off its SRTT. Then the sender can use a formula for RTO derived from the value of MAD that it calculates the receiver will be using. Then its RTO will be completely tailored to the RTT of the flow.

A couple questions here:

- Why  should we add the complexity of making MAD dependent on RTT? I'm not clear on what the argument would be for the benefit of introducing this complexity.

- Even if the receiver only communicates its clock granularity to the sender, and the fact that it is driving MAD off its SRTT, then there's a the question of *how* it is deriving MAD. Presumably this could change, as we come up with better ideas. So then we would want a version number field to indicate which calculation is being used. It seems much simpler to me to allow the end point to just communicate a numerical delay value, rather than negotiate a version number of a formula that can take a clock granularity and RTT as input and produce a delay as output.

- Introducing RTT as a dependence also introduces the question of what to do when there is no RTT estimate (because all packets so far have been retransmitted, with no timestamps). And as we discussed in Prague and you mention here, the two sides often have slightly different RTT estimates. There are probably other wrinkles as well.
 

Note: There are two different uses for the min RTO that need to be separated:
    a) Before an initial RTT value has been measured, to determine the RTO during the 3WHS.
    b) Once either end has measured the RTT for a connection.
(a) needs to cope with the whole range of possible RTTs, whereas (b) is the subject of this email, because it can be tailored for the measured RTT.

2/ The problem, and its prevalence

With gradual removal of bufferbloat and more prevalent usage of CDNs, typical base RTTs on the public Internet now make the value of minRTO and of MAD look silly.

As can be seen above, the problem is indeed that each end only has partial knowledge of the config of the other end.
However, the problem is not just that MAD needs to be communicated to the other end so it can be hard-coded to a lower value.
The problem is that MAD is hard-coded in the first place.

The draft needs to say how prevalent the problem is (on the public Internet) where the sender has to wait for the receiver's delayed ACK timer at the end of a flow or between the end of a volley of packets and the start of the next.

The draft also needs to say what tradeoff is considered acceptable between a residual level of spurious retransmissions and lower timeout delay. Eliminating all spurious retransmissions is not the goal.

The draft also needs to say that introducing a new TCP Option is itself a problem (on the public Internet), because of middleboxes particularly proxies. Therefore a solution that does not need a new TCP Option would be preferable....

Perhaps the solution for communicating timestamp resolution in draft-scheffenegger-tcpm-timestamp-negotiation-05 (which cites draft-trammell-tcpm-timestamp-interval-01) could be modified to also communicate:
* TCP's clock granularity (closely related to TCP timestamp resolution),
*  and the fact that the host is calculating MAD as a function of RTT and granularity.
Then the existing timestamp option could be repurposed, which should drastically reduce deployment problems.

3/ Only DC?

All the related work references are solely in the context of a DC. Pls include refs about this problem in a public Internet context. You will find there is a pretty good search engine at www.google.com.

The only non-DC ref I can find about minRTO is [Psaras07], which is mainly about a proposal to apply minRTO if the sender expects the next ACK to be delayed. Nonetheless, the simulation experiment in Section 5.1 provides good evidence for how RTO latency is dependent on uncertainty about the MAD that the other end is using.

[Psaras07] Psaras, I. & Tsaoussidis, V., "The TCP Minimum RTO Revisited," In: Proc. 6th Int'l IFIP-TC6 Conference on Ad Hoc and Sensor Networks, Wireless Networks, Next Generation Internet NETWORKING'07 pp.981-991 Springer-Verlag (2007)
https://www.researchgate.net/publication/225442912_The_TCP_Minimum_RTO_Revisited

All great points. Thanks!
 

4/ Status

Normally, I wouldn't want to hold up a draft that has been proven over years of practice, such as the technique in low-latency-opt, which has been proven in Google's DCs over the last few years. Whereas, my ideas are just that: ideas, not proven. However, the technique in low-latency-opt has only been proven in DC environments where the range of RTTs is limited. So, now that you are proposing to transplant it onto the public Internet, it also only has the status of an unproven idea.

To be clear, as it stands, I do not think low-latency-opt is applicable to the public Internet.

Can you please elaborate on this? Is this because you think there ought to be a dependence on RTT?
 


5/ Nits
These nits depart from my promise not comment on details that could become irrelevant if you agree with my idea. Hey, whatever,...

S.3.5:
	RTO <- SRTT + max(G, K*RTTVAR) + max(G, max_ACK_delay)
My immediate reaction to this was that G should not appear twice. However, perhaps you meant them to be G_s and G_r (sender and receiver) respectively. {Note 2}

S.3.5 & S.5. It seems unnecessary to prohibit values of MAD greater than the default (given some companies are already investing in commercial public space flight programmes, so TCP could need to routinely support RTTs that are longer than typical not just shorter).


Cheers



Bob


{Note 1}: On average, if not app-limited, the time between ACKs will be d_r*R_r/W_s where:
   R is SRTT
   d is the delayed ACK factor, e.g. d=2 for ACKing every other packet
   W is the window in units of segments
   subscripts X_r or X_s denote receiver or sender for the half-connection.

So as long as the receiver can estimate the varying value of W at the sender, the receiver's MAD could be
    MAD_r = max(k*d_r*R_r / W_s, G_r),
The factor k (lower case) allows for some bunching of packets e.g. due to link layer aggregation or the residual effects of slow-start, which leaves some bunching even if SS uses pacing. Let's say k=2, but it would need to be checked empirically.

For example, take R=100us, d=2, W=8 and G = 1us.
Given d*R/W = 25us, MAD could be perhaps 50us (i.e. k=2). k might need to be greater, but there would certainly be no need for MAD to be 5ms, which is perhaps 100 times greater than necessary.

With currently popular OS implementations I'm aware of, 50us for a delayed ACK timer is infeasible. Most have a minimum granularity of 1ms, or 10ms, or even larger, for delayed ACKs. And part of the point of delayed ACKs is to wait for applications to respond, so that data can be combined with the ACK. And 50us does not give the app much time to respond.

Again, IMHO the MAD needs to incorporate hardware, software, and workload constraints on the receiving end host. 
 

{Note 2}: Why is there no field in the Low Latency option to communicate receiver clock granularity to the sender?


The idea is that the MAD value is a function of many parameters on the end host. The clock granularity is only one of them. The simplest way to convey on the wire a MAD parameter that is a function of many other parameters is just to convey the MAD value itself.

Bob, thanks again for your detailed and insightful feedback!

neal



_______________________________________________
tcpm mailing list
[hidden email]
https://www.ietf.org/mailman/listinfo/tcpm
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Review of draft-wang-tcpm-low-latency-opt-00

Bob Briscoe-4
In reply to this post by Wei Wang
Wei,

On 04/08/17 17:55, Wei Wang wrote:
Hi Bob,

Thanks a lot for your review and detailed feedback on the draft.
Please see my comments inline below:

On Wed, Aug 2, 2017 at 8:54 AM, Bob Briscoe <[hidden email]> wrote:
Wei, Yuchung, Neal and Eric, as authors of draft-wang-tcpm-low-latency-opt-00,

I promised a review. It questions the technical logic behind the draft, so I haven't bothered to give a detailed review of the wording of the draft, because that might be irrelevant if you agree with my arguments.

1/ MAD by configuration?
   o  If the user does not specify a MAD value, then the implementation
      SHOULD NOT specify a MAD value in the Low Latency option.
That sentence triggered my "anti-human-intervention" reflex. My train of thought went as follows:

* Let's consider what advice we would give on what MAD value ought to be configured.
* You say that MAD can be smaller in DCs. So I assume your advice would be that MAD should depend on RTT {Note 1} and clock granularity {Note 2}.
* So why configure one value of MAD for all RTTs? That only makes sense in DC environments where the range of RTTs is small.
* However, for the range of RTTs on the public Internet, why not calculate MAD from RTT and granularity, then standardize the calculation so that both ends arrive at the same result when starting from the same RTT and granularity parameters? (The sender and receiver might measure different smoothed (SRTT) values, but they will converge as the flow progresses.)

Then the receiver only needs to communicate its clock granularity to the sender, and the fact that it is driving MAD off its SRTT. Then the sender can use a formula for RTO derived from the value of MAD that it calculates the receiver will be using. Then its RTO will be completely tailored to the RTT of the flow.

First of all, we recommend that operating system should have a per-route MAD configuration API and a per-connection MAD configuration API. So different connections could have different MAD values configured. It is not one value for all.
[BB]: I prefer Neal's subsequent response agreeing that a MAD API is fraught with human-intervention problems, and preferring a binary API (use MAD or not).

I saw that pre route config is already possible when I checked the Linux code. However, per route config just makes the likelihood of errors greater. Particularly cos IETF standardization is primarily for the Internet, not just DCs. And on the Internet, a large proportion of clients are not controlled by a management system.


And in my opinion, what MAD value should be set to is not only depending on RTT and clock granularity. It also depends on how the application wants the delayed ack behavior to be. Some application might only send data say every 1ms, so it will delay its ack up to 2ms so that it can always piggy back the ack to the data.
That is why a per-connection MAD configuration makes sense for the application to fine tune MAD according to its own demand.
[BB]: This has to be automated for the Internet.

An app only cares if MAD is too long. An app doesn't care if the ACK delay is too short. But 'the network' cares if there are too many unnecessary ACKs (and this knocks-on to every other app including the original app). So on the public Internet, the stack, not the app, is an appropriate place to determine MAD. The app can only be trusted to do this in a managed environment.

See response to Neal for further thoughts.



And when user tries to set a new MAD value, we do boundary check to make sure it is less than the current default MAD value. This is a safety check to make sure user does not configure something that is worse than current default value.
[BB]: That warrants a warning on the UI, not prohibition and certainly not silently ignoring the input (see point I already made below about large RTT environments).


About your question in {Note 2} that why receiver does not communicate its clock granularity to the sender, I don't really see a reason why receiver side clock granularity is needed. Because the MAD value sent by receiver is already a value that is rounded to the clock granularity. Say if a user wants to set MAD to 1ms, and the clock granularity is 10ms, receiver will send MAD value as 10ms. In the draft, we specify that:

      If specified, then the MAD value in the Low Latency option MUST be
      set, as close as possible, to the implementation's actual delayed
      ACK timeout for the connection.  Note that the actual maximum
      delayed ACK timeout of the connection may be larger than the
      actual user specified value because of implementation constraints 
             (e.g. timer granularity limitations). 
[BB]: Understood. I should have made clear that my question was only relevant if you accepted my argument that the sender would calculate what the receiver would use for MAD (from RTT and granularity).

See my response to Neal for further thoughts.




Note: There are two different uses for the min RTO that need to be separated:
    a) Before an initial RTT value has been measured, to determine the RTO during the 3WHS.
    b) Once either end has measured the RTT for a connection.
(a) needs to cope with the whole range of possible RTTs, whereas (b) is the subject of this email, because it can be tailored for the measured RTT.

Again, we don't think MAD value is only a function of RTT and clock granularity.

 

2/ The problem, and its prevalence

With gradual removal of bufferbloat and more prevalent usage of CDNs, typical base RTTs on the public Internet now make the value of minRTO and of MAD look silly.

As can be seen above, the problem is indeed that each end only has partial knowledge of the config of the other end.
However, the problem is not just that MAD needs to be communicated to the other end so it can be hard-coded to a lower value.
The problem is that MAD is hard-coded in the first place.

The draft needs to say how prevalent the problem is (on the public Internet) where the sender has to wait for the receiver's delayed ACK timer at the end of a flow or between the end of a volley of packets and the start of the next.

Noted. We will add more contexts on how delayed ack works and why long delayed ack time is hurting performance. We are also planning on adding some history about why delayed ack was configured as a constant in the first place and why the current constant value was chosen.
 

The draft also needs to say what tradeoff is considered acceptable between a residual level of spurious retransmissions and lower timeout delay. Eliminating all spurious retransmissions is not the goal.

Noted.
 

The draft also needs to say that introducing a new TCP Option is itself a problem (on the public Internet), because of middleboxes particularly proxies. Therefore a solution that does not need a new TCP Option would be preferable....


There is already a section in the draft that states the middle box issue:
        5. Middlebox Considerations 
Is that portion a good enough explanation on this?
[BB]: I'm afraid not.

1/ The likelihood that the option is stripped (e.g. by proxies) is not mentioned. It only mentions the likelihood the whole SYN is discarded because of the option. That was why I pointed out it may be possible to redesign this without a new TCP option, by repurposing the timestamp option in a similar way to tcpm-timestamp-negotiation (note: 'similar' means using the ideas, not necessarily the exact same scheme).

2/ The first bullet relies on data about middleboxes that Michio gathered 6 years ago. Google has the ability to verify the current position.

3/ The second bullet would be irrelevant if you accept my point that the option needs to support larger RTTs not just smaller. Nonetheless, there is little evidence that middleboxes alter the fields  in unknown options.

 
Perhaps the solution for communicating timestamp resolution in draft-scheffenegger-tcpm-timestamp-negotiation-05 (which cites draft-trammell-tcpm-timestamp-interval-01) could be modified to also communicate:
* TCP's clock granularity (closely related to TCP timestamp resolution),
*  and the fact that the host is calculating MAD as a function of RTT and granularity.
Then the existing timestamp option could be repurposed, which should drastically reduce deployment problems.

I am not sure if this is doable but will look into it.

 

3/ Only DC?

All the related work references are solely in the context of a DC. Pls include refs about this problem in a public Internet context. You will find there is a pretty good search engine at www.google.com.

The only non-DC ref I can find about minRTO is [Psaras07], which is mainly about a proposal to apply minRTO if the sender expects the next ACK to be delayed. Nonetheless, the simulation experiment in Section 5.1 provides good evidence for how RTO latency is dependent on uncertainty about the MAD that the other end is using.

[Psaras07] Psaras, I. & Tsaoussidis, V., "The TCP Minimum RTO Revisited," In: Proc. 6th Int'l IFIP-TC6 Conference on Ad Hoc and Sensor Networks, Wireless Networks, Next Generation Internet NETWORKING'07 pp.981-991 Springer-Verlag (2007)
https://www.researchgate.net/publication/225442912_The_TCP_Minimum_RTO_Revisited

Noted. Thanks a lot for the pointers. Will look into them and add to the draft.
 


4/ Status

Normally, I wouldn't want to hold up a draft that has been proven over years of practice, such as the technique in low-latency-opt, which has been proven in Google's DCs over the last few years. Whereas, my ideas are just that: ideas, not proven. However, the technique in low-latency-opt has only been proven in DC environments where the range of RTTs is limited. So, now that you are proposing to transplant it onto the public Internet, it also only has the status of an unproven idea.

To be clear, as it stands, I do not think low-latency-opt is applicable to the public Internet.


Hmm... I think overall, this approach should not do any harm to the network. It provides an additional feature to let the user configure the MAD if the user cares about it. If not, they can leave it as the default behavior as it is right now.
To your concerns about the RTT variation in the internet, first, as I explained, this MAD value will be set per connection or per route. Secondly, I would think it is doable to do some bound check or error correction on the MAD value set by the user if we find that it is way below RTT and does not make sense. But again, we don't think MAD value is only a function of RTT. User should be able to configure it to a value suitable for his/her need.
We want to make it as a standard so that all operating systems could implement this in the same way so that they could understand each other. One use case is that in a cloud environment where different operating systems are running in the same DC, they should be able to interpret this option with no issue.
[BB]: Yes, I guessed that this was probably what Google was really wanting to standardize this for. With the current config constraint text, it is limited to managed environments, which would make it uninteresting to many at the IETF.

Fortunately, I think the line of thinking between Neal & me is already widening applicability to unmanaged environments.


 

5/ Nits
These nits depart from my promise not comment on details that could become irrelevant if you agree with my idea. Hey, whatever,...

S.3.5:
	RTO <- SRTT + max(G, K*RTTVAR) + max(G, max_ACK_delay)
My immediate reaction to this was that G should not appear twice. However, perhaps you meant them to be G_s and G_r (sender and receiver) respectively. {Note 2}


As explained earlier, clock granularity of the receiver is already being considered in the MAD value itself. In the above formula, both G are the clock granularity on the sender side.
[BB]: Then it should not be necessary to round up 2 terms to the same granularity. Would it not be correct to use:
	RTO <- SRTT + max(G, (K*RTTVAR + max_ACK_delay) )


 
S.3.5 & S.5. It seems unnecessary to prohibit values of MAD greater than the default (given some companies are already investing in commercial public space flight programmes, so TCP could need to routinely support RTTs that are longer than typical not just shorter).
   

Noted. Will take consideration of this.

Regards



Bob
 
 
Cheers



Bob


{Note 1}: On average, if not app-limited, the time between ACKs will be d_r*R_r/W_s where:
   R is SRTT
   d is the delayed ACK factor, e.g. d=2 for ACKing every other packet
   W is the window in units of segments
   subscripts X_r or X_s denote receiver or sender for the half-connection.

So as long as the receiver can estimate the varying value of W at the sender, the receiver's MAD could be
    MAD_r = max(k*d_r*R_r / W_s, G_r),
The factor k (lower case) allows for some bunching of packets e.g. due to link layer aggregation or the residual effects of slow-start, which leaves some bunching even if SS uses pacing. Let's say k=2, but it would need to be checked empirically.

For example, take R=100us, d=2, W=8 and G = 1us.
Given d*R/W = 25us, MAD could be perhaps 50us (i.e. k=2). k might need to be greater, but there would certainly be no need for MAD to be 5ms, which is perhaps 100 times greater than necessary.

{Note 2}: Why is there no field in the Low Latency option to communicate receiver clock granularity to the sender?


Bob

-- 
________________________________________________________________
Bob Briscoe                               http://bobbriscoe.net/


-- 
________________________________________________________________
Bob Briscoe                               http://bobbriscoe.net/

_______________________________________________
tcpm mailing list
[hidden email]
https://www.ietf.org/mailman/listinfo/tcpm
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Review of draft-wang-tcpm-low-latency-opt-00

Bob Briscoe-4
In reply to this post by Neal Cardwell
Neal,

On 04/08/17 23:20, Neal Cardwell wrote:
Thanks, Bob, for your detailed and thoughtful review! This is very insightful and useful.

Sorry I'm coming to this discussion a little late. I wanted to add a few points, beyond what Wei has already noted.

On Wed, Aug 2, 2017 at 11:54 AM, Bob Briscoe <[hidden email]> wrote:
Wei, Yuchung, Neal and Eric, as authors of draft-wang-tcpm-low-latency-opt-00,

I promised a review. It questions the technical logic behind the draft, so I haven't bothered to give a detailed review of the wording of the draft, because that might be irrelevant if you agree with my arguments.

1/ MAD by configuration?
   o  If the user does not specify a MAD value, then the implementation
      SHOULD NOT specify a MAD value in the Low Latency option.
That sentence triggered my "anti-human-intervention" reflex. My train of thought went as follows:

Bob's remark about his "anti-human-intervention" reflex being
triggered got me thinking.

I, too, would like to minimize the amount of human (application)
intervention this proposal involves (to avoid errors, maintenance,
etc).

It occurs to me that actually at Google our experience has shown that
indeed apps have repeatedly made mistakes with this value, and we have
found it convenient to progressively narrow their freedom in tuning
this knob. To the point where actually in our deployment there is very
little freedom left. Because in reality the OS and TCP stack
developers know the timer granularity considerations, and the apps
don't (and tend to use values 5 years out of date). So we've found it
useful to have the OS tightly clamp the app's request for a MAD value.

So in the interests of simplicity and avoiding human intervention,
what if we do not have the MAD value as part of the API, but rather
just allow the API to express a single "please use MAD" bit? And then
the transport implementation uses the smallest value that it can
support on this end host.

Can we go further, and make MAD an automatic feature of the TCP
implementation (so the transport implementation hard-wires MAD to "on"
or "off")? My sense is that we don't want to go that far, and that
instead we want to still allow apps to decide whether to use the
"please use MAD" bit. Why? There may be middlebox or remote host
compatibility issues with MAD. So we want apps (like browsers) to be
able to do A/B experiments to validate that sending the MAD option on
SYNs does not cause problems. We don't want to turn on MAD in Linux
and then find compatibility issues, and have to wait for a client OS
upgrade to everyone's cell phone to turn off MAD; instead we want to
only have to wait for an app update.
[BB]: If there are problems, they will be per path, not per app. So there could be a cache to record per-path black-holing of packets carrying the option (no need to record stripping the option, which would be benign). Then no API at all would be needed.

As a fail-safe, you would want a system-wide sysctl to turn on MAD. I guess switching that would require an OS upgrade.

Whatever, as you say below, these are not really interop standardization issues (but it's still worth airing the possibilities).


So... suppose an app decides it is latency-sensitive and wants to
reduce ACK delays and negotiate a MAD value. And furthermore, the app
is either (a) doing A/B experiments, or (b) has already convinced
itself that MAD will work on this path.

Then the app could enable MAD with a simple API like:
   int mad = 1; // enable
   err = setsockopt(fd, SOL_TCP, TCP_MAD, &mad, sizeof(mad));

For better or for worse, that makes the TCP_MAD option much like the
TCP_NODELAY option. Both in the sense that latency sensitive apps
should remember to set this bit if they want low-latency behavior. And
in the sense that the APIs would look very similar. And TCP_NODELAY
and TCP_MAD would be sort of complimentary: TCP_NODELAY is the app
saying "I want low latency for my sends" and TCP_MAD is the app
saying "I want low latency for my ACKs". My guess is that most
low-latency apps will want both.

For the MAD API, I think this might be the "as simple as possible, but
no simpler point".
[BB]: Is there an app that wants high delay loss recovery?
There is no tradeoff here, so pls keep it simple and just enable low latency for all connections.


That said, that's an API issue. And I think for TCPM we should focus
more on the wire protocol issues.
 
* Let's consider what advice we would give on what MAD value ought to be configured.

I would suggest that the advice be that when an app requests TCP_MAD,
then transport implementors would have the transport implementation
use the lowest feasible value based on the end host hardware/OS/app
capabilities and workloads.
Our sense from our deployment at Google
is that for many current technologies and workloads this is probably
currently in the range of 5ms - 10ms.

But I don't think we should get bogged down in a discussion of what this
configured value ought to be.
[BB]: Sry, perhaps I wasn't clear. I wrote that sentence to ask:
* not "what specific MAD value ought to be configured"
* but rather "what a good MAD value ought to depend on". I pick up on this question later...

I think we should focus on the simplest
protocol mechanism that can convey to the remote host the minimum
info needed for the remote transport endpoint to achieve excellent
performance.

Here I think of the MSS option as a good analogy (and that's why we
suggested the name "MAD").

For MSS, the point is not to spend time discussing what MSS should be
used, or to come up with complicated formulas to derive MSS. The point
is to have a simple but general mechanism so that, no matter what the
MSS value is (or the underlying hardware constraints are), there is a
simple option that can convey a hint to the remote host. Then the
remote host can use that hint to tune its sending behavior to achieve
good performance.

Now substitute "MAD" in the place of "MSS" in the preceding paragraph. :-)
 
* You say that MAD can be smaller in DCs. So I assume your advice would be that MAD should depend on RTT {Note 1} and clock granularity {Note 2}.

Personally I do not think that MAD should depend on RTT. And I don't think the draft says that it should (though let me know if there is some spot I didn't notice).

I'd vote for keeping MAD as simple as possible, which means keeping RTT out of it. :-)

* So why configure one value of MAD for all RTTs? That only makes sense in DC environments where the range of RTTs is small.

I'd recommend one value of MAD for all RTTs for the sake of simplicity. If we keep MAD as simple as possible, then it stays just about the practical delay limitations of the end host (OS timers, CPU power, CPU load, app behavior, end host queuing delays, etc). That is what we have found makes sense in our deployment. And note that our deployment of a MAD-like option covers RTTs that span quite a range, from <1 ms up to hundreds of ms.

Most OSes I know already have a constant that defines the maximum interval over which they can delay their ACKs. We are basically just suggesting a simple wire format for transport endpoints to advertise this existing value as a hint.
 
* However, for the range of RTTs on the public Internet, why not calculate MAD from RTT and granularity, then standardize the calculation so that both ends arrive at the same result when starting from the same RTT and granularity parameters? (The sender and receiver might measure different smoothed (SRTT) values, but they will converge as the flow progresses.)

Then the receiver only needs to communicate its clock granularity to the sender, and the fact that it is driving MAD off its SRTT. Then the sender can use a formula for RTO derived from the value of MAD that it calculates the receiver will be using. Then its RTO will be completely tailored to the RTT of the flow.

A couple questions here:

- Why  should we add the complexity of making MAD dependent on RTT? I'm not clear on what the argument would be for the benefit of introducing this complexity.

- Even if the receiver only communicates its clock granularity to the sender, and the fact that it is driving MAD off its SRTT, then there's a the question of *how* it is deriving MAD. Presumably this could change, as we come up with better ideas. So then we would want a version number field to indicate which calculation is being used. It seems much simpler to me to allow the end point to just communicate a numerical delay value, rather than negotiate a version number of a formula that can take a clock granularity and RTT as input and produce a delay as output.
[BB]: Good point.

- Introducing RTT as a dependence also introduces the question of what to do when there is no RTT estimate (because all packets so far have been retransmitted, with no timestamps). And as we discussed in Prague and you mention here, the two sides often have slightly different RTT estimates. There are probably other wrinkles as well.
[BB]: OK. I understood that the pretext for this draft was that the max ACK delay is too long for the low RTTs that are often in use these days. So I hadn't appreciated that you would advise that MAD would not depend on RTT.

Fair enough. I'll go along with this advice for now (but see later). However, let's just check that your proposal makes sense in other respects.

Q1. Is there not a risk that a value of MAD solely dependent on the receiver's OS parameters will be lower than the typical inter-packet arrival time for some flows? E.g.
    If data packets arrive every 7 ms {Note 3} then, even with a del_ack factor of 2, a receiver with MAD = 5 ms will ACK every packet. In fact, I think it will immediately ACK the first packet, then delay the ACK of every subsequent packet by 5ms. {Note 4}

I guess you are saying that would be OK from the point of view of the receiver's workload (otherwise it would not have set MAD=5ms). However, delayed ACKs are also intended to reduce network workload. {Note 5}.

{Note 3}: With 1500B packets that implies 1.7Mb/s, which is more than 3x my own ADSL uplink (I live in the developed world, but in a rural part of  it, where  such rates are common and the only alternative is 3G, which offers an even slower uplink :(

{Note 4}: I don't know what Implementations do, but RFC5681 implies that a receiver delays the next ACK whenever it sent the previous ACK, even if it delayed the previous one. The words are: "MUST be generated within <MAD> of the arrival of the first unacknowledged packet,"

{Note 5}: Not to mention that delaying every ACK makes it hard for the sender to use the ACKs to monitor queuing delay. However, this might be fixed by separate introduction of way to measure one-way delay using timestamps.

 

Note: There are two different uses for the min RTO that need to be separated:
    a) Before an initial RTT value has been measured, to determine the RTO during the 3WHS.
    b) Once either end has measured the RTT for a connection.
(a) needs to cope with the whole range of possible RTTs, whereas (b) is the subject of this email, because it can be tailored for the measured RTT.

2/ The problem, and its prevalence

With gradual removal of bufferbloat and more prevalent usage of CDNs, typical base RTTs on the public Internet now make the value of minRTO and of MAD look silly.

As can be seen above, the problem is indeed that each end only has partial knowledge of the config of the other end.
However, the problem is not just that MAD needs to be communicated to the other end so it can be hard-coded to a lower value.
The problem is that MAD is hard-coded in the first place.

The draft needs to say how prevalent the problem is (on the public Internet) where the sender has to wait for the receiver's delayed ACK timer at the end of a flow or between the end of a volley of packets and the start of the next.

The draft also needs to say what tradeoff is considered acceptable between a residual level of spurious retransmissions and lower timeout delay. Eliminating all spurious retransmissions is not the goal.

The draft also needs to say that introducing a new TCP Option is itself a problem (on the public Internet), because of middleboxes particularly proxies. Therefore a solution that does not need a new TCP Option would be preferable....

Perhaps the solution for communicating timestamp resolution in draft-scheffenegger-tcpm-timestamp-negotiation-05 (which cites draft-trammell-tcpm-timestamp-interval-01) could be modified to also communicate:
* TCP's clock granularity (closely related to TCP timestamp resolution),
*  and the fact that the host is calculating MAD as a function of RTT and granularity.
Then the existing timestamp option could be repurposed, which should drastically reduce deployment problems.

3/ Only DC?

All the related work references are solely in the context of a DC. Pls include refs about this problem in a public Internet context. You will find there is a pretty good search engine at www.google.com.

The only non-DC ref I can find about minRTO is [Psaras07], which is mainly about a proposal to apply minRTO if the sender expects the next ACK to be delayed. Nonetheless, the simulation experiment in Section 5.1 provides good evidence for how RTO latency is dependent on uncertainty about the MAD that the other end is using.

[Psaras07] Psaras, I. & Tsaoussidis, V., "The TCP Minimum RTO Revisited," In: Proc. 6th Int'l IFIP-TC6 Conference on Ad Hoc and Sensor Networks, Wireless Networks, Next Generation Internet NETWORKING'07 pp.981-991 Springer-Verlag (2007)
https://www.researchgate.net/publication/225442912_The_TCP_Minimum_RTO_Revisited

All great points. Thanks!
 

4/ Status

Normally, I wouldn't want to hold up a draft that has been proven over years of practice, such as the technique in low-latency-opt, which has been proven in Google's DCs over the last few years. Whereas, my ideas are just that: ideas, not proven. However, the technique in low-latency-opt has only been proven in DC environments where the range of RTTs is limited. So, now that you are proposing to transplant it onto the public Internet, it also only has the status of an unproven idea.

To be clear, as it stands, I do not think low-latency-opt is applicable to the public Internet.

Can you please elaborate on this? Is this because you think there ought to be a dependence on RTT?
[BB]: I was trying to judge whether this is a straightforward standardization of tried and tested technology, or experimental.

The opinion about inapplicability to the Internet was based on the way the config requirements were written, which limited the draft to environments covered by a configuration management system, which is not typical for the public Internet.

I'm happier now that the focus is moving towards auto-tuning. However, this makes my first point about unproven territory even more applicable..., so Google's previous experience becomes less relevant, and makes this more experimental/researchy. For instance, the case I pointed out above for my own uplink would double the ACK rate, which might lead to knock-on problems - perhaps an increase in server processing load, or even processor overload on intermediate network equipment. We are also likely to discover interactions with ACK-thinning middleboxes.

more...
 


5/ Nits
These nits depart from my promise not comment on details that could become irrelevant if you agree with my idea. Hey, whatever,...

S.3.5:
	RTO <- SRTT + max(G, K*RTTVAR) + max(G, max_ACK_delay)
My immediate reaction to this was that G should not appear twice. However, perhaps you meant them to be G_s and G_r (sender and receiver) respectively. {Note 2}

S.3.5 & S.5. It seems unnecessary to prohibit values of MAD greater than the default (given some companies are already investing in commercial public space flight programmes, so TCP could need to routinely support RTTs that are longer than typical not just shorter).


Cheers



Bob


{Note 1}: On average, if not app-limited, the time between ACKs will be d_r*R_r/W_s where:
   R is SRTT
   d is the delayed ACK factor, e.g. d=2 for ACKing every other packet
   W is the window in units of segments
   subscripts X_r or X_s denote receiver or sender for the half-connection.

So as long as the receiver can estimate the varying value of W at the sender, the receiver's MAD could be
    MAD_r = max(k*d_r*R_r / W_s, G_r),
The factor k (lower case) allows for some bunching of packets e.g. due to link layer aggregation or the residual effects of slow-start, which leaves some bunching even if SS uses pacing. Let's say k=2, but it would need to be checked empirically.

For example, take R=100us, d=2, W=8 and G = 1us.
Given d*R/W = 25us, MAD could be perhaps 50us (i.e. k=2). k might need to be greater, but there would certainly be no need for MAD to be 5ms, which is perhaps 100 times greater than necessary.

With currently popular OS implementations I'm aware of, 50us for a delayed ACK timer is infeasible. Most have a minimum granularity of 1ms, or 10ms, or even larger, for delayed ACKs. And part of the point of delayed ACKs is to wait for applications to respond, so that data can be combined with the ACK. And 50us does not give the app much time to respond.
[BB]: A modern processor can to do as much in 50us as a processor from the 1990s could do in about 10mins.

The min clock interrupt period has not changed much from the typical value of 10ms in 1990 [Dovrolis00]. This minimum is meant to maintain performance by keeping a healthy ratio between real work and context switching. However, the number of ops that can be processed in this duration has increased by about 10^7 over the same period.

I am (genuinely) interested to know what is the underlying factor that limits ACK delay to no less than 1-10ms? Is it Wirth's Law of software bloat (that the same task takes just as long, because increases in processing speed are absorbed by increases in code complexity)?

Nonetheless, whatever the clock granularity on a particular OS/machine, a stack should still be able to calculate what MAD ought to be. Then the final step in the calculation would round MAD up to the interrupt clock granularity. At least then code would perform better for OSs that reduce their clock granularity.


[Dovrolis00] Dovrolis, C. & Ramanathan, P., "Increasing the Clock Interrupt Frequency for Better Support of Real-Time Applications," Uni Wisconsin-Madison, Dept. Electrical & Computer Engineering http://www.cc.gatech.edu/~dovrolis/Papers/timers.ps (March 2000).


This returns us to the question:
What ought MAD depend on?

I prefer to start by analyzing what the best function and dependencies should be, then try to approximate for simplicity. I don't think we should start from the other direction ("let's think up a simple way to do it, and see if it works"). The former gives insight. The latter risks a random stumble across a territory of countless unforeseen problems.

The thought experiment I was conducting in my 'Note 1' above started from the idea that MAD ought to be somehow related to the average inter-arrival time in a flow. The (unstated) reasoning went like this:
* the retransmission delay after a pause/stop in the data stream ought to be similar to the retransmission delay without a pause/stop.
* so the ack delay after a pause/stop in the data stream ought to be similar to the ack delay without a pause/stop.

Put more succinctly:
    R + MAD ~= R + d * t_i            (1)
Then by definition:
    t_i = R/W
Which is how I got to
    MAD ~= d * R / W

where the notation is as earlier, plus
    t_i = avg inter-arrival time

Now, moving on to how to simplify this...

I accept your point that MAD has to be above "the practical delay limitations of the end host (OS timers, CPU power, CPU load, app behavior, end host queuing delays, etc)". Let's wrap that all into a variable we'll call g_r (which is itself lower bounded by clock granularity G_r).

Eqn (1) shows that the approximation for MAD only has to be within the order of the RTT.

Also, setting a lower bound for MAD of the order of an RTT would help to prevent the case I raised earlier where MAD is less than the average inter-arrival time (because it is abnormal to have <1 packet per RTT). Even for ultra-low RTTs, this also protects the network, because network processing should be sized to cope with ~1 ACK per RTT.

So, in summary, this is my current preferred approximation (but I'm open to others):

    MAD ~= max(c*R_r, g_r)

c is a constant factor determined empirically. I'm not sure whether it will be less or greater than 1, so let's assume nominally c=1.

To be clear, I'm accepting your argument that it is simplest for the receiver to communicate MAD in the TCP option. So (for now) I'm no longer proposing that the sender bases MAD on the RTT it measures itself. I.e. the receiver calculates MAD based on it's initial estimate of the RTT of the connection and other local parameters, then communicates it to the sender. It's not important how accurate the RTT estimate is. This is just to get a lower bound of roughly the right order of magnitude.

You are right that the receiver might not have a good RTT estimate if packets within the 3WHS were retransmitted. But, let me assume (for now) that we are using TCP timestamps, so a host can get a good RTT estimate even with retransmissions (...because I believe it will be easiest to deploy this MAD option by repurposing the TCP timestamp, similar to draft-scheffenegger-tcpm-timestamp-negotiation-05).


Again, IMHO the MAD needs to incorporate hardware, software, and workload constraints on the receiving end host.

[BB]: As above, delayed ACKs are also about reducing processing load in network equipment. If we do not take this into account, we risk networks deploying boxes to take this into account for themselves (e.g. ACK thinning).


 

{Note 2}: Why is there no field in the Low Latency option to communicate receiver clock granularity to the sender?


The idea is that the MAD value is a function of many parameters on the end host. The clock granularity is only one of them. The simplest way to convey on the wire a MAD parameter that is a function of many other parameters is just to convey the MAD value itself.

Bob, thanks again for your detailed and insightful feedback!
[BB]: I think we're getting somewhere.

Cheers


Bob


neal



-- 
________________________________________________________________
Bob Briscoe                               http://bobbriscoe.net/

_______________________________________________
tcpm mailing list
[hidden email]
https://www.ietf.org/mailman/listinfo/tcpm
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Review of draft-wang-tcpm-low-latency-opt-00

Jeremy Harris
In reply to this post by Neal Cardwell
The draft proposes a SYN-time option to notify a MAD value
for the connection.  Would it not be preferable to use
a data-time option, permitting an implementation to track
(from the TCP endpoint's view) the application response time,
adjusting the delayed-ACK timer to suit - and notifying the
peer occasionally?
--
Cheers,
  Jeremy

_______________________________________________
tcpm mailing list
[hidden email]
https://www.ietf.org/mailman/listinfo/tcpm
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Review of draft-wang-tcpm-low-latency-opt-00

Jeremy Harris
In reply to this post by Bob Briscoe-4
On 06/08/17 18:39, Bob Briscoe wrote:
> [BB]: Is there an app that wants high delay loss recovery?
> There is no tradeoff here, so pls keep it simple and just enable low
> latency for all connections.

The tradeoff is against potentially mistaken retransmissions, which
could matter on a data-costly channel or for an energy-constrained
endpoint.
--
Cheers,
  Jeremy

_______________________________________________
tcpm mailing list
[hidden email]
https://www.ietf.org/mailman/listinfo/tcpm
Loading...