questions about flow control

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

questions about flow control

Chuck Lever III
Howdy-

We're working on prototyping RPC/RDMA version two. As many of
you know, RPC/RDMA uses credit-based flow control.

I've presented to the WG before on the kinds of improvements
to credit accounting we need to make over version one of
RPC/RDMA in order to support control plane operations and
message continuation -- cases where we no longer have
perfectly symmetrical Call/Reply pairing.

I'm looking at Section 4.2.1.1 of draft-ietf-nfsv4-rpcrdma-version-two
as it is currently constructed and I'm finding it ...
underwhelming.

I'm thinking of replacing it with something more akin to the
original forms of credit-based flow control, as described in
Chapter 4 of:

https://dl.acm.org/doi/pdf/10.1145/190314.190324

and implemented in the form of Chapter 5 of that paper. The
rdma_credits field would be filled in with the sender's Vr,
in both directions, and N2 + N3 would be the credit limit. We
would need to add some kind of "reset credit accounting"
message as well.

I'm not feeling confident about this choice, however. Does
anyone know a person or people who could answer some questions
about flow control design? Or is there a good reference I could
read to help me understand fundamentals and common pitfalls?

Many thanks in advance!


--
Chuck Lever



_______________________________________________
nfsv4 mailing list
[hidden email]
https://www.ietf.org/mailman/listinfo/nfsv4
Reply | Threaded
Open this post in threaded view
|

Re: questions about flow control

Tom Talpey-3
I'd be happy to discuss it with you, and you and I may have
done so in the past, regarding the approach made by SMB Direct.
This 2012 presentation* discusses some of that design, and
it's perhaps my suggestion to separate the credit protocol
from the credit policy. The protocol is the easy part, honestly.

I'd advise against using the methods in the ATM paper you
cite, at least on their face. These crediting approaches are
largely attempting to meter available bandwidth. They also
are for protocols which are tolerant of loss. This is a very
different goal from the RDMA crediting required to provide a
reliable stream of requests and responses.

Tom.

*
https://www.snia.org/sites/default/orig/SDC2012/presentations/Revisions/TomTalpeyKramer-High_Performance__File.pdf

On 4/29/2021 2:16 PM, Chuck Lever III wrote:

> Howdy-
>
> We're working on prototyping RPC/RDMA version two. As many of
> you know, RPC/RDMA uses credit-based flow control.
>
> I've presented to the WG before on the kinds of improvements
> to credit accounting we need to make over version one of
> RPC/RDMA in order to support control plane operations and
> message continuation -- cases where we no longer have
> perfectly symmetrical Call/Reply pairing.
>
> I'm looking at Section 4.2.1.1 of draft-ietf-nfsv4-rpcrdma-version-two
> as it is currently constructed and I'm finding it ...
> underwhelming.
>
> I'm thinking of replacing it with something more akin to the
> original forms of credit-based flow control, as described in
> Chapter 4 of:
>
> https://dl.acm.org/doi/pdf/10.1145/190314.190324
>
> and implemented in the form of Chapter 5 of that paper. The
> rdma_credits field would be filled in with the sender's Vr,
> in both directions, and N2 + N3 would be the credit limit. We
> would need to add some kind of "reset credit accounting"
> message as well.
>
> I'm not feeling confident about this choice, however. Does
> anyone know a person or people who could answer some questions
> about flow control design? Or is there a good reference I could
> read to help me understand fundamentals and common pitfalls?
>
> Many thanks in advance!
>
>
> --
> Chuck Lever
>
>
>
> _______________________________________________
> nfsv4 mailing list
> [hidden email]
> https://www.ietf.org/mailman/listinfo/nfsv4
>

_______________________________________________
nfsv4 mailing list
[hidden email]
https://www.ietf.org/mailman/listinfo/nfsv4
Reply | Threaded
Open this post in threaded view
|

Re: questions about flow control

Chuck Lever III

> On Apr 30, 2021, at 2:47 PM, Tom Talpey <[hidden email]> wrote:
>
> I'd be happy to discuss it with you, and you and I may have
> done so in the past, regarding the approach made by SMB Direct.
> This 2012 presentation* discusses some of that design, and
> it's perhaps my suggestion to separate the credit protocol
> from the credit policy. The protocol is the easy part, honestly.

A design conference call might be best, once I've had a chance
to digest more literature.

In terms of protocol design I'd like to avoid splitting the
rdma_credits field into two 16-bit fields, as RPC/RDMA v2 does
currently. We did that because of real or perceived restrictions
that RFC 8166 places on the structure of the transport headers
in all RPC/RDMA versions.

The credits in v1 are RPC direction-aware; in v2 they need to
be bound to the peer Receivers, which operate in both directions.

RPC/RDMA v1 implementations ignore the credit request value
completely, so it's not clear it needs to be retained in v2.


> I'd advise against using the methods in the ATM paper you
> cite, at least on their face. These crediting approaches are
> largely attempting to meter available bandwidth. They also
> are for protocols which are tolerant of loss. This is a very
> different goal from the RDMA crediting required to provide a
> reliable stream of requests and responses.

The ATM paper has a design that veers away from request/grant,
and looks more like a sliding window, but it seems easily
adaptable to a pair of bi-directional streams.


> * https://www.snia.org/sites/default/orig/SDC2012/presentations/Revisions/TomTalpeyKramer-High_Performance__File.pdf


--
Chuck Lever



_______________________________________________
nfsv4 mailing list
[hidden email]
https://www.ietf.org/mailman/listinfo/nfsv4
Reply | Threaded
Open this post in threaded view
|

Re: questions about flow control

Tom Talpey-3
On 4/30/2021 5:01 PM, Chuck Lever III wrote:

>
>> On Apr 30, 2021, at 2:47 PM, Tom Talpey <[hidden email]> wrote:
>>
>> I'd be happy to discuss it with you, and you and I may have
>> done so in the past, regarding the approach made by SMB Direct.
>> This 2012 presentation* discusses some of that design, and
>> it's perhaps my suggestion to separate the credit protocol
>> from the credit policy. The protocol is the easy part, honestly.
>
> A design conference call might be best, once I've had a chance
> to digest more literature.
>
> In terms of protocol design I'd like to avoid splitting the
> rdma_credits field into two 16-bit fields, as RPC/RDMA v2 does
> currently. We did that because of real or perceived restrictions
> that RFC 8166 places on the structure of the transport headers
> in all RPC/RDMA versions.
>
> The credits in v1 are RPC direction-aware; in v2 they need to
> be bound to the peer Receivers, which operate in both directions.
>
> RPC/RDMA v1 implementations ignore the credit request value
> completely, so it's not clear it needs to be retained in v2.

Well, ok. It's a good approach to avoid unnecessary backwards
compatibility. And the v1 protocol was rather limited, as we've
discovered.

Just as a point of discussion, I'd ask whether you are intending
to design an actual credit algorithm for the protocol, or only
the protocol itself. These are two different things, and in my
view, standardizing an algorithm is something to be avoided. There
are any number of ways to implement a credit scheme, and over-
specifying the details constrains the implementations. This can
quickly lead to v3, v4, ... So, I would suggest a clear goal for
the discussion, if it's to be had.

Tom.

>> I'd advise against using the methods in the ATM paper you
>> cite, at least on their face. These crediting approaches are
>> largely attempting to meter available bandwidth. They also
>> are for protocols which are tolerant of loss. This is a very
>> different goal from the RDMA crediting required to provide a
>> reliable stream of requests and responses.
>
> The ATM paper has a design that veers away from request/grant,
> and looks more like a sliding window, but it seems easily
> adaptable to a pair of bi-directional streams.
>
>
>> * https://www.snia.org/sites/default/orig/SDC2012/presentations/Revisions/TomTalpeyKramer-High_Performance__File.pdf
>
>
> --
> Chuck Lever
>
>
>
> _______________________________________________
> nfsv4 mailing list
> [hidden email]
> https://www.ietf.org/mailman/listinfo/nfsv4
>

_______________________________________________
nfsv4 mailing list
[hidden email]
https://www.ietf.org/mailman/listinfo/nfsv4
Reply | Threaded
Open this post in threaded view
|

Re: questions about flow control

David Noveck
In reply to this post by Chuck Lever III


On Fri, Apr 30, 2021, 5:01 PM Chuck Lever III <[hidden email]> wrote:

> On Apr 30, 2021, at 2:47 PM, Tom Talpey <[hidden email]> wrote:
>
> I'd be happy to discuss it with you, and you and I may have
> done so in the past, regarding the approach made by SMB Direct.
> This 2012 presentation* discusses some of that design, and
> it's perhaps my suggestion to separate the credit protocol
> from the credit policy. The protocol is the easy part, honestly.

A design conference call might be best, once I've had a chance
to digest more literature.

Sounds like a good idea to me.


In terms of protocol design I'd like to avoid splitting the
rdma_credits field into two 16-bit fields, as RPC/RDMA v2 does
currently. We did that because of real or perceived restrictions
that RFC 8166 places on the structure of the transport headers
in all RPC/RDMA versions.

I'd hope we will be able to address this particular issue in the design conference and decide what to do about the RFC 8166 versioning text.  When that text was written, major changes in the credit approach had not really been contemplated.  Nevertheless, if we do need two fields, we could put the second outside of the magic four words. In any case, if the credit scheme changes in a major way, we need to address the implications of the same-forever text in RFC 8166 and figure out how to correct our mistakes in this area.


The credits in v1 are RPC direction-aware; in v2 they need to
be bound to the peer Receivers, which operate in both directions.

RPC/RDMA v1 implementations ignore the credit request value
completely, so it's not clear it needs to be retained in v2.

If there is no need for it, it should not be retained.


> I'd advise against using the methods in the ATM paper you
> cite, at least on their face. These crediting approaches are
> largely attempting to meter available bandwidth. They also
> are for protocols which are tolerant of loss. This is a very
> different goal from the RDMA crediting required to provide a
> reliable stream of requests and responses.

The ATM paper has a design that veers away from request/grant,

That's fine with me

and looks more like a sliding window, but it seems easily
adaptable to a pair of bi-directional streams.

If you go away from request/grant, maybe a pair of uni-directional streams is a better way to think of what we are building.

_______________________________________________
nfsv4 mailing list
[hidden email]
https://www.ietf.org/mailman/listinfo/nfsv4
Reply | Threaded
Open this post in threaded view
|

Re: questions about flow control

Chuck Lever III
In reply to this post by Tom Talpey-3


> On Apr 30, 2021, at 8:44 PM, Tom Talpey <[hidden email]> wrote:
>
> On 4/30/2021 5:01 PM, Chuck Lever III wrote:
>>> On Apr 30, 2021, at 2:47 PM, Tom Talpey <[hidden email]> wrote:
>>>
>>> I'd be happy to discuss it with you, and you and I may have
>>> done so in the past, regarding the approach made by SMB Direct.
>>> This 2012 presentation* discusses some of that design, and
>>> it's perhaps my suggestion to separate the credit protocol
>>> from the credit policy. The protocol is the easy part, honestly.
>> A design conference call might be best, once I've had a chance
>> to digest more literature.
>> In terms of protocol design I'd like to avoid splitting the
>> rdma_credits field into two 16-bit fields, as RPC/RDMA v2 does
>> currently. We did that because of real or perceived restrictions
>> that RFC 8166 places on the structure of the transport headers
>> in all RPC/RDMA versions.
>> The credits in v1 are RPC direction-aware; in v2 they need to
>> be bound to the peer Receivers, which operate in both directions.
>> RPC/RDMA v1 implementations ignore the credit request value
>> completely, so it's not clear it needs to be retained in v2.
>
> Well, ok. It's a good approach to avoid unnecessary backwards
> compatibility. And the v1 protocol was rather limited, as we've
> discovered.
>
> Just as a point of discussion, I'd ask whether you are intending
> to design an actual credit algorithm for the protocol, or only
> the protocol itself. These are two different things, and in my
> view, standardizing an algorithm is something to be avoided. There
> are any number of ways to implement a credit scheme, and over-
> specifying the details constrains the implementations. This can
> quickly lead to v3, v4, ... So, I would suggest a clear goal for
> the discussion, if it's to be had.

I don't believe we can avoid discussing potential algorithms during
a conversation about protocol design. We are likely going to have
to provide some constraints on implementations to guarantee correct
interoperation of the protocol. However, I expect that any suggested
credit accounting algorithms in the finished text will take the form
of implementation guidance.

Speaking as a co-author of RFC 8166 and rpcrdma-version-two, it has
been a challenge to separate the concept of credit from a particular
(small) set of hardware resources. I will need some help selecting
appropriately narrow and abstract language for this text.


>>> I'd advise against using the methods in the ATM paper you
>>> cite, at least on their face. These crediting approaches are
>>> largely attempting to meter available bandwidth. They also
>>> are for protocols which are tolerant of loss. This is a very
>>> different goal from the RDMA crediting required to provide a
>>> reliable stream of requests and responses.
>> The ATM paper has a design that veers away from request/grant,
>> and looks more like a sliding window, but it seems easily
>> adaptable to a pair of bi-directional streams.

As Dave noted elsewhere, I meant rather "a pair of uni-directional
streams" on a single connection. I believe that has to be our model
for v2.

Where I struggle is that the Send and Receive queues on a peer are
completely independent, yet credit accounting ties the queues
together.

- The Vr values for one stream are passed to the remote peer via
the other stream asynchronously, thus the values are probably stale
by the time they reach the remote.

- The Vr value on a peer would be maintained by its Receive
completion handler, which is a single process often tied to a
single CPU core, but that value would have to be read by Sender
threads on other cores. We can mitigate the cost of the cross
memory traffic somewhat but it is not ideal.


--
Chuck Lever



_______________________________________________
nfsv4 mailing list
[hidden email]
https://www.ietf.org/mailman/listinfo/nfsv4
Reply | Threaded
Open this post in threaded view
|

Re: questions about flow control

David Noveck


On Sat, May 1, 2021, 3:03 PM Chuck Lever III <[hidden email]> wrote:


> On Apr 30, 2021, at 8:44 PM, Tom Talpey <[hidden email]> wrote:
>
> On 4/30/2021 5:01 PM, Chuck Lever III wrote:
>>> On Apr 30, 2021, at 2:47 PM, Tom Talpey <[hidden email]> wrote:
>>>
>>> I'd be happy to discuss it with you, and you and I may have
>>> done so in the past, regarding the approach made by SMB Direct.
>>> This 2012 presentation* discusses some of that design, and
>>> it's perhaps my suggestion to separate the credit protocol
>>> from the credit policy. The protocol is the easy part, honestly.
>> A design conference call might be best, once I've had a chance
>> to digest more literature.
>> In terms of protocol design I'd like to avoid splitting the
>> rdma_credits field into two 16-bit fields, as RPC/RDMA v2 does
>> currently. We did that because of real or perceived restrictions
>> that RFC 8166 places on the structure of the transport headers
>> in all RPC/RDMA versions.
>> The credits in v1 are RPC direction-aware; in v2 they need to
>> be bound to the peer Receivers, which operate in both directions.
>> RPC/RDMA v1 implementations ignore the credit request value
>> completely, so it's not clear it needs to be retained in v2.
>
> Well, ok. It's a good approach to avoid unnecessary backwards
> compatibility. And the v1 protocol was rather limited, as we've
> discovered.
>
> Just as a point of discussion, I'd ask whether you are intending
> to design an actual credit algorithm for the protocol, or only
> the protocol itself. These are two different things, and in my
> view, standardizing an algorithm is something to be avoided. There
> are any number of ways to implement a credit scheme, and over-
> specifying the details constrains the implementations. This can
> quickly lead to v3, v4, ... So, I would suggest a clear goal for
> the discussion, if it's to be had.

I don't believe we can avoid discussing potential algorithms during
a conversation about protocol design.

We at least have to make sure that we do not wind up with something unimplementable.

We are likely going to have
to provide some constraints on implementations to guarantee correct
interoperation of the protocol.

We have to do that.

Howeverr, I expect that any suggested
credit accounting algorithms in the finished text will take the form
of implementation guidance.

I agree.


Speaking as a co-author of RFC 8166 and rpcrdma-version-two, it has
been a challenge to separate the concept of credit from a particular
(small) set of hardware resources. I will need some help selecting
appropriately narrow and abstract language for this text.

I think the refinement of language is necessary but hope that we can defer that process until after the design discussion has reached its conclusion, using language that might not state things as an  RFC would.



>>> I'd advise against using the methods in the ATM paper you
>>> cite, at least on their face. These crediting approaches are
>>> largely attempting to meter available bandwidth. They also
>>> are for protocols which are tolerant of loss. This is a very
>>> different goal from the RDMA crediting required to provide a
>>> reliable stream of requests and responses.
>> The ATM paper has a design that veers away from request/grant,
>> and looks more like a sliding window, but it seems easily
>> adaptable to a pair of bi-directional streams.

As Dave noted elsewhere, I meant rather "a pair of uni-directional
streams" on a single connection. I believe that has to be our model
for v2.

Where I struggle is that the Send and Receive queues on a peer are
completely independent, yet credit accounting ties the queues
together.

- The Vr values for one stream are passed to the remote peer via
the other stream asynchronously, thus the values are probably stale
by the time they reach the remote.

They may be out-of-date but we have to be sure that they are not so wrong that a message is sent when there is no matching receive.

- The Vr value on a peer would be maintained by its Receive
completion handler, which is a single process often tied to a
single CPU core, but that value would have to be read by Sender
threads on other cores.

They probably also have to share writable data, in order to prevent them from sending messages that would exhaust available receiver credits.

We can mitigate the cost of the cross
memory traffic somewhat but it is not ideal.

In most cases that is dealt with by the cpu's memory coherence logic.  There is a cost but it will vary with the cpu cache implementation.  In the case of multiple coherence domains, I'm sure that mitigation is possible but I prefer we don't spend too much discussion time on optimizing that case or make it focus of our implementation guidance.


--
Chuck Lever



_______________________________________________
nfsv4 mailing list
[hidden email]
https://www.ietf.org/mailman/listinfo/nfsv4

_______________________________________________
nfsv4 mailing list
[hidden email]
https://www.ietf.org/mailman/listinfo/nfsv4
Reply | Threaded
Open this post in threaded view
|

Re: questions about flow control

Chuck Lever III

On May 2, 2021, at 7:26 AM, David Noveck <[hidden email]> wrote:

> On Sat, May 1, 2021, 3:03 PM Chuck Lever III <[hidden email]> wrote:
>> Where I struggle is that the Send and Receive queues on a peer are
>> completely independent, yet credit accounting ties the queues
>> together.
>>
>> - The Vr values for one stream are passed to the remote peer via
>> the other stream asynchronously, thus the values are probably stale
>> by the time they reach the remote.
>>
> They may be out-of-date but we have to be sure that they are not so wrong that a message is sent when there is no matching receive.
>
>> - The Vr value on a peer would be maintained by its Receive
>> completion handler, which is a single process often tied to a
>> single CPU core, but that value would have to be read by Sender
>> threads on other cores.
>
> They probably also have to share writable data, in order to prevent them from sending messages that would exhaust available receiver credits.

Yes, a value like Vr is updated on every Receive completion, and
then used when constructing each transport header to Send to the
remote.


>> We can mitigate the cost of the cross
>> memory traffic somewhat but it is not ideal.
>
> In most cases that is dealt with by the cpu's memory coherence logic.  There is a cost but it will vary with the cpu cache implementation.  In the case of multiple coherence domains, I'm sure that mitigation is possible but I prefer we don't spend too much discussion time on optimizing that case or make it focus of our implementation guidance.

AIUI, the iSCSI protocol has a sequence number element that is
updated on every transmit and receive. This has been demonstrated
to be a significant scalability issue in implementations.

Given the network throughputs typical for NFS/RDMA implementations,
we should be careful to avoid defining protocol that will be
impossible to turn into a scalable implementation. IME that
requires sensitivity to this detail as we are designing the
protocol, not as an afterthought.


--
Chuck Lever



_______________________________________________
nfsv4 mailing list
[hidden email]
https://www.ietf.org/mailman/listinfo/nfsv4
Reply | Threaded
Open this post in threaded view
|

Re: questions about flow control

Martin Duke
In reply to this post by Chuck Lever III
There are tons of people I would trust with this, but I might recommend Jana Iyengar as a first stop.

On Thu, Apr 29, 2021 at 11:17 AM Chuck Lever III <[hidden email]> wrote:
Howdy-

We're working on prototyping RPC/RDMA version two. As many of
you know, RPC/RDMA uses credit-based flow control.

I've presented to the WG before on the kinds of improvements
to credit accounting we need to make over version one of
RPC/RDMA in order to support control plane operations and
message continuation -- cases where we no longer have
perfectly symmetrical Call/Reply pairing.

I'm looking at Section 4.2.1.1 of draft-ietf-nfsv4-rpcrdma-version-two
as it is currently constructed and I'm finding it ...
underwhelming.

I'm thinking of replacing it with something more akin to the
original forms of credit-based flow control, as described in
Chapter 4 of:

https://dl.acm.org/doi/pdf/10.1145/190314.190324

and implemented in the form of Chapter 5 of that paper. The
rdma_credits field would be filled in with the sender's Vr,
in both directions, and N2 + N3 would be the credit limit. We
would need to add some kind of "reset credit accounting"
message as well.

I'm not feeling confident about this choice, however. Does
anyone know a person or people who could answer some questions
about flow control design? Or is there a good reference I could
read to help me understand fundamentals and common pitfalls?

Many thanks in advance!


--
Chuck Lever




_______________________________________________
nfsv4 mailing list
[hidden email]
https://www.ietf.org/mailman/listinfo/nfsv4
Reply | Threaded
Open this post in threaded view
|

Re: questions about flow control

Tom Talpey-3
In reply to this post by Chuck Lever III
On 5/3/2021 10:30 AM, Chuck Lever III wrote:

>
> On May 2, 2021, at 7:26 AM, David Noveck <[hidden email]> wrote:
>> On Sat, May 1, 2021, 3:03 PM Chuck Lever III <[hidden email]> wrote:
>>> Where I struggle is that the Send and Receive queues on a peer are
>>> completely independent, yet credit accounting ties the queues
>>> together.
>>>
>>> - The Vr values for one stream are passed to the remote peer via
>>> the other stream asynchronously, thus the values are probably stale
>>> by the time they reach the remote.
>>>
>> They may be out-of-date but we have to be sure that they are not so wrong that a message is sent when there is no matching receive.
>>
>>> - The Vr value on a peer would be maintained by its Receive
>>> completion handler, which is a single process often tied to a
>>> single CPU core, but that value would have to be read by Sender
>>> threads on other cores.
>>
>> They probably also have to share writable data, in order to prevent them from sending messages that would exhaust available receiver credits.
>
> Yes, a value like Vr is updated on every Receive completion, and
> then used when constructing each transport header to Send to the
> remote.
>
>
>>> We can mitigate the cost of the cross
>>> memory traffic somewhat but it is not ideal.
>>
>> In most cases that is dealt with by the cpu's memory coherence logic.  There is a cost but it will vary with the cpu cache implementation.  In the case of multiple coherence domains, I'm sure that mitigation is possible but I prefer we don't spend too much discussion time on optimizing that case or make it focus of our implementation guidance.
>
> AIUI, the iSCSI protocol has a sequence number element that is
> updated on every transmit and receive. This has been demonstrated
> to be a significant scalability issue in implementations.
>
> Given the network throughputs typical for NFS/RDMA implementations,
> we should be careful to avoid defining protocol that will be
> impossible to turn into a scalable implementation. IME that
> requires sensitivity to this detail as we are designing the
> protocol, not as an afterthought.

For the record, I'm fine with informative text which sets out some
basic principles. But, no MUSTs.

It's tricky to write such text, and often I find that it's a
counterproductive effort. A separate "implementation experience"
document can serve the purpose much more effectively.

Tom.

_______________________________________________
nfsv4 mailing list
[hidden email]
https://www.ietf.org/mailman/listinfo/nfsv4