[internet-drafts@ietf.org: New Version Notification for draft-hellwig-nfsv4-rdma-layout-00.txt]

classic Classic list List threaded Threaded
9 messages Options
hch
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[internet-drafts@ietf.org: New Version Notification for draft-hellwig-nfsv4-rdma-layout-00.txt]

hch
As promised here is an initial cut at the RDMA layout before the
meeting in Prague.

I have to admit it's not the highest quality draft, but I wanted to
get it out - it's probably unsuitable for readers without deep
knowledge of RDMA at this point.

The idea of the layout is to provide RDMA READ / WRITE access to
remote memory regions - usually persistent memory in some form,
but to some extent it will also work with volatile caching of
data, e.g. in features like the NVMe controller memory buffer or
even host memory.  It is done by registering these regions on
the server and performing the RDMA READ / WRITE operations from
the client, that is it inverts the model used by RDMA RPC or
other storage models.

Besides improving the spec language there still is a lot left to
be done:

 - define the exact connection establishment model.  I'd really
   like to rely on RDMA/CM for that
 - figure out if we can get rid of the sub-layout extent inherited
   from the block layout.  This should be possible by providing two
   handles in the layout
 - find a way to future-proof for the introduction of a RDMA FLUSH
   or COMMIT operation, where we don't have to do a LAYOUTCOMMIT
   for every write

----- Forwarded message from [hidden email] -----

Date: Sun, 02 Jul 2017 16:04:45 -0700
From: [hidden email]
Subject: New Version Notification for draft-hellwig-nfsv4-rdma-layout-00.txt
To: Christoph Hellwig <[hidden email]>


A new version of I-D, draft-hellwig-nfsv4-rdma-layout-00.txt
has been successfully submitted by Christoph Hellwig and posted to the
IETF repository.

Name: draft-hellwig-nfsv4-rdma-layout
Revision: 00
Title: Parallel NFS (pNFS) RDMA Layout
Document date: 2017-07-02
Group: Individual Submission
Pages: 18
URL:            https://www.ietf.org/internet-drafts/draft-hellwig-nfsv4-rdma-layout-00.txt
Status:         https://datatracker.ietf.org/doc/draft-hellwig-nfsv4-rdma-layout/
Htmlized:       https://tools.ietf.org/html/draft-hellwig-nfsv4-rdma-layout-00
Htmlized:       https://datatracker.ietf.org/doc/html/draft-hellwig-nfsv4-rdma-layout-00


Abstract:
   The Parallel Network File System (pNFS) allows a separation between
   the metadata (onto a metadata server) and data (onto a storage
   device) for a file.  The RDMA Layout Type is defined in this document
   as an extension to pNFS to allow the use of RDMA Verbs operations to
   access remote storage, with a special focus on accessing byte
   addressable persistent memory.

                                                                                 


Please note that it may take a couple of minutes from the time of submission
until the htmlized version and diff are available at tools.ietf.org.

The IETF Secretariat

----- End forwarded message -----

_______________________________________________
nfsv4 mailing list
[hidden email]
https://www.ietf.org/mailman/listinfo/nfsv4
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [internet-drafts@ietf.org: New Version Notification for draft-hellwig-nfsv4-rdma-layout-00.txt]

Chuck Lever-2

> On Jul 3, 2017, at 01:10, Christoph Hellwig <[hidden email]> wrote:
>
> As promised here is an initial cut at the RDMA layout before the
> meeting in Prague.
>
> I have to admit it's not the highest quality draft, but I wanted to
> get it out - it's probably unsuitable for readers without deep
> knowledge of RDMA at this point.
>
> The idea of the layout is to provide RDMA READ / WRITE access to
> remote memory regions - usually persistent memory in some form,
> but to some extent it will also work with volatile caching of
> data, e.g. in features like the NVMe controller memory buffer or
> even host memory.  It is done by registering these regions on
> the server and performing the RDMA READ / WRITE operations from
> the client, that is it inverts the model used by RDMA RPC or
> other storage models.
>
> Besides improving the spec language there still is a lot left to
> be done:
>
> - define the exact connection establishment model.  I'd really
>   like to rely on RDMA/CM for that

The connection model is critical, because the handles returned
to clients can work only on one connection (QP). For example,
you can't assume that NFS/RDMA will be used to access the MDS,
nor can you assume that the MDS and DS's are accessed through
the same HCA/RNIC on the same connection.

To make it work, there will have to be some way of binding a
layout (containing handles) to a particular connection to a
particular storage device.

Also, if the connection to a DS is lost, more than a
reconnect is necessary. The client will need to take steps to
get the server to re-register the memory and send fresh
handles.


> - figure out if we can get rid of the sub-layout extent inherited
>   from the block layout.  This should be possible by providing two
>   handles in the layout
> - find a way to future-proof for the introduction of a RDMA FLUSH
>   or COMMIT operation, where we don't have to do a LAYOUTCOMMIT
>   for every write
>
> ----- Forwarded message from [hidden email] -----
>
> Date: Sun, 02 Jul 2017 16:04:45 -0700
> From: [hidden email]
> Subject: New Version Notification for draft-hellwig-nfsv4-rdma-layout-00.txt
> To: Christoph Hellwig <[hidden email]>
>
>
> A new version of I-D, draft-hellwig-nfsv4-rdma-layout-00.txt
> has been successfully submitted by Christoph Hellwig and posted to the
> IETF repository.
>
> Name: draft-hellwig-nfsv4-rdma-layout
> Revision: 00
> Title: Parallel NFS (pNFS) RDMA Layout
> Document date: 2017-07-02
> Group: Individual Submission
> Pages: 18
> URL:            https://www.ietf.org/internet-drafts/draft-hellwig-nfsv4-rdma-layout-00.txt
> Status:         https://datatracker.ietf.org/doc/draft-hellwig-nfsv4-rdma-layout/
> Htmlized:       https://tools.ietf.org/html/draft-hellwig-nfsv4-rdma-layout-00
> Htmlized:       https://datatracker.ietf.org/doc/html/draft-hellwig-nfsv4-rdma-layout-00
>
>
> Abstract:
>   The Parallel Network File System (pNFS) allows a separation between
>   the metadata (onto a metadata server) and data (onto a storage
>   device) for a file.  The RDMA Layout Type is defined in this document
>   as an extension to pNFS to allow the use of RDMA Verbs operations to
>   access remote storage, with a special focus on accessing byte
>   addressable persistent memory.
>
>
>
>
> Please note that it may take a couple of minutes from the time of submission
> until the htmlized version and diff are available at tools.ietf.org.
>
> The IETF Secretariat
>
> ----- End forwarded message -----
>
> _______________________________________________
> nfsv4 mailing list
> [hidden email]
> https://www.ietf.org/mailman/listinfo/nfsv4

--
Chuck Lever
[hidden email]



_______________________________________________
nfsv4 mailing list
[hidden email]
https://www.ietf.org/mailman/listinfo/nfsv4
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [internet-drafts@ietf.org: New Version Notification for draft-hellwig-nfsv4-rdma-layout-00.txt]

David Noveck
> - define the exact connection establishment model.  I'd really
>   like to rely on RDMA/CM for that

Connection establishment can use RDMA/CM to establish an RC 
connection.  Unfortunately, as Chuck points out, establishing the
connection is only part of the problem.

> The connection model is critical, because the handles returned
> to clients can work only on one connection (QP).

True :-(

> For example,
> you can't assume that NFS/RDMA will be used to access the MDS,
> nor can you assume that the MDS and DS's are accessed through
> the same HCA/RNIC on the same connection.

Also, you can't assume  there is only a single connection between any
client-DS pair.  Someone will have to choose one and it isn't clear whether the
client or the MDS is the best one to do the choosing.

> To make it work, there will have to be some way of binding a
> layout (containing handles) to a particular connection to a
> particular storage device.

That sound like it requires a (new) bind operation, separate from the creation of the
layout.  That is doable but it increases the complexity to define both a protocol 
extension and a pNFS layout type that work together.  Also, IIRC, RFC 8178 may require 
a new minor version to do this.

An alternative is for the MDS to do the bind operation as part of assigning the layout, which
it can do as long as it knows the connection.  One possibility is for the to client to pass
a description of the chosen connection as a layout-hint.  Another is for the MDS to be able 
to find out the set of possible connections, havie it choose one, and let the client know,
in the layout, which connection is associated with the layout.

> Also, if the connection to a DS is lost, more than a
> reconnect is necessary. The client will need to take steps to
> get the server to re-register the memory and send fresh
> handles.

The simplest way to deal with this, although maybe not the best, is to consider the 
connection break as  effectively revoking all associated layouts.  The client needs
to get new layouts to replace the ones lost.

On Mon, Jul 17, 2017 at 5:49 AM, Chuck Lever <[hidden email]> wrote:

> On Jul 3, 2017, at 01:10, Christoph Hellwig <[hidden email]> wrote:
>
> As promised here is an initial cut at the RDMA layout before the
> meeting in Prague.
>
> I have to admit it's not the highest quality draft, but I wanted to
> get it out - it's probably unsuitable for readers without deep
> knowledge of RDMA at this point.
>
> The idea of the layout is to provide RDMA READ / WRITE access to
> remote memory regions - usually persistent memory in some form,
> but to some extent it will also work with volatile caching of
> data, e.g. in features like the NVMe controller memory buffer or
> even host memory.  It is done by registering these regions on
> the server and performing the RDMA READ / WRITE operations from
> the client, that is it inverts the model used by RDMA RPC or
> other storage models.
>
> Besides improving the spec language there still is a lot left to
> be done:
>
> - define the exact connection establishment model.  I'd really
>   like to rely on RDMA/CM for that

The connection model is critical, because the handles returned
to clients can work only on one connection (QP). For example,
you can't assume that NFS/RDMA will be used to access the MDS,
nor can you assume that the MDS and DS's are accessed through
the same HCA/RNIC on the same connection.

To make it work, there will have to be some way of binding a
layout (containing handles) to a particular connection to a
particular storage device.

Also, if the connection to a DS is lost, more than a
reconnect is necessary. The client will need to take steps to
get the server to re-register the memory and send fresh
handles.


> - figure out if we can get rid of the sub-layout extent inherited
>   from the block layout.  This should be possible by providing two
>   handles in the layout
> - find a way to future-proof for the introduction of a RDMA FLUSH
>   or COMMIT operation, where we don't have to do a LAYOUTCOMMIT
>   for every write
>
> ----- Forwarded message from [hidden email] -----
>
> Date: Sun, 02 Jul 2017 16:04:45 -0700
> From: [hidden email]
> Subject: New Version Notification for draft-hellwig-nfsv4-rdma-layout-00.txt
> To: Christoph Hellwig <[hidden email]>
>
>
> A new version of I-D, draft-hellwig-nfsv4-rdma-layout-00.txt
> has been successfully submitted by Christoph Hellwig and posted to the
> IETF repository.
>
> Name:         draft-hellwig-nfsv4-rdma-layout
> Revision:     00
> Title:                Parallel NFS (pNFS) RDMA Layout
> Document date:        2017-07-02
> Group:                Individual Submission
> Pages:                18
> URL:            https://www.ietf.org/internet-drafts/draft-hellwig-nfsv4-rdma-layout-00.txt
> Status:         https://datatracker.ietf.org/doc/draft-hellwig-nfsv4-rdma-layout/
> Htmlized:       https://tools.ietf.org/html/draft-hellwig-nfsv4-rdma-layout-00
> Htmlized:       https://datatracker.ietf.org/doc/html/draft-hellwig-nfsv4-rdma-layout-00
>
>
> Abstract:
>   The Parallel Network File System (pNFS) allows a separation between
>   the metadata (onto a metadata server) and data (onto a storage
>   device) for a file.  The RDMA Layout Type is defined in this document
>   as an extension to pNFS to allow the use of RDMA Verbs operations to
>   access remote storage, with a special focus on accessing byte
>   addressable persistent memory.
>
>
>
>
> Please note that it may take a couple of minutes from the time of submission
> until the htmlized version and diff are available at tools.ietf.org.
>
> The IETF Secretariat
>
> ----- End forwarded message -----
>
> _______________________________________________
> nfsv4 mailing list
> [hidden email]
> https://www.ietf.org/mailman/listinfo/nfsv4

--
Chuck Lever
[hidden email]



_______________________________________________
nfsv4 mailing list
[hidden email]
https://www.ietf.org/mailman/listinfo/nfsv4


_______________________________________________
nfsv4 mailing list
[hidden email]
https://www.ietf.org/mailman/listinfo/nfsv4
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [internet-drafts@ietf.org: New Version Notification for draft-hellwig-nfsv4-rdma-layout-00.txt]

Steve Byan's Lists
I have a number of comments on draft-hellwig-nfsv4-rdma-layout-00.txt.

I'm coming at this from the perspective of a user-space pNFS-RDMA client (for the RDMA layout part of the protocol, not necessarily for the NFS part) and a user-space pNFS-RDMA server (for example, an extended NFS-Ganesha that supports pNFS-RDMA).

As a result, I don't presuppose that there is a pre-existing RDMA reliable connection between the rdma layout client and server. This has implications on the connection establishment model. The NFS portions of the protocol, including exchanging the layout, could occur over a TCP connection (either over the RDMA network or over an ethernet network), and the identity of the RDMA layout server would not be known until the client receives the layout.

2.3. Device Addressing and Discovery

I think addressing and discovery should support multipathing, to enhance availability. So rather than a single netaddr4, I think the struct pnfs_rdma_device_addr4 should be defined to contain a multipath_list4, as in the File layout.

Combined with an assumption of not requiring a pre-existing RDMA connection, this has major implications for the protocol.

In later sections, the draft defines the rdma layout to contain a registered memory handle. If there is no pre-existing connection, the server has to provide an unconnected queue pair, register memory for the file with it, and pass the memory registration handle back to the client in the layout, along with the identity of the rdma server. Finally it must supply the unconnected queue pair to the RDMA Communication Manager when the server accepts the client's connection request.

This dance is possible (I think, I haven't tried it) if there is only one address for the server, as the server can bind its rdma_cm_id to one RDMA device before listening. However, if the server supports multiple addresses (for multipathing), then it is not possible to pre-create the server-side queue pair, because an unbound listening rdma_cm_id doesn't have a valid ibv_context until a connection attempt is received.

Consequently I think the rdma layout should contain only the file offset, length and extent state. The client would then obtain the handles using a pNFS-RDMA protocol exchange. This is unpalatable, but I think it is necessary. Trying to fit it all into the LAYOUTGET confronts a chicken or the egg problem for RDMA connection establishment.


2.4.  Data Structures: Extents and Extent Lists

The layout definition seems to pre-suppose a kernel (or at least highly-priviledged user space) server, because it exposes portions of the whole persistent memory device address space via the re_storage_offset field in the extent. This means the pNFS-RDMA server must be cognizant of the file system on the device.

It seems better to me to model the extent using strictly file-local information, i.e. the registered handle is simply that resulting from, for example, a user-space server mapping the file, determining its sparseness, and registering the non-sparse extents of the file. Thus re_storage_offset would not be needed in struct pnfs_rdma_extent4. The re_device_id, re_state, re_file_offset, re_length, and a separately provided set of re_handles are sufficient. I view getting a pNFS-RDMA layout as analogous to mmap’ing a local file - the layout mmap’s a file into the RDMA address space of a RC queue pair.

Exposing the portions of the persistent memory device address space seems to be motivated by a desire to enable client-offload for filling of holes in a sparse file and copy-on-write. However, I question whether these client-offloads are very useful.

Offloaded sparse-hole-filling and copy-on-write are not available to a user-space client for its local persistent memory - the local file system has to provide for them using page-mapping tricks. User-space servers don't have offloaded access to the offloads either, unless they implement the entire file system in user space, and so again must rely on page-mapping tricks by the local file system. In either case, writing to a sparse (unallocated) page or copy-on-write of a page in a file in local persistent memory is expected to be a high-latency operation. Given that, why not just have the client send a plain old NFSv4 write to the server when it encounters a copy-on-write extent?

Removing client-offloaded sparse-hole-filling and copy-on-write would considerably simplify the layout, and it makes server implementation possible in a user-space process without the server having to have intimate knowledge of the underlying local persistent memory file system.


Best regards,
-Steve

--
Steve Byan <[hidden email]>
Littleton, MA





_______________________________________________
nfsv4 mailing list
[hidden email]
https://www.ietf.org/mailman/listinfo/nfsv4
hch
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [internet-drafts@ietf.org: New Version Notification for draft-hellwig-nfsv4-rdma-layout-00.txt]

hch
In reply to this post by Chuck Lever-2
On Mon, Jul 17, 2017 at 11:49:35AM +0200, Chuck Lever wrote:
> The connection model is critical, because the handles returned
> to clients can work only on one connection (QP). For example,
> you can't assume that NFS/RDMA will be used to access the MDS,
> nor can you assume that the MDS and DS's are accessed through
> the same HCA/RNIC on the same connection.

My crappy prototype wasn't even tested using NFS/RDMA, but the
way it opens this additional backchannel is a mess, so I wanted
to go back to the drawing board before writing anything down -
what I had is defintively not how I want it to look like in
the end.

> To make it work, there will have to be some way of binding a
> layout (containing handles) to a particular connection to a
> particular storage device.

My original idea was to open explicit new connections for it,
mostly so that the QP won't have to deal with any of the
NFS/RDMA issues and can be used purely for the operations of
the layout.

> Also, if the connection to a DS is lost, more than a
> reconnect is necessary. The client will need to take steps to
> get the server to re-register the memory and send fresh
> handles.

Yes - a connection loss with this layout is an implicit CB_RECALL
with LAYOUTRECALL4_ALL scope.

_______________________________________________
nfsv4 mailing list
[hidden email]
https://www.ietf.org/mailman/listinfo/nfsv4
hch
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [internet-drafts@ietf.org: New Version Notification for draft-hellwig-nfsv4-rdma-layout-00.txt]

hch
In reply to this post by David Noveck
On Mon, Jul 17, 2017 at 07:56:31AM -0400, David Noveck wrote:
> An alternative is for the MDS to do the bind operation as part of assigning
> the layout, which
> it can do as long as it knows the connection.  One possibility is for the
> to client to pass
> a description of the chosen connection as a layout-hint.  Another is for
> the MDS to be able
> to find out the set of possible connections, havie it choose one, and let
> the client know,
> in the layout, which connection is associated with the layout.

My idea (although it's not described very well) is that a device_addr4
describes a specific QP.  So by referencing that addr from a layout
we're bound to the specific QP.

> The simplest way to deal with this, although maybe not the best, is to
> consider the
> connection break as  effectively revoking all associated layouts.  The
> client needs
> to get new layouts to replace the ones lost.

Agreed.

_______________________________________________
nfsv4 mailing list
[hidden email]
https://www.ietf.org/mailman/listinfo/nfsv4
hch
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [internet-drafts@ietf.org: New Version Notification for draft-hellwig-nfsv4-rdma-layout-00.txt]

hch
In reply to this post by Steve Byan's Lists
[hi Steve, can you properly break lines after ~75 chars in your mail,
that would make them a lot more readable]

On Tue, Jul 18, 2017 at 05:33:31PM -0400, Steve Byan's Lists wrote:
> As a result, I don't presuppose that there is a pre-existing RDMA reliable
> connection between the rdma layout client and server.

I do not assume that, it's just that my prototype CM code is so bad
that I don't want to document it in its current form.

> 2.3. Device Addressing and Discovery
>
> I think addressing and discovery should support multipathing, to enhance
> availability. So rather than a single netaddr4, I think the
> struct pnfs_rdma_device_addr4 should be defined to contain a
> multipath_list4, as in the File layout.

Memory registrations are bound to a protection domain, which at least
for NFS is generally bound to a specific QP, so simply returning
multiple addresses for interchangable use might not be a good idea.

It also is a very bad idea for load balancing purposes - I'd much rather
have the MDS control explicitly which layouts go to which QP, to e.g.
steer them to different HCAs.  (and not that that nothing in the draft
requires multiple HCAs used for the RDMA operations to even be in the
same system).


> Consequently I think the rdma layout should contain only the file offset,
> length and extent state. The client would then obtain the handles using
> a pNFS-RDMA protocol exchange. This is unpalatable, but I think it is
> necessary. Trying to fit it all into the LAYOUTGET confronts a chicken
> or the egg problem for RDMA connection establishment.

Connection establishment is something done at GETDEVICEINFO time,
although that is indeed triggered by the first LAYOUTGET usually.

But the basic idea behind the protocol is that indeed the memory
registration is generally done at LAYOUTGET time.


> The layout definition seems to pre-suppose a kernel (or at least
> highly-priviledged user space) server, because it exposes portions of the
> whole persistent memory device address space via the re_storage_offset
> field in the extent. This means the pNFS-RDMA server must be cognizant
> of the file system on the device.

With RDMA memory registrations this offset is relative to the MR, similar
to the address fields in NVMeoF, SRP or iSER.  If you use the (rather unsafe)
global MR your above observations are indeed true.  But I would recommend
against such an implementation and use safer registrations methods at
LAYOUTGET time (e.g. FRs), in which case re_storage_offset is an offset
inside the MR.

And yes, I agree the naming and description needs an improvement, the
current text is copy and paste material from the SCSI layout.

> It seems better to me to model the extent using strictly file-local
> information, i.e. the registered handle is simply that resulting from,
> for example, a user-space server mapping the file, determining its
> sparseness, and registering the non-sparse extents of the file. Thus
> re_storage_offset would not be needed in struct pnfs_rdma_extent4. The
> re_device_id, re_state, re_file_offset, re_length, and a separately
> provided set of re_handles are sufficient. I view getting a pNFS-RDMA
> layout as analogous to mmap’ing a local file - the layout mmap’s a
> file into the RDMA address space of a RC queue pair.

File mappings are very much an implementation detail.  E.g. one the
scenarious I want to support with this layout is indirect writes
where the client gets a write buffer that only gets moved into the
file itself by the layoutcommit (or RDMA FLUSH/COMMIT operation once
standardized).

That would work together with an NFS extension to support O_ATOMIC
out of place updates ala:

https://www.usenix.org/conference/fast15/technical-sessions/presentation/verma
https://lwn.net/Articles/715918/

to provide byte level write persistent memory semantics over NFS.

> Offloaded sparse-hole-filling and copy-on-write are not available to a user-space client for its local persistent memory - the local file system has to provide for them using page-mapping tricks. User-space servers don't have offloaded access to the offloads either, unless they implement the entire file system in user space, and so again must rely on page-mapping tricks by the local file system. In either case, writing to a sparse (unallocated) page or copy-on-write of a page in a file in local persistent memory is expected to be a high-latency operation. Given that, why not just have the client send a plain old NFSv4 write to the server when it encounters a copy-on-write extent?

You don't have to hand out a layout for this case, but at least for my
server it's a natural operation that adds no additional latency in
the write path, and very additional latency in the commit path.

> Removing client-offloaded sparse-hole-filling and copy-on-write would considerably simplify the layout, and it makes server implementation possible in a user-space process without the server having to have intimate knowledge of the underlying local persistent memory file system.

Again, just because the protocol specifies this you don't have to
implement it.  For example I've not seen an implementation of this in
the block and scsi layouts so far, although I'm looking into implementing
it in the future.

_______________________________________________
nfsv4 mailing list
[hidden email]
https://www.ietf.org/mailman/listinfo/nfsv4
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [internet-drafts@ietf.org: New Version Notification for draft-hellwig-nfsv4-rdma-layout-00.txt]

Tom Talpey-3
In reply to this post by hch
On 7/19/2017 3:48 AM, Christoph Hellwig wrote:
> On Mon, Jul 17, 2017 at 11:49:35AM +0200, Chuck Lever wrote:
>> Also, if the connection to a DS is lost, more than a
>> reconnect is necessary. The client will need to take steps to
>> get the server to re-register the memory and send fresh
>> handles.
>
> Yes - a connection loss with this layout is an implicit CB_RECALL
> with LAYOUTRECALL4_ALL scope.

I'm concerned with making the upper layer statement based on the
lower layer event. The connection loss is experienced by each end
at different times, and a very different state is entered depending
on whch end sees it first.

If the client sees the RDMA layout connection loss, it may promptly
reconnect and attempt to reacquire the layout. When the server sees
this, it believes that a layout is already granted. Is the server
required to grant it again? How does it know which old layout to free,
with its queue pair and memory handles, etc?

On the other hand, if the server sees a connection loss, your
"implicit" statement seems to imply it will not recall the layout
on the operation channel. Does the client need to see its layout
connection break to discover this?

Tom.

_______________________________________________
nfsv4 mailing list
[hidden email]
https://www.ietf.org/mailman/listinfo/nfsv4
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [internet-drafts@ietf.org: New Version Notification for draft-hellwig-nfsv4-rdma-layout-00.txt]

Steve Byan's Lists
In reply to this post by hch

> On Jul 19, 2017, at 4:10 AM, Christoph Hellwig <[hidden email]> wrote:
>
> On Tue, Jul 18, 2017 at 05:33:31PM -0400, Steve Byan's Lists wrote:
>
>> 2.3. Device Addressing and Discovery
>>
>> I think addressing and discovery should support multipathing, to enhance
>> availability. So rather than a single netaddr4, I think the
>> struct pnfs_rdma_device_addr4 should be defined to contain a
>> multipath_list4, as in the File layout.
>
> Memory registrations are bound to a protection domain, which at least
> for NFS is generally bound to a specific QP, so simply returning
> multiple addresses for interchangable use might not be a good idea.
>
> It also is a very bad idea for load balancing purposes - I'd much rather
> have the MDS control explicitly which layouts go to which QP, to e.g.
> steer them to different HCAs.  (and not that that nothing in the draft
> requires multiple HCAs used for the RDMA operations to even be in the
> same system).

I think the server may not be the right place for load balancing for
pNFS-RDMA. The pNFS-RDMA client is much more likely to know if it
is experiencing congestion delays than even the server HCA/RNIC, much
less the pNFS-RDMA server software, which is not even involved in the
data transfer path.

Also, if the client is unable to establish a connection to the single address
specified by the layout, how can it request the server to fail over to a
redundant path? Is the server required to hand out a different path
(assuming it has one) the next time the client requests a layout?

>> It seems better to me to model the extent using strictly file-local
>> information, i.e. the registered handle is simply that resulting from,
>> for example, a user-space server mapping the file, determining its
>> sparseness, and registering the non-sparse extents of the file. Thus
>> re_storage_offset would not be needed in struct pnfs_rdma_extent4. The
>> re_device_id, re_state, re_file_offset, re_length, and a separately
>> provided set of re_handles are sufficient. I view getting a pNFS-RDMA
>> layout as analogous to mmap’ing a local file - the layout mmap’s a
>> file into the RDMA address space of a RC queue pair.
>
> File mappings are very much an implementation detail.  E.g. one the
> scenarious I want to support with this layout is indirect writes
> where the client gets a write buffer that only gets moved into the
> file itself by the layoutcommit (or RDMA FLUSH/COMMIT operation once
> standardized).
>
> That would work together with an NFS extension to support O_ATOMIC
> out of place updates ala:
>
> https://www.usenix.org/conference/fast15/technical-sessions/presentation/verma
> https://lwn.net/Articles/715918/
>
> to provide byte level write persistent memory semantics over NFS.

I think using an RDMA RPC for atomic writes might be a better approach.
I’m not convinced that using one-sided RDMA ops is lower latency — the
client still has to send the layoutcommit RPC. Or if one tries to include
the atomic commit semantic in the proposed RDMA FLUSH/COMMIT,
that effectively turns it into an RPC.

Why not just send the data along with the commit, given that you need
an RPC anyway?

Also, forcing the interface to the failure-atomic transaction to look like
copy-on-write is awkward for servers that implement it using undo or
redo logging.

It would be good to see some data on the performance of these
approaches before we bake something into the protocol.

Best regards,
-Steve

--
Steve Byan <[hidden email]>
Littleton, MA





_______________________________________________
nfsv4 mailing list
[hidden email]
https://www.ietf.org/mailman/listinfo/nfsv4
Loading...