[PATCH v3 00/19][RFC] virtio-fs: Enable DAX support
by Vivek Goyal
Hi,
This patch series enables DAX support for virtio-fs filesystem. Patches
are based on 5.3-rc5 kernel and need first patch series posted for
virtio-fs support with subject "virtio-fs: shared file system for virtual
machines".
https://www.redhat.com/archives/virtio-fs/2019-August/msg00281.html
Enabling DAX seems to improve performance for most of the operations
in general a great deal. I have reported performance numbers in first patch
series so I am not repeating these here.
Any comments or feedback is welcome.
Thanks
Vivek
Sebastien Boeuf (3):
virtio: Add get_shm_region method
virtio: Implement get_shm_region for PCI transport
virtio: Implement get_shm_region for MMIO transport
Stefan Hajnoczi (4):
dax: remove block device dependencies
fuse, dax: add fuse_conn->dax_dev field
virtio_fs, dax: Set up virtio_fs dax_device
fuse, dax: add DAX mmap support
Vivek Goyal (12):
dax: Pass dax_dev to dax_writeback_mapping_range()
fuse: Keep a list of free dax memory ranges
fuse: implement FUSE_INIT map_alignment field
fuse: Introduce setupmapping/removemapping commands
fuse, dax: Implement dax read/write operations
fuse: Define dax address space operations
fuse, dax: Take ->i_mmap_sem lock during dax page fault
fuse: Maintain a list of busy elements
dax: Create a range version of dax_layout_busy_page()
fuse: Add logic to free up a memory range
fuse: Release file in process context
fuse: Take inode lock for dax inode truncation
drivers/dax/super.c | 3 +-
drivers/virtio/virtio_mmio.c | 32 +
drivers/virtio/virtio_pci_modern.c | 108 +++
fs/dax.c | 89 +-
fs/ext2/inode.c | 2 +-
fs/ext4/inode.c | 2 +-
fs/fuse/cuse.c | 3 +-
fs/fuse/dir.c | 2 +
fs/fuse/file.c | 1206 +++++++++++++++++++++++++++-
fs/fuse/fuse_i.h | 99 ++-
fs/fuse/inode.c | 138 +++-
fs/fuse/virtio_fs.c | 134 +++-
fs/xfs/xfs_aops.c | 2 +-
include/linux/dax.h | 12 +-
include/linux/virtio_config.h | 17 +
include/uapi/linux/fuse.h | 47 +-
include/uapi/linux/virtio_fs.h | 3 +
include/uapi/linux/virtio_mmio.h | 11 +
include/uapi/linux/virtio_pci.h | 11 +-
19 files changed, 1868 insertions(+), 53 deletions(-)
--
2.20.1
11 months, 2 weeks
[LSF/MM TOPIC] The end of the DAX experiment
by Dan Williams
Before people get too excited this isn't a proposal to kill DAX. The
topic proposal is a discussion to resolve lingering open questions
that currently motivate ext4 and xfs to scream "EXPERIMENTAL" when the
current DAX facilities are enabled. The are 2 primary concerns to
resolve. Enumerate the remaining features/fixes, and identify a path
to implement it all without regressing any existing application use
cases.
An enumeration of remaining projects follows, please expand this list
if I missed something:
* "DAX" has no specific meaning by itself, users have 2 use cases for
"DAX" capabilities: userspace cache management via MAP_SYNC, and page
cache avoidance where the latter aspect of DAX has no current api to
discover / use it. The project is to supplement MAP_SYNC with a
MAP_DIRECT facility and MADV_SYNC / MADV_DIRECT to indicate the same
dynamically via madvise. Similar to O_DIRECT, MAP_DIRECT would be an
application hint to avoid / minimiize page cache usage, but no strict
guarantee like what MAP_SYNC provides.
* Resolve all "if (dax) goto fail;" patterns in the kernel. Outside of
longterm-GUP (a topic in its own right) the projects here are
XFS-reflink and XFS-realtime-device support. DAX+reflink effectively
requires a given physical page to be mapped into two different inodes
at different (page->index) offsets. The challenge is to support
DAX-reflink without violating any existing application visible
semantics, the operating assumption / strawman to debate is that
experimental status is not blanket permission to go change existing
semantics in backwards incompatible ways.
* Deprecate, but not remove, the DAX mount option. Too many flows
depend on the option so it will never go away, but the facility is too
coarse. Provide an option to enable MAP_SYNC and
more-likely-to-do-something-useful-MAP_DIRECT on a per-directory
basis. The current proposal is to allow this property to only be
toggled while the directory is empty to avoid the complications of
racing page invalidation with new DAX mappings.
Secondary projects, i.e. important but I would submit are not in the
critical path to removing the "experimental" designation:
* Filesystem-integrated badblock management. Hook up the media error
notifications from libnvdimm to the filesystem to allow for operations
like "list files with media errors" and "enumerate bad file offsets on
a granulatiy smaller than a page". Another consideration along these
lines is to integrate machine-check-handling and dynamic error
notification into a filesystem interface. I've heard complaints that
the sigaction() based mechanism to receive BUS_MCEERR_* information,
while sufficient for the "System RAM" use case, is not precise enough
for the "Persistent Memory / DAX" use case where errors are repairable
and sub-page error information is useful.
* Userfaultfd for file-backed mappings and DAX
Ideally all the usual DAX, persistent memory, and GUP suspects could
be in the room to discuss this:
* Jan Kara
* Dave Chinner
* Christoph Hellwig
* Jeff Moyer
* Johannes Thumshirn
* Matthew Wilcox
* John Hubbard
* Jérôme Glisse
* MM folks for the reflink vs 'struct page' vs Xarray considerations
1 year, 1 month
[PATCH] Consider namespace with size as active namespace
by Aneesh Kumar K.V
This enables us to mark a namespace as disabled due to pfn_sb
mismatch. We have pending kernel patches at that will mark the
namespace disabled when the PAGE_SIZE or struct page size didn't
match with the value stored in pfn_sb.
We need to make sure we don't use this disabled namespace as seed namespace
for new namespace creation.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar(a)linux.ibm.com>
---
ndctl/namespace.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/ndctl/namespace.c b/ndctl/namespace.c
index 58a9e3c53474..1f212a2b3a9b 100644
--- a/ndctl/namespace.c
+++ b/ndctl/namespace.c
@@ -455,7 +455,8 @@ static int is_namespace_active(struct ndctl_namespace *ndns)
return ndns && (ndctl_namespace_is_enabled(ndns)
|| ndctl_namespace_get_pfn(ndns)
|| ndctl_namespace_get_dax(ndns)
- || ndctl_namespace_get_btt(ndns));
+ || ndctl_namespace_get_btt(ndns)
+ || ndctl_namespace_get_size(ndns));
}
/*
--
2.21.0
1 year, 2 months
[PATCH] ndctl: Use the same align value as original namespace on reconfigure
by Aneesh Kumar K.V
When using reconfigure command to add a `name` to the namespace we end
up updating the align attribute. Avoid this by using the value from
the original namespace. Do this only if we are keeping the namespace mode
same.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar(a)linux.ibm.com>
---
ndctl/namespace.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/ndctl/namespace.c b/ndctl/namespace.c
index 1f212a2b3a9b..24e51bb35ae1 100644
--- a/ndctl/namespace.c
+++ b/ndctl/namespace.c
@@ -596,6 +596,22 @@ static int validate_namespace_options(struct ndctl_region *region,
return -ENXIO;
}
} else {
+
+ /*
+ * If we are tryint to reconfigure with the same namespace mode
+ * Use the align details from the origin namespace. Otherwise
+ * pick the align details from seed namespace
+ */
+ if (ndns && p->mode == ndctl_namespace_get_mode(ndns)) {
+ struct ndctl_pfn *ns_pfn = ndctl_namespace_get_pfn(ndns);
+ struct ndctl_dax *ns_dax = ndctl_namespace_get_dax(ndns);
+ if (ns_pfn)
+ p->align = ndctl_pfn_get_align(ns_pfn);
+ else if (ns_dax)
+ p->align = ndctl_dax_get_align(ns_dax);
+ else
+ p->align = sysconf(_SC_PAGE_SIZE);
+ } else
/*
* Use the seed namespace alignment as the default if we need
* one. If we don't then use PAGE_SIZE so the size_align
--
2.21.0
1 year, 3 months
[RFC PATCH 0/7] xfs: add reflink & dedupe support for fsdax.
by Shiyang Ruan
This patchset aims to take care of this issue to make reflink and dedupe
work correctly in XFS.
It is based on Goldwyn's patchsets: "v4 Btrfs dax support" and "Btrfs
iomap". I picked up some patches related and made a few fix to make it
basically works fine.
For dax framework:
1. adapt to the latest change in iomap.
For XFS:
1. report the source address and set IOMAP_COW type for those write
operations that need COW.
2. update extent list at the end.
3. add file contents comparison function based on dax framework.
4. use xfs_break_layouts() to support dax.
Goldwyn Rodrigues (3):
dax: replace mmap entry in case of CoW
fs: dedup file range to use a compare function
dax: memcpy before zeroing range
Shiyang Ruan (4):
dax: Introduce dax_copy_edges() for COW.
dax: copy data before write.
xfs: Add COW handle for fsdax.
xfs: Add dedupe support for fsdax.
fs/btrfs/ioctl.c | 11 ++-
fs/dax.c | 203 ++++++++++++++++++++++++++++++++++++++----
fs/iomap.c | 9 +-
fs/ocfs2/file.c | 2 +-
fs/read_write.c | 11 +--
fs/xfs/xfs_iomap.c | 42 +++++----
fs/xfs/xfs_reflink.c | 84 +++++++++--------
include/linux/dax.h | 15 ++--
include/linux/fs.h | 8 +-
include/linux/iomap.h | 6 ++
10 files changed, 294 insertions(+), 97 deletions(-)
--
2.17.0
1 year, 3 months
[RFC v3 00/19] kunit: introduce KUnit, the Linux kernel unit testing framework
by Brendan Higgins
This patch set proposes KUnit, a lightweight unit testing and mocking
framework for the Linux kernel.
Unlike Autotest and kselftest, KUnit is a true unit testing framework;
it does not require installing the kernel on a test machine or in a VM
and does not require tests to be written in userspace running on a host
kernel. Additionally, KUnit is fast: From invocation to completion KUnit
can run several dozen tests in under a second. Currently, the entire
KUnit test suite for KUnit runs in under a second from the initial
invocation (build time excluded).
KUnit is heavily inspired by JUnit, Python's unittest.mock, and
Googletest/Googlemock for C++. KUnit provides facilities for defining
unit test cases, grouping related test cases into test suites, providing
common infrastructure for running tests, mocking, spying, and much more.
## What's so special about unit testing?
A unit test is supposed to test a single unit of code in isolation,
hence the name. There should be no dependencies outside the control of
the test; this means no external dependencies, which makes tests orders
of magnitudes faster. Likewise, since there are no external dependencies,
there are no hoops to jump through to run the tests. Additionally, this
makes unit tests deterministic: a failing unit test always indicates a
problem. Finally, because unit tests necessarily have finer granularity,
they are able to test all code paths easily solving the classic problem
of difficulty in exercising error handling code.
## Is KUnit trying to replace other testing frameworks for the kernel?
No. Most existing tests for the Linux kernel are end-to-end tests, which
have their place. A well tested system has lots of unit tests, a
reasonable number of integration tests, and some end-to-end tests. KUnit
is just trying to address the unit test space which is currently not
being addressed.
## More information on KUnit
There is a bunch of documentation near the end of this patch set that
describes how to use KUnit and best practices for writing unit tests.
For convenience I am hosting the compiled docs here:
https://google.github.io/kunit-docs/third_party/kernel/docs/
Additionally for convenience, I have applied these patches to a branch:
https://kunit.googlesource.com/linux/+/kunit/rfc/4.19/v3
The repo may be cloned with:
git clone https://kunit.googlesource.com/linux
This patchset is on the kunit/rfc/4.19/v3 branch.
## Changes Since Last Version
- Changed namespace prefix from `test_*` to `kunit_*` as requested by
Shuah.
- Started converting/cleaning up the device tree unittest to use KUnit.
- Started adding KUnit expectations with custom messages.
--
2.20.0.rc0.387.gc7a69e6b6c-goog
1 year, 4 months
[PATCH v5] mm/nvdimm: Fix endian conversion issues
by Aneesh Kumar K.V
nd_label->dpa issue was observed when trying to enable the namespace created
with little-endian kernel on a big-endian kernel. That made me run
`sparse` on the rest of the code and other changes are the result of that.
Fixes: d9b83c756953 ("libnvdimm, btt: rework error clearing")
Fixes: 9dedc73a4658 ("libnvdimm/btt: Fix LBA masking during 'free list' population")
Reviewed-by: Vishal Verma <vishal.l.verma(a)intel.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar(a)linux.ibm.com>
---
Changes from V4:
* Rebase to latest kernel
drivers/nvdimm/btt.c | 8 ++++----
drivers/nvdimm/namespace_devs.c | 7 ++++---
2 files changed, 8 insertions(+), 7 deletions(-)
diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index a8d56887ec88..3e9f45aec8d1 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -392,9 +392,9 @@ static int btt_flog_write(struct arena_info *arena, u32 lane, u32 sub,
arena->freelist[lane].sub = 1 - arena->freelist[lane].sub;
if (++(arena->freelist[lane].seq) == 4)
arena->freelist[lane].seq = 1;
- if (ent_e_flag(ent->old_map))
+ if (ent_e_flag(le32_to_cpu(ent->old_map)))
arena->freelist[lane].has_err = 1;
- arena->freelist[lane].block = le32_to_cpu(ent_lba(ent->old_map));
+ arena->freelist[lane].block = ent_lba(le32_to_cpu(ent->old_map));
return ret;
}
@@ -560,8 +560,8 @@ static int btt_freelist_init(struct arena_info *arena)
* FIXME: if error clearing fails during init, we want to make
* the BTT read-only
*/
- if (ent_e_flag(log_new.old_map) &&
- !ent_normal(log_new.old_map)) {
+ if (ent_e_flag(le32_to_cpu(log_new.old_map)) &&
+ !ent_normal(le32_to_cpu(log_new.old_map))) {
arena->freelist[i].has_err = 1;
ret = arena_clear_freelist_error(arena, i);
if (ret)
diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
index a9c76df12cb9..f779cb2b0c69 100644
--- a/drivers/nvdimm/namespace_devs.c
+++ b/drivers/nvdimm/namespace_devs.c
@@ -1987,7 +1987,7 @@ static struct device *create_namespace_pmem(struct nd_region *nd_region,
nd_mapping = &nd_region->mapping[i];
label_ent = list_first_entry_or_null(&nd_mapping->labels,
typeof(*label_ent), list);
- label0 = label_ent ? label_ent->label : 0;
+ label0 = label_ent ? label_ent->label : NULL;
if (!label0) {
WARN_ON(1);
@@ -2322,8 +2322,9 @@ static struct device **scan_labels(struct nd_region *nd_region)
continue;
/* skip labels that describe extents outside of the region */
- if (nd_label->dpa < nd_mapping->start || nd_label->dpa > map_end)
- continue;
+ if (__le64_to_cpu(nd_label->dpa) < nd_mapping->start ||
+ __le64_to_cpu(nd_label->dpa) > map_end)
+ continue;
i = add_namespace_resource(nd_region, nd_label, devs, count);
if (i < 0)
--
2.21.0
1 year, 4 months
[RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)
by ira.weiny@intel.com
From: Ira Weiny <ira.weiny(a)intel.com>
Pre-requisites
==============
Based on mmotm tree.
Based on the feedback from LSFmm, the LWN article, the RFC series since
then, and a ton of scenarios I've worked in my mind and/or tested...[1]
Solution summary
================
The real issue is that there is no use case for a user to have RDMA pinn'ed
memory which is then truncated. So really any solution we present which:
A) Prevents file system corruption or data leaks
...and...
B) Informs the user that they did something wrong
Should be an acceptable solution.
Because this is slightly new behavior. And because this is going to be
specific to DAX (because of the lack of a page cache) we have made the user
"opt in" to this behavior.
The following patches implement the following solution.
0) Registrations to Device DAX char devs are not affected
1) The user has to opt in to allowing page pins on a file with an exclusive
layout lease. Both exclusive and layout lease flags are user visible now.
2) page pins will fail if the lease is not active when the file back page is
encountered.
3) Any truncate or hole punch operation on a pinned DAX page will fail.
4) The user has the option of holding the lease or releasing it. If they
release it no other pin calls will work on the file.
5) Closing the file is ok.
6) Unmapping the file is ok
7) Pins against the files are tracked back to an owning file or an owning mm
depending on the internal subsystem needs. With RDMA there is an owning
file which is related to the pined file.
8) Only RDMA is currently supported
9) Truncation of pages which are not actively pinned nor covered by a lease
will succeed.
Reporting of pinned files in procfs
===================================
A number of alternatives were explored for how to report the file pins within
procfs. The following incorporates ideas from Jan Kara, Jason Gunthorpe, Dave
Chinner, Dan Williams and myself.
A new entry is added to procfs
/proc/<pid>/file_pins
For processes which have pinned DAX file memory file_pins reference come in 2
flavors. Those which are attached to another open file descriptor (For example
what is done in the RDMA subsytem) and those which are attached to a process
mm.
For those which are attached to another open file descriptor (such as RDMA)
the file pin references go through the 'struct file' associated with that pin.
In RDMA this is the RDMA context struct file.
The resulting output from proc fs is something like.
$ cat /proc/<pid>/file_pins
3: /dev/infiniband/uverbs0
/mnt/pmem/foo
Where '3' is the file descriptor (and file path) of the rdma context within the
process. The paths of the files pinned using that context are then listed.
RDMA contexts may have multiple MR each of which may have multiple files pinned
within them. So an output like the following is possible.
$ cat /proc/<pid>/file_pins
4: /dev/infiniband/uverbs0
/mnt/pmem/foo
/mnt/pmem/bar
/mnt/pmem/another
/mnt/pmem/one
The actual memory regions associated with the file pins are not reported.
For processes which are pinning memory which is not associated with a specific
file descriptor memory pins are reported directly as paths to the file.
$ cat /proc/<pid>/file_pins
/mnt/pmem/foo
Putting the above together if a process was using RDMA and another subsystem
the output could be something like:
$ cat /proc/<pid>/file_pins
4: /dev/infiniband/uverbs0
/mnt/pmem/foo
/mnt/pmem/bar
/mnt/pmem/another
/mnt/pmem/one
/mnt/pmem/foo
/mnt/pmem/another
/mnt/pmem/mm_mapped_file
[1] https://lkml.org/lkml/2019/6/5/1046
Background
==========
It should be noted that one solution for this problem is to use RDMA's On
Demand Paging (ODP). There are 2 big reasons this may not work.
1) The hardware being used for RDMA may not support ODP
2) ODP may be detrimental to the over all network (cluster or cloud)
performance
Therefore, in order to support RDMA to File system pages without On Demand
Paging (ODP) a number of things need to be done.
1) "longterm" GUP users need to inform other subsystems that they have taken a
pin on a page which may remain pinned for a very "long time". The
definition of long time is debatable but it has been established that RDMAs
use of pages for, minutes, hours, or even days after the pin is the extreme
case which makes this problem most severe.
2) Any page which is "controlled" by a file system needs to have special
handling. The details of the handling depends on if the page is page cache
fronted or not.
2a) A page cache fronted page which has been pinned by GUP long term can use a
bounce buffer to allow the file system to write back snap shots of the page.
This is handled by the FS recognizing the GUP long term pin and making a copy
of the page to be written back.
NOTE: this patch set does not address this path.
2b) A FS "controlled" page which is not page cache fronted is either easier
to deal with or harder depending on the operation the filesystem is trying
to do.
2ba) [Hard case] If the FS operation _is_ a truncate or hole punch the
FS can no longer use the pages in question until the pin has been
removed. This patch set presents a solution to this by introducing
some reasonable restrictions on user space applications.
2bb) [Easy case] If the FS operation is _not_ a truncate or hole punch
then there is nothing which need be done. Data is Read or Written
directly to the page. This is an easy case which would currently work
if not for GUP long term pins being disabled. Therefore this patch set
need not change access to the file data but does allow for GUP pins
after 2ba above is dealt with.
This patch series and presents a solution for problem 2ba)
Ira Weiny (19):
fs/locks: Export F_LAYOUT lease to user space
fs/locks: Add Exclusive flag to user Layout lease
mm/gup: Pass flags down to __gup_device_huge* calls
mm/gup: Ensure F_LAYOUT lease is held prior to GUP'ing pages
fs/ext4: Teach ext4 to break layout leases
fs/ext4: Teach dax_layout_busy_page() to operate on a sub-range
fs/xfs: Teach xfs to use new dax_layout_busy_page()
fs/xfs: Fail truncate if page lease can't be broken
mm/gup: Introduce vaddr_pin structure
mm/gup: Pass a NULL vaddr_pin through GUP fast
mm/gup: Pass follow_page_context further down the call stack
mm/gup: Prep put_user_pages() to take an vaddr_pin struct
{mm,file}: Add file_pins objects
fs/locks: Associate file pins while performing GUP
mm/gup: Introduce vaddr_pin_pages()
RDMA/uverbs: Add back pointer to system file object
RDMA/umem: Convert to vaddr_[pin|unpin]* operations.
{mm,procfs}: Add display file_pins proc
mm/gup: Remove FOLL_LONGTERM DAX exclusion
drivers/infiniband/core/umem.c | 26 +-
drivers/infiniband/core/umem_odp.c | 16 +-
drivers/infiniband/core/uverbs.h | 1 +
drivers/infiniband/core/uverbs_main.c | 1 +
fs/Kconfig | 1 +
fs/dax.c | 38 ++-
fs/ext4/ext4.h | 2 +-
fs/ext4/extents.c | 6 +-
fs/ext4/inode.c | 26 +-
fs/file_table.c | 4 +
fs/locks.c | 291 +++++++++++++++++-
fs/proc/base.c | 214 +++++++++++++
fs/xfs/xfs_file.c | 21 +-
fs/xfs/xfs_inode.h | 5 +-
fs/xfs/xfs_ioctl.c | 15 +-
fs/xfs/xfs_iops.c | 14 +-
include/linux/dax.h | 12 +-
include/linux/file.h | 49 +++
include/linux/fs.h | 5 +-
include/linux/huge_mm.h | 17 --
include/linux/mm.h | 69 +++--
include/linux/mm_types.h | 2 +
include/rdma/ib_umem.h | 2 +-
include/uapi/asm-generic/fcntl.h | 5 +
kernel/fork.c | 3 +
mm/gup.c | 418 ++++++++++++++++----------
mm/huge_memory.c | 18 +-
mm/internal.h | 28 ++
28 files changed, 1048 insertions(+), 261 deletions(-)
--
2.20.1
1 year, 4 months
[PATCH v6 0/7] Mark the namespace disabled on pfn superblock mismatch
by Aneesh Kumar K.V
We add new members to pfn superblock (PAGE_SIZE and struct page size) in this series.
This is now checked while initializing the namespace. If we find a mismatch we mark
the namespace disabled.
This series also handle configs where hugepage support is not enabled by default.
This can result in different align restrictions for dax namespace. We mark the
dax namespace disabled if we find the alignment not supported.
Changes from v5:
* Split patch 3
* Update commit message
* Add MAX_STRUCT_PAGE_SIZE with value 64 and use that when allocating reserve block
* Add BUILD_BUG_ON if we find sizeof(struct page) > 64
Aneesh Kumar K.V (6):
libnvdimm/pmem: Advance namespace seed for specific probe errors
libnvdimm/pfn_dev: Add a build check to make sure we notice when
struct page size change
libnvdimm/pfn_dev: Add page size and struct page size to pfn
superblock
libnvdimm/label: Remove the dpa align check
libnvdimm: Use PAGE_SIZE instead of SZ_4K for align check
libnvdimm/dax: Pick the right alignment default when creating dax
devices
Dan Williams (1):
libnvdimm/region: Rewrite _probe_success() to _advance_seeds()
arch/powerpc/include/asm/libnvdimm.h | 9 ++++
arch/powerpc/mm/Makefile | 1 +
arch/powerpc/mm/nvdimm.c | 34 +++++++++++++
arch/x86/include/asm/libnvdimm.h | 19 +++++++
drivers/nvdimm/bus.c | 8 ++-
drivers/nvdimm/label.c | 5 --
drivers/nvdimm/namespace_devs.c | 40 +++++++++++----
drivers/nvdimm/nd-core.h | 3 +-
drivers/nvdimm/nd.h | 10 ++--
drivers/nvdimm/pfn.h | 5 +-
drivers/nvdimm/pfn_devs.c | 67 ++++++++++++++++++++++--
drivers/nvdimm/pmem.c | 29 +++++++++--
drivers/nvdimm/region_devs.c | 76 +++++-----------------------
include/linux/huge_mm.h | 7 ++-
14 files changed, 215 insertions(+), 98 deletions(-)
create mode 100644 arch/powerpc/include/asm/libnvdimm.h
create mode 100644 arch/powerpc/mm/nvdimm.c
create mode 100644 arch/x86/include/asm/libnvdimm.h
--
2.21.0
1 year, 4 months
Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ; -)
by Dave Chinner
On Fri, Aug 23, 2019 at 09:04:29AM -0300, Jason Gunthorpe wrote:
> On Fri, Aug 23, 2019 at 01:23:45PM +1000, Dave Chinner wrote:
>
> > > But the fact that RDMA, and potentially others, can "pass the
> > > pins" to other processes is something I spent a lot of time trying to work out.
> >
> > There's nothing in file layout lease architecture that says you
> > can't "pass the pins" to another process. All the file layout lease
> > requirements say is that if you are going to pass a resource for
> > which the layout lease guarantees access for to another process,
> > then the destination process already have a valid, active layout
> > lease that covers the range of the pins being passed to it via the
> > RDMA handle.
>
> How would the kernel detect and enforce this? There are many ways to
> pass a FD.
AFAIC, that's not really a kernel problem. It's more of an
application design constraint than anything else. i.e. if the app
passes the IB context to another process without a lease, then the
original process is still responsible for recalling the lease and
has to tell that other process to release the IB handle and it's
resources.
> IMHO it is wrong to try and create a model where the file lease exists
> independently from the kernel object relying on it. In other words the
> IB MR object itself should hold a reference to the lease it relies
> upon to function properly.
That still doesn't work. Leases are not individually trackable or
reference counted objects objects - they are attached to a struct
file bUt, in reality, they are far more restricted than a struct
file.
That is, a lease specifically tracks the pid and the _open fd_ it
was obtained for, so it is essentially owned by a specific process
context. Hence a lease is not able to be passed to a separate
process context and have it still work correctly for lease break
notifications. i.e. the layout break signal gets delivered to
original process that created the struct file, if it still exists
and has the original fd still open. It does not get sent to the
process that currently holds a reference to the IB context.
So while a struct file passed to another process might still have
an active lease, and you can change the owner of the struct file
via fcntl(F_SETOWN), you can't associate the existing lease with a
the new fd in the new process and so layout break signals can't be
directed at the lease fd....
This really means that a lease can only be owned by a single process
context - it can't be shared across multiple processes (so I was
wrong about dup/pass as being a possible way of passing them)
because there's only one process that can "own" a struct file, and
that where signals are sent when the lease needs to be broken.
So, fundamentally, if you want to pass a resource that pins a file
layout between processes, both processes need to hold a layout lease
on that file range. And that means exclusive leases and passing
layouts between processes are fundamentally incompatible because you
can't hold two exclusive leases on the same file range....
Cheers,
Dave.
--
Dave Chinner
david(a)fromorbit.com
1 year, 4 months