[PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps
by Joao Martins
Hey,
This series, attempts at minimizing 'struct page' overhead by
pursuing a similar approach as Muchun Song series "Free some vmemmap
pages of hugetlb page"[0] but applied to devmap/ZONE_DEVICE.
[0] https://lore.kernel.org/linux-mm/20210308102807.59745-1-songmuchun@byteda...
The link above describes it quite nicely, but the idea is to reuse tail
page vmemmap areas, particular the area which only describes tail pages.
So a vmemmap page describes 64 struct pages, and the first page for a given
ZONE_DEVICE vmemmap would contain the head page and 63 tail pages. The second
vmemmap page would contain only tail pages, and that's what gets reused across
the rest of the subsection/section. The bigger the page size, the bigger the
savings (2M hpage -> save 6 vmemmap pages; 1G hpage -> save 4094 vmemmap pages).
This series also takes one step further on 1GB pages and *also* reuse PMD pages
which only contain tail pages which allows to keep parity with current hugepage
based memmap. This further let us more than halve the overhead with 1GB pages
(40M -> 16M per Tb)
In terms of savings, per 1Tb of memory, the struct page cost would go down
with compound pagemap:
* with 2M pages we lose 4G instead of 16G (0.39% instead of 1.5% of total memory)
* with 1G pages we lose 16MB instead of 16G (0.0014% instead of 1.5% of total memory)
Along the way I've extended it past 'struct page' overhead *trying* to address a
few performance issues we knew about for pmem, specifically on the
{pin,get}_user_pages_fast with device-dax vmas which are really
slow even of the fast variants. THP is great on -fast variants but all except
hugetlbfs perform rather poorly on non-fast gup. Although I deferred the
__get_user_pages() improvements (in a follow up series I have stashed as its
ortogonal to device-dax as THP suffers from the same syndrome).
So to summarize what the series does:
Patch 1: Prepare hwpoisoning to work with dax compound pages.
Patches 2-4: Have memmap_init_zone_device() initialize its metadata as compound
pages. We split the current utility function of prep_compound_page() into head
and tail and use those two helpers where appropriate to take advantage of caches
being warm after __init_single_page(). Since RFC this also lets us further speed
up from 190ms down to 80ms init time.
Patches 5-10: Much like Muchun series, we reuse PTE (and PMD) tail page vmemmap
areas across a given page size (namely @align was referred by remaining
memremap/dax code) and enabling of memremap to initialize the ZONE_DEVICE pages
as compound pages or a given @align order. The main difference though, is that
contrary to the hugetlbfs series, there's no vmemmap for the area, because we
are populating it as opposed to remapping it. IOW no freeing of pages of
already initialized vmemmap like the case for hugetlbfs, which simplifies the
logic (besides not being arch-specific). After these, there's quite visible
region bootstrap of pmem memmap given that we would initialize fewer struct
pages depending on the page size with DRAM backed struct pages. altmap sees no
difference in bootstrap.
NVDIMM namespace bootstrap improves from ~268-358 ms to ~78-100/<1ms on 128G NVDIMMs
with 2M and 1G respectivally.
Patch 11: Optimize grabbing page refcount changes given that we
are working with compound pages i.e. we do 1 increment to the head
page for a given set of N subpages compared as opposed to N individual writes.
{get,pin}_user_pages_fast() for zone_device with compound pagemap consequently
improves considerably with DRAM stored struct pages. It also *greatly*
improves pinning with altmap. Results with gup_test:
before after
(16G get_user_pages_fast 2M page size) ~59 ms -> ~6.1 ms
(16G pin_user_pages_fast 2M page size) ~87 ms -> ~6.2 ms
(16G get_user_pages_fast altmap 2M page size) ~494 ms -> ~9 ms
(16G pin_user_pages_fast altmap 2M page size) ~494 ms -> ~10 ms
altmap performance gets specially interesting when pinning a pmem dimm:
before after
(128G get_user_pages_fast 2M page size) ~492 ms -> ~49 ms
(128G pin_user_pages_fast 2M page size) ~493 ms -> ~50 ms
(128G get_user_pages_fast altmap 2M page size) ~3.91 ms -> ~70 ms
(128G pin_user_pages_fast altmap 2M page size) ~3.97 ms -> ~74 ms
The unpinning improvement patches are in mmotm/linux-next so removed from this
series.
I have deferred the __get_user_pages() patch to outside this series
(https://lore.kernel.org/linux-mm/20201208172901.17384-11-joao.m.martins@o...),
as I found an simpler way to address it and that is also applicable to
THP. But will submit that as a follow up of this.
Patches apply on top of linux-next tag next-20210325 (commit b4f20b70784a).
Comments and suggestions very much appreciated!
Changelog,
RFC -> v1:
(New patches 1-3, 5-8 but the diffstat is that different)
* Fix hwpoisoning of devmap pages reported by Jane (Patch 1 is new in v1)
* Fix/Massage commit messages to be more clear and remove the 'we' occurences (Dan, John, Matthew)
* Use pfn_align to be clear it's nr of pages for @align value (John, Dan)
* Add two helpers pgmap_align() and pgmap_pfn_align() as accessors of pgmap->align;
* Remove the gup_device_compound_huge special path and have the same code
work both ways while special casing when devmap page is compound (Jason, John)
* Avoid usage of vmemmap_populate_basepages() and introduce a first class
loop that doesn't care about passing an altmap for memmap reuse. (Dan)
* Completely rework the vmemmap_populate_compound() to avoid the sparse_add_section
hack into passing block across sparse_add_section calls. It's a lot easier to
follow and more explicit in what it does.
* Replace the vmemmap refactoring with adding a @pgmap argument and moving
parts of the vmemmap_populate_base_pages(). (Patch 5 and 6 are new as a result)
* Add PMD tail page vmemmap area reuse for 1GB pages. (Patch 8 is new)
* Improve memmap_init_zone_device() to initialize compound pages when
struct pages are cache warm. That lead to a even further speed up further
from RFC series from 190ms -> 80-120ms. Patches 2 and 3 are the new ones
as a result (Dan)
* Remove PGMAP_COMPOUND and use @align as the property to detect whether
or not to reuse vmemmap areas (Dan)
Thanks,
Joao
Joao Martins (11):
memory-failure: fetch compound_head after pgmap_pfn_valid()
mm/page_alloc: split prep_compound_page into head and tail subparts
mm/page_alloc: refactor memmap_init_zone_device() page init
mm/memremap: add ZONE_DEVICE support for compound pages
mm/sparse-vmemmap: add a pgmap argument to section activation
mm/sparse-vmemmap: refactor vmemmap_populate_basepages()
mm/sparse-vmemmap: populate compound pagemaps
mm/sparse-vmemmap: use hugepages for PUD compound pagemaps
mm/page_alloc: reuse tail struct pages for compound pagemaps
device-dax: compound pagemap support
mm/gup: grab head page refcount once for group of subpages
drivers/dax/device.c | 58 +++++++--
include/linux/memory_hotplug.h | 5 +-
include/linux/memremap.h | 13 ++
include/linux/mm.h | 8 +-
mm/gup.c | 52 +++++---
mm/memory-failure.c | 2 +
mm/memory_hotplug.c | 3 +-
mm/memremap.c | 9 +-
mm/page_alloc.c | 126 +++++++++++++------
mm/sparse-vmemmap.c | 221 +++++++++++++++++++++++++++++----
mm/sparse.c | 24 ++--
11 files changed, 406 insertions(+), 115 deletions(-)
--
2.17.1
12 months
[PATCH] MAINTAINERS: Move nvdimm mailing list
by Dan Williams
After seeing some users have subscription management trouble, more spam
than other Linux development lists, and considering some of the benefits
of kernel.org hosted lists, nvdimm and persistent memory development is
moving to nvdimm(a)lists.linux.dev.
The old list will remain up until v5.14-rc1 and shutdown thereafter.
Cc: Ira Weiny <ira.weiny(a)intel.com>
Cc: Oliver O'Halloran <oohall(a)gmail.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Jan Kara <jack(a)suse.cz>
Cc: Jonathan Corbet <corbet(a)lwn.net>
Cc: Dave Jiang <dave.jiang(a)intel.com>
Cc: Vishal Verma <vishal.l.verma(a)intel.com>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
---
Documentation/ABI/obsolete/sysfs-class-dax | 2 +
Documentation/ABI/removed/sysfs-bus-nfit | 2 +
Documentation/ABI/testing/sysfs-bus-nfit | 40 +++++++++++++------------
Documentation/ABI/testing/sysfs-bus-papr-pmem | 4 +--
Documentation/driver-api/nvdimm/nvdimm.rst | 2 +
MAINTAINERS | 14 ++++-----
6 files changed, 32 insertions(+), 32 deletions(-)
diff --git a/Documentation/ABI/obsolete/sysfs-class-dax b/Documentation/ABI/obsolete/sysfs-class-dax
index 0faf1354cd05..5bcce27458e3 100644
--- a/Documentation/ABI/obsolete/sysfs-class-dax
+++ b/Documentation/ABI/obsolete/sysfs-class-dax
@@ -1,7 +1,7 @@
What: /sys/class/dax/
Date: May, 2016
KernelVersion: v4.7
-Contact: linux-nvdimm(a)lists.01.org
+Contact: nvdimm(a)lists.linux.dev
Description: Device DAX is the device-centric analogue of Filesystem
DAX (CONFIG_FS_DAX). It allows memory ranges to be
allocated and mapped without need of an intervening file
diff --git a/Documentation/ABI/removed/sysfs-bus-nfit b/Documentation/ABI/removed/sysfs-bus-nfit
index ae8c1ca53828..277437005def 100644
--- a/Documentation/ABI/removed/sysfs-bus-nfit
+++ b/Documentation/ABI/removed/sysfs-bus-nfit
@@ -1,7 +1,7 @@
What: /sys/bus/nd/devices/regionX/nfit/ecc_unit_size
Date: Aug, 2017
KernelVersion: v4.14 (Removed v4.18)
-Contact: linux-nvdimm(a)lists.01.org
+Contact: nvdimm(a)lists.linux.dev
Description:
(RO) Size of a write request to a DIMM that will not incur a
read-modify-write cycle at the memory controller.
diff --git a/Documentation/ABI/testing/sysfs-bus-nfit b/Documentation/ABI/testing/sysfs-bus-nfit
index 63ef0b9ecce7..e7282d184a74 100644
--- a/Documentation/ABI/testing/sysfs-bus-nfit
+++ b/Documentation/ABI/testing/sysfs-bus-nfit
@@ -5,7 +5,7 @@ Interface Table (NFIT)' section in the ACPI specification
What: /sys/bus/nd/devices/nmemX/nfit/serial
Date: Jun, 2015
KernelVersion: v4.2
-Contact: linux-nvdimm(a)lists.01.org
+Contact: nvdimm(a)lists.linux.dev
Description:
(RO) Serial number of the NVDIMM (non-volatile dual in-line
memory module), assigned by the module vendor.
@@ -14,7 +14,7 @@ Description:
What: /sys/bus/nd/devices/nmemX/nfit/handle
Date: Apr, 2015
KernelVersion: v4.2
-Contact: linux-nvdimm(a)lists.01.org
+Contact: nvdimm(a)lists.linux.dev
Description:
(RO) The address (given by the _ADR object) of the device on its
parent bus of the NVDIMM device containing the NVDIMM region.
@@ -23,7 +23,7 @@ Description:
What: /sys/bus/nd/devices/nmemX/nfit/device
Date: Apr, 2015
KernelVersion: v4.1
-Contact: linux-nvdimm(a)lists.01.org
+Contact: nvdimm(a)lists.linux.dev
Description:
(RO) Device id for the NVDIMM, assigned by the module vendor.
@@ -31,7 +31,7 @@ Description:
What: /sys/bus/nd/devices/nmemX/nfit/rev_id
Date: Jun, 2015
KernelVersion: v4.2
-Contact: linux-nvdimm(a)lists.01.org
+Contact: nvdimm(a)lists.linux.dev
Description:
(RO) Revision of the NVDIMM, assigned by the module vendor.
@@ -39,7 +39,7 @@ Description:
What: /sys/bus/nd/devices/nmemX/nfit/phys_id
Date: Apr, 2015
KernelVersion: v4.2
-Contact: linux-nvdimm(a)lists.01.org
+Contact: nvdimm(a)lists.linux.dev
Description:
(RO) Handle (i.e., instance number) for the SMBIOS (system
management BIOS) Memory Device structure describing the NVDIMM
@@ -49,7 +49,7 @@ Description:
What: /sys/bus/nd/devices/nmemX/nfit/flags
Date: Jun, 2015
KernelVersion: v4.2
-Contact: linux-nvdimm(a)lists.01.org
+Contact: nvdimm(a)lists.linux.dev
Description:
(RO) The flags in the NFIT memory device sub-structure indicate
the state of the data on the nvdimm relative to its energy
@@ -68,7 +68,7 @@ What: /sys/bus/nd/devices/nmemX/nfit/format1
What: /sys/bus/nd/devices/nmemX/nfit/formats
Date: Apr, 2016
KernelVersion: v4.7
-Contact: linux-nvdimm(a)lists.01.org
+Contact: nvdimm(a)lists.linux.dev
Description:
(RO) The interface codes indicate support for persistent memory
mapped directly into system physical address space and / or a
@@ -84,7 +84,7 @@ Description:
What: /sys/bus/nd/devices/nmemX/nfit/vendor
Date: Apr, 2016
KernelVersion: v4.7
-Contact: linux-nvdimm(a)lists.01.org
+Contact: nvdimm(a)lists.linux.dev
Description:
(RO) Vendor id of the NVDIMM.
@@ -92,7 +92,7 @@ Description:
What: /sys/bus/nd/devices/nmemX/nfit/dsm_mask
Date: May, 2016
KernelVersion: v4.7
-Contact: linux-nvdimm(a)lists.01.org
+Contact: nvdimm(a)lists.linux.dev
Description:
(RO) The bitmask indicates the supported device specific control
functions relative to the NVDIMM command family supported by the
@@ -102,7 +102,7 @@ Description:
What: /sys/bus/nd/devices/nmemX/nfit/family
Date: Apr, 2016
KernelVersion: v4.7
-Contact: linux-nvdimm(a)lists.01.org
+Contact: nvdimm(a)lists.linux.dev
Description:
(RO) Displays the NVDIMM family command sets. Values
0, 1, 2 and 3 correspond to NVDIMM_FAMILY_INTEL,
@@ -118,7 +118,7 @@ Description:
What: /sys/bus/nd/devices/nmemX/nfit/id
Date: Apr, 2016
KernelVersion: v4.7
-Contact: linux-nvdimm(a)lists.01.org
+Contact: nvdimm(a)lists.linux.dev
Description:
(RO) ACPI specification 6.2 section 5.2.25.9, defines an
identifier for an NVDIMM, which refelects the id attribute.
@@ -127,7 +127,7 @@ Description:
What: /sys/bus/nd/devices/nmemX/nfit/subsystem_vendor
Date: Apr, 2016
KernelVersion: v4.7
-Contact: linux-nvdimm(a)lists.01.org
+Contact: nvdimm(a)lists.linux.dev
Description:
(RO) Sub-system vendor id of the NVDIMM non-volatile memory
subsystem controller.
@@ -136,7 +136,7 @@ Description:
What: /sys/bus/nd/devices/nmemX/nfit/subsystem_rev_id
Date: Apr, 2016
KernelVersion: v4.7
-Contact: linux-nvdimm(a)lists.01.org
+Contact: nvdimm(a)lists.linux.dev
Description:
(RO) Sub-system revision id of the NVDIMM non-volatile memory subsystem
controller, assigned by the non-volatile memory subsystem
@@ -146,7 +146,7 @@ Description:
What: /sys/bus/nd/devices/nmemX/nfit/subsystem_device
Date: Apr, 2016
KernelVersion: v4.7
-Contact: linux-nvdimm(a)lists.01.org
+Contact: nvdimm(a)lists.linux.dev
Description:
(RO) Sub-system device id for the NVDIMM non-volatile memory
subsystem controller, assigned by the non-volatile memory
@@ -156,7 +156,7 @@ Description:
What: /sys/bus/nd/devices/ndbusX/nfit/revision
Date: Jun, 2015
KernelVersion: v4.2
-Contact: linux-nvdimm(a)lists.01.org
+Contact: nvdimm(a)lists.linux.dev
Description:
(RO) ACPI NFIT table revision number.
@@ -164,7 +164,7 @@ Description:
What: /sys/bus/nd/devices/ndbusX/nfit/scrub
Date: Sep, 2016
KernelVersion: v4.9
-Contact: linux-nvdimm(a)lists.01.org
+Contact: nvdimm(a)lists.linux.dev
Description:
(RW) This shows the number of full Address Range Scrubs (ARS)
that have been completed since driver load time. Userspace can
@@ -177,7 +177,7 @@ Description:
What: /sys/bus/nd/devices/ndbusX/nfit/hw_error_scrub
Date: Sep, 2016
KernelVersion: v4.9
-Contact: linux-nvdimm(a)lists.01.org
+Contact: nvdimm(a)lists.linux.dev
Description:
(RW) Provides a way to toggle the behavior between just adding
the address (cache line) where the MCE happened to the poison
@@ -196,7 +196,7 @@ Description:
What: /sys/bus/nd/devices/ndbusX/nfit/dsm_mask
Date: Jun, 2017
KernelVersion: v4.13
-Contact: linux-nvdimm(a)lists.01.org
+Contact: nvdimm(a)lists.linux.dev
Description:
(RO) The bitmask indicates the supported bus specific control
functions. See the section named 'NVDIMM Root Device _DSMs' in
@@ -205,7 +205,7 @@ Description:
What: /sys/bus/nd/devices/ndbusX/nfit/firmware_activate_noidle
Date: Apr, 2020
KernelVersion: v5.8
-Contact: linux-nvdimm(a)lists.01.org
+Contact: nvdimm(a)lists.linux.dev
Description:
(RW) The Intel platform implementation of firmware activate
support exposes an option let the platform force idle devices in
@@ -225,7 +225,7 @@ Description:
What: /sys/bus/nd/devices/regionX/nfit/range_index
Date: Jun, 2015
KernelVersion: v4.2
-Contact: linux-nvdimm(a)lists.01.org
+Contact: nvdimm(a)lists.linux.dev
Description:
(RO) A unique number provided by the BIOS to identify an address
range. Used by NVDIMM Region Mapping Structure to uniquely refer
diff --git a/Documentation/ABI/testing/sysfs-bus-papr-pmem b/Documentation/ABI/testing/sysfs-bus-papr-pmem
index 8316c33862a0..92e2db0e2d3d 100644
--- a/Documentation/ABI/testing/sysfs-bus-papr-pmem
+++ b/Documentation/ABI/testing/sysfs-bus-papr-pmem
@@ -1,7 +1,7 @@
What: /sys/bus/nd/devices/nmemX/papr/flags
Date: Apr, 2020
KernelVersion: v5.8
-Contact: linuxppc-dev <linuxppc-dev(a)lists.ozlabs.org>, linux-nvdimm(a)lists.01.org,
+Contact: linuxppc-dev <linuxppc-dev(a)lists.ozlabs.org>, nvdimm(a)lists.linux.dev,
Description:
(RO) Report flags indicating various states of a
papr-pmem NVDIMM device. Each flag maps to a one or
@@ -36,7 +36,7 @@ Description:
What: /sys/bus/nd/devices/nmemX/papr/perf_stats
Date: May, 2020
KernelVersion: v5.9
-Contact: linuxppc-dev <linuxppc-dev(a)lists.ozlabs.org>, linux-nvdimm(a)lists.01.org,
+Contact: linuxppc-dev <linuxppc-dev(a)lists.ozlabs.org>, nvdimm(a)lists.linux.dev,
Description:
(RO) Report various performance stats related to papr-scm NVDIMM
device. Each stat is reported on a new line with each line
diff --git a/Documentation/driver-api/nvdimm/nvdimm.rst b/Documentation/driver-api/nvdimm/nvdimm.rst
index ef6d59e0978e..1d8302b89bd4 100644
--- a/Documentation/driver-api/nvdimm/nvdimm.rst
+++ b/Documentation/driver-api/nvdimm/nvdimm.rst
@@ -4,7 +4,7 @@ LIBNVDIMM: Non-Volatile Devices
libnvdimm - kernel / libndctl - userspace helper library
-linux-nvdimm(a)lists.01.org
+nvdimm(a)lists.linux.dev
Version 13
diff --git a/MAINTAINERS b/MAINTAINERS
index 9450e052f1b1..4d18fa67f71b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5146,7 +5146,7 @@ DEVICE DIRECT ACCESS (DAX)
M: Dan Williams <dan.j.williams(a)intel.com>
M: Vishal Verma <vishal.l.verma(a)intel.com>
M: Dave Jiang <dave.jiang(a)intel.com>
-L: linux-nvdimm(a)lists.01.org
+L: nvdimm(a)lists.linux.dev
S: Supported
F: drivers/dax/
@@ -6887,7 +6887,7 @@ M: Dan Williams <dan.j.williams(a)intel.com>
R: Matthew Wilcox <willy(a)infradead.org>
R: Jan Kara <jack(a)suse.cz>
L: linux-fsdevel(a)vger.kernel.org
-L: linux-nvdimm(a)lists.01.org
+L: nvdimm(a)lists.linux.dev
S: Supported
F: fs/dax.c
F: include/linux/dax.h
@@ -10146,7 +10146,7 @@ LIBNVDIMM BLK: MMIO-APERTURE DRIVER
M: Dan Williams <dan.j.williams(a)intel.com>
M: Vishal Verma <vishal.l.verma(a)intel.com>
M: Dave Jiang <dave.jiang(a)intel.com>
-L: linux-nvdimm(a)lists.01.org
+L: nvdimm(a)lists.linux.dev
S: Supported
Q: https://patchwork.kernel.org/project/linux-nvdimm/list/
P: Documentation/nvdimm/maintainer-entry-profile.rst
@@ -10157,7 +10157,7 @@ LIBNVDIMM BTT: BLOCK TRANSLATION TABLE
M: Vishal Verma <vishal.l.verma(a)intel.com>
M: Dan Williams <dan.j.williams(a)intel.com>
M: Dave Jiang <dave.jiang(a)intel.com>
-L: linux-nvdimm(a)lists.01.org
+L: nvdimm(a)lists.linux.dev
S: Supported
Q: https://patchwork.kernel.org/project/linux-nvdimm/list/
P: Documentation/nvdimm/maintainer-entry-profile.rst
@@ -10167,7 +10167,7 @@ LIBNVDIMM PMEM: PERSISTENT MEMORY DRIVER
M: Dan Williams <dan.j.williams(a)intel.com>
M: Vishal Verma <vishal.l.verma(a)intel.com>
M: Dave Jiang <dave.jiang(a)intel.com>
-L: linux-nvdimm(a)lists.01.org
+L: nvdimm(a)lists.linux.dev
S: Supported
Q: https://patchwork.kernel.org/project/linux-nvdimm/list/
P: Documentation/nvdimm/maintainer-entry-profile.rst
@@ -10175,7 +10175,7 @@ F: drivers/nvdimm/pmem*
LIBNVDIMM: DEVICETREE BINDINGS
M: Oliver O'Halloran <oohall(a)gmail.com>
-L: linux-nvdimm(a)lists.01.org
+L: nvdimm(a)lists.linux.dev
S: Supported
Q: https://patchwork.kernel.org/project/linux-nvdimm/list/
F: Documentation/devicetree/bindings/pmem/pmem-region.txt
@@ -10186,7 +10186,7 @@ M: Dan Williams <dan.j.williams(a)intel.com>
M: Vishal Verma <vishal.l.verma(a)intel.com>
M: Dave Jiang <dave.jiang(a)intel.com>
M: Ira Weiny <ira.weiny(a)intel.com>
-L: linux-nvdimm(a)lists.01.org
+L: nvdimm(a)lists.linux.dev
S: Supported
Q: https://patchwork.kernel.org/project/linux-nvdimm/list/
P: Documentation/nvdimm/maintainer-entry-profile.rst
1 year
[PATCH 1/4] libndctl: Unify adding dimms for papr and nfit families
by Santosh Sivaraj
In preparation for enabling tests on non-nfit devices, unify both, already very
similar, functions into one. This will help in adding all attributes needed for
the unit tests. Since the function doesn't fail if some of the dimm attributes
are missing, this will work fine on PAPR platforms though only part of the DIMM
attributes are provided (This doesn't mean that all of the DIMM attributes can
be missing).
Signed-off-by: Santosh Sivaraj <santosh(a)fossix.org>
---
ndctl/lib/libndctl.c | 103 ++++++++++++++++---------------------------
1 file changed, 38 insertions(+), 65 deletions(-)
diff --git a/ndctl/lib/libndctl.c b/ndctl/lib/libndctl.c
index 36fb6fe..26b9317 100644
--- a/ndctl/lib/libndctl.c
+++ b/ndctl/lib/libndctl.c
@@ -1646,41 +1646,9 @@ static int ndctl_bind(struct ndctl_ctx *ctx, struct kmod_module *module,
static int ndctl_unbind(struct ndctl_ctx *ctx, const char *devpath);
static struct kmod_module *to_module(struct ndctl_ctx *ctx, const char *alias);
-static int add_papr_dimm(struct ndctl_dimm *dimm, const char *dimm_base)
-{
- int rc = -ENODEV;
- char buf[SYSFS_ATTR_SIZE];
- struct ndctl_ctx *ctx = dimm->bus->ctx;
- char *path = calloc(1, strlen(dimm_base) + 100);
- const char * const devname = ndctl_dimm_get_devname(dimm);
-
- dbg(ctx, "%s: Probing of_pmem dimm at %s\n", devname, dimm_base);
-
- if (!path)
- return -ENOMEM;
-
- /* construct path to the papr compatible dimm flags file */
- sprintf(path, "%s/papr/flags", dimm_base);
-
- if (ndctl_bus_is_papr_scm(dimm->bus) &&
- sysfs_read_attr(ctx, path, buf) == 0) {
-
- dbg(ctx, "%s: Adding papr-scm dimm flags:\"%s\"\n", devname, buf);
- dimm->cmd_family = NVDIMM_FAMILY_PAPR;
-
- /* Parse dimm flags */
- parse_papr_flags(dimm, buf);
-
- /* Allocate monitor mode fd */
- dimm->health_eventfd = open(path, O_RDONLY|O_CLOEXEC);
- rc = 0;
- }
-
- free(path);
- return rc;
-}
-
-static int add_nfit_dimm(struct ndctl_dimm *dimm, const char *dimm_base)
+static int populate_dimm_attributes(struct ndctl_dimm *dimm,
+ const char *dimm_base,
+ const char *bus_prefix)
{
int i, rc = -1;
char buf[SYSFS_ATTR_SIZE];
@@ -1694,7 +1662,7 @@ static int add_nfit_dimm(struct ndctl_dimm *dimm, const char *dimm_base)
* 'unique_id' may not be available on older kernels, so don't
* fail if the read fails.
*/
- sprintf(path, "%s/nfit/id", dimm_base);
+ sprintf(path, "%s/%s/id", dimm_base, bus_prefix);
if (sysfs_read_attr(ctx, path, buf) == 0) {
unsigned int b[9];
@@ -1709,68 +1677,74 @@ static int add_nfit_dimm(struct ndctl_dimm *dimm, const char *dimm_base)
}
}
- sprintf(path, "%s/nfit/handle", dimm_base);
+ sprintf(path, "%s/%s/handle", dimm_base, bus_prefix);
if (sysfs_read_attr(ctx, path, buf) < 0)
goto err_read;
dimm->handle = strtoul(buf, NULL, 0);
- sprintf(path, "%s/nfit/phys_id", dimm_base);
+ sprintf(path, "%s/%s/phys_id", dimm_base, bus_prefix);
if (sysfs_read_attr(ctx, path, buf) < 0)
goto err_read;
dimm->phys_id = strtoul(buf, NULL, 0);
- sprintf(path, "%s/nfit/serial", dimm_base);
+ sprintf(path, "%s/%s/serial", dimm_base, bus_prefix);
if (sysfs_read_attr(ctx, path, buf) == 0)
dimm->serial = strtoul(buf, NULL, 0);
- sprintf(path, "%s/nfit/vendor", dimm_base);
+ sprintf(path, "%s/%s/vendor", dimm_base, bus_prefix);
if (sysfs_read_attr(ctx, path, buf) == 0)
dimm->vendor_id = strtoul(buf, NULL, 0);
- sprintf(path, "%s/nfit/device", dimm_base);
+ sprintf(path, "%s/%s/device", dimm_base, bus_prefix);
if (sysfs_read_attr(ctx, path, buf) == 0)
dimm->device_id = strtoul(buf, NULL, 0);
- sprintf(path, "%s/nfit/rev_id", dimm_base);
+ sprintf(path, "%s/%s/rev_id", dimm_base, bus_prefix);
if (sysfs_read_attr(ctx, path, buf) == 0)
dimm->revision_id = strtoul(buf, NULL, 0);
- sprintf(path, "%s/nfit/dirty_shutdown", dimm_base);
+ sprintf(path, "%s/%s/dirty_shutdown", dimm_base, bus_prefix);
if (sysfs_read_attr(ctx, path, buf) == 0)
dimm->dirty_shutdown = strtoll(buf, NULL, 0);
- sprintf(path, "%s/nfit/subsystem_vendor", dimm_base);
+ sprintf(path, "%s/%s/subsystem_vendor", dimm_base, bus_prefix);
if (sysfs_read_attr(ctx, path, buf) == 0)
dimm->subsystem_vendor_id = strtoul(buf, NULL, 0);
- sprintf(path, "%s/nfit/subsystem_device", dimm_base);
+ sprintf(path, "%s/%s/subsystem_device", dimm_base, bus_prefix);
if (sysfs_read_attr(ctx, path, buf) == 0)
dimm->subsystem_device_id = strtoul(buf, NULL, 0);
- sprintf(path, "%s/nfit/subsystem_rev_id", dimm_base);
+ sprintf(path, "%s/%s/subsystem_rev_id", dimm_base, bus_prefix);
if (sysfs_read_attr(ctx, path, buf) == 0)
dimm->subsystem_revision_id = strtoul(buf, NULL, 0);
- sprintf(path, "%s/nfit/family", dimm_base);
+ sprintf(path, "%s/%s/family", dimm_base, bus_prefix);
if (sysfs_read_attr(ctx, path, buf) == 0)
dimm->cmd_family = strtoul(buf, NULL, 0);
- sprintf(path, "%s/nfit/dsm_mask", dimm_base);
+ sprintf(path, "%s/%s/dsm_mask", dimm_base, bus_prefix);
if (sysfs_read_attr(ctx, path, buf) == 0)
dimm->nfit_dsm_mask = strtoul(buf, NULL, 0);
- sprintf(path, "%s/nfit/format", dimm_base);
+ sprintf(path, "%s/%s/format", dimm_base, bus_prefix);
if (sysfs_read_attr(ctx, path, buf) == 0)
dimm->format[0] = strtoul(buf, NULL, 0);
for (i = 1; i < dimm->formats; i++) {
- sprintf(path, "%s/nfit/format%d", dimm_base, i);
+ sprintf(path, "%s/%s/format%d", dimm_base, bus_prefix, i);
if (sysfs_read_attr(ctx, path, buf) == 0)
dimm->format[i] = strtoul(buf, NULL, 0);
}
- sprintf(path, "%s/nfit/flags", dimm_base);
- if (sysfs_read_attr(ctx, path, buf) == 0)
- parse_nfit_mem_flags(dimm, buf);
+ sprintf(path, "%s/%s/flags", dimm_base, bus_prefix);
+ if (sysfs_read_attr(ctx, path, buf) == 0) {
+ if (ndctl_bus_has_nfit(dimm->bus))
+ parse_nfit_mem_flags(dimm, buf);
+ else if (ndctl_bus_is_papr_scm(dimm->bus)) {
+ dimm->cmd_family = NVDIMM_FAMILY_PAPR;
+ parse_papr_flags(dimm, buf);
+ }
+ }
dimm->health_eventfd = open(path, O_RDONLY|O_CLOEXEC);
rc = 0;
@@ -1792,7 +1766,8 @@ static void *add_dimm(void *parent, int id, const char *dimm_base)
if (!path)
return NULL;
- sprintf(path, "%s/nfit/formats", dimm_base);
+ sprintf(path, "%s/%s/formats", dimm_base,
+ ndctl_bus_has_nfit(bus) ? "nfit" : "papr");
if (sysfs_read_attr(ctx, path, buf) < 0)
formats = 1;
else
@@ -1866,13 +1841,12 @@ static void *add_dimm(void *parent, int id, const char *dimm_base)
else
dimm->fwa_result = fwa_result_to_result(buf);
+ dimm->formats = formats;
/* Check if the given dimm supports nfit */
if (ndctl_bus_has_nfit(bus)) {
- dimm->formats = formats;
- rc = add_nfit_dimm(dimm, dimm_base);
- } else if (ndctl_bus_has_of_node(bus)) {
- rc = add_papr_dimm(dimm, dimm_base);
- }
+ rc = populate_dimm_attributes(dimm, dimm_base, "nfit");
+ } else if (ndctl_bus_has_of_node(bus))
+ rc = populate_dimm_attributes(dimm, dimm_base, "papr");
if (rc == -ENODEV) {
/* Unprobed dimm with no family */
@@ -2531,13 +2505,12 @@ static void *add_region(void *parent, int id, const char *region_base)
goto err_read;
region->num_mappings = strtoul(buf, NULL, 0);
- sprintf(path, "%s/nfit/range_index", region_base);
- if (ndctl_bus_has_nfit(bus)) {
- if (sysfs_read_attr(ctx, path, buf) < 0)
- goto err_read;
- region->range_index = strtoul(buf, NULL, 0);
- } else
+ sprintf(path, "%s/%s/range_index", region_base,
+ ndctl_bus_has_nfit(bus) ? "nfit": "papr");
+ if (sysfs_read_attr(ctx, path, buf) < 0)
region->range_index = -1;
+ else
+ region->range_index = strtoul(buf, NULL, 0);
sprintf(path, "%s/read_only", region_base);
if (sysfs_read_attr(ctx, path, buf) < 0)
--
2.30.2
1 year
[PATCH v18 0/9] mm: introduce memfd_secret system call to create "secret" memory areas
by Mike Rapoport
From: Mike Rapoport <rppt(a)linux.ibm.com>
Hi,
@Andrew, this is based on v5.12-rc1, I can rebase whatever way you prefer.
This is an implementation of "secret" mappings backed by a file descriptor.
The file descriptor backing secret memory mappings is created using a
dedicated memfd_secret system call The desired protection mode for the
memory is configured using flags parameter of the system call. The mmap()
of the file descriptor created with memfd_secret() will create a "secret"
memory mapping. The pages in that mapping will be marked as not present in
the direct map and will be present only in the page table of the owning mm.
Although normally Linux userspace mappings are protected from other users,
such secret mappings are useful for environments where a hostile tenant is
trying to trick the kernel into giving them access to other tenants
mappings.
Additionally, in the future the secret mappings may be used as a mean to
protect guest memory in a virtual machine host.
For demonstration of secret memory usage we've created a userspace library
https://git.kernel.org/pub/scm/linux/kernel/git/jejb/secret-memory-preloa...
that does two things: the first is act as a preloader for openssl to
redirect all the OPENSSL_malloc calls to secret memory meaning any secret
keys get automatically protected this way and the other thing it does is
expose the API to the user who needs it. We anticipate that a lot of the
use cases would be like the openssl one: many toolkits that deal with
secret keys already have special handling for the memory to try to give
them greater protection, so this would simply be pluggable into the
toolkits without any need for user application modification.
Hiding secret memory mappings behind an anonymous file allows usage of
the page cache for tracking pages allocated for the "secret" mappings as
well as using address_space_operations for e.g. page migration callbacks.
The anonymous file may be also used implicitly, like hugetlb files, to
implement mmap(MAP_SECRET) and use the secret memory areas with "native" mm
ABIs in the future.
Removing of the pages from the direct map may cause its fragmentation on
architectures that use large pages to map the physical memory which affects
the system performance. However, the original Kconfig text for
CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct map "... can
improve the kernel's performance a tiny bit ..." (commit 00d1c5e05736
("x86: add gbpages switches")) and the recent report [1] showed that "...
although 1G mappings are a good default choice, there is no compelling
evidence that it must be the only choice". Hence, it is sufficient to have
secretmem disabled by default with the ability of a system administrator to
enable it at boot time.
In addition, there is also a long term goal to improve management of the
direct map.
[1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@lin...
v18:
* rebase on v5.12-rc1
* merge kfence fix into the original patch
* massage commit message of the patch introducing the memfd_secret syscall
v17: https://lore.kernel.org/lkml/20210208084920.2884-1-rppt@kernel.org
* Remove pool of large pages backing secretmem allocations, per Michal Hocko
* Add secretmem pages to unevictable LRU, per Michal Hocko
* Use GFP_HIGHUSER as secretmem mapping mask, per Michal Hocko
* Make secretmem an opt-in feature that is disabled by default
v16: https://lore.kernel.org/lkml/20210121122723.3446-1-rppt@kernel.org
* Fix memory leak intorduced in v15
* Clean the data left from previous page user before handing the page to
the userspace
v15: https://lore.kernel.org/lkml/20210120180612.1058-1-rppt@kernel.org
* Add riscv/Kconfig update to disable set_memory operations for nommu
builds (patch 3)
* Update the code around add_to_page_cache() per Matthew's comments
(patches 6,7)
* Add fixups for build/checkpatch errors discovered by CI systems
v14: https://lore.kernel.org/lkml/20201203062949.5484-1-rppt@kernel.org
* Finally s/mod_node_page_state/mod_lruvec_page_state/
v13: https://lore.kernel.org/lkml/20201201074559.27742-1-rppt@kernel.org
* Added Reviewed-by, thanks Catalin and David
* s/mod_node_page_state/mod_lruvec_page_state/ as Shakeel suggested
Older history:
v12: https://lore.kernel.org/lkml/20201125092208.12544-1-rppt@kernel.org
v11: https://lore.kernel.org/lkml/20201124092556.12009-1-rppt@kernel.org
v10: https://lore.kernel.org/lkml/20201123095432.5860-1-rppt@kernel.org
v9: https://lore.kernel.org/lkml/20201117162932.13649-1-rppt@kernel.org
v8: https://lore.kernel.org/lkml/20201110151444.20662-1-rppt@kernel.org
v7: https://lore.kernel.org/lkml/20201026083752.13267-1-rppt@kernel.org
v6: https://lore.kernel.org/lkml/20200924132904.1391-1-rppt@kernel.org
v5: https://lore.kernel.org/lkml/20200916073539.3552-1-rppt@kernel.org
v4: https://lore.kernel.org/lkml/20200818141554.13945-1-rppt@kernel.org
v3: https://lore.kernel.org/lkml/20200804095035.18778-1-rppt@kernel.org
v2: https://lore.kernel.org/lkml/20200727162935.31714-1-rppt@kernel.org
v1: https://lore.kernel.org/lkml/20200720092435.17469-1-rppt@kernel.org
rfc-v2: https://lore.kernel.org/lkml/20200706172051.19465-1-rppt@kernel.org/
rfc-v1: https://lore.kernel.org/lkml/20200130162340.GA14232@rapoport-lnx/
rfc-v0: https://lore.kernel.org/lkml/1572171452-7958-1-git-send-email-rppt@kernel...
Mike Rapoport (9):
mm: add definition of PMD_PAGE_ORDER
mmap: make mlock_future_check() global
riscv/Kconfig: make direct map manipulation options depend on MMU
set_memory: allow set_direct_map_*_noflush() for multiple pages
set_memory: allow querying whether set_direct_map_*() is actually enabled
mm: introduce memfd_secret system call to create "secret" memory areas
PM: hibernate: disable when there are active secretmem users
arch, mm: wire up memfd_secret system call where relevant
secretmem: test: add basic selftest for memfd_secret(2)
arch/arm64/include/asm/Kbuild | 1 -
arch/arm64/include/asm/cacheflush.h | 6 -
arch/arm64/include/asm/kfence.h | 2 +-
arch/arm64/include/asm/set_memory.h | 17 ++
arch/arm64/include/uapi/asm/unistd.h | 1 +
arch/arm64/kernel/machine_kexec.c | 1 +
arch/arm64/mm/mmu.c | 6 +-
arch/arm64/mm/pageattr.c | 23 +-
arch/riscv/Kconfig | 4 +-
arch/riscv/include/asm/set_memory.h | 4 +-
arch/riscv/include/asm/unistd.h | 1 +
arch/riscv/mm/pageattr.c | 8 +-
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/x86/include/asm/set_memory.h | 4 +-
arch/x86/mm/pat/set_memory.c | 8 +-
fs/dax.c | 11 +-
include/linux/pgtable.h | 3 +
include/linux/secretmem.h | 30 +++
include/linux/set_memory.h | 16 +-
include/linux/syscalls.h | 1 +
include/uapi/asm-generic/unistd.h | 6 +-
include/uapi/linux/magic.h | 1 +
kernel/power/hibernate.c | 5 +-
kernel/power/snapshot.c | 4 +-
kernel/sys_ni.c | 2 +
mm/Kconfig | 3 +
mm/Makefile | 1 +
mm/gup.c | 10 +
mm/internal.h | 3 +
mm/mlock.c | 3 +-
mm/mmap.c | 5 +-
mm/secretmem.c | 261 +++++++++++++++++++
mm/vmalloc.c | 5 +-
scripts/checksyscalls.sh | 4 +
tools/testing/selftests/vm/.gitignore | 1 +
tools/testing/selftests/vm/Makefile | 3 +-
tools/testing/selftests/vm/memfd_secret.c | 296 ++++++++++++++++++++++
tools/testing/selftests/vm/run_vmtests.sh | 17 ++
39 files changed, 726 insertions(+), 53 deletions(-)
create mode 100644 arch/arm64/include/asm/set_memory.h
create mode 100644 include/linux/secretmem.h
create mode 100644 mm/secretmem.c
create mode 100644 tools/testing/selftests/vm/memfd_secret.c
--
2.28.0
1 year
[PATCH v3 0/2] secretmem: optimize page_is_secretmem()
by Mike Rapoport
From: Mike Rapoport <rppt(a)linux.ibm.com>
Hi,
This is an updated version of page_is_secretmem() changes.
This is based on v5.12-rc7-mmots-2021-04-15-16-28.
@Andrew, please let me know if you'd like me to rebase it differently or
resend the entire set.
v3:
* add missing put_compound_head() if we are to return NULL from
gup_page_range(), thanks David.
* add unlikely() to test for page_is_secretmem.
v2:
* move the check for secretmem page in gup_pte_range after we get a
reference to the page, per Matthew.
Mike Rapoport (2):
secretmem/gup: don't check if page is secretmem without reference
secretmem: optimize page_is_secretmem()
include/linux/secretmem.h | 26 +++++++++++++++++++++++++-
mm/gup.c | 6 +++---
mm/secretmem.c | 12 +-----------
3 files changed, 29 insertions(+), 15 deletions(-)
--
2.28.0
Mike Rapoport (2):
secretmem/gup: don't check if page is secretmem without reference
secretmem: optimize page_is_secretmem()
include/linux/secretmem.h | 26 +++++++++++++++++++++++++-
mm/gup.c | 8 +++++---
mm/secretmem.c | 12 +-----------
3 files changed, 31 insertions(+), 15 deletions(-)
--
2.28.0
1 year
[PATCH v3 0/3] fsdax: Factor helper functions to simplify the code
by Shiyang Ruan
From: Shiyang Ruan <ruansy.fnst(a)cn.fujitsu.com>
The page fault part of fsdax code is little complex. In order to add CoW
feature and make it easy to understand, I was suggested to factor some
helper functions to simplify the current dax code.
This is separated from the previous patchset called "V3 fsdax,xfs: Add
reflink&dedupe support for fsdax", and the previous comments are here[1].
[1]: https://patchwork.kernel.org/project/linux-nvdimm/patch/20210319015237.99...
Changes from V2:
- fix the type of 'major' in patch 2
- Rebased on v5.12-rc8
Changes from V1:
- fix Ritesh's email address
- simplify return logic in dax_fault_cow_page()
(Rebased on v5.12-rc8)
==
Shiyang Ruan (3):
fsdax: Factor helpers to simplify dax fault code
fsdax: Factor helper: dax_fault_actor()
fsdax: Output address in dax_iomap_pfn() and rename it
fs/dax.c | 443 +++++++++++++++++++++++++++++--------------------------
1 file changed, 234 insertions(+), 209 deletions(-)
--
2.31.1
1 year
[PATCH v4 0/3] nvdimm: Enable sync-dax property for nvdimm
by Shivaprasad G Bhat
The nvdimm devices are expected to ensure write persistence during power
failure kind of scenarios.
The libpmem has architecture specific instructions like dcbf on POWER
to flush the cache data to backend nvdimm device during normal writes
followed by explicit flushes if the backend devices are not synchronous
DAX capable.
Qemu - virtual nvdimm devices are memory mapped. The dcbf in the guest
and the subsequent flush doesn't traslate to actual flush to the backend
file on the host in case of file backed v-nvdimms. This is addressed by
virtio-pmem in case of x86_64 by making explicit flushes translating to
fsync at qemu.
On SPAPR, the issue is addressed by adding a new hcall to
request for an explicit flush from the guest ndctl driver when the backend
nvdimm cannot ensure write persistence with dcbf alone. So, the approach
here is to convey when the hcall flush is required in a device tree
property. The guest makes the hcall when the property is found, instead
of relying on dcbf.
A new device property sync-dax is added to the nvdimm device. When the
sync-dax is 'writeback'(default for PPC), device property
"hcall-flush-required" is set, and the guest makes hcall H_SCM_FLUSH
requesting for an explicit flush.
sync-dax is "unsafe" on all other platforms(x86, ARM) and old pseries machines
prior to 5.2 on PPC. sync-dax="writeback" on ARM and x86_64 is prevented
now as the flush semantics are unimplemented.
When the backend file is actually synchronous DAX capable and no explicit
flushes are required, the sync-dax mode 'direct' is to be used.
The below demonstration shows the map_sync behavior with sync-dax writeback &
direct.
(https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master...)
The pmem0 is from nvdimm with With sync-dax=direct, and pmem1 is from
nvdimm with syn-dax=writeback, mounted as
/dev/pmem0 on /mnt1 type xfs (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)
/dev/pmem1 on /mnt2 type xfs (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)
[root@atest-guest ~]# ./mapsync /mnt1/newfile ----> When sync-dax=unsafe/direct
[root@atest-guest ~]# ./mapsync /mnt2/newfile ----> when sync-dax=writeback
Failed to mmap with Operation not supported
The first patch does the header file cleanup necessary for the
subsequent ones. Second patch implements the hcall, adds the necessary
vmstate properties to spapr machine structure for carrying the hcall
status during save-restore. The nature of the hcall being asynchronus,
the patch uses aio utilities to offload the flush. The third patch adds
the 'sync-dax' device property and enables the device tree property
for the guest to utilise the hcall.
The kernel changes to exploit this hcall is at
https://github.com/linuxppc/linux/commit/75b7c05ebf9026.patch
---
v3 - https://lists.gnu.org/archive/html/qemu-devel/2021-03/msg07916.html
Changes from v3:
- Fixed the forward declaration coding guideline violations in 1st patch.
- Removed the code waiting for the flushes to complete during migration,
instead restart the flush worker on destination qemu in post load.
- Got rid of the randomization of the flush tokens, using simple
counter.
- Got rid of the redundant flush state lock, relying on the BQL now.
- Handling the memory-backend-ram usage
- Changed the sync-dax symantics from on/off to 'unsafe','writeback' and 'direct'.
Added prevention code using 'writeback' on arm and x86_64.
- Fixed all the miscellaneous comments.
v2 - https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg07031.html
Changes from v2:
- Using the thread pool based approach as suggested
- Moved the async hcall handling code to spapr_nvdimm.c along
with some simplifications
- Added vmstate to preserve the hcall status during save-restore
along with pre_save handler code to complete all ongoning flushes.
- Added hw_compat magic for sync-dax 'on' on previous machines.
- Miscellanious minor fixes.
v1 - https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg06330.html
Changes from v1
- Fixed a missed-out unlock
- using QLIST_FOREACH instead of QLIST_FOREACH_SAFE while generating token
Shivaprasad G Bhat (3):
spapr: nvdimm: Forward declare and move the definitions
spapr: nvdimm: Implement H_SCM_FLUSH hcall
nvdimm: Enable sync-dax device property for nvdimm
hw/arm/virt.c | 28 ++++
hw/i386/pc.c | 28 ++++
hw/mem/nvdimm.c | 52 +++++++
hw/ppc/spapr.c | 16 ++
hw/ppc/spapr_nvdimm.c | 285 +++++++++++++++++++++++++++++++++++++++++
include/hw/mem/nvdimm.h | 11 ++
include/hw/ppc/spapr.h | 11 +-
include/hw/ppc/spapr_nvdimm.h | 27 ++--
qapi/common.json | 20 +++
9 files changed, 455 insertions(+), 23 deletions(-)
--
Signature
1 year
Urgent PO
by Accountant Assistant
Dear, linux-nvdimm
1 year
BUG_ON(!mapping_empty(&inode->i_data))
by Hugh Dickins
Running my usual tmpfs kernel builds swapping load, on Sunday's rc4-mm1
mmotm (I never got to try rc3-mm1 but presume it behaved the same way),
I hit clear_inode()'s BUG_ON(!mapping_empty(&inode->i_data)); on two
machines, within an hour or few, repeatably though not to order.
The stack backtrace has always been clear_inode < ext4_clear_inode <
ext4_evict_inode < evict < dispose_list < prune_icache_sb <
super_cache_scan < do_shrink_slab < shrink_slab_memcg < shrink_slab <
shrink_node_memgs < shrink_node < balance_pgdat < kswapd.
ext4 is the disk filesystem I read the source to build from, and also
the filesystem I use on a loop device on a tmpfs file: I have not tried
with other filesystems, nor checked whether perhaps it happens always on
the loop one or always on the disk one. I have not seen it happen with
tmpfs - probably because its inodes cannot be evicted by the shrinker
anyway; I have not seen it happen when "rm -rf" evicts ext4 or tmpfs
inodes (but suspect that may be down to timing, or less pressure).
I doubt it's a matter of filesystem: think it's an XArray thing.
Whenever I've looked at the XArray nodes involved, the root node
(shift 6) contained one or three (adjacent) pointers to empty shift
0 nodes, which each had offset and parent and array correctly set.
Is there some way in which empty nodes can get left behind, and so
fail eviction's mapping_empty() check?
I did wonder whether some might get left behind if xas_alloc() fails
(though probably the tree here is too shallow to show that). Printks
showed that occasionally xas_alloc() did fail while testing (maybe at
memcg limit), but there was no correlation with the BUG_ONs.
I did wonder whether this is a long-standing issue, which your new
BUG_ON is the first to detect: so tried 5.12-rc5 clear_inode() with
a BUG_ON(!xa_empty(&inode->i_data.i_pages)) after its nrpages and
nrexceptional BUG_ONs. The result there surprised me: I expected
it to behave the same way, but it hits that BUG_ON in a minute or
so, instead of an hour or so. Was there a fix you made somewhere,
to avoid the BUG_ON(!mapping_empty) most of the time? but needs
more work. I looked around a little, but didn't find any.
I had hoped to work this out myself, and save us both some writing:
but better hand over to you, in the hope that you'll quickly guess
what's up, then I can try patches. I do like the no-nrexceptionals
series, but there's something still to be fixed.
Hugh
1 year