[PATCH v3] powerpc/papr_scm: Implement support for H_SCM_FLUSH hcall
by Shivaprasad G Bhat
Add support for ND_REGION_ASYNC capability if the device tree
indicates 'ibm,hcall-flush-required' property in the NVDIMM node.
Flush is done by issuing H_SCM_FLUSH hcall to the hypervisor.
If the flush request failed, the hypervisor is expected to
to reflect the problem in the subsequent nvdimm H_SCM_HEALTH call.
This patch prevents mmap of namespaces with MAP_SYNC flag if the
nvdimm requires an explicit flush[1].
References:
[1] https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master...
Signed-off-by: Shivaprasad G Bhat <sbhat(a)linux.ibm.com>
---
v2 - https://www.spinics.net/lists/kvm-ppc/msg18799.html
Changes from v2:
- Fixed the commit message.
- Add dev_dbg before the H_SCM_FLUSH hcall
v1 - https://www.spinics.net/lists/kvm-ppc/msg18272.html
Changes from v1:
- Hcall semantics finalized, all changes are to accomodate them.
Documentation/powerpc/papr_hcalls.rst | 14 ++++++++++
arch/powerpc/include/asm/hvcall.h | 3 +-
arch/powerpc/platforms/pseries/papr_scm.c | 40 +++++++++++++++++++++++++++++
3 files changed, 56 insertions(+), 1 deletion(-)
diff --git a/Documentation/powerpc/papr_hcalls.rst b/Documentation/powerpc/papr_hcalls.rst
index 48fcf1255a33..648f278eea8f 100644
--- a/Documentation/powerpc/papr_hcalls.rst
+++ b/Documentation/powerpc/papr_hcalls.rst
@@ -275,6 +275,20 @@ Health Bitmap Flags:
Given a DRC Index collect the performance statistics for NVDIMM and copy them
to the resultBuffer.
+**H_SCM_FLUSH**
+
+| Input: *drcIndex, continue-token*
+| Out: *continue-token*
+| Return Value: *H_SUCCESS, H_Parameter, H_P2, H_BUSY*
+
+Given a DRC Index Flush the data to backend NVDIMM device.
+
+The hcall returns H_BUSY when the flush takes longer time and the hcall needs
+to be issued multiple times in order to be completely serviced. The
+*continue-token* from the output to be passed in the argument list of
+subsequent hcalls to the hypervisor until the hcall is completely serviced
+at which point H_SUCCESS or other error is returned by the hypervisor.
+
References
==========
.. [1] "Power Architecture Platform Reference"
diff --git a/arch/powerpc/include/asm/hvcall.h b/arch/powerpc/include/asm/hvcall.h
index ed6086d57b22..9f7729a97ebd 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -315,7 +315,8 @@
#define H_SCM_HEALTH 0x400
#define H_SCM_PERFORMANCE_STATS 0x418
#define H_RPT_INVALIDATE 0x448
-#define MAX_HCALL_OPCODE H_RPT_INVALIDATE
+#define H_SCM_FLUSH 0x44C
+#define MAX_HCALL_OPCODE H_SCM_FLUSH
/* Scope args for H_SCM_UNBIND_ALL */
#define H_UNBIND_SCOPE_ALL (0x1)
diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c
index 835163f54244..b7a47fcc5aa5 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -93,6 +93,7 @@ struct papr_scm_priv {
uint64_t block_size;
int metadata_size;
bool is_volatile;
+ bool hcall_flush_required;
uint64_t bound_addr;
@@ -117,6 +118,39 @@ struct papr_scm_priv {
size_t stat_buffer_len;
};
+static int papr_scm_pmem_flush(struct nd_region *nd_region,
+ struct bio *bio __maybe_unused)
+{
+ struct papr_scm_priv *p = nd_region_provider_data(nd_region);
+ unsigned long ret_buf[PLPAR_HCALL_BUFSIZE];
+ uint64_t token = 0;
+ int64_t rc;
+
+ dev_dbg(&p->pdev->dev, "flush drc 0x%x", p->drc_index);
+
+ do {
+ rc = plpar_hcall(H_SCM_FLUSH, ret_buf, p->drc_index, token);
+ token = ret_buf[0];
+
+ /* Check if we are stalled for some time */
+ if (H_IS_LONG_BUSY(rc)) {
+ msleep(get_longbusy_msecs(rc));
+ rc = H_BUSY;
+ } else if (rc == H_BUSY) {
+ cond_resched();
+ }
+ } while (rc == H_BUSY);
+
+ if (rc) {
+ dev_err(&p->pdev->dev, "flush error: %lld", rc);
+ rc = -EIO;
+ } else {
+ dev_dbg(&p->pdev->dev, "flush drc 0x%x complete", p->drc_index);
+ }
+
+ return rc;
+}
+
static LIST_HEAD(papr_nd_regions);
static DEFINE_MUTEX(papr_ndr_lock);
@@ -943,6 +977,11 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
ndr_desc.num_mappings = 1;
ndr_desc.nd_set = &p->nd_set;
+ if (p->hcall_flush_required) {
+ set_bit(ND_REGION_ASYNC, &ndr_desc.flags);
+ ndr_desc.flush = papr_scm_pmem_flush;
+ }
+
if (p->is_volatile)
p->region = nvdimm_volatile_region_create(p->bus, &ndr_desc);
else {
@@ -1088,6 +1127,7 @@ static int papr_scm_probe(struct platform_device *pdev)
p->block_size = block_size;
p->blocks = blocks;
p->is_volatile = !of_property_read_bool(dn, "ibm,cache-flush-required");
+ p->hcall_flush_required = of_property_read_bool(dn, "ibm,hcall-flush-required");
/* We just need to ensure that set cookies are unique across */
uuid_parse(uuid_str, (uuid_t *) uuid);
3 days, 8 hours
[PATCH v3 00/10] fsdax,xfs: Add reflink&dedupe support for fsdax
by Shiyang Ruan
From: Shiyang Ruan <ruansy.fnst(a)cn.fujitsu.com>
This patchset is attempt to add CoW support for fsdax, and take XFS,
which has both reflink and fsdax feature, as an example.
Changes from V2:
- Fix the mistake in iomap_apply2() and dax_dedupe_file_range_compare()
- Add CoW judgement in dax_iomap_zero()
- Fix other code style problems and mistakes
Changes from V1:
- Factor some helper functions to simplify dax fault code
- Introduce iomap_apply2() for dax_dedupe_file_range_compare()
- Fix mistakes and other problems
- Rebased on v5.11
One of the key mechanism need to be implemented in fsdax is CoW. Copy
the data from srcmap before we actually write data to the destance
iomap. And we just copy range in which data won't be changed.
Another mechanism is range comparison. In page cache case, readpage()
is used to load data on disk to page cache in order to be able to
compare data. In fsdax case, readpage() does not work. So, we need
another compare data with direct access support.
With the two mechanism implemented in fsdax, we are able to make reflink
and fsdax work together in XFS.
Some of the patches are picked up from Goldwyn's patchset. I made some
changes to adapt to this patchset.
(Rebased on v5.11)
==
Shiyang Ruan (10):
fsdax: Factor helpers to simplify dax fault code
fsdax: Factor helper: dax_fault_actor()
fsdax: Output address in dax_iomap_pfn() and rename it
fsdax: Introduce dax_iomap_cow_copy()
fsdax: Replace mmap entry in case of CoW
fsdax: Add dax_iomap_cow_copy() for dax_iomap_zero
iomap: Introduce iomap_apply2() for operations on two files
fsdax: Dedup file range to use a compare function
fs/xfs: Handle CoW for fsdax write() path
fs/xfs: Add dedupe support for fsdax
fs/dax.c | 596 ++++++++++++++++++++++++++---------------
fs/iomap/apply.c | 56 ++++
fs/iomap/buffered-io.c | 2 +-
fs/remap_range.c | 45 +++-
fs/xfs/xfs_bmap_util.c | 3 +-
fs/xfs/xfs_file.c | 29 +-
fs/xfs/xfs_inode.c | 8 +-
fs/xfs/xfs_inode.h | 1 +
fs/xfs/xfs_iomap.c | 58 +++-
fs/xfs/xfs_iomap.h | 4 +
fs/xfs/xfs_iops.c | 7 +-
fs/xfs/xfs_reflink.c | 17 +-
include/linux/dax.h | 7 +-
include/linux/fs.h | 15 +-
include/linux/iomap.h | 7 +-
15 files changed, 602 insertions(+), 253 deletions(-)
--
2.30.1
1 week, 2 days
[PATCH 1/4] libndctl: Unify adding dimms for papr and nfit families
by Santosh Sivaraj
In preparation for enabling tests on non-nfit devices, unify both, already very
similar, functions into one. This will help in adding all attributes needed for
the unit tests. Since the function doesn't fail if some of the dimm attributes
are missing, this will work fine on PAPR platforms though only part of the DIMM
attributes are provided (This doesn't mean that all of the DIMM attributes can
be missing).
Signed-off-by: Santosh Sivaraj <santosh(a)fossix.org>
---
ndctl/lib/libndctl.c | 103 ++++++++++++++++---------------------------
1 file changed, 38 insertions(+), 65 deletions(-)
diff --git a/ndctl/lib/libndctl.c b/ndctl/lib/libndctl.c
index 36fb6fe..26b9317 100644
--- a/ndctl/lib/libndctl.c
+++ b/ndctl/lib/libndctl.c
@@ -1646,41 +1646,9 @@ static int ndctl_bind(struct ndctl_ctx *ctx, struct kmod_module *module,
static int ndctl_unbind(struct ndctl_ctx *ctx, const char *devpath);
static struct kmod_module *to_module(struct ndctl_ctx *ctx, const char *alias);
-static int add_papr_dimm(struct ndctl_dimm *dimm, const char *dimm_base)
-{
- int rc = -ENODEV;
- char buf[SYSFS_ATTR_SIZE];
- struct ndctl_ctx *ctx = dimm->bus->ctx;
- char *path = calloc(1, strlen(dimm_base) + 100);
- const char * const devname = ndctl_dimm_get_devname(dimm);
-
- dbg(ctx, "%s: Probing of_pmem dimm at %s\n", devname, dimm_base);
-
- if (!path)
- return -ENOMEM;
-
- /* construct path to the papr compatible dimm flags file */
- sprintf(path, "%s/papr/flags", dimm_base);
-
- if (ndctl_bus_is_papr_scm(dimm->bus) &&
- sysfs_read_attr(ctx, path, buf) == 0) {
-
- dbg(ctx, "%s: Adding papr-scm dimm flags:\"%s\"\n", devname, buf);
- dimm->cmd_family = NVDIMM_FAMILY_PAPR;
-
- /* Parse dimm flags */
- parse_papr_flags(dimm, buf);
-
- /* Allocate monitor mode fd */
- dimm->health_eventfd = open(path, O_RDONLY|O_CLOEXEC);
- rc = 0;
- }
-
- free(path);
- return rc;
-}
-
-static int add_nfit_dimm(struct ndctl_dimm *dimm, const char *dimm_base)
+static int populate_dimm_attributes(struct ndctl_dimm *dimm,
+ const char *dimm_base,
+ const char *bus_prefix)
{
int i, rc = -1;
char buf[SYSFS_ATTR_SIZE];
@@ -1694,7 +1662,7 @@ static int add_nfit_dimm(struct ndctl_dimm *dimm, const char *dimm_base)
* 'unique_id' may not be available on older kernels, so don't
* fail if the read fails.
*/
- sprintf(path, "%s/nfit/id", dimm_base);
+ sprintf(path, "%s/%s/id", dimm_base, bus_prefix);
if (sysfs_read_attr(ctx, path, buf) == 0) {
unsigned int b[9];
@@ -1709,68 +1677,74 @@ static int add_nfit_dimm(struct ndctl_dimm *dimm, const char *dimm_base)
}
}
- sprintf(path, "%s/nfit/handle", dimm_base);
+ sprintf(path, "%s/%s/handle", dimm_base, bus_prefix);
if (sysfs_read_attr(ctx, path, buf) < 0)
goto err_read;
dimm->handle = strtoul(buf, NULL, 0);
- sprintf(path, "%s/nfit/phys_id", dimm_base);
+ sprintf(path, "%s/%s/phys_id", dimm_base, bus_prefix);
if (sysfs_read_attr(ctx, path, buf) < 0)
goto err_read;
dimm->phys_id = strtoul(buf, NULL, 0);
- sprintf(path, "%s/nfit/serial", dimm_base);
+ sprintf(path, "%s/%s/serial", dimm_base, bus_prefix);
if (sysfs_read_attr(ctx, path, buf) == 0)
dimm->serial = strtoul(buf, NULL, 0);
- sprintf(path, "%s/nfit/vendor", dimm_base);
+ sprintf(path, "%s/%s/vendor", dimm_base, bus_prefix);
if (sysfs_read_attr(ctx, path, buf) == 0)
dimm->vendor_id = strtoul(buf, NULL, 0);
- sprintf(path, "%s/nfit/device", dimm_base);
+ sprintf(path, "%s/%s/device", dimm_base, bus_prefix);
if (sysfs_read_attr(ctx, path, buf) == 0)
dimm->device_id = strtoul(buf, NULL, 0);
- sprintf(path, "%s/nfit/rev_id", dimm_base);
+ sprintf(path, "%s/%s/rev_id", dimm_base, bus_prefix);
if (sysfs_read_attr(ctx, path, buf) == 0)
dimm->revision_id = strtoul(buf, NULL, 0);
- sprintf(path, "%s/nfit/dirty_shutdown", dimm_base);
+ sprintf(path, "%s/%s/dirty_shutdown", dimm_base, bus_prefix);
if (sysfs_read_attr(ctx, path, buf) == 0)
dimm->dirty_shutdown = strtoll(buf, NULL, 0);
- sprintf(path, "%s/nfit/subsystem_vendor", dimm_base);
+ sprintf(path, "%s/%s/subsystem_vendor", dimm_base, bus_prefix);
if (sysfs_read_attr(ctx, path, buf) == 0)
dimm->subsystem_vendor_id = strtoul(buf, NULL, 0);
- sprintf(path, "%s/nfit/subsystem_device", dimm_base);
+ sprintf(path, "%s/%s/subsystem_device", dimm_base, bus_prefix);
if (sysfs_read_attr(ctx, path, buf) == 0)
dimm->subsystem_device_id = strtoul(buf, NULL, 0);
- sprintf(path, "%s/nfit/subsystem_rev_id", dimm_base);
+ sprintf(path, "%s/%s/subsystem_rev_id", dimm_base, bus_prefix);
if (sysfs_read_attr(ctx, path, buf) == 0)
dimm->subsystem_revision_id = strtoul(buf, NULL, 0);
- sprintf(path, "%s/nfit/family", dimm_base);
+ sprintf(path, "%s/%s/family", dimm_base, bus_prefix);
if (sysfs_read_attr(ctx, path, buf) == 0)
dimm->cmd_family = strtoul(buf, NULL, 0);
- sprintf(path, "%s/nfit/dsm_mask", dimm_base);
+ sprintf(path, "%s/%s/dsm_mask", dimm_base, bus_prefix);
if (sysfs_read_attr(ctx, path, buf) == 0)
dimm->nfit_dsm_mask = strtoul(buf, NULL, 0);
- sprintf(path, "%s/nfit/format", dimm_base);
+ sprintf(path, "%s/%s/format", dimm_base, bus_prefix);
if (sysfs_read_attr(ctx, path, buf) == 0)
dimm->format[0] = strtoul(buf, NULL, 0);
for (i = 1; i < dimm->formats; i++) {
- sprintf(path, "%s/nfit/format%d", dimm_base, i);
+ sprintf(path, "%s/%s/format%d", dimm_base, bus_prefix, i);
if (sysfs_read_attr(ctx, path, buf) == 0)
dimm->format[i] = strtoul(buf, NULL, 0);
}
- sprintf(path, "%s/nfit/flags", dimm_base);
- if (sysfs_read_attr(ctx, path, buf) == 0)
- parse_nfit_mem_flags(dimm, buf);
+ sprintf(path, "%s/%s/flags", dimm_base, bus_prefix);
+ if (sysfs_read_attr(ctx, path, buf) == 0) {
+ if (ndctl_bus_has_nfit(dimm->bus))
+ parse_nfit_mem_flags(dimm, buf);
+ else if (ndctl_bus_is_papr_scm(dimm->bus)) {
+ dimm->cmd_family = NVDIMM_FAMILY_PAPR;
+ parse_papr_flags(dimm, buf);
+ }
+ }
dimm->health_eventfd = open(path, O_RDONLY|O_CLOEXEC);
rc = 0;
@@ -1792,7 +1766,8 @@ static void *add_dimm(void *parent, int id, const char *dimm_base)
if (!path)
return NULL;
- sprintf(path, "%s/nfit/formats", dimm_base);
+ sprintf(path, "%s/%s/formats", dimm_base,
+ ndctl_bus_has_nfit(bus) ? "nfit" : "papr");
if (sysfs_read_attr(ctx, path, buf) < 0)
formats = 1;
else
@@ -1866,13 +1841,12 @@ static void *add_dimm(void *parent, int id, const char *dimm_base)
else
dimm->fwa_result = fwa_result_to_result(buf);
+ dimm->formats = formats;
/* Check if the given dimm supports nfit */
if (ndctl_bus_has_nfit(bus)) {
- dimm->formats = formats;
- rc = add_nfit_dimm(dimm, dimm_base);
- } else if (ndctl_bus_has_of_node(bus)) {
- rc = add_papr_dimm(dimm, dimm_base);
- }
+ rc = populate_dimm_attributes(dimm, dimm_base, "nfit");
+ } else if (ndctl_bus_has_of_node(bus))
+ rc = populate_dimm_attributes(dimm, dimm_base, "papr");
if (rc == -ENODEV) {
/* Unprobed dimm with no family */
@@ -2531,13 +2505,12 @@ static void *add_region(void *parent, int id, const char *region_base)
goto err_read;
region->num_mappings = strtoul(buf, NULL, 0);
- sprintf(path, "%s/nfit/range_index", region_base);
- if (ndctl_bus_has_nfit(bus)) {
- if (sysfs_read_attr(ctx, path, buf) < 0)
- goto err_read;
- region->range_index = strtoul(buf, NULL, 0);
- } else
+ sprintf(path, "%s/%s/range_index", region_base,
+ ndctl_bus_has_nfit(bus) ? "nfit": "papr");
+ if (sysfs_read_attr(ctx, path, buf) < 0)
region->range_index = -1;
+ else
+ region->range_index = strtoul(buf, NULL, 0);
sprintf(path, "%s/read_only", region_base);
if (sysfs_read_attr(ctx, path, buf) < 0)
--
2.30.2
1 week, 3 days
BUG_ON(!mapping_empty(&inode->i_data))
by Hugh Dickins
Running my usual tmpfs kernel builds swapping load, on Sunday's rc4-mm1
mmotm (I never got to try rc3-mm1 but presume it behaved the same way),
I hit clear_inode()'s BUG_ON(!mapping_empty(&inode->i_data)); on two
machines, within an hour or few, repeatably though not to order.
The stack backtrace has always been clear_inode < ext4_clear_inode <
ext4_evict_inode < evict < dispose_list < prune_icache_sb <
super_cache_scan < do_shrink_slab < shrink_slab_memcg < shrink_slab <
shrink_node_memgs < shrink_node < balance_pgdat < kswapd.
ext4 is the disk filesystem I read the source to build from, and also
the filesystem I use on a loop device on a tmpfs file: I have not tried
with other filesystems, nor checked whether perhaps it happens always on
the loop one or always on the disk one. I have not seen it happen with
tmpfs - probably because its inodes cannot be evicted by the shrinker
anyway; I have not seen it happen when "rm -rf" evicts ext4 or tmpfs
inodes (but suspect that may be down to timing, or less pressure).
I doubt it's a matter of filesystem: think it's an XArray thing.
Whenever I've looked at the XArray nodes involved, the root node
(shift 6) contained one or three (adjacent) pointers to empty shift
0 nodes, which each had offset and parent and array correctly set.
Is there some way in which empty nodes can get left behind, and so
fail eviction's mapping_empty() check?
I did wonder whether some might get left behind if xas_alloc() fails
(though probably the tree here is too shallow to show that). Printks
showed that occasionally xas_alloc() did fail while testing (maybe at
memcg limit), but there was no correlation with the BUG_ONs.
I did wonder whether this is a long-standing issue, which your new
BUG_ON is the first to detect: so tried 5.12-rc5 clear_inode() with
a BUG_ON(!xa_empty(&inode->i_data.i_pages)) after its nrpages and
nrexceptional BUG_ONs. The result there surprised me: I expected
it to behave the same way, but it hits that BUG_ON in a minute or
so, instead of an hour or so. Was there a fix you made somewhere,
to avoid the BUG_ON(!mapping_empty) most of the time? but needs
more work. I looked around a little, but didn't find any.
I had hoped to work this out myself, and save us both some writing:
but better hand over to you, in the hope that you'll quickly guess
what's up, then I can try patches. I do like the no-nrexceptionals
series, but there's something still to be fixed.
Hugh
2 weeks
[PATCH 0/3] mm, pmem: Force unmap pmem on surprise remove
by Dan Williams
Summary:
A dax_dev can be unbound from its driver at any time. Unbind can not
fail. The driver-core will always trigger ->remove() and the result from
->remove() is ignored. After ->remove() the driver-core proceeds to tear
down context. The filesystem-dax implementation can leave pfns mapped
after ->remove() if it is triggered while the filesystem is mounted.
Security and data-integrity is forfeit if the dax_dev is repurposed for
another security domain (new filesystem or change device modes), or if
the dax_dev is physically replaced. CXL is a hotplug bus that makes
dax_dev physical replace a real world prospect.
All dax_dev pfns must be unmapped at remove. Detect the "remove while
mounted" case and trigger memory_failure() over the entire dax_dev
range.
Details:
The get_user_pages_fast() path expects all synchronization to be handled
by the pattern of checking for pte presence, taking a page reference,
and then validating that the pte was stable over that event. The
gup-fast path for devmap / DAX pages additionally attempts to take/hold
a live reference against the hosting pgmap over the page pin. The
rational for the pgmap reference is to synchronize against a dax-device
unbind / ->remove() event, but that is unnecessary if pte invalidation
is guaranteed in the ->remove() path.
Global dax-device pte invalidation *does* happen when the device is in
raw "device-dax" mode where there is a single shared inode to unmap at
remove, but the filesystem-dax path has a large number of actively
mapped inodes unknown to the driver at ->remove() time. So, that unmap
does not happen today for filesystem-dax. However, as Jason points out,
that unmap / invalidation *needs* to happen not only to cleanup
get_user_pages_fast() semantics, but in a future (see CXL) where dax_dev
->remove() is correlated with actual physical removal / replacement the
implications of allowing a physical pfn to be exchanged without tearing
down old mappings are severe (security and data-integrity).
What is not in this patch set is coordination with the dax_kmem driver
to trigger memory_failure() when the dax_dev is onlined as "System
RAM". The remove_memory() API was built with the assumption that
platform firmware negotiates all removal requests and the OS has a
chance to say "no". This is why dax_kmem today simply leaks
request_region() to burn that physical address space for any other
usage until the next reboot on a manual unbind event if the memory can't
be offlined. However a future to make sure that remove_memory() succeeds
after memory_failure() of the same range seems a better semantic than
permanently burning physical address space.
The topic of remove_memory() failures gets to the question of what
happens to active page references when the inopportune ->remove() event
happens. For transient pins the ->remove() event will wait for for all
pins to be dropped before allowing ->remove() to complete. Since
fileystem-dax forbids longterm pins all those pins are transient.
Device-dax, on the other hand, does allow longterm pins which means that
->remove() will hang unless / until the longterm pin is dropped.
Hopefully an unmap_mapping_range() event is sufficient to get the pin
dropped, but I suspect device-dax might need to trigger memory_failure()
as well to get the longterm pin holder to wake up and get out of the
way (TBD).
Lest we repeat the "longterm-pin-revoke" debate, which highlighted that
RDMA devices do not respond well to having context torn down, keep in
mind that this proposal is to do a best effort recovery of an event that
should not happen (surprise removal) under nominal operation.
---
Dan Williams (3):
mm/memory-failure: Prepare for mass memory_failure()
mm, dax, pmem: Introduce dev_pagemap_failure()
mm/devmap: Remove pgmap accounting in the get_user_pages_fast() path
drivers/dax/super.c | 15 +++++++++++++++
drivers/nvdimm/pmem.c | 10 +++++++++-
drivers/nvdimm/pmem.h | 1 +
include/linux/dax.h | 5 +++++
include/linux/memremap.h | 5 +++++
include/linux/mm.h | 3 +++
mm/gup.c | 38 ++++++++++++++++----------------------
mm/memory-failure.c | 36 +++++++++++++++++++++++-------------
mm/memremap.c | 11 +++++++++++
9 files changed, 88 insertions(+), 36 deletions(-)
2 weeks, 1 day
[PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps
by Joao Martins
Hey,
This series, attempts at minimizing 'struct page' overhead by
pursuing a similar approach as Muchun Song series "Free some vmemmap
pages of hugetlb page"[0] but applied to devmap/ZONE_DEVICE.
[0] https://lore.kernel.org/linux-mm/20210308102807.59745-1-songmuchun@byteda...
The link above describes it quite nicely, but the idea is to reuse tail
page vmemmap areas, particular the area which only describes tail pages.
So a vmemmap page describes 64 struct pages, and the first page for a given
ZONE_DEVICE vmemmap would contain the head page and 63 tail pages. The second
vmemmap page would contain only tail pages, and that's what gets reused across
the rest of the subsection/section. The bigger the page size, the bigger the
savings (2M hpage -> save 6 vmemmap pages; 1G hpage -> save 4094 vmemmap pages).
This series also takes one step further on 1GB pages and *also* reuse PMD pages
which only contain tail pages which allows to keep parity with current hugepage
based memmap. This further let us more than halve the overhead with 1GB pages
(40M -> 16M per Tb)
In terms of savings, per 1Tb of memory, the struct page cost would go down
with compound pagemap:
* with 2M pages we lose 4G instead of 16G (0.39% instead of 1.5% of total memory)
* with 1G pages we lose 16MB instead of 16G (0.0014% instead of 1.5% of total memory)
Along the way I've extended it past 'struct page' overhead *trying* to address a
few performance issues we knew about for pmem, specifically on the
{pin,get}_user_pages_fast with device-dax vmas which are really
slow even of the fast variants. THP is great on -fast variants but all except
hugetlbfs perform rather poorly on non-fast gup. Although I deferred the
__get_user_pages() improvements (in a follow up series I have stashed as its
ortogonal to device-dax as THP suffers from the same syndrome).
So to summarize what the series does:
Patch 1: Prepare hwpoisoning to work with dax compound pages.
Patches 2-4: Have memmap_init_zone_device() initialize its metadata as compound
pages. We split the current utility function of prep_compound_page() into head
and tail and use those two helpers where appropriate to take advantage of caches
being warm after __init_single_page(). Since RFC this also lets us further speed
up from 190ms down to 80ms init time.
Patches 5-10: Much like Muchun series, we reuse PTE (and PMD) tail page vmemmap
areas across a given page size (namely @align was referred by remaining
memremap/dax code) and enabling of memremap to initialize the ZONE_DEVICE pages
as compound pages or a given @align order. The main difference though, is that
contrary to the hugetlbfs series, there's no vmemmap for the area, because we
are populating it as opposed to remapping it. IOW no freeing of pages of
already initialized vmemmap like the case for hugetlbfs, which simplifies the
logic (besides not being arch-specific). After these, there's quite visible
region bootstrap of pmem memmap given that we would initialize fewer struct
pages depending on the page size with DRAM backed struct pages. altmap sees no
difference in bootstrap.
NVDIMM namespace bootstrap improves from ~268-358 ms to ~78-100/<1ms on 128G NVDIMMs
with 2M and 1G respectivally.
Patch 11: Optimize grabbing page refcount changes given that we
are working with compound pages i.e. we do 1 increment to the head
page for a given set of N subpages compared as opposed to N individual writes.
{get,pin}_user_pages_fast() for zone_device with compound pagemap consequently
improves considerably with DRAM stored struct pages. It also *greatly*
improves pinning with altmap. Results with gup_test:
before after
(16G get_user_pages_fast 2M page size) ~59 ms -> ~6.1 ms
(16G pin_user_pages_fast 2M page size) ~87 ms -> ~6.2 ms
(16G get_user_pages_fast altmap 2M page size) ~494 ms -> ~9 ms
(16G pin_user_pages_fast altmap 2M page size) ~494 ms -> ~10 ms
altmap performance gets specially interesting when pinning a pmem dimm:
before after
(128G get_user_pages_fast 2M page size) ~492 ms -> ~49 ms
(128G pin_user_pages_fast 2M page size) ~493 ms -> ~50 ms
(128G get_user_pages_fast altmap 2M page size) ~3.91 ms -> ~70 ms
(128G pin_user_pages_fast altmap 2M page size) ~3.97 ms -> ~74 ms
The unpinning improvement patches are in mmotm/linux-next so removed from this
series.
I have deferred the __get_user_pages() patch to outside this series
(https://lore.kernel.org/linux-mm/20201208172901.17384-11-joao.m.martins@o...),
as I found an simpler way to address it and that is also applicable to
THP. But will submit that as a follow up of this.
Patches apply on top of linux-next tag next-20210325 (commit b4f20b70784a).
Comments and suggestions very much appreciated!
Changelog,
RFC -> v1:
(New patches 1-3, 5-8 but the diffstat is that different)
* Fix hwpoisoning of devmap pages reported by Jane (Patch 1 is new in v1)
* Fix/Massage commit messages to be more clear and remove the 'we' occurences (Dan, John, Matthew)
* Use pfn_align to be clear it's nr of pages for @align value (John, Dan)
* Add two helpers pgmap_align() and pgmap_pfn_align() as accessors of pgmap->align;
* Remove the gup_device_compound_huge special path and have the same code
work both ways while special casing when devmap page is compound (Jason, John)
* Avoid usage of vmemmap_populate_basepages() and introduce a first class
loop that doesn't care about passing an altmap for memmap reuse. (Dan)
* Completely rework the vmemmap_populate_compound() to avoid the sparse_add_section
hack into passing block across sparse_add_section calls. It's a lot easier to
follow and more explicit in what it does.
* Replace the vmemmap refactoring with adding a @pgmap argument and moving
parts of the vmemmap_populate_base_pages(). (Patch 5 and 6 are new as a result)
* Add PMD tail page vmemmap area reuse for 1GB pages. (Patch 8 is new)
* Improve memmap_init_zone_device() to initialize compound pages when
struct pages are cache warm. That lead to a even further speed up further
from RFC series from 190ms -> 80-120ms. Patches 2 and 3 are the new ones
as a result (Dan)
* Remove PGMAP_COMPOUND and use @align as the property to detect whether
or not to reuse vmemmap areas (Dan)
Thanks,
Joao
Joao Martins (11):
memory-failure: fetch compound_head after pgmap_pfn_valid()
mm/page_alloc: split prep_compound_page into head and tail subparts
mm/page_alloc: refactor memmap_init_zone_device() page init
mm/memremap: add ZONE_DEVICE support for compound pages
mm/sparse-vmemmap: add a pgmap argument to section activation
mm/sparse-vmemmap: refactor vmemmap_populate_basepages()
mm/sparse-vmemmap: populate compound pagemaps
mm/sparse-vmemmap: use hugepages for PUD compound pagemaps
mm/page_alloc: reuse tail struct pages for compound pagemaps
device-dax: compound pagemap support
mm/gup: grab head page refcount once for group of subpages
drivers/dax/device.c | 58 +++++++--
include/linux/memory_hotplug.h | 5 +-
include/linux/memremap.h | 13 ++
include/linux/mm.h | 8 +-
mm/gup.c | 52 +++++---
mm/memory-failure.c | 2 +
mm/memory_hotplug.c | 3 +-
mm/memremap.c | 9 +-
mm/page_alloc.c | 126 +++++++++++++------
mm/sparse-vmemmap.c | 221 +++++++++++++++++++++++++++++----
mm/sparse.c | 24 ++--
11 files changed, 406 insertions(+), 115 deletions(-)
--
2.17.1
2 weeks, 2 days
[PATCH] memfd_secret: use unsigned int rather than long as syscall flags type
by Mike Rapoport
From: Mike Rapoport <rppt(a)linux.ibm.com>
Yuri Norov says:
If parameter size is the same for native and compat ABIs, we may
wire a syscall made by compat client to native handler. This is
true for unsigned int, but not true for unsigned long or pointer.
That's why I suggest using unsigned int and so avoid creating compat
entry point.
Use unsigned int as the type of the flags parameter in memfd_secret()
system call.
Signed-off-by: Mike Rapoport <rppt(a)linux.ibm.com>
---
@Andrew,
The patch is vs v5.12-rc5-mmots-2021-03-30-23, I'd appreciate if it would
be added as a fixup to the memfd_secret series.
include/linux/syscalls.h | 2 +-
mm/secretmem.c | 2 +-
tools/testing/selftests/vm/memfd_secret.c | 2 +-
3 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 49c93c906893..1a1b5d724497 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1050,7 +1050,7 @@ asmlinkage long sys_landlock_create_ruleset(const struct landlock_ruleset_attr _
asmlinkage long sys_landlock_add_rule(int ruleset_fd, enum landlock_rule_type rule_type,
const void __user *rule_attr, __u32 flags);
asmlinkage long sys_landlock_restrict_self(int ruleset_fd, __u32 flags);
-asmlinkage long sys_memfd_secret(unsigned long flags);
+asmlinkage long sys_memfd_secret(unsigned int flags);
/*
* Architecture-specific system calls
diff --git a/mm/secretmem.c b/mm/secretmem.c
index f2ae3f32a193..3b1ba3991964 100644
--- a/mm/secretmem.c
+++ b/mm/secretmem.c
@@ -199,7 +199,7 @@ static struct file *secretmem_file_create(unsigned long flags)
return file;
}
-SYSCALL_DEFINE1(memfd_secret, unsigned long, flags)
+SYSCALL_DEFINE1(memfd_secret, unsigned int, flags)
{
struct file *file;
int fd, err;
diff --git a/tools/testing/selftests/vm/memfd_secret.c b/tools/testing/selftests/vm/memfd_secret.c
index c878c2b841fc..2462f52e9c96 100644
--- a/tools/testing/selftests/vm/memfd_secret.c
+++ b/tools/testing/selftests/vm/memfd_secret.c
@@ -38,7 +38,7 @@ static unsigned long page_size;
static unsigned long mlock_limit_cur;
static unsigned long mlock_limit_max;
-static int memfd_secret(unsigned long flags)
+static int memfd_secret(unsigned int flags)
{
return syscall(__NR_memfd_secret, flags);
}
--
2.28.0
2 weeks, 2 days
[ndctl PATCH] daxctl: emit counts of total and online memblocks
by Vishal Verma
Fir daxctl device listings, if in 'system-ram' mode, it is useful to
know whether the memory associated with the device is online or not.
Since the memory is comprised of a number of 'memblocks', and it is
possible (albeit rare) to have a subset of them online, and the rest
offline, we can't just use a boolean online-or-offline flag for the
state.
Add a couple of counts, one for the total number of memblocks associated
with the device, and another for the ones that are online.
Link: https://github.com/pmem/ndctl/issues/139
Cc: Dan Williams <dan.j.williams(a)intel.com>
Cc: Dave Hansen <dave.hansen(a)linux.intel.com>
Reported-by: Steve Scargall <steve.scargall(a)intel.com>
Signed-off-by: Vishal Verma <vishal.l.verma(a)intel.com>
---
util/json.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/util/json.c b/util/json.c
index ca0167b..a8d2412 100644
--- a/util/json.c
+++ b/util/json.c
@@ -482,6 +482,17 @@ struct json_object *util_daxctl_dev_to_json(struct daxctl_dev *dev,
json_object_object_add(jdev, "mode", jobj);
if (mem && daxctl_dev_get_resource(dev) != 0) {
+ int num_sections = daxctl_memory_num_sections(mem);
+ int num_online = daxctl_memory_is_online(mem);
+
+ jobj = json_object_new_int(num_online);
+ if (jobj)
+ json_object_object_add(jdev, "online_memblocks", jobj);
+
+ jobj = json_object_new_int(num_sections);
+ if (jobj)
+ json_object_object_add(jdev, "total_memblocks", jobj);
+
movable = daxctl_memory_is_movable(mem);
if (movable == 1)
jobj = json_object_new_boolean(true);
--
2.30.2
2 weeks, 2 days