[PATCH v2] nvdimm: Avoid race between probe and reading device attributes
by Richard Palethorpe
It is possible to cause a division error and use-after-free by querying the
nmem device before the driver data is fully initialised in nvdimm_probe. E.g
by doing
(while true; do
cat /sys/bus/nd/devices/nmem*/available_slots 2>&1 > /dev/null
done) &
while true; do
for i in $(seq 0 4); do
echo nmem$i > /sys/bus/nd/drivers/nvdimm/bind
done
for i in $(seq 0 4); do
echo nmem$i > /sys/bus/nd/drivers/nvdimm/unbind
done
done
On 5.7-rc3 this causes:
[ 12.711578] divide error: 0000 [#1] SMP KASAN PTI
[ 12.712321] CPU: 0 PID: 231 Comm: cat Not tainted 5.7.0-rc3 #48
[ 12.713188] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
[ 12.714857] RIP: 0010:nd_label_nfree+0x134/0x1a0 [libnvdimm]
[ 12.715772] Code: ba 00 00 00 00 00 fc ff df 48 89 f9 48 c1 e9 03 0f b6 14 11 84 d2 74 05 80 fa 03 7e 52 8b 73 08 31 d2 89 c1 48 83 c4 08 5b 5d <f7> f6 31 d2 41 5c 83 c0 07 c1 e8 03 48 8d 84 00 8e 02 00 00 25 00
[ 12.718311] RSP: 0018:ffffc9000046fd08 EFLAGS: 00010282
[ 12.719030] RAX: 0000000000000000 RBX: ffffffffc0073aa0 RCX: 0000000000000000
[ 12.720005] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff888060931808
[ 12.720970] RBP: ffff88806609d018 R08: 0000000000000001 R09: ffffed100cc0a2b1
[ 12.721889] R10: ffff888066051587 R11: ffffed100cc0a2b0 R12: ffff888060931800
[ 12.722744] R13: ffff888064362000 R14: ffff88806609d018 R15: ffffffff8b1a2520
[ 12.723602] FS: 00007fd16f3d5580(0000) GS:ffff88806b400000(0000) knlGS:0000000000000000
[ 12.724600] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 12.725308] CR2: 00007fd16f1ec000 CR3: 0000000064322006 CR4: 0000000000160ef0
[ 12.726268] Call Trace:
[ 12.726633] available_slots_show+0x4e/0x120 [libnvdimm]
[ 12.727380] dev_attr_show+0x42/0x80
[ 12.727891] ? memset+0x20/0x40
[ 12.728341] sysfs_kf_seq_show+0x218/0x410
[ 12.728923] seq_read+0x389/0xe10
[ 12.729415] vfs_read+0x101/0x2d0
[ 12.729891] ksys_read+0xf9/0x1d0
[ 12.730361] ? kernel_write+0x120/0x120
[ 12.730915] do_syscall_64+0x95/0x4a0
[ 12.731435] entry_SYSCALL_64_after_hwframe+0x49/0xb3
[ 12.732163] RIP: 0033:0x7fd16f2fe4be
[ 12.732685] Code: c0 e9 c6 fe ff ff 50 48 8d 3d 2e 12 0a 00 e8 69 e9 01 00 66 0f 1f 84 00 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 0f 05 <48> 3d 00 f0 ff ff 77 5a c3 66 0f 1f 84 00 00 00 00 00 48 83 ec 28
[ 12.735207] RSP: 002b:00007ffd3177b838 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[ 12.736261] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007fd16f2fe4be
[ 12.737233] RDX: 0000000000020000 RSI: 00007fd16f1ed000 RDI: 0000000000000003
[ 12.738203] RBP: 00007fd16f1ed000 R08: 00007fd16f1ec010 R09: 0000000000000000
[ 12.739172] R10: 00007fd16f3f4f70 R11: 0000000000000246 R12: 00007ffd3177ce23
[ 12.740144] R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000020000
[ 12.741139] Modules linked in: nfit libnvdimm
[ 12.741783] ---[ end trace 99532e4b82410044 ]---
[ 12.742452] RIP: 0010:nd_label_nfree+0x134/0x1a0 [libnvdimm]
[ 12.743167] Code: ba 00 00 00 00 00 fc ff df 48 89 f9 48 c1 e9 03 0f b6 14 11 84 d2 74 05 80 fa 03 7e 52 8b 73 08 31 d2 89 c1 48 83 c4 08 5b 5d <f7> f6 31 d2 41 5c 83 c0 07 c1 e8 03 48 8d 84 00 8e 02 00 00 25 00
[ 12.745709] RSP: 0018:ffffc9000046fd08 EFLAGS: 00010282
[ 12.746340] RAX: 0000000000000000 RBX: ffffffffc0073aa0 RCX: 0000000000000000
[ 12.747209] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff888060931808
[ 12.748081] RBP: ffff88806609d018 R08: 0000000000000001 R09: ffffed100cc0a2b1
[ 12.748977] R10: ffff888066051587 R11: ffffed100cc0a2b0 R12: ffff888060931800
[ 12.749849] R13: ffff888064362000 R14: ffff88806609d018 R15: ffffffff8b1a2520
[ 12.750729] FS: 00007fd16f3d5580(0000) GS:ffff88806b400000(0000) knlGS:0000000000000000
[ 12.751708] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 12.752441] CR2: 00007fd16f1ec000 CR3: 0000000064322006 CR4: 0000000000160ef0
[ 12.821357] ==================================================================
[ 12.822284] BUG: KASAN: use-after-free in __mutex_lock+0x111c/0x11a0
[ 12.823084] Read of size 4 at addr ffff888065c26238 by task reproducer/218
[ 12.823968]
[ 12.824183] CPU: 2 PID: 218 Comm: reproducer Tainted: G D 5.7.0-rc3 #48
[ 12.825167] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
[ 12.826595] Call Trace:
[ 12.826926] dump_stack+0x97/0xe0
[ 12.827362] print_address_description.constprop.0+0x1b/0x210
[ 12.828111] ? __mutex_lock+0x111c/0x11a0
[ 12.828645] __kasan_report.cold+0x37/0x92
[ 12.829179] ? __mutex_lock+0x111c/0x11a0
[ 12.829706] kasan_report+0x38/0x50
[ 12.830158] __mutex_lock+0x111c/0x11a0
[ 12.830666] ? ftrace_graph_stop+0x10/0x10
[ 12.831193] ? is_nvdimm_bus+0x40/0x40 [libnvdimm]
[ 12.831820] ? mutex_trylock+0x2b0/0x2b0
[ 12.832333] ? nvdimm_probe+0x259/0x420 [libnvdimm]
[ 12.832975] ? mutex_trylock+0x2b0/0x2b0
[ 12.833500] ? nvdimm_probe+0x259/0x420 [libnvdimm]
[ 12.834122] ? prepare_ftrace_return+0xa1/0xf0
[ 12.834724] ? ftrace_graph_caller+0x6b/0xa0
[ 12.835269] ? acpi_label_write+0x390/0x390 [nfit]
[ 12.835909] ? nvdimm_probe+0x259/0x420 [libnvdimm]
[ 12.836558] ? nvdimm_probe+0x259/0x420 [libnvdimm]
[ 12.837179] nvdimm_probe+0x259/0x420 [libnvdimm]
[ 12.837802] nvdimm_bus_probe+0x110/0x6b0 [libnvdimm]
[ 12.838470] really_probe+0x212/0x9a0
[ 12.838954] driver_probe_device+0x1cd/0x300
[ 12.839511] ? driver_probe_device+0x5/0x300
[ 12.840063] device_driver_attach+0xe7/0x120
[ 12.840623] bind_store+0x18d/0x230
[ 12.841075] kernfs_fop_write+0x200/0x420
[ 12.841606] vfs_write+0x154/0x450
[ 12.842047] ksys_write+0xf9/0x1d0
[ 12.842497] ? __ia32_sys_read+0xb0/0xb0
[ 12.843010] do_syscall_64+0x95/0x4a0
[ 12.843495] entry_SYSCALL_64_after_hwframe+0x49/0xb3
[ 12.844140] RIP: 0033:0x7f5b235d3563
[ 12.844607] Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 55 c3 0f 1f 40 00 48 83 ec 28 48 89 54 24 18
[ 12.846877] RSP: 002b:00007fff1c3bc578 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 12.847822] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007f5b235d3563
[ 12.848717] RDX: 0000000000000006 RSI: 000055f9576710d0 RDI: 0000000000000001
[ 12.849594] RBP: 000055f9576710d0 R08: 000000000000000a R09: 0000000000000000
[ 12.850470] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000006
[ 12.851333] R13: 00007f5b236a3500 R14: 0000000000000006 R15: 00007f5b236a3700
[ 12.852247]
[ 12.852466] Allocated by task 225:
[ 12.852893] save_stack+0x1b/0x40
[ 12.853310] __kasan_kmalloc.constprop.0+0xc2/0xd0
[ 12.853918] kmem_cache_alloc_node+0xef/0x270
[ 12.854475] copy_process+0x485/0x6130
[ 12.854945] _do_fork+0xf1/0xb40
[ 12.855353] __do_sys_clone+0xc3/0x100
[ 12.855843] do_syscall_64+0x95/0x4a0
[ 12.856302] entry_SYSCALL_64_after_hwframe+0x49/0xb3
[ 12.856939]
[ 12.857140] Freed by task 0:
[ 12.857522] save_stack+0x1b/0x40
[ 12.857940] __kasan_slab_free+0x12c/0x170
[ 12.858464] kmem_cache_free+0xb0/0x330
[ 12.858945] rcu_core+0x55f/0x19f0
[ 12.859385] __do_softirq+0x228/0x944
[ 12.859869]
[ 12.860075] The buggy address belongs to the object at ffff888065c26200
[ 12.860075] which belongs to the cache task_struct of size 6016
[ 12.861638] The buggy address is located 56 bytes inside of
[ 12.861638] 6016-byte region [ffff888065c26200, ffff888065c27980)
[ 12.863084] The buggy address belongs to the page:
[ 12.863702] page:ffffea0001970800 refcount:1 mapcount:0 mapping:0000000021ee3712 index:0x0 head:ffffea0001970800 order:3 compound_mapcount:0 compound_pincount:0
[ 12.865478] flags: 0x80000000010200(slab|head)
[ 12.866039] raw: 0080000000010200 0000000000000000 0000000100000001 ffff888066c0f980
[ 12.867010] raw: 0000000000000000 0000000080050005 00000001ffffffff 0000000000000000
[ 12.867986] page dumped because: kasan: bad access detected
[ 12.868696]
[ 12.868900] Memory state around the buggy address:
[ 12.869514] ffff888065c26100: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 12.870414] ffff888065c26180: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 12.871318] >ffff888065c26200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 12.872238] ^
[ 12.872870] ffff888065c26280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 12.873754] ffff888065c26300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 12.874640]
==================================================================
This can be prevented by setting the driver data after initialisation is
complete.
Fixes: 4d88a97aa9e8 ("libnvdimm, nvdimm: dimm driver and base libnvdimm device-driver infrastructure")
Cc: Dan Williams <dan.j.williams(a)intel.com>
Cc: Vishal Verma <vishal.l.verma(a)intel.com>
Cc: Dave Jiang <dave.jiang(a)intel.com>
Cc: Ira Weiny <ira.weiny(a)intel.com>
Cc: linux-nvdimm(a)lists.01.org
Cc: linux-kernel(a)vger.kernel.org
Cc: Coly Li <colyli(a)suse.com>
Signed-off-by: Richard Palethorpe <rpalethorpe(a)suse.com>
---
V2:
+ Reviewed by Coly and removed unecessary lock
drivers/nvdimm/dimm.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/nvdimm/dimm.c b/drivers/nvdimm/dimm.c
index 7d4ddc4d9322..3d3988e1d9a0 100644
--- a/drivers/nvdimm/dimm.c
+++ b/drivers/nvdimm/dimm.c
@@ -43,7 +43,6 @@ static int nvdimm_probe(struct device *dev)
if (!ndd)
return -ENOMEM;
- dev_set_drvdata(dev, ndd);
ndd->dpa.name = dev_name(dev);
ndd->ns_current = -1;
ndd->ns_next = -1;
@@ -106,6 +105,8 @@ static int nvdimm_probe(struct device *dev)
if (rc)
goto err;
+ dev_set_drvdata(dev, ndd);
+
return 0;
err:
--
2.26.2
1 month
[PATCH 1/1] ndctl/namespace: Fix disable-namespace accounting relative to seed devices
by Redhairer Li
Seed namespaces are included in "ndctl disable-namespace all". However
since the user never "creates" them it is surprising to see
"disable-namespace" report 1 more namespace relative to the number that
have been created. Catch attempts to disable a zero-sized namespace:
Before:
{
"dev":"namespace1.0",
"size":"492.00 MiB (515.90 MB)",
"blockdev":"pmem1"
}
{
"dev":"namespace1.1",
"size":"492.00 MiB (515.90 MB)",
"blockdev":"pmem1.1"
}
{
"dev":"namespace1.2",
"size":"492.00 MiB (515.90 MB)",
"blockdev":"pmem1.2"
}
disabled 4 namespaces
After:
{
"dev":"namespace1.0",
"size":"492.00 MiB (515.90 MB)",
"blockdev":"pmem1"
}
{
"dev":"namespace1.3",
"size":"492.00 MiB (515.90 MB)",
"blockdev":"pmem1.3"
}
{
"dev":"namespace1.1",
"size":"492.00 MiB (515.90 MB)",
"blockdev":"pmem1.1"
}
disabled 3 namespaces
Signed-off-by: Redhairer Li <redhairer.li(a)intel.com>
---
ndctl/lib/libndctl.c | 11 ++++++++---
ndctl/region.c | 4 +++-
2 files changed, 11 insertions(+), 4 deletions(-)
diff --git a/ndctl/lib/libndctl.c b/ndctl/lib/libndctl.c
index ee737cb..49f362b 100644
--- a/ndctl/lib/libndctl.c
+++ b/ndctl/lib/libndctl.c
@@ -4231,6 +4231,7 @@ NDCTL_EXPORT int ndctl_namespace_disable_safe(struct ndctl_namespace *ndns)
const char *bdev = NULL;
char path[50];
int fd;
+ unsigned long long size = ndctl_namespace_get_size(ndns);
if (pfn && ndctl_pfn_is_enabled(pfn))
bdev = ndctl_pfn_get_block_device(pfn);
@@ -4260,9 +4261,13 @@ NDCTL_EXPORT int ndctl_namespace_disable_safe(struct ndctl_namespace *ndns)
devname, bdev, strerror(errno));
return -errno;
}
- } else
- ndctl_namespace_disable_invalidate(ndns);
-
+ } else {
+ if (size == 0)
+ /* Don't try to disable idle namespace (no capacity allocated) */
+ return -ENXIO;
+ else
+ ndctl_namespace_disable_invalidate(ndns);
+ }
return 0;
}
diff --git a/ndctl/region.c b/ndctl/region.c
index 7945007..0014bb9 100644
--- a/ndctl/region.c
+++ b/ndctl/region.c
@@ -72,6 +72,7 @@ static int region_action(struct ndctl_region *region, enum device_action mode)
{
struct ndctl_namespace *ndns;
int rc = 0;
+ unsigned long long size;
switch (mode) {
case ACTION_ENABLE:
@@ -80,7 +81,8 @@ static int region_action(struct ndctl_region *region, enum device_action mode)
case ACTION_DISABLE:
ndctl_namespace_foreach(region, ndns) {
rc = ndctl_namespace_disable_safe(ndns);
- if (rc)
+ size = ndctl_namespace_get_size(ndns);
+ if (rc && size != 0)
return rc;
}
rc = ndctl_region_disable_invalidate(region);
--
2.20.1.windows.1
1 month, 1 week
[PATCH ndctl v1 0/8] daxctl: Add device align and range mapping allocation
by Joao Martins
Hey,
This series builds on top of this one[0] and does the following improvements
to the Soft-Reserved subdivision:
1) Support for {create,reconfigure}-device for selecting @align (hugepage size).
Here we add a '-a|--align 4K|2M|1G' option to the existing commands;
2) Listing improvements for device alignment and mappings;
Note: Perhaps it is better to hide the mappings by default, and only
print with -v|--verbose. This would align with ndctl, as the mappings
info can be quite large.
3) Allow creating devices from selecting ranges. This allows to keep the
same GPA->HPA mapping as before we kexec the hypervisor with running guests:
daxctl list -d dax0.1 > /var/log/dax0.1.json
kexec -d -l bzImage
systemctl kexec
daxctl create -u --restore /var/log/dax0.1.json
The JSON was what I though it would be easier for an user, given that it is
the data format daxctl outputs. Alternatives could be adding multiple:
--mapping <pgoff>:<start>-<end>
But that could end up in a gigantic line and a little more
unmanageable I think.
This series requires this series[0] on top of Dan's patches[1]:
[0] https://lore.kernel.org/linux-nvdimm/20200716172913.19658-1-joao.m.martin...
[1] https://lore.kernel.org/linux-nvdimm/159457116473.754248.7879464730875147...
The only TODO here is docs and improving tests to validate mappings, and test
the restore path.
Suggestions/comments are welcome.
Joao
Joao Martins (8):
daxctl: add daxctl_dev_{get,set}_align()
util/json: Print device align
daxctl: add align support in reconfigure-device
daxctl: add align support in create-device
libdaxctl: add mapping iterator APIs
daxctl: include mappings when listing
libdaxctl: add daxctl_dev_set_mapping()
daxctl: Allow restore devices from JSON metadata
daxctl/device.c | 154 +++++++++++++++++++++++++++++++++++++++--
daxctl/lib/libdaxctl-private.h | 9 +++
daxctl/lib/libdaxctl.c | 152 +++++++++++++++++++++++++++++++++++++++-
daxctl/lib/libdaxctl.sym | 9 +++
daxctl/libdaxctl.h | 16 +++++
util/json.c | 63 ++++++++++++++++-
util/json.h | 3 +
7 files changed, 396 insertions(+), 10 deletions(-)
--
1.8.3.1
2 months, 3 weeks
[PATCH ndctl v2 00/10] daxctl: Support for sub-dividing soft-reserved regions
by Joao Martins
Changes since v1:
* Add a Documentation/daxctl/ entry for each patch that adds commands or new
option.
* Fix functional test suite to only change region 0 and not touch others
* Fix reconfigure-device -s changes (third patch) for better bisection.
v1: https://lore.kernel.org/linux-nvdimm/20200403205900.18035-1-joao.m.martin...
---
This series introduces the daxctl support for sub-dividing soft-reserved
regions created by EFI/HMAT/efi_fake_mem. It's the userspace counterpart
of this recent patch series [0].
These new 'dynamic' regions can be partitioned into multiple different devices
which its subdivisions can consist of one or more ranges. This
is in contrast to static dax regions -- created with ndctl-create-namespace
-m devdax -- which can't be subdivided neither discontiguous.
See also cover-letter of [0].
The daxctl changes in these patches are depicted as:
* {create,destroy,disable,enable}-device:
These orchestrate/manage the sub-division devices.
It mimmics the same as namespaces equivalent commands.
* Allow reconfigure-device to change the size of an existing *dynamic* dax
device.
* Add test coverage (Tried to cover all range allocation code paths).
v2 of kernel patches now passes this test suite.
* Documentation regarding the new command additions.
[0] "device-dax: Support sub-dividing soft-reserved ranges",
https://lore.kernel.org/linux-nvdimm/159457116473.754248.7879464730875147...
Dan Williams (1):
daxctl: Cleanup whitespace
Joao Martins (9):
libdaxctl: add daxctl_dev_set_size()
daxctl: add resize support in reconfigure-device
daxctl: add command to disable devdax device
daxctl: add command to enable devdax device
libdaxctl: add daxctl_region_create_dev()
daxctl: add command to create device
libdaxctl: add daxctl_region_destroy_dev()
daxctl: add command to destroy device
daxctl/test: Add tests for dynamic dax regions
Documentation/daxctl/Makefile.am | 6 +-
Documentation/daxctl/daxctl-create-device.txt | 105 +++++++
Documentation/daxctl/daxctl-destroy-device.txt | 63 +++++
Documentation/daxctl/daxctl-disable-device.txt | 58 ++++
Documentation/daxctl/daxctl-enable-device.txt | 59 ++++
Documentation/daxctl/daxctl-reconfigure-device.txt | 16 ++
daxctl/builtin.h | 4 +
daxctl/daxctl.c | 4 +
daxctl/device.c | 310 ++++++++++++++++++++-
daxctl/lib/libdaxctl.c | 67 +++++
daxctl/lib/libdaxctl.sym | 7 +
daxctl/libdaxctl.h | 3 +
test/Makefile.am | 1 +
test/daxctl-create.sh | 294 +++++++++++++++++++
util/filter.c | 2 +-
15 files changed, 993 insertions(+), 6 deletions(-)
create mode 100644 Documentation/daxctl/daxctl-create-device.txt
create mode 100644 Documentation/daxctl/daxctl-destroy-device.txt
create mode 100644 Documentation/daxctl/daxctl-disable-device.txt
create mode 100644 Documentation/daxctl/daxctl-enable-device.txt
create mode 100755 test/daxctl-create.sh
--
1.8.3.1
2 months, 3 weeks
Feedback requested: Exposing NVDIMM performance statistics in a generic way
by Vaibhav Jain
Hello,
I am looking for some community feedback on these two Problem-statements:
1.How to expose NVDIMM performance statistics in an arch or nvdimm vendor
agnostic manner ?
2. Is there a common set of performance statistics for NVDIMMs that all
vendors should provide ?
Problem context
===============
While working on bring up of PAPR SCM based NVDIMMs[1] for arch/powerpc
we want to expose certain dimm performance statistics like "Media
Read/Write Counts", "Power-on Seconds" etc to user-space [2]. These
performance statistics are similar to what ipmctl[3] reports for Intel®
Optane™ persistent memory via the '-show performance' command line
arg. However the reported set of performance stats doesn't cover the
entirety of all performance stats supported by PAPR SCM based NVDimms.
For example here is a subset of performance stats which are specific to
PAPR SCM NVDimms and that not reported by ipmctl:
* Controller Reset Count
* Controller Reset Elapsed Time
* Power-on Seconds
* Cache Read Hit Count
* Cache Write Hit Count
Possibility of updating ipmctl to add support for these performance
statistics is greatly hampered by no support for ACPI on Powerpc
arch. Secondly vendors who dont support ACPI/NFIT command set
similar to Intel® Optane™ (Example MSFT) are also left out in
lurch. Problem-statement#1 points to this specific problem.
Additionally in absence of any pre-agreed set of performance statistics
which all vendors should support, adding support for such a
functionality in ipmctl may not bode well of other nvdimm vendors. For
example if support for reporting "Controller Reset Count" is added to
ipmctl then it may not be applicable to other vendors such as Intel®
Optane™. This issue is what Problem-statement#2 refers to.
Possible Solution for Problem#1
===============================
One possible solution to Problem#1 can to add support for reporting
NVDIMM performance statistics in 'ndtcl'. 'libndctl' already has a layer
that abstracts underlying NVDIMM vendors (via struct ndctl_dimm_ops),
making supporting different NVDIMM vendors fairly easy. Also ndctl is
more widely used compared to 'ipmctl', hence adding such a functionality
to ndctl would make it more widely used.
Above solution was implemented as RFC patch-set[2] that exposes these
performance statistics through a generic abstraction in libndctl and
added a presentation layer for this data in ndctl[4]. It added a new
command line flags '--stat' to ndctl to report *all* nvdimm vendor
reported performance stats. The output is similar to one below:
# ndctl list -D --stats
[
{
"dev":"nmem0",
"stats":{
"Power-on Seconds":603931,
"Media Read Count":0,
"Media Write Count":6313,
}
}
]
This was done by adding two new dimm-ops callbacks that were
implemented by the papr_scm implementation within libndctl. These
callbacks are invoked by newly introduce code in 'util/json-smart.c'
that format the returned stats from these new dimm-ops and transform
them into a json-object to later presentation. I would request you to
look at RFC patch-set[2] to understand the implementation details.
Possibled Solution for Problem#2
================================
Solution to Problem-statement#2 is what eludes me though. If there is a
minimal set of performance stats (similar to what ndctl enforces for
health-stats) then implementation of such a functionality in
ndctl/ipmctl would be easy to implement. But is it really possible to
have such a common set of performance stats that NVDIMM vendors can
expose.
Patch-set[2] though tries to bypass this problem by letting the vendor
descide which performance stats to expose. This opens up a possibility
of this functionality to abused by dimm vendors to reports arbirary data
through this flag that may not be performance-stats.
Summing-up
==========
In light of above, requesting your feedback as to how
problem-statements#{1, 2} can be addressed within ndctl subsystem. Also
are these problems even worth solving.
References
==========
[1] https://github.com/torvalds/linux/blob/master/Documentation/powerpc/papr_...
[2] "[ndctl RFC-PATCH 0/4] Add support for reporting PAPR NVDIMM
Statistics"
https://lore.kernel.org/linux-nvdimm/20200518110814.145644-1-vaibhav@linu...
[3] https://docs.pmem.io/ipmctl-user-guide/instrumentation/show-device-perfor...
[4] "[RFC-PATCH 1/4] ndctl,libndctl: Implement new dimm-ops 'new_stats'
and 'get_stat'"
https://lore.kernel.org/linux-nvdimm/20200514225258.508463-2-vaibhav@linu...
Thanks,
~ Vaibhav
3 months, 2 weeks
[RFC PATCH 0/4] powerpc/papr_scm: Add support for reporting NVDIMM performance statistics
by Vaibhav Jain
The patch-set proposes to add support for fetching and reporting
performance statistics for PAPR compliant NVDIMMs as described in
documentation for H_SCM_PERFORMANCE_STATS hcall Ref[1]. The patch-set
also implements mechanisms to expose NVDIMM performance stats via
sysfs and newly introduced PDSMs[2] for libndctl.
This patch-set combined with corresponding ndctl and libndctl changes
proposed at Ref[3] should enable user to fetch PAPR compliant NVDIMMs
using following command:
# ndctl list -D --stats
[
{
"dev":"nmem0",
"stats":{
"Controller Reset Count":2,
"Controller Reset Elapsed Time":603331,
"Power-on Seconds":603931,
"Life Remaining":"100%",
"Critical Resource Utilization":"0%",
"Host Load Count":5781028,
"Host Store Count":8966800,
"Host Load Duration":975895365,
"Host Store Duration":716230690,
"Media Read Count":0,
"Media Write Count":6313,
"Media Read Duration":0,
"Media Write Duration":9679615,
"Cache Read Hit Count":5781028,
"Cache Write Hit Count":8442479,
"Fast Write Count":8969912
}
}
]
The patchset is dependent on existing patch-set "[PATCH v7 0/5]
powerpc/papr_scm: Add support for reporting nvdimm health" available
at Ref[2] that adds support for reporting PAPR compliant NVDIMMs in
'papr_scm' kernel module.
Structure of the patch-set
==========================
The patch-set starts with implementing functionality in papr_scm
module to issue H_SCM_PERFORMANCE_STATS hcall, fetch & parse dimm
performance stats and exposing them as a PAPR specific libnvdimm
attribute named 'perf_stats'
Patch-2 introduces a new PDSM named FETCH_PERF_STATS that can be
issued by libndctl asking papr_scm to issue the
H_SCM_PERFORMANCE_STATS hcall using helpers introduced earlier and
storing the results in a dimm specific perf-stats-buffer.
Patch-3 introduces a new PDSM named READ_PERF_STATS that can be
issued by libndctl to read the perf-stats-buffer in an incremental
manner to workaround the 256-bytes envelop limitation of libnvdimm.
Finally Patch-4 introduces a new PDSM named GET_PERF_STAT that can be
issued by libndctl to read values of a specific NVDIMM performance
stat like "Life Remaining".
References
==========
[1] Documentation/powerpc/papr_hcals.rst
[2] https://lore.kernel.org/linux-nvdimm/20200508104922.72565-1-vaibhav@linux...
[3] https://github.com/vaibhav92/ndctl/tree/papr_scm_stats_v1
Vaibhav Jain (4):
powerpc/papr_scm: Fetch nvdimm performance stats from PHYP
powerpc/papr_scm: Add support for PAPR_SCM_PDSM_FETCH_PERF_STATS
powerpc/papr_scm: Implement support for PAPR_SCM_PDSM_READ_PERF_STATS
powerpc/papr_scm: Add support for PDSM GET_PERF_STAT
Documentation/ABI/testing/sysfs-bus-papr-scm | 27 ++
arch/powerpc/include/uapi/asm/papr_scm_pdsm.h | 60 +++
arch/powerpc/platforms/pseries/papr_scm.c | 391 ++++++++++++++++++
3 files changed, 478 insertions(+)
--
2.26.2
4 months, 2 weeks
[PATCH 0/4] Remove nrexceptional tracking
by Matthew Wilcox (Oracle)
We actually use nrexceptional for very little these days. It's a constant
source of pain with the THP patches because we don't know how large a
shadow entry is, so either we have to ask the xarray how many indices
it covers, or store that information in the shadow entry (and reduce
the amount of other information in the shadow entry proportionally).
While tracking down the most recent case of "evict tells me I've got
the accounting wrong again", I wondered if it might not be simpler to
just remove it. So here's a patch set to do just that. I think each
of these patches is an improvement in isolation, but the combination of
all four is larger than the sum of its parts.
I'm running xfstests on this patchset right now. If one of the DAX
people could try it out, that'd be fantastic.
Matthew Wilcox (Oracle) (4):
mm: Introduce and use page_cache_empty
mm: Stop accounting shadow entries
dax: Account DAX entries as nrpages
mm: Remove nrexceptional from inode
fs/block_dev.c | 2 +-
fs/dax.c | 8 ++++----
fs/inode.c | 2 +-
include/linux/fs.h | 2 --
include/linux/pagemap.h | 5 +++++
mm/filemap.c | 15 ---------------
mm/truncate.c | 19 +++----------------
mm/workingset.c | 1 -
8 files changed, 14 insertions(+), 40 deletions(-)
--
2.27.0
5 months
[PATCH v4 00/23] device-dax: Support sub-dividing soft-reserved
ranges
by Dan Williams
Changes since v3 [1]:
- Update x86 boot options documentation for 'nohmat' (Randy)
- Fixup a handful of kbuild robot reports, the most significant being
moving usage of PUD_SIZE and PMD_SIZE under
#ifdef CONFIG_TRANSPARENT_HUGEPAGE protection.
[1]: http://lore.kernel.org/r/159625229779.3040297.11363509688097221416.stgit@...
---
Merge notes:
Well, no v5.8-rc8 to line this up for v5.9, so next best is early
integration into -mm before other collisions develop.
Chatted with Justin offline and it currently appears that the missing
numa information is the fault of the platform firmware to populate all
the necessary NUMA data in the NFIT.
---
Cover:
The device-dax facility allows an address range to be directly mapped
through a chardev, or optionally hotplugged to the core kernel page
allocator as System-RAM. It is the mechanism for converting persistent
memory (pmem) to be used as another volatile memory pool i.e. the
current Memory Tiering hot topic on linux-mm.
In the case of pmem the nvdimm-namespace-label mechanism can sub-divide
it, but that labeling mechanism is not available / applicable to
soft-reserved ("EFI specific purpose") memory [3]. This series provides
a sysfs-mechanism for the daxctl utility to enable provisioning of
volatile-soft-reserved memory ranges.
The motivations for this facility are:
1/ Allow performance differentiated memory ranges to be split between
kernel-managed and directly-accessed use cases.
2/ Allow physical memory to be provisioned along performance relevant
address boundaries. For example, divide a memory-side cache [4] along
cache-color boundaries.
3/ Parcel out soft-reserved memory to VMs using device-dax as a security
/ permissions boundary [5]. Specifically I have seen people (ab)using
memmap=nn!ss (mark System-RAM as Persistent Memory) just to get the
device-dax interface on custom address ranges. A follow-on for the VM
use case is to teach device-dax to dynamically allocate 'struct page' at
runtime to reduce the duplication of 'struct page' space in both the
guest and the host kernel for the same physical pages.
[2]: http://lore.kernel.org/r/20200713160837.13774-11-joao.m.martins@oracle.com
[3]: http://lore.kernel.org/r/157309097008.1579826.12818463304589384434.stgit@...
[4]: http://lore.kernel.org/r/154899811738.3165233.12325692939590944259.stgit@...
[5]: http://lore.kernel.org/r/20200110190313.17144-1-joao.m.martins@oracle.com
---
Dan Williams (19):
x86/numa: Cleanup configuration dependent command-line options
x86/numa: Add 'nohmat' option
efi/fake_mem: Arrange for a resource entry per efi_fake_mem instance
ACPI: HMAT: Refactor hmat_register_target_device to hmem_register_device
resource: Report parent to walk_iomem_res_desc() callback
mm/memory_hotplug: Introduce default phys_to_target_node() implementation
ACPI: HMAT: Attach a device for each soft-reserved range
device-dax: Drop the dax_region.pfn_flags attribute
device-dax: Move instance creation parameters to 'struct dev_dax_data'
device-dax: Make pgmap optional for instance creation
device-dax: Kill dax_kmem_res
device-dax: Add an allocation interface for device-dax instances
device-dax: Introduce 'seed' devices
drivers/base: Make device_find_child_by_name() compatible with sysfs inputs
device-dax: Add resize support
mm/memremap_pages: Convert to 'struct range'
mm/memremap_pages: Support multiple ranges per invocation
device-dax: Add dis-contiguous resource support
device-dax: Introduce 'mapping' devices
Joao Martins (4):
device-dax: Make align a per-device property
device-dax: Add an 'align' attribute
dax/hmem: Introduce dax_hmem.region_idle parameter
device-dax: Add a range mapping allocation attribute
Documentation/x86/x86_64/boot-options.rst | 4
arch/powerpc/kvm/book3s_hv_uvmem.c | 14
arch/x86/include/asm/numa.h | 8
arch/x86/kernel/e820.c | 16
arch/x86/mm/numa.c | 11
arch/x86/mm/numa_emulation.c | 3
arch/x86/xen/enlighten_pv.c | 2
drivers/acpi/numa/hmat.c | 76 --
drivers/acpi/numa/srat.c | 9
drivers/base/core.c | 2
drivers/dax/Kconfig | 4
drivers/dax/Makefile | 3
drivers/dax/bus.c | 1046 +++++++++++++++++++++++++++--
drivers/dax/bus.h | 28 -
drivers/dax/dax-private.h | 60 +-
drivers/dax/device.c | 134 ++--
drivers/dax/hmem.c | 56 --
drivers/dax/hmem/Makefile | 6
drivers/dax/hmem/device.c | 100 +++
drivers/dax/hmem/hmem.c | 65 ++
drivers/dax/kmem.c | 199 +++---
drivers/dax/pmem/compat.c | 2
drivers/dax/pmem/core.c | 22 -
drivers/firmware/efi/x86_fake_mem.c | 12
drivers/gpu/drm/nouveau/nouveau_dmem.c | 15
drivers/nvdimm/badrange.c | 26 -
drivers/nvdimm/claim.c | 13
drivers/nvdimm/nd.h | 3
drivers/nvdimm/pfn_devs.c | 13
drivers/nvdimm/pmem.c | 27 -
drivers/nvdimm/region.c | 21 -
drivers/pci/p2pdma.c | 12
include/acpi/acpi_numa.h | 14
include/linux/dax.h | 8
include/linux/memory_hotplug.h | 5
include/linux/memremap.h | 11
include/linux/numa.h | 11
include/linux/range.h | 6
kernel/resource.c | 11
lib/test_hmm.c | 15
mm/memory_hotplug.c | 10
mm/memremap.c | 299 +++++---
tools/testing/nvdimm/dax-dev.c | 22 -
tools/testing/nvdimm/test/iomap.c | 2
44 files changed, 1825 insertions(+), 601 deletions(-)
delete mode 100644 drivers/dax/hmem.c
create mode 100644 drivers/dax/hmem/Makefile
create mode 100644 drivers/dax/hmem/device.c
create mode 100644 drivers/dax/hmem/hmem.c
base-commit: 01830e6c042e8eb6eb202e05d7df8057135b4c26
5 months, 2 weeks
[PATCH v8 0/2] Renovate memcpy_mcsafe with copy_mc_to_{user,
kernel}
by Dan Williams
Changes since v7 [1]:
- Rebased on v5.8-rc5 to resolve a conflict with commit eb25de276505
("tools arch: Update arch/x86/lib/memcpy_64.S copy used in 'perf bench
mem memcpy'")
[1]: http://lore.kernel.org/r/159408043801.2272533.17485467640602344900.stgit@...
---
Vishal, since this patch set has experienced unprecedented silence from
x86 folks I expect you will need to send it to Linus directly during the
merge window. It merges cleanly with recent -next.
Thomas, Ingo, Boris, please chime in to save Vishal from that
awkwardness. I am only going to be sporadically online for the next few
weeks.
---
The primary motivation to go touch memcpy_mcsafe() is that the existing
benefit of doing slow and careful copies is obviated on newer CPUs. That
fact solves the problem of needing to detect machine-check recovery
capability. Now the old "mcsafe_key" opt-in to careful copying can be made
an opt-out from the default fast copy implementation.
The discussion with Linus further made clear that this facility had
already lost its x86-machine-check specificity starting with commit
2c89130a56a ("x86/asm/memcpy_mcsafe: Add write-protection-fault
handling"). The new changes to not require a "careful copy" further
de-emphasizes the role that x86-MCA plays in the implementation to just
one more source of recoverable trap during the operation.
With the above realizations the name "mcsafe" is no longer accurate and
copy_safe() is proposed as its replacement. x86 grows a copy_safe_fast()
implementation as a default implementation that is independent of
detecting the presence of x86-MCA.
---
Dan Williams (2):
x86, powerpc: Rename memcpy_mcsafe() to copy_mc_to_{user,kernel}()
x86/copy_mc: Introduce copy_mc_generic()
arch/powerpc/Kconfig | 2
arch/powerpc/include/asm/string.h | 2
arch/powerpc/include/asm/uaccess.h | 40 +++--
arch/powerpc/lib/Makefile | 2
arch/powerpc/lib/copy_mc_64.S | 4
arch/x86/Kconfig | 2
arch/x86/Kconfig.debug | 2
arch/x86/include/asm/copy_mc_test.h | 75 +++++++++
arch/x86/include/asm/mcsafe_test.h | 75 ---------
arch/x86/include/asm/string_64.h | 32 ----
arch/x86/include/asm/uaccess.h | 21 +++
arch/x86/include/asm/uaccess_64.h | 20 --
arch/x86/kernel/cpu/mce/core.c | 8 -
arch/x86/kernel/quirks.c | 9 -
arch/x86/lib/Makefile | 1
arch/x86/lib/copy_mc.c | 64 ++++++++
arch/x86/lib/copy_mc_64.S | 165 ++++++++++++++++++++
arch/x86/lib/memcpy_64.S | 115 --------------
arch/x86/lib/usercopy_64.c | 21 ---
drivers/md/dm-writecache.c | 15 +-
drivers/nvdimm/claim.c | 2
drivers/nvdimm/pmem.c | 6 -
include/linux/string.h | 9 -
include/linux/uaccess.h | 9 +
include/linux/uio.h | 10 +
lib/Kconfig | 7 +
lib/iov_iter.c | 43 +++--
tools/arch/x86/include/asm/mcsafe_test.h | 13 --
tools/arch/x86/lib/memcpy_64.S | 115 --------------
tools/objtool/check.c | 5 -
tools/perf/bench/Build | 1
tools/perf/bench/mem-memcpy-x86-64-lib.c | 24 ---
tools/testing/nvdimm/test/nfit.c | 48 +++---
.../testing/selftests/powerpc/copyloops/.gitignore | 2
tools/testing/selftests/powerpc/copyloops/Makefile | 6 -
.../selftests/powerpc/copyloops/copy_mc_64.S | 1
.../selftests/powerpc/copyloops/memcpy_mcsafe_64.S | 1
37 files changed, 451 insertions(+), 526 deletions(-)
rename arch/powerpc/lib/{memcpy_mcsafe_64.S => copy_mc_64.S} (98%)
create mode 100644 arch/x86/include/asm/copy_mc_test.h
delete mode 100644 arch/x86/include/asm/mcsafe_test.h
create mode 100644 arch/x86/lib/copy_mc.c
create mode 100644 arch/x86/lib/copy_mc_64.S
delete mode 100644 tools/arch/x86/include/asm/mcsafe_test.h
delete mode 100644 tools/perf/bench/mem-memcpy-x86-64-lib.c
create mode 120000 tools/testing/selftests/powerpc/copyloops/copy_mc_64.S
delete mode 120000 tools/testing/selftests/powerpc/copyloops/memcpy_mcsafe_64.S
base-commit: 11ba468877bb23f28956a35e896356252d63c983
5 months, 2 weeks
[PATCH] acpi/nfit: Use kobj_to_dev() instead
by Wang Qing
Use kobj_to_dev() instead of container_of()
Signed-off-by: Wang Qing <wangqing(a)vivo.com>
---
drivers/acpi/nfit/core.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index fa4500f..3bb350b
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -1382,7 +1382,7 @@ static bool ars_supported(struct nvdimm_bus *nvdimm_bus)
static umode_t nfit_visible(struct kobject *kobj, struct attribute *a, int n)
{
- struct device *dev = container_of(kobj, struct device, kobj);
+ struct device *dev = kobj_to_dev(kobj);
struct nvdimm_bus *nvdimm_bus = to_nvdimm_bus(dev);
if (a == &dev_attr_scrub.attr && !ars_supported(nvdimm_bus))
@@ -1667,7 +1667,7 @@ static struct attribute *acpi_nfit_dimm_attributes[] = {
static umode_t acpi_nfit_dimm_attr_visible(struct kobject *kobj,
struct attribute *a, int n)
{
- struct device *dev = container_of(kobj, struct device, kobj);
+ struct device *dev = kobj_to_dev(kobj);
struct nvdimm *nvdimm = to_nvdimm(dev);
struct nfit_mem *nfit_mem = nvdimm_provider_data(nvdimm);
--
2.7.4
5 months, 3 weeks