[ndctl PATCH v13 0/5] ndctl, monitor: add ndctl monitor daemon
by QI Fuli
This is the v13 patch for ndctl monitor, a tiny daemon to monitor
the smart events of nvdimm DIMMs. Since NVDIMM does not have a
feature like mirroring, if it breaks down, the data will be
impossible to restore. Ndctl monitor daemon will catch the smart
events notify from firmware and outputs notification to logfile,
therefore users can replace NVDIMM before it is completely broken.
Signed-off-by: QI Fuli <qi.fuli(a)jp.fujitsu.com>
---
Change log since v12:
- Fixing log_fn() for removing output new line
- Fixing hard code default configuration file path
- Fixing RPM spec file for configuration file and systemd unit file
- Fixing man page
Change log since v11:
- Adding log_standard()
- Adding [-u | --human] option
- Fixing man page
- Refactoring unit test
- Updating configuration file and systemd unit file to RPM spec file
Change log since v10:
- Adding unit test
- Adding fflush to log_file()
Change log since v9:
- Replacing ndctl_cmd_smart_get_event_flags() with
ndctl_dimm_get_event_flags()
- Adding ndctl_dimm_get_health() api
- Adding ndctl_dimm_get_flags() api
- Adding ndctl_dimm_is_flag_supported api
- Adding manpage
Change log since v8:
- Adding ndctl_cmd_smart_get_event_flags() api
- Adding monitor_filter_arg to the union in util_filter_ctx
- Removing is_dir()
- Replacing malloc + vsprintf with vasprintf() in log_file() and log_syslog()
- Adding parse_monitor_event()
- Refactoring util_dimm_event_filter()
- Adding event_flags to monitor
- Refactoring dimm_event_to_json()
- Adding check_dimm_supported_threshold_alarms()
- Fixing fail token
Change log since v7:
- Replacing logreport() with log_file() and log_syslog()
- Refactoring read_config_file()
- Replacing set_confile() with parse_config()
- Fixing the ndctl/ndct.conf file
Change log since v6:
- Changing License to GPL-2.0
- Adding event object to output notification
- Adding [--dimm-event] option to filter notification by event type
- Rewriting read_config_file()
- Replacing monitor_dimm_event() with monitor_event()
- Renaming some variables
Change log since v5:
- Fixing systemd unit file cannot be installed bug
- Adding license to ./util/abspath.c
Change log since v4:
- Adding OPTION_FILENAME to make sure filename is correct
- Adding configuration file
- Adding [--config-file] option to override the default configuration
- Making some options support multiple space-seperated arguments
- Making systemctl enable ndctl-monitor.service command work
- Making systemctl restart ndctl-monitor.service command work
- Making the directory of systemd unit file to be configurable
- Changing log_file() and log_syslog() to logreport()
- Changing date format in notification to nanoseconds since epoch
- Changing select() to epoll()
- Adding filter_bus() and filter_region()
Change log since v3:
- Removing create-monitor, show-monitor, list-monitor, destroy-monitor
- Adding [--daemon] option to run ndctl monitor as a daemon
- Using systemd to manage ndctl monitor daemon
- Replacing filter_monitor_dimm() with filter_dimm()
Change log since v2:
- Changing the interface of daemon to the ndctl command line
- Changing the name of daemon form "nvdimmd" to "monitor"
- Removing the config file, unit_file, nvdimmd dir
- Removing nvdimmd_test program
- Adding ndctl/monitor.c
Change log since v1:
- Adding a config file(/etc/nvdimmd/nvdimmd.conf)
- Using struct log_ctx instead of syslog()
- Using log_syslog() to save the notify messages to syslog
- Using log_file() to save the notify messages to special file
- Adding LOG_NOTICE level to log_priority
- Using automake instead of Makefile
- Adding a new util file(nvdimmd/util.c) including helper functions
needed for nvdimm daemon
- Adding nvdimmd_test program
QI Fuli (5):
ndctl, monitor: add a new command - monitor
ndctl, monitor: add main ndctl monitor configuration file
ndctl, monitor: add the unit file of systemd for ndctl-monitor service
ndctl, documentation: add man page for monitor
ndctl, test: add a new unit test for monitor
.gitignore | 1 +
Documentation/ndctl/Makefile.am | 3 +-
Documentation/ndctl/ndctl-monitor.txt | 108 +++++
autogen.sh | 3 +-
builtin.h | 1 +
configure.ac | 23 +
ndctl.spec.in | 3 +
ndctl/Makefile.am | 12 +-
ndctl/lib/libndctl.c | 82 ++++
ndctl/lib/libndctl.sym | 4 +
ndctl/libndctl.h | 10 +
ndctl/monitor.c | 650 ++++++++++++++++++++++++++
ndctl/monitor.conf | 41 ++
ndctl/ndctl-monitor.service | 7 +
ndctl/ndctl.c | 1 +
test/Makefile.am | 14 +-
test/list-smart-dimm.c | 117 +++++
test/monitor.sh | 176 +++++++
util/filter.h | 9 +
19 files changed, 1260 insertions(+), 5 deletions(-)
create mode 100644 Documentation/ndctl/ndctl-monitor.txt
create mode 100644 ndctl/monitor.c
create mode 100644 ndctl/monitor.conf
create mode 100644 ndctl/ndctl-monitor.service
create mode 100644 test/list-smart-dimm.c
create mode 100755 test/monitor.sh
--
2.18.0
3 years, 6 months
[PATCH v5 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages
by Dan Williams
Changes since v4 [1]:
* Rework dax_lock_page() to reuse get_unlocked_mapping_entry() (Jan)
* Change the calling convention to take a 'struct page *' and return
success / failure instead of performing the pfn_to_page() internal to
the api (Jan, Ross).
* Rename dax_lock_page() to dax_lock_mapping_entry() (Jan)
* Account for the case that a given pfn can be fsdax mapped with
different sizes in different vmas (Jan)
* Update collect_procs() to determine the mapping size of the pfn for
each page given it can be variable in the dax case.
[1]: https://lists.01.org/pipermail/linux-nvdimm/2018-June/016279.html
---
As it stands, memory_failure() gets thoroughly confused by dev_pagemap
backed mappings. The recovery code has specific enabling for several
possible page states and needs new enabling to handle poison in dax
mappings.
In order to support reliable reverse mapping of user space addresses:
1/ Add new locking in the memory_failure() rmap path to prevent races
that would typically be handled by the page lock.
2/ Since dev_pagemap pages are hidden from the page allocator and the
"compound page" accounting machinery, add a mechanism to determine the
size of the mapping that encompasses a given poisoned pfn.
3/ Given pmem errors can be repaired, change the speculatively accessed
poison protection, mce_unmap_kpfn(), to be reversible and otherwise
allow ongoing access from the kernel.
A side effect of this enabling is that MADV_HWPOISON becomes usable for
dax mappings, however the primary motivation is to allow the system to
survive userspace consumption of hardware-poison via dax. Specifically
the current behavior is:
mce: Uncorrected hardware memory error in user-access at af34214200
{1}[Hardware Error]: It has been corrected by h/w and requires no further action
mce: [Hardware Error]: Machine check events logged
{1}[Hardware Error]: event severity: corrected
Memory failure: 0xaf34214: reserved kernel page still referenced by 1 users
[..]
Memory failure: 0xaf34214: recovery action for reserved kernel page: Failed
mce: Memory error not recovered
<reboot>
...and with these changes:
Injecting memory failure for pfn 0x20cb00 at process virtual address 0x7f763dd00000
Memory failure: 0x20cb00: Killing dax-pmd:5421 due to hardware memory corruption
Memory failure: 0x20cb00: recovery action for dax page: Recovered
Given all the cross dependencies I propose taking this through
nvdimm.git with acks from Naoya, x86/core, x86/RAS, and of course dax
folks.
---
Dan Williams (11):
device-dax: Convert to vmf_insert_mixed and vm_fault_t
device-dax: Enable page_mapping()
device-dax: Set page->index
filesystem-dax: Set page->index
mm, madvise_inject_error: Let memory_failure() optionally take a page reference
mm, memory_failure: Collect mapping size in collect_procs()
filesystem-dax: Introduce dax_lock_mapping_entry()
mm, memory_failure: Teach memory_failure() about dev_pagemap pages
x86/mm/pat: Prepare {reserve,free}_memtype() for "decoy" addresses
x86/memory_failure: Introduce {set,clear}_mce_nospec()
libnvdimm, pmem: Restore page attributes when clearing errors
arch/x86/include/asm/set_memory.h | 42 ++++++
arch/x86/kernel/cpu/mcheck/mce-internal.h | 15 --
arch/x86/kernel/cpu/mcheck/mce.c | 38 -----
arch/x86/mm/pat.c | 16 ++
drivers/dax/device.c | 75 +++++++----
drivers/nvdimm/pmem.c | 26 ++++
drivers/nvdimm/pmem.h | 13 ++
fs/dax.c | 125 +++++++++++++++++-
include/linux/dax.h | 24 +++
include/linux/huge_mm.h | 5 -
include/linux/mm.h | 1
include/linux/set_memory.h | 14 ++
mm/huge_memory.c | 4 -
mm/madvise.c | 18 ++-
mm/memory-failure.c | 201 +++++++++++++++++++++++------
15 files changed, 483 insertions(+), 134 deletions(-)
3 years, 7 months
[PATCH v4 0/2] ext4: fix DAX dma vs truncate/hole-punch
by Ross Zwisler
Changes since v3:
* Added an ext4_break_layouts() call to ext4_insert_range() to ensure
that the {ext4,xfs}_break_layouts() calls have the same meaning.
(Dave, Darrick and Jan)
---
This series from Dan:
https://lists.01.org/pipermail/linux-nvdimm/2018-March/014913.html
added synchronization between DAX dma and truncate/hole-punch in XFS.
This short series adds analogous support to ext4.
I've added calls to ext4_break_layouts() everywhere that ext4 removes
blocks from an inode's map.
The timings in XFS are such that it's difficult to hit this race. Dan
was able to show the race by manually introducing delays in the direct
I/O path.
For ext4, though, its trivial to hit this race, and a hit will result in
a trigger of this WARN_ON_ONCE() in dax_disassociate_entry():
WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
I've made an xfstest which tests all the paths where we now call
ext4_break_layouts(). Each of the four paths easily hits this race many
times in my test setup with the xfstest. You can find that test here:
https://lists.01.org/pipermail/linux-nvdimm/2018-June/016435.html
With these patches applied, I've still seen occasional hits of the above
WARN_ON_ONCE(), which tells me that we still have some work to do. I'll
continue looking at these more rare hits.
Ross Zwisler (2):
dax: dax_layout_busy_page() warn on !exceptional
ext4: handle layout changes to pinned DAX mappings
fs/dax.c | 10 +++++++++-
fs/ext4/ext4.h | 1 +
fs/ext4/extents.c | 17 +++++++++++++++++
fs/ext4/inode.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
fs/ext4/truncate.h | 4 ++++
5 files changed, 77 insertions(+), 1 deletion(-)
--
2.14.4
3 years, 8 months
[PATCH] device-dax: avoid hang on error before devm_memremap_pages()
by Stefan Hajnoczi
dax_pmem_percpu_exit() waits for dax_pmem_percpu_release() to invoke the
dax_pmem->cmp completion. Unfortunately this approach to cleaning up
the percpu_ref only works after devm_memremap_pages() was successful.
If devm_add_action_or_reset() or devm_memremap_pages() fails,
dax_pmem_percpu_release() is not invoked. Therefore
dax_pmem_percpu_exit() hangs waiting for the completion:
rc = devm_add_action_or_reset(dev, dax_pmem_percpu_exit,
&dax_pmem->ref);
if (rc)
return rc;
dax_pmem->pgmap.ref = &dax_pmem->ref;
addr = devm_memremap_pages(dev, &dax_pmem->pgmap);
Avoid the hang by calling percpu_ref_exit() in the error paths instead
of going through dax_pmem_percpu_exit().
Signed-off-by: Stefan Hajnoczi <stefanha(a)redhat.com>
---
Found by code inspection. Compile-tested only.
---
drivers/dax/pmem.c | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)
diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c
index fd49b24fd6af..99e2aace8078 100644
--- a/drivers/dax/pmem.c
+++ b/drivers/dax/pmem.c
@@ -105,15 +105,19 @@ static int dax_pmem_probe(struct device *dev)
if (rc)
return rc;
- rc = devm_add_action_or_reset(dev, dax_pmem_percpu_exit,
- &dax_pmem->ref);
- if (rc)
+ rc = devm_add_action(dev, dax_pmem_percpu_exit, &dax_pmem->ref);
+ if (rc) {
+ percpu_ref_exit(&dax_pmem->ref);
return rc;
+ }
dax_pmem->pgmap.ref = &dax_pmem->ref;
addr = devm_memremap_pages(dev, &dax_pmem->pgmap);
- if (IS_ERR(addr))
+ if (IS_ERR(addr)) {
+ devm_remove_action(dev, dax_pmem_percpu_exit, &dax_pmem->ref);
+ percpu_ref_exit(&dax_pmem->ref);
return PTR_ERR(addr);
+ }
rc = devm_add_action_or_reset(dev, dax_pmem_percpu_kill,
&dax_pmem->ref);
--
2.17.1
3 years, 8 months
[PATCH v2 00/14] mm: Asynchronous + multithreaded memmap init for ZONE_DEVICE
by Dan Williams
Changes since v1 [1]:
* Teach memmap_sync() to take over a sub-set of memmap initialization in
the foreground. This foreground work still needs to await the
completion of vmemmap_populate_hugepages(), but it will otherwise
steal 1/1024th of the 'struct page' init work for the given range.
(Jan)
* Add kernel-doc for all the new 'async' structures.
* Split foreach_order_pgoff() to its own patch.
* Add Pavel and Daniel to the cc as they have been active in the memory
hotplug code.
* Fix a typo that prevented CONFIG_DAX_DRIVER_DEBUG=y from performing
early pfn retrieval at dax-filesystem mount time.
* Improve some of the changelogs
[1]: https://lwn.net/Articles/759117/
---
In order to keep pfn_to_page() a simple offset calculation the 'struct
page' memmap needs to be mapped and initialized in advance of any usage
of a page. This poses a problem for large memory systems as it delays
full availability of memory resources for 10s to 100s of seconds.
For typical 'System RAM' the problem is mitigated by the fact that large
memory allocations tend to happen after the kernel has fully initialized
and userspace services / applications are launched. A small amount, 2GB
of memory, is initialized up front. The remainder is initialized in the
background and freed to the page allocator over time.
Unfortunately, that scheme is not directly reusable for persistent
memory and dax because userspace has visibility to the entire resource
pool and can choose to access any offset directly at its choosing. In
other words there is no allocator indirection where the kernel can
satisfy requests with arbitrary pages as they become initialized.
That said, we can approximate the optimization by performing the
initialization in the background, allow the kernel to fully boot the
platform, start up pmem block devices, mount filesystems in dax mode,
and only incur delay at the first userspace dax fault. When that initial
fault occurs that process is delegated a portion of the memmap to
initialize in the foreground so that it need not wait for initialization
of resources that it does not immediately need.
With this change an 8 socket system was observed to initialize pmem
namespaces in ~4 seconds whereas it was previously taking ~4 minutes.
These patches apply on top of the HMM + devm_memremap_pages() reworks:
https://marc.info/?l=linux-mm&m=153128668008585&w=2
---
Dan Williams (10):
mm: Plumb dev_pagemap instead of vmem_altmap to memmap_init_zone()
mm: Enable asynchronous __add_pages() and vmemmap_populate_hugepages()
mm: Teach memmap_init_zone() to initialize ZONE_DEVICE pages
mm: Multithread ZONE_DEVICE initialization
mm, memremap: Up-level foreach_order_pgoff()
mm: Allow an external agent to coordinate memmap initialization
filesystem-dax: Make mount time pfn validation a debug check
libnvdimm, pmem: Initialize the memmap in the background
device-dax: Initialize the memmap in the background
libnvdimm, namespace: Publish page structure init state / control
Huaisheng Ye (4):
libnvdimm, pmem: Allow a NULL-pfn to ->direct_access()
tools/testing/nvdimm: Allow a NULL-pfn to ->direct_access()
s390, dcssblk: Allow a NULL-pfn to ->direct_access()
filesystem-dax: Do not request a pfn when not required
arch/ia64/mm/init.c | 5 +
arch/powerpc/mm/mem.c | 5 +
arch/s390/mm/init.c | 8 +
arch/sh/mm/init.c | 5 +
arch/x86/mm/init_32.c | 8 +
arch/x86/mm/init_64.c | 27 ++--
drivers/dax/Kconfig | 10 +
drivers/dax/dax-private.h | 2
drivers/dax/device-dax.h | 2
drivers/dax/device.c | 16 ++
drivers/dax/pmem.c | 5 +
drivers/dax/super.c | 64 ++++++---
drivers/nvdimm/nd.h | 2
drivers/nvdimm/pfn_devs.c | 50 +++++--
drivers/nvdimm/pmem.c | 17 ++
drivers/nvdimm/pmem.h | 1
drivers/s390/block/dcssblk.c | 5 -
fs/dax.c | 10 -
include/linux/memmap_async.h | 110 ++++++++++++++++
include/linux/memory_hotplug.h | 18 ++-
include/linux/memremap.h | 31 ++++
include/linux/mm.h | 8 +
kernel/memremap.c | 85 ++++++------
mm/memory_hotplug.c | 73 ++++++++---
mm/page_alloc.c | 271 +++++++++++++++++++++++++++++++++++----
mm/sparse-vmemmap.c | 56 ++++++--
tools/testing/nvdimm/pmem-dax.c | 11 +-
27 files changed, 717 insertions(+), 188 deletions(-)
create mode 100644 include/linux/memmap_async.h
3 years, 8 months
[PATCH v1] libnvdimm, namespace: Replace kmemdup() with kstrndup()
by Andy Shevchenko
kstrndup() takes care of '\0' terminator for the strings.
Use it here instead of kmemdup() + explicit terminating the input string.
Signed-off-by: Andy Shevchenko <andriy.shevchenko(a)linux.intel.com>
---
drivers/nvdimm/namespace_devs.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
index 28afdd668905..19525f025539 100644
--- a/drivers/nvdimm/namespace_devs.c
+++ b/drivers/nvdimm/namespace_devs.c
@@ -270,11 +270,10 @@ static ssize_t __alt_name_store(struct device *dev, const char *buf,
if (dev->driver || to_ndns(dev)->claim)
return -EBUSY;
- input = kmemdup(buf, len + 1, GFP_KERNEL);
+ input = kstrndup(buf, len, GFP_KERNEL);
if (!input)
return -ENOMEM;
- input[len] = '\0';
pos = strim(input);
if (strlen(pos) + 1 > NSLABEL_NAME_LEN) {
rc = -EINVAL;
--
2.17.1
3 years, 8 months
[RFC v3 0/2] kvm "fake DAX" device flushing
by Pankaj Gupta
This is RFC V3 for 'fake DAX' flushing interface sharing
for review. This patchset has two parts:
- Guest virtio-pmem driver
Guest driver reads persistent memory range from paravirt device
and registers with 'nvdimm_bus'. 'nvdimm/pmem' driver uses this
information to allocate persistent memory range. Also, we have
implemented guest side of VIRTIO flushing interface.
- Qemu virtio-pmem device
It exposes a persistent memory range to KVM guest which at host
side is file backed memory and works as persistent memory device.
In addition to this it provides virtio device handling of flushing
interface. KVM guest performs Qemu side asynchronous sync using
this interface.
Changes from RFC v2:
- Add flush function in the nd_region in place of switching
on a flag - Dan & Stefan
- Add flush completion function with proper locking and wait
for host side flush completion - Stefan & Dan
- Keep userspace API in uapi header file - Stefan, MST
- Use LE fields & New device id - MST
- Indentation & spacing suggestions - MST & Eric
- Remove extra header files & add licensing - Stefan
Changes from RFC v1:
- Reuse existing 'pmem' code for registering persistent
memory and other operations instead of creating an entirely
new block driver.
- Use VIRTIO driver to register memory information with
nvdimm_bus and create region_type accordingly.
- Call VIRTIO flush from existing pmem driver.
Details of project idea for 'fake DAX' flushing interface is
shared [2] & [3].
Pankaj Gupta (2):
Add virtio-pmem guest driver
pmem: device flush over VIRTIO
[1] https://marc.info/?l=linux-mm&m=150782346802290&w=2
[2] https://www.spinics.net/lists/kvm/msg149761.html
[3] https://www.spinics.net/lists/kvm/msg153095.html
drivers/nvdimm/nd.h | 1
drivers/nvdimm/pmem.c | 4
drivers/nvdimm/region_devs.c | 24 +++-
drivers/virtio/Kconfig | 9 +
drivers/virtio/Makefile | 1
drivers/virtio/virtio_pmem.c | 190 +++++++++++++++++++++++++++++++++++++++
include/linux/libnvdimm.h | 5 -
include/linux/virtio_pmem.h | 44 +++++++++
include/uapi/linux/virtio_ids.h | 1
include/uapi/linux/virtio_pmem.h | 40 ++++++++
10 files changed, 310 insertions(+), 9 deletions(-)
3 years, 8 months
[PATCH v2 0/3] Add support for memcpy_mcsafe
by Balbir Singh
memcpy_mcsafe() is an API currently used by the pmem subsystem to convert
errors while doing a memcpy (machine check exception errors) to a return
value. This patchset consists of three patches
1. The first patch is a bug fix to handle machine check errors correctly
while walking the page tables in kernel mode, due to huge pmd/pud sizes
2. The second patch adds memcpy_mcsafe() support, this is largely derived
from existing code
3. The third patch registers for callbacks on machine check exceptions and
in them uses specialized knowledge of the type of page to decide whether
to handle the MCE as is or to return to a fixup address present in
memcpy_mcsafe(). If a fixup address is used, then we return an error
value of -EFAULT to the caller.
Testing
A large part of the testing was done under a simulator by selectively
inserting machine check exceptions in a test driver doing memcpy_mcsafe
via ioctls.
Changelog v2
- Fix the logic of shifting in addr_to_pfn
- Use shift consistently instead of PAGE_SHIFT
- Fix a typo in patch1
Balbir Singh (3):
powerpc/mce: Bug fixes for MCE handling in kernel space
powerpc/memcpy: Add memcpy_mcsafe for pmem
powerpc/mce: Handle memcpy_mcsafe
arch/powerpc/include/asm/mce.h | 3 +-
arch/powerpc/include/asm/string.h | 2 +
arch/powerpc/kernel/mce.c | 77 ++++++++++++-
arch/powerpc/kernel/mce_power.c | 26 +++--
arch/powerpc/lib/Makefile | 2 +-
arch/powerpc/lib/memcpy_mcsafe_64.S | 212 ++++++++++++++++++++++++++++++++++++
6 files changed, 308 insertions(+), 14 deletions(-)
create mode 100644 arch/powerpc/lib/memcpy_mcsafe_64.S
--
2.13.6
3 years, 9 months
Help trying to use /dev/pmem for dax debugging?
by Theodore Y. Ts'o
In newer kernels, it looks like you can't use /dev/pmem0 for DAX
unless it's marked as being DAX capable. This appears to require
CONFIG_NVDIMM_PFN. But when I tried to build a kernel with that
configured, I get the following BUG:
[ 0.000000] Linux version 4.18.0-rc4-xfstests-00031-g7c2d77aa7d80 (tytso@cwcc) (gcc version 7.3.0 (Debian 7.3.0-27)) #460 SMP Mon Jul 30 19:38:44 EDT 2018
[ 0.000000] Command line: systemd.show_status=auto systemd.log_level=crit root=/dev/vda console=ttyS0,115200 cmd=maint fstesttz=America/New_York fstesttyp=ext4 fstestapi=1.4 memmap=4G!9G memmap=9G!14G
...
[ 16.544707] BUG: unable to handle kernel paging request at ffffed0048000000
[ 16.546132] PGD 6bffe9067 P4D 6bffe9067 PUD 6bfbec067 PMD 0
[ 16.547174] Oops: 0000 [#1] SMP KASAN PTI
[ 16.547923] CPU: 0 PID: 81 Comm: kworker/u8:1 Not tainted 4.18.0-rc4-xfstests-00031-g7c2d77aa7d80 #460
[ 16.549706] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-1 04/01/2014
[ 16.551285] Workqueue: events_unbound async_run_entry_fn
[ 16.552309] RIP: 0010:check_memory_region+0xdd/0x190
[ 16.553264] Code: 74 0b 41 80 38 00 74 f0 4d 85 c0 75 56 4c 01 c8 49 89 e8 49 29 c0 4d 8d 48 07 4d 85 c0 4d 0f 49 c8 49 c1 f9 03 45 85 c9 74 5b <48> 83 38 00 75 18 45 8d 41 ff 4e 8d 44 c0 08 48 83 c0 08 49 39 c0
[ 16.556872] RSP: 0000:ffff8806469b6bb8 EFLAGS: 00010202
[ 16.557861] RAX: ffffed0048000000 RBX: ffff880240000fff RCX: ffffffffa8a2f9bc
[ 16.559500] RDX: 0000000000000000 RSI: 0000000000001000 RDI: ffff880240000000
[ 16.561255] RBP: ffffed0048000200 R08: 0000000000000200 R09: 0000000000000040
[ 16.563245] R10: 0000000000000200 R11: ffffed00480001ff R12: ffff880240000000
[ 16.565186] R13: dffffc0000000000 R14: fffffbfff5361562 R15: ffffea0015d34bd8
[ 16.567119] FS: 0000000000000000(0000) GS:ffff88064b600000(0000) knlGS:0000000000000000
[ 16.569331] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 16.570927] CR2: ffffed0048000000 CR3: 0000000212416001 CR4: 0000000000360ef0
[ 16.572839] Call Trace:
[ 16.573493] memcpy+0x1f/0x50
[ 16.574050] pmem_do_bvec+0x1dc/0x670
[ 16.575086] ? pmem_release_pgmap_ops+0x10/0x10
[ 16.576392] ? rcu_read_lock_sched_held+0x110/0x130
[ 16.577785] ? generic_make_request_checks+0xf87/0x1520
[ 16.579310] ? do_read_cache_page+0x219/0x8b0
[ 16.580551] pmem_make_request+0x306/0x9e0
[ 16.581714] generic_make_request+0x565/0xd30
[ 16.582947] ? mempool_alloc+0xf7/0x2d0
[ 16.584032] ? blk_plug_queued_count+0x150/0x150
[ 16.585339] ? sched_clock_cpu+0x18/0x180
[ 16.586473] ? debug_show_all_locks+0x2d0/0x2d0
[ 16.587803] ? submit_bio+0x139/0x3a0
[ 16.588864] submit_bio+0x139/0x3a0
[ 16.589896] ? lock_downgrade+0x5e0/0x5e0
[ 16.591031] ? lock_acquire+0x106/0x3e0
[ 16.592123] ? direct_make_request+0x1e0/0x1e0
[ 16.593428] ? guard_bio_eod+0x19d/0x570
[ 16.594547] submit_bh_wbc.isra.12+0x409/0x5a0
[ 16.595804] block_read_full_page+0x526/0x800
[ 16.597032] ? block_llseek+0xd0/0xd0
[ 16.598072] ? block_page_mkwrite+0x270/0x270
[ 16.599317] ? add_to_page_cache_lru+0x119/0x210
[ 16.600621] ? add_to_page_cache_locked+0x40/0x40
[ 16.601943] ? pagecache_get_page+0x44/0x6b0
[ 16.603153] do_read_cache_page+0x219/0x8b0
[ 16.604338] ? blkdev_writepages+0x10/0x10
[ 16.605500] read_dev_sector+0xbb/0x390
[ 16.606606] read_lba.isra.0+0x2f0/0x5c0
[ 16.607735] ? compare_gpts+0x1500/0x1500
[ 16.608870] ? efi_partition+0x2bc/0x1bb0
[ 16.610021] ? rcu_read_lock_sched_held+0x110/0x130
[ 16.611387] efi_partition+0x2e6/0x1bb0
[ 16.612468] ? __isolate_free_page+0x530/0x530
[ 16.613717] ? rcu_read_lock_sched_held+0x110/0x130
[ 16.615103] ? is_gpt_valid.part.1+0xdc0/0xdc0
[ 16.616396] ? string+0x14c/0x220
[ 16.617344] ? string+0x14c/0x220
[ 16.618285] ? format_decode+0x3be/0x760
[ 16.619409] ? vsnprintf+0x1ff/0x10a0
[ 16.620439] ? num_to_str+0x220/0x220
[ 16.621472] ? snprintf+0x8f/0xc0
[ 16.622411] ? vscnprintf+0x30/0x30
[ 16.623402] ? is_gpt_valid.part.1+0xdc0/0xdc0
[ 16.624650] ? check_partition+0x308/0x660
[ 16.625818] check_partition+0x308/0x660
[ 16.626966] rescan_partitions+0x187/0x8d0
[ 16.628123] ? lock_acquire+0x106/0x3e0
[ 16.629219] ? up_write+0x1d/0x150
[ 16.630185] ? bd_set_size+0x24e/0x2e0
[ 16.631244] __blkdev_get+0x696/0xfd0
[ 16.632276] ? bd_set_size+0x2e0/0x2e0
[ 16.633337] ? kvm_sched_clock_read+0x21/0x30
[ 16.634570] ? sched_clock+0x5/0x10
[ 16.635563] ? sched_clock_cpu+0x18/0x180
[ 16.636706] blkdev_get+0x28f/0x850
[ 16.637714] ? lockdep_rcu_suspicious+0x150/0x150
[ 16.639032] ? __blkdev_get+0xfd0/0xfd0
[ 16.640144] ? refcount_sub_and_test+0xcd/0x160
[ 16.641415] ? refcount_inc+0x30/0x30
[ 16.642453] ? do_raw_spin_unlock+0x144/0x220
[ 16.643680] ? kobject_put+0x50/0x410
[ 16.644711] __device_add_disk+0xbe5/0xe40
[ 16.645916] ? bdget_disk+0x60/0x60
[ 16.646919] ? alloc_dax+0x2b2/0x5b0
[ 16.647939] ? kill_dax+0x140/0x140
[ 16.648928] ? nvdimm_badblocks_populate+0x47/0x360
[ 16.649904] ? __raw_spin_lock_init+0x2d/0x100
[ 16.650720] pmem_attach_disk+0x944/0xf90
[ 16.651477] ? nd_pmem_notify+0x4a0/0x4a0
[ 16.652233] ? kfree+0xd4/0x210
[ 16.652822] ? nd_dax_probe+0x1d0/0x240
[ 16.653526] nvdimm_bus_probe+0xd4/0x370
[ 16.654261] driver_probe_device+0x56d/0xbe0
[ 16.655432] ? __driver_attach+0x2c0/0x2c0
[ 16.656548] bus_for_each_drv+0x10d/0x1a0
[ 16.657414] ? subsys_find_device_by_id+0x2e0/0x2e0
[ 16.658385] __device_attach+0x19c/0x230
[ 16.659225] ? device_bind_driver+0xa0/0xa0
[ 16.660135] ? kobject_uevent_env+0x223/0xfb0
[ 16.661072] bus_probe_device+0x1ad/0x260
[ 16.661852] ? sysfs_create_groups+0x86/0x130
[ 16.662826] device_add+0x9fe/0x1340
[ 16.663814] ? device_private_init+0x180/0x180
[ 16.664786] nd_async_device_register+0xe/0x40
[ 16.665621] async_run_entry_fn+0xc3/0x630
[ 16.666400] process_one_work+0x767/0x1670
[ 16.667221] ? debug_show_all_locks+0x2d0/0x2d0
[ 16.668137] ? pwq_dec_nr_in_flight+0x2c0/0x2c0
[ 16.669011] worker_thread+0x87/0xb90
[ 16.669730] ? __kthread_parkme+0xb6/0x180
[ 16.670515] ? process_one_work+0x1670/0x1670
[ 16.671339] kthread+0x314/0x3d0
[ 16.671963] ? kthread_flush_work_fn+0x10/0x10
[ 16.672810] ret_from_fork+0x3a/0x50
[ 16.673502] CR2: ffffed0048000000
[ 16.674436] ---[ end trace ac6b16a57e0c48ad ]---
Does this ring any bells? Any suggestions about how I get ext4 dax
testing working again? Many thanks!!
(full log and config attached below, compressed for size reasons)
- Ted
3 years, 9 months
[RFC PATCH 1/1] device-dax: check for vma range while dax_mmap.
by Zhang Yi
It should be prevent user map an illegal vma range which larger than
dax device phiscal resourse, as we don't have swap logic while page
faulting in dax device.
Applications, especailly qemu, map the /dev/dax for virtual nvdimm's
backend device, we defined the v-nvdimm label area at the end of mapped
rang. By using an illegal size that exceeds the physical resource of
/dev/dax, then it will triger qemu a signal fault while accessing these
label area.
Signed-off-by: Zhang Yi <yi.z.zhang(a)linux.intel.com>
---
drivers/dax/device.c | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)
diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index aff2c15..c9a50cd 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -177,6 +177,32 @@ static const struct attribute_group *dax_attribute_groups[] = {
NULL,
};
+static int check_vma_range(struct dev_dax *dev_dax, struct vm_area_struct *vma,
+ const char *func)
+{
+ struct device *dev = &dev_dax->dev;
+ struct resource *res;
+ unsigned long size;
+ int ret, i;
+
+ if (!dax_alive(dev_dax->dax_dev))
+ return -ENXIO;
+
+ size = vma->vm_end - vma->vm_start + (vma->vm_pgoff << PAGE_SHIFT);
+ ret = -EINVAL;
+ for (i = 0; i < dev_dax->num_resources; i++) {
+ res = &dev_dax->res[i];
+ if (size > resource_size(res)) {
+ dev_info(dev, "%s: %s: fail, vma range is overflow\n",
+ current->comm, func);
+ ret = -EINVAL;
+ continue;
+ } else
+ return 0;
+ }
+ return ret;
+}
+
static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
const char *func)
{
@@ -465,6 +491,8 @@ static int dax_mmap(struct file *filp, struct vm_area_struct *vma)
*/
id = dax_read_lock();
rc = check_vma(dev_dax, vma, __func__);
+ if (!rc)
+ rc |= check_vma_range(dev_dax, vma, __func__);
dax_read_unlock(id);
if (rc)
return rc;
--
2.7.4
3 years, 9 months