[PATCH v3 0/2] Support ACPI 6.1 update in NFIT Control Region Structure
by Toshi Kani
ACPI 6.1, Table 5-133, updates NVDIMM Control Region Structure as
follows.
- Valid Fields, Manufacturing Location, and Manufacturing Date
are added from reserved range. No change in the structure size.
- IDs (SPD values) are stored as arrays of bytes (i.e. big-endian
format). The spec clarifies that they need to be represented
as arrays of bytes as well.
Patch 1 changes the NFIT driver to comply with ACPI 6.1.
Patch 2 adds a new sysfs file "id" to show NVDIMM ID defined in ACPI 6.1.
The patch-set applies on linux-pm.git acpica.
link: http://www.uefi.org/sites/default/files/resources/ACPI_6_1.pdf
---
v3:
- Need to coordinate with ACPICA update (Bob Moore, Dan Williams)
- Integrate with ACPICA changes in struct acpi_nfit_control_region.
(commit 138a95547ab0)
v2:
- Remove 'mfg_location' and 'mfg_date'. (Dan Williams)
- Rename 'unique_id' to 'id' and make this change as a separate patch.
(Dan Williams)
---
Toshi Kani (3):
1/2 acpi/nfit: Update nfit driver to comply with ACPI 6.1
2/3 acpi/nfit: Add sysfs "id" for NVDIMM ID
---
drivers/acpi/nfit.c | 29 ++++++++++++++++++++++++-----
1 file changed, 24 insertions(+), 5 deletions(-)
2 years, 7 months
Enabling peer to peer device transactions for PCIe devices
by Deucher, Alexander
This is certainly not the first time this has been brought up, but I'd like to try and get some consensus on the best way to move this forward. Allowing devices to talk directly improves performance and reduces latency by avoiding the use of staging buffers in system memory. Also in cases where both devices are behind a switch, it avoids the CPU entirely. Most current APIs (DirectGMA, PeerDirect, CUDA, HSA) that deal with this are pointer based. Ideally we'd be able to take a CPU virtual address and be able to get to a physical address taking into account IOMMUs, etc. Having struct pages for the memory would allow it to work more generally and wouldn't require as much explicit support in drivers that wanted to use it.
Some use cases:
1. Storage devices streaming directly to GPU device memory
2. GPU device memory to GPU device memory streaming
3. DVB/V4L/SDI devices streaming directly to GPU device memory
4. DVB/V4L/SDI devices streaming directly to storage devices
Here is a relatively simple example of how this could work for testing. This is obviously not a complete solution.
- Device memory will be registered with Linux memory sub-system by created corresponding struct page structures for device memory
- get_user_pages_fast() will return corresponding struct pages when CPU address points to the device memory
- put_page() will deal with struct pages for device memory
Previously proposed solutions and related proposals:
1.P2P DMA
DMA-API/PCI map_peer_resource support for peer-to-peer (http://www.spinics.net/lists/linux-pci/msg44560.html)
Pros: Low impact, already largely reviewed.
Cons: requires explicit support in all drivers that want to support it, doesn't handle S/G in device memory.
2. ZONE_DEVICE IO
Direct I/O and DMA for persistent memory (https://lwn.net/Articles/672457/)
Add support for ZONE_DEVICE IO memory with struct pages. (https://patchwork.kernel.org/patch/8583221/)
Pro: Doesn't waste system memory for ZONE metadata
Cons: CPU access to ZONE metadata slow, may be lost, corrupted on device reset.
3. DMA-BUF
RDMA subsystem DMA-BUF support (http://www.spinics.net/lists/linux-rdma/msg38748.html)
Pros: uses existing dma-buf interface
Cons: dma-buf is handle based, requires explicit dma-buf support in drivers.
4. iopmem
iopmem : A block device for PCIe memory (https://lwn.net/Articles/703895/)
5. HMM
Heterogeneous Memory Management (http://lkml.iu.edu/hypermail/linux/kernel/1611.2/02473.html)
6. Some new mmap-like interface that takes a userptr and a length and returns a dma-buf and offset?
Alex
3 years, 2 months
[PATCH 00/13] dax, pmem: move cpu cache maintenance to libnvdimm
by Dan Williams
A couple weeks back, in the course of reviewing the memcpy_nocache()
proposal from Brian, Linus subtly suggested that the pmem specific
memcpy_to_pmem() routine be moved to be implemented at the driver
level [1]:
"Quite frankly, the whole 'memcpy_nocache()' idea or (ab-)using
copy_user_nocache() just needs to die. It's idiotic.
As you point out, it's also fundamentally buggy crap.
Throw it away. There is no possible way this is ever valid or
portable. We're not going to lie and claim that it is.
If some driver ends up using 'movnt' by hand, that is up to that
*driver*. But no way in hell should we care about this one whit in
the sense of <linux/uaccess.h>."
This feedback also dovetails with another fs/dax.c design wart of being
hard coded to assume the backing device is pmem. We call the pmem
specific copy, clear, and flush routines even if the backing device
driver is one of the other 3 dax drivers (axonram, dccssblk, or brd).
There is no reason to spend cpu cycles flushing the cache after writing
to brd, for example, since it is using volatile memory for storage.
Moreover, the pmem driver might be fronting a volatile memory range
published by the ACPI NFIT, or the platform might have arranged to flush
cpu caches on power fail. This latter capability is a feature that has
appeared in embedded storage appliances ("legacy" / pre-NFIT nvdimm
platforms).
So, this series:
1/ moves what was previously named "the pmem api" out of the global
namespace and into "that *driver*" (libnvdimm / pmem).
2/ arranges for dax to stop abusing copy_user_nocache() and implements a
libnvdimm-local memcpy that uses movnt
3/ makes cache maintenance optional by arranging for dax to call driver
specific copy and flush operations only if the driver publishes them.
4/ adds a module parameter that can be used to inform libnvdimm of a
platform-level flush-cache-on-power-fail capability.
[1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.html
These patches have a build success notification from the 0day kbuild robot
and pass the libnvdimm / ndctl unit tests. I am looking to take them
through the libnvdimm tree with acks from x86, block, dm etc...
---
Dan Williams (13):
x86, dax, pmem: remove indirection around memcpy_from_pmem()
block, dax: introduce dax_operations
x86, dax, pmem: introduce 'copy_from_iter' dax operation
dax, pmem: introduce an optional 'flush' dax operation
x86, dax: replace clear_pmem() with open coded memset + dax_ops->flush
x86, dax, libnvdimm: move wb_cache_pmem() to libnvdimm
x86, libnvdimm, pmem: move arch_invalidate_pmem() to libnvdimm
x86, libnvdimm, dax: stop abusing __copy_user_nocache
libnvdimm, pmem: implement cache bypass for all copy_from_iter() operations
libnvdimm, pmem: fix persistence warning
libnvdimm, nfit: enable support for volatile ranges
libnvdimm, pmem: disable dax flushing when pmem is fronting a volatile region
libnvdimm, pmem: disable dax flushing for 'cache flush on fail' platforms
MAINTAINERS | 2
arch/powerpc/sysdev/axonram.c | 6 +
arch/x86/Kconfig | 1
arch/x86/include/asm/pmem.h | 121 ----------------------------
arch/x86/include/asm/string_64.h | 1
drivers/acpi/nfit/core.c | 15 ++-
drivers/block/brd.c | 6 +
drivers/md/dm.c | 6 +
drivers/nvdimm/Kconfig | 5 +
drivers/nvdimm/Makefile | 2
drivers/nvdimm/bus.c | 10 +-
drivers/nvdimm/claim.c | 9 +-
drivers/nvdimm/core.c | 2
drivers/nvdimm/dax_devs.c | 2
drivers/nvdimm/dimm_devs.c | 4 -
drivers/nvdimm/namespace_devs.c | 9 +-
drivers/nvdimm/nd-core.h | 9 ++
drivers/nvdimm/pfn_devs.c | 4 -
drivers/nvdimm/pmem.c | 46 ++++++++---
drivers/nvdimm/pmem.h | 20 +++++
drivers/nvdimm/region_devs.c | 52 ++++++++----
drivers/nvdimm/x86-asm.S | 71 ++++++++++++++++
drivers/nvdimm/x86.c | 84 +++++++++++++++++++
drivers/s390/block/dcssblk.c | 6 +
fs/block_dev.c | 6 +
fs/dax.c | 35 +++++++-
include/linux/blkdev.h | 10 ++
include/linux/libnvdimm.h | 9 ++
include/linux/pmem.h | 165 --------------------------------------
include/linux/string.h | 8 ++
include/linux/uio.h | 4 +
lib/Kconfig | 6 +
lib/iov_iter.c | 25 ++++++
tools/testing/nvdimm/Kbuild | 2
34 files changed, 405 insertions(+), 358 deletions(-)
delete mode 100644 arch/x86/include/asm/pmem.h
create mode 100644 drivers/nvdimm/x86-asm.S
create mode 100644 drivers/nvdimm/x86.c
delete mode 100644 include/linux/pmem.h
3 years, 9 months
[PATCH 0/6] introduce DAX tracepoint support
by Ross Zwisler
Tracepoints are the standard way to capture debugging and tracing
information in many parts of the kernel, including the XFS and ext4
filesystems. This series creates a tracepoint header for FS DAX and add
the first few DAX tracepoints to the PMD fault handler. This allows the
tracing for DAX to be done in the same way as the filesystem tracing so
that developers can look at them together and get a coherent idea of what
the system is doing.
I do intend to add tracepoints to the normal 4k DAX fault path and to the
DAX I/O path, but I wanted to get feedback on the PMD tracepoints before I
went any further.
This series is based on Jan Kara's "dax: Clear dirty bits after flushing
caches" series:
https://lists.01.org/pipermail/linux-nvdimm/2016-November/007864.html
I've pushed a git tree with this work here:
https://git.kernel.org/cgit/linux/kernel/git/zwisler/linux.git/log/?h=dax...
Ross Zwisler (6):
dax: fix build breakage with ext4, dax and !iomap
dax: remove leading space from labels
dax: add tracepoint infrastructure, PMD tracing
dax: update MAINTAINERS entries for FS DAX
dax: add tracepoints to dax_pmd_load_hole()
dax: add tracepoints to dax_pmd_insert_mapping()
MAINTAINERS | 4 +-
fs/Kconfig | 1 +
fs/dax.c | 78 ++++++++++++++----------
fs/ext2/Kconfig | 1 -
include/linux/mm.h | 14 +++++
include/linux/pfn_t.h | 6 ++
include/trace/events/fs_dax.h | 135 ++++++++++++++++++++++++++++++++++++++++++
7 files changed, 206 insertions(+), 33 deletions(-)
create mode 100644 include/trace/events/fs_dax.h
--
2.7.4
3 years, 10 months
[LTP issues] MAP_LOCKED MS_INVALIDATE, dio rw odd count on DAX
by Xiong Zhou
Hi,
LTP tests on DAX show 2 issues.
msync03 and diotest4, both xfs and ext4,
non-DAX pass
DAX fail
1, MAP_LOCKED && msync with MS_INVALIDATE, which should fail.
Flag checking code in msync looks ok but missing _LOCK vma falgs
for DAX mapped vma ? i guess DAX now does not support that ?
Tracking by LTP testcase "msync03"
2. O_DIRECT rw odd counts on DAX
read/write 1 byte on file opened with O_DIRECT, EINVAL is
expected but Success.
I'm not sure whether this is an issue, please enlighten :)
Tracking by LTP testcase "dio04 diotest4".
BTW, I am testing DAX with xfstests, LTP and other fs test
cases. If the same case fails on DAX but pass on non-DAX,
i'll look into and report if it is a real issue to me. I've
been doing this for a while, recently, I started looking at
cases that fail on non-DAX and pass on DAX inspired by
Darrick in another thread.
For now, test result looks good. Except the above 2 issues
which I've seen for a while and not sure they are really
issues, xfstests check -g auto has no major regressions
between DAX and non-DAX. generic/403 is a new case and its
failures are under investigation, i'll report if it is.
Thanks,
Xiong
3 years, 10 months
mmap dio write failure
by Xiong Zhou
Hi,
At first, I am not sure whether this is an issue.
mmap a file in a DAX mountpoint, open another file
in a non-DAX mountpoint with O_DIRECT, write the
mapped area to the other file.
This write Success on pmem ramdisk(memmap=2G!20G like)
This write Fail(Bad address) on nvdimm pmem devices.
This write Fail(Bad address) on brd based ramdisk.
If we skip the O_DIRECT flag, all tests pass.
If we write from DAX to DAX, all tests pass.
If we write from non-DAX to DAX, all tests pass.
Kernel version: Linus tree commit 44b4b46.
I have checked back to v4.6 testing on nvdimm devices,
all the same results. I do remember that this test
passed on nvdimms back to May 2016 and i have some
notes for that. However things changed a lot, test
scripts, kernel code, even the nvdimm and machine
firmweare.
Thanks,
Xiong
sh-4.2# cat tbad.sh
#!/bin/bash
[ -z "$1" ] && exit 1
DEV="$1"
MNT=/tbdmnt
cc t_mmap_dio.c
mkdir -p $MNT
wipefs -af $DEV
mkfs.xfs -fq $DEV && \
mount -o dax $DEV $MNT && \
xfs_io -f -c "w -W 0 268435456" $MNT/ts > /dev/null && \
xfs_io -f -c "w -W 0 268435456" ./td > /dev/null
if ./a.out $MNT/ts ./td 16777216 "$DEV" ; then
echo PASS
else
echo FAIL
fi
umount $MNT
sh-4.2# cat t_mmap_dio.c
/*
* This programme was originally written by
* Jeff Moyer <jmoyer(a)redhat.com>
*/
#define _GNU_SOURCE 1
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <libaio.h>
#include <errno.h>
#include <sys/time.h>
void usage(char *prog)
{
fprintf(stderr,
"usage: %s <src file> <dest file> <size> <msg>\n",
prog);
exit(1);
}
void err_exit(char *op, unsigned long len, char *s)
{
fprintf(stderr, "%s(%s) len %lu %s\n",
op, strerror(errno), len, s);
exit(1);
}
int main(int argc, char **argv)
{
int fd, fd2, ret;
char *map;
unsigned long len;
if (argc < 4)
usage(basename(argv[0]));
len = strtoul(argv[3], NULL, 10);
if (errno == ERANGE)
err_exit("strtoul", 0, argv[4]);
/* Open source file and mmap*/
fd = open(argv[1], O_RDWR, 0644);
if (fd < 0)
err_exit("open s", len, argv[4]);
map = (char *)mmap(NULL, len,
PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
if (map == MAP_FAILED)
err_exit("mmap", len, argv[4]);
/* Open dest file with O_DIRECT */
fd2 = open(argv[2], O_RDWR|O_DIRECT, 0644);
if (fd2 < 0)
err_exit("open d", len, argv[4]);
/* First, test storing to dest file from source mapping */
ret = write(fd2, map, len);
if (ret != len)
err_exit("write", len, argv[4]);
ret = (int)lseek(fd2, 0, SEEK_SET);
if (ret == -1)
err_exit("lseek", len, argv[4]);
/* Next, test reading from dest file into source mapping */
ret = read(fd2, map, len);
if (ret != len)
err_exit("read", len, argv[4]);
ret = msync(map, len, MS_SYNC);
if (ret < 0)
err_exit("msync", len, argv[4]);
ret = munmap(map, len);
if (ret < 0)
err_exit("munmap", len, argv[4]);
ret = close(fd);
if (ret < 0)
err_exit("clsoe fd", len, argv[4]);
ret = close(fd2);
if (ret < 0)
err_exit("close fd2", len, argv[4]);
exit(0);
}
sh-4.2# ndctl list -N
[
{
"dev":"namespace3.0",
"mode":"raw",
"size":8589934592,
"blockdev":"pmem3"
},
{
"dev":"namespace2.0",
"mode":"raw",
"size":8589934592,
"blockdev":"pmem2"
},
{
"dev":"namespace1.0",
"mode":"memory",
"size":2147483648,
"blockdev":"pmem1"
},
{
"dev":"namespace0.0",
"mode":"memory",
"size":2147483648,
"blockdev":"pmem0"
}
]
sh-4.2# modinfo brd
filename: /lib/modules/4.10.0-rc4-master-44b4b46+/kernel/drivers/block/brd.ko
alias: rd
alias: block-major-1-*
license: GPL
srcversion: 25AABF2EF57F6A37AFFEBA6
depends:
intree: Y
vermagic: 4.10.0-rc4-master-44b4b46+ SMP mod_unload modversions
parm: rd_nr:Maximum number of brd devices (int)
parm: rd_size:Size of each RAM disk in kbytes. (ulong)
parm: max_part:Num Minors to reserve between devices (int)
sh-4.2# uname -r
4.10.0-rc4-master-44b4b46+
sh-4.2# bash tbad.sh /dev/pmem0
/dev/pmem0: 4 bytes were erased at offset 0x00000000 (xfs): 58 46 53 42
PASS
sh-4.2# bash tbad.sh /dev/pmem2
/dev/pmem2: 4 bytes were erased at offset 0x00000000 (xfs): 58 46 53 42
write(Bad address) len 16777216 /dev/pmem2
FAIL
sh-4.2# bash tbad.sh /dev/ram0
/dev/ram0: 4 bytes were erased at offset 0x00000000 (xfs): 58 46 53 42
write(Bad address) len 16777216 /dev/ram0
FAIL
sh-4.2# df .
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/mapper/rhxxxxxxxxxxxxxxxxx-01-root 52399104 43658792 8740312 84% /
sh-4.2#
3 years, 11 months
[PATCH 0/4] mmap dio and DAX
by Xiong Zhou
common/rc : requires SCRATCH_DEV support DAX
src/t_mmap_dio.c : intro mmap and O_DIRECT rw through files
tests/generic/405 : IO between DAX/non-DAX mountpoints
tests/xfs/138 : IO between DAX/non-DAX xfs files(per-inode flag)
Xiong Zhou (4):
common/rc: add _require_scratch_dax
src/t_mmap_dio: add mmap dio test
xfs: test per-inode DAX flag by IO
generic: test mmap dio through DAX and non-DAX
.gitignore | 1 +
common/rc | 10 +++++
src/Makefile | 2 +-
src/t_mmap_dio.c | 81 ++++++++++++++++++++++++++++++++++++++
tests/generic/405 | 100 +++++++++++++++++++++++++++++++++++++++++++++++
tests/generic/405.out | 2 +
tests/generic/group | 1 +
tests/xfs/138 | 105 ++++++++++++++++++++++++++++++++++++++++++++++++++
tests/xfs/138.out | 2 +
tests/xfs/group | 1 +
10 files changed, 304 insertions(+), 1 deletion(-)
create mode 100644 src/t_mmap_dio.c
create mode 100755 tests/generic/405
create mode 100644 tests/generic/405.out
create mode 100755 tests/xfs/138
create mode 100644 tests/xfs/138.out
--
1.8.3.1
3 years, 11 months
fix write synchronization for DAX
by Christoph Hellwig
While I've fixed both ext4 and XFS to not incorrectly allow parallel
writers when mounting with -o dax ext4 still has this issue after the
iomap conversion.
Patch 1 fixes it, and patch 2 adds a lockdep assert to catch any new
file systems copy and pasting from the direct I/O path.
3 years, 11 months
[PATCH] mm, dax: clear PMD or PUD size flags when in fall through path
by Dave Jiang
Ross reported that:
Running xfstests generic/030 with XFS + DAX gives me the following kernel BUG,
which I bisected to this commit: mm,fs,dax: Change ->pmd_fault to ->huge_fault
[ 370.086205] ------------[ cut here ]------------
[ 370.087182] kernel BUG at arch/x86/mm/fault.c:1038!
[ 370.088336] invalid opcode: 0000 [#3] PREEMPT SMP
[ 370.089073] Modules linked in: dax_pmem nd_pmem dax nd_btt nd_e820 libnvdimm
[ 370.090212] CPU: 0 PID: 12415 Comm: xfs_io Tainted: G D 4.10.0-rc5-mm1-00202-g7e90fc0 #10
[ 370.091648] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.1-1.fc24 04/01/2014
[ 370.092946] task: ffff8800ac4f8000 task.stack: ffffc9001148c000
[ 370.093769] RIP: 0010:mm_fault_error+0x15e/0x190
[ 370.094410] RSP: 0000:ffffc9001148fe60 EFLAGS: 00010246
[ 370.095135] RAX: 0000000000000000 RBX: 0000000000000006 RCX: ffff8800ac4f8000
[ 370.096107] RDX: 00007f111c8e6400 RSI: 0000000000000006 RDI: ffffc9001148ff58
[ 370.097087] RBP: ffffc9001148fe88 R08: 0000000000000000 R09: ffff880510bd3300
[ 370.098072] R10: ffff8800ac4f8000 R11: 0000000000000000 R12: 00007f111c8e6400
[ 370.099057] R13: 00007f111c8e6400 R14: ffff880510bd3300 R15: 0000000000000055
[ 370.100135] FS: 00007f111d95e700(0000) GS:ffff880514800000(0000) knlGS:0000000000000000
[ 370.101238] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 370.102021] CR2: 00007f111c8e6400 CR3: 00000000add00000 CR4: 00000000001406f0
[ 370.103189] Call Trace:
[ 370.103537] __do_page_fault+0x54e/0x590
[ 370.104090] trace_do_page_fault+0x58/0x2c0
[ 370.104675] do_async_page_fault+0x2c/0x90
[ 370.105342] async_page_fault+0x28/0x30
[ 370.106044] RIP: 0033:0x405e9a
[ 370.106470] RSP: 002b:00007fffb7f30590 EFLAGS: 00010287
[ 370.107185] RAX: 00000000004e6400 RBX: 0000000000000057 RCX: 00000000004e7000
[ 370.108155] RDX: 00007f111c400000 RSI: 00000000004e7000 RDI: 0000000001c35080
[ 370.109157] RBP: 00000000004e6400 R08: 0000000000000014 R09: 1999999999999999
[ 370.110158] R10: 00007f111d2dc200 R11: 0000000000000000 R12: 0000000001c32fc0
[ 370.111165] R13: 0000000000000000 R14: 0000000000000c00 R15: 0000000000000005
[ 370.112171] Code: 07 00 00 00 e8 a4 ee ff ff e9 11 ff ff ff 4c 89 ea 48 89 de 45 31 c0 31 c9 e8 8f f7 ff ff 48 83 c4 08 5b 41 5c 41 5d 41 5e 5d c3 <0f> 0b 41 8b 94 24 80 04 00 00 49 8d b4 24 b0 06 00 00 4c 89 e9
[ 370.114823] RIP: mm_fault_error+0x15e/0x190 RSP: ffffc9001148fe60
[ 370.115722] ---[ end trace 2ce10d930638254d ]---
It appears that there are 2 issues. First, the size bits used for vm_fault
needs to be shifted over. Otherwise, FAULT_FLAG_SIZE_PMD is clobbering
FAULT_FLAG_INSTRUCTION. Second issue, after create_huge_pmd() is being
called and is falling back to the pte fault handler, the FAULT_FLAG_SIZE_PMD
flag remains and that causes the dax fault handler to go towards the pmd
fault handler instead of the pte fault handler. Fixes are made for the pud
and pmd fall through paths.
Reported-by: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Signed-off-by: Dave Jiang <dave.jiang(a)intel.com>
---
include/linux/mm.h | 8 ++++----
mm/memory.c | 4 ++++
2 files changed, 8 insertions(+), 4 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f50e730..6194aeb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -285,10 +285,10 @@ extern pgprot_t protection_map[16];
#define FAULT_FLAG_REMOTE 0x80 /* faulting for non current tsk/mm */
#define FAULT_FLAG_INSTRUCTION 0x100 /* The fault was during an instruction fetch */
-#define FAULT_FLAG_SIZE_MASK 0x700 /* Support up to 8-level page tables */
-#define FAULT_FLAG_SIZE_PTE 0x000 /* First level (eg 4k) */
-#define FAULT_FLAG_SIZE_PMD 0x100 /* Second level (eg 2MB) */
-#define FAULT_FLAG_SIZE_PUD 0x200 /* Third level (eg 1GB) */
+#define FAULT_FLAG_SIZE_MASK 0x7000 /* Support up to 8-level page tables */
+#define FAULT_FLAG_SIZE_PTE 0x0000 /* First level (eg 4k) */
+#define FAULT_FLAG_SIZE_PMD 0x1000 /* Second level (eg 2MB) */
+#define FAULT_FLAG_SIZE_PUD 0x2000 /* Third level (eg 1GB) */
#define FAULT_FLAG_TRACE \
{ FAULT_FLAG_WRITE, "WRITE" }, \
diff --git a/mm/memory.c b/mm/memory.c
index d465806..bdf1661 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3663,6 +3663,8 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
ret = create_huge_pud(&vmf);
if (!(ret & VM_FAULT_FALLBACK))
return ret;
+ /* fall through path, remove PUD flag */
+ vmf.flags &= ~FAULT_FLAG_SIZE_PUD;
} else {
pud_t orig_pud = *vmf.pud;
@@ -3693,6 +3695,8 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
ret = create_huge_pmd(&vmf);
if (!(ret & VM_FAULT_FALLBACK))
return ret;
+ /* fall through path, remove PMD flag */
+ vmf.flags &= ~FAULT_FLAG_SIZE_PMD;
} else {
pmd_t orig_pmd = *vmf.pmd;
3 years, 11 months
[RFC PATCH 00/17] introduce a dax_inode for dax_operations
by Dan Williams
Recently there was an effort to introduce dax_operations to unwind the
abuse of the user-copy api in the pmem api [1]. Christoph noted that we
should not add new block-dax operations as it is further abuse of struct
block_device [2].
The ->direct_access() method in block_device_operations was an expedient
way to get the filesystem-dax capability bootstrapped. However, looking
forward to native persistent memory filesystems, they can forgo the
block layer and mount directly on a provider of dax services, a dax
inode.
For the time being, since current dax capable filesystems are block
based, we need a facility to look up this dax object via the
block-device name. If this approach looks reasonable I'll follow up with
reworking the proposed ->copy_from_iter(), ->flush(), and ->clear() dax
operations into this new scheme.
These patches survive a run of the libnvdimm unit tests, but I have not
tested the non-libnvdimm dax drivers.
[1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008586.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008638.html
---
Dan Williams (17):
dax: refactor dax-fs into a generic provider of dax inodes
dax: convert dax_inode locking to srcu
dax: add a facility to lookup a dax inode by 'host' device name
dax: introduce dax_operations
pmem: add dax_operations support
axon_ram: add dax_operations support
brd: add dax_operations support
dcssblk: add dax_operations support
block: kill bdev_dax_capable()
block: introduce bdev_dax_direct_access()
dm: add dax_operations support (producer)
dm: add dax_operations support (consumer)
fs: update mount_bdev() to lookup dax infrastructure
ext2, ext4, xfs: retrieve dax_inode through iomap operations
Revert "block: use DAX for partition table reads"
fs, dax: convert filesystem-dax to bdev_dax_direct_access
block: remove block_device_operations.direct_access and related infrastructure
arch/powerpc/platforms/Kconfig | 1
arch/powerpc/sysdev/axonram.c | 37 +++
block/Kconfig | 1
block/partition-generic.c | 17 --
drivers/Makefile | 2
drivers/block/Kconfig | 1
drivers/block/brd.c | 48 +++-
drivers/dax/Kconfig | 9 +
drivers/dax/Makefile | 5
drivers/dax/dax.h | 19 +-
drivers/dax/device-dax.h | 25 ++
drivers/dax/device.c | 257 ++++-------------------
drivers/dax/pmem.c | 2
drivers/dax/super.c | 434 +++++++++++++++++++++++++++++++++++++++
drivers/md/Kconfig | 1
drivers/md/dm-core.h | 3
drivers/md/dm-linear.c | 15 +
drivers/md/dm-snap.c | 8 +
drivers/md/dm-stripe.c | 16 +
drivers/md/dm-table.c | 2
drivers/md/dm-target.c | 10 +
drivers/md/dm.c | 43 +++-
drivers/nvdimm/Kconfig | 1
drivers/nvdimm/pmem.c | 46 +++-
drivers/nvdimm/pmem.h | 7 -
drivers/s390/block/Kconfig | 1
drivers/s390/block/dcssblk.c | 41 +++-
fs/block_dev.c | 75 ++-----
fs/dax.c | 149 ++++++-------
fs/ext2/inode.c | 1
fs/ext4/inode.c | 1
fs/iomap.c | 3
fs/super.c | 32 +++
fs/xfs/xfs_aops.c | 13 +
fs/xfs/xfs_aops.h | 1
fs/xfs/xfs_buf.h | 1
fs/xfs/xfs_iomap.c | 1
fs/xfs/xfs_super.c | 3
include/linux/blkdev.h | 7 -
include/linux/dax.h | 29 ++-
include/linux/device-mapper.h | 16 +
include/linux/fs.h | 1
include/linux/iomap.h | 1
tools/testing/nvdimm/Kbuild | 6 -
tools/testing/nvdimm/pmem-dax.c | 12 -
45 files changed, 927 insertions(+), 477 deletions(-)
create mode 100644 drivers/dax/device-dax.h
rename drivers/dax/{dax.c => device.c} (74%)
create mode 100644 drivers/dax/super.c
3 years, 11 months