[PATCH v3 0/2] Support ACPI 6.1 update in NFIT Control Region Structure
by Toshi Kani
ACPI 6.1, Table 5-133, updates NVDIMM Control Region Structure as
follows.
- Valid Fields, Manufacturing Location, and Manufacturing Date
are added from reserved range. No change in the structure size.
- IDs (SPD values) are stored as arrays of bytes (i.e. big-endian
format). The spec clarifies that they need to be represented
as arrays of bytes as well.
Patch 1 changes the NFIT driver to comply with ACPI 6.1.
Patch 2 adds a new sysfs file "id" to show NVDIMM ID defined in ACPI 6.1.
The patch-set applies on linux-pm.git acpica.
link: http://www.uefi.org/sites/default/files/resources/ACPI_6_1.pdf
---
v3:
- Need to coordinate with ACPICA update (Bob Moore, Dan Williams)
- Integrate with ACPICA changes in struct acpi_nfit_control_region.
(commit 138a95547ab0)
v2:
- Remove 'mfg_location' and 'mfg_date'. (Dan Williams)
- Rename 'unique_id' to 'id' and make this change as a separate patch.
(Dan Williams)
---
Toshi Kani (3):
1/2 acpi/nfit: Update nfit driver to comply with ACPI 6.1
2/3 acpi/nfit: Add sysfs "id" for NVDIMM ID
---
drivers/acpi/nfit.c | 29 ++++++++++++++++++++++++-----
1 file changed, 24 insertions(+), 5 deletions(-)
2 years, 7 months
[PATCH v6 0/8] libnvdimm: add DMA supported blk-mq pmem driver
by Dave Jiang
v6:
- Put all common code for pmem drivers in pmem_core per Dan's suggestion.
- Added support code to get number of available DMA chans
- Fixed up Kconfig so that when pmem is built into the kernel, pmem_dma won't
show up.
v5:
- Added support to report descriptor transfer capability limit from dmaengine.
- Fixed up scatterlist support for dma_unmap_data per Dan's comments.
- Made the driver a separate pmem blk driver per Christoph's suggestion
and also fixed up all the issues pointed out by Christoph.
- Added pmem badblock checking/handling per Robert and also made DMA op to
be used by all buffer sizes.
v4:
- Addressed kbuild test bot issues. Passed kbuild test bot, 179 configs.
v3:
- Added patch to rename DMA_SG to DMA_SG_SG to make it explicit
- Added DMA_MEMCPY_SG transaction type to dmaengine
- Misc patch to add verification of DMA_MEMSET_SG that was missing
- Addressed all nd_pmem driver comments from Ross.
v2:
- Make dma_prep_memcpy_* into one function per Dan.
- Addressed various comments from Ross with code formatting and etc.
- Replaced open code with offset_in_page() macro per Johannes.
The following series implements a blk-mq pmem driver and
also adds infrastructure code to ioatdma and dmaengine in order to
support copying to and from scatterlist in order to process block
requests provided by blk-mq. The usage of DMA engines available on certain
platforms allow us to drastically reduce CPU utilization and at the same time
maintain performance that is good enough. Experimentations have been done on
DRAM backed pmem block device that showed the utilization of DMA engine is
beneficial. By default nd_pmem.ko will be loaded. This can be overridden
through module blacklisting in order to load nd_pmem_dma.ko.
---
Dave Jiang (8):
dmaengine: ioatdma: revert 7618d035 to allow sharing of DMA channels
dmaengine: Add DMA_MEMCPY_SG transaction op
dmaengine: add verification of DMA_MEMSET_SG in dmaengine
dmaengine: ioatdma: dma_prep_memcpy_sg support
dmaengine: add function to provide per descriptor xfercap for dma engine
dmaengine: add SG support to dmaengine_unmap
dmaengine: provide number of available channels
libnvdimm: Add blk-mq pmem driver
Documentation/dmaengine/provider.txt | 3
drivers/dma/dmaengine.c | 76 ++++
drivers/dma/ioat/dma.h | 4
drivers/dma/ioat/init.c | 6
drivers/dma/ioat/prep.c | 57 +++
drivers/nvdimm/Kconfig | 21 +
drivers/nvdimm/Makefile | 6
drivers/nvdimm/pmem.c | 264 ---------------
drivers/nvdimm/pmem.h | 48 +++
drivers/nvdimm/pmem_core.c | 298 +++++++++++++++++
drivers/nvdimm/pmem_dma.c | 606 ++++++++++++++++++++++++++++++++++
include/linux/dmaengine.h | 49 +++
12 files changed, 1170 insertions(+), 268 deletions(-)
create mode 100644 drivers/nvdimm/pmem_core.c
create mode 100644 drivers/nvdimm/pmem_dma.c
--
Signature
3 years, 2 months
Re: KVM "fake DAX" flushing interface - discussion
by Dan Williams
On Wed, Jul 26, 2017 at 2:27 PM, Rik van Riel <riel(a)redhat.com> wrote:
> On Wed, 2017-07-26 at 09:47 -0400, Pankaj Gupta wrote:
>> >
>> Just want to summarize here(high level):
>>
>> This will require implementing new 'virtio-pmem' device which
>> presents
>> a DAX address range(like pmem) to guest with read/write(direct
>> access)
>> & device flush functionality. Also, qemu should implement
>> corresponding
>> support for flush using virtio.
>>
> Alternatively, the existing pmem code, with
> a flush-only block device on the side, which
> is somehow associated with the pmem device.
>
> I wonder which alternative leads to the least
> code duplication, and the least maintenance
> hassle going forward.
I'd much prefer to have another driver. I.e. a driver that refactors
out some common pmem details into a shared object and can attach to
ND_DEVICE_NAMESPACE_{IO,PMEM}. A control device on the side seems like
a recipe for confusion.
With a $new_driver in hand you can just do:
modprobe $new_driver
echo $namespace > /sys/bus/nd/drivers/nd_pmem/unbind
echo $namespace > /sys/bus/nd/drivers/$new_driver/new_id
echo $namespace > /sys/bus/nd/drivers/$new_driver/bind
...and the guest can arrange for $new_driver to be the default, so you
don't need to do those steps each boot of the VM, by doing:
echo "blacklist nd_pmem" > /etc/modprobe.d/virt-dax-flush.conf
echo "alias nd:t4* $new_driver" >> /etc/modprobe.d/virt-dax-flush.conf
echo "alias nd:t5* $new_driver" >> /etc/modprobe.d/virt-dax-flush.conf
3 years, 2 months
Enabling peer to peer device transactions for PCIe devices
by Deucher, Alexander
This is certainly not the first time this has been brought up, but I'd like to try and get some consensus on the best way to move this forward. Allowing devices to talk directly improves performance and reduces latency by avoiding the use of staging buffers in system memory. Also in cases where both devices are behind a switch, it avoids the CPU entirely. Most current APIs (DirectGMA, PeerDirect, CUDA, HSA) that deal with this are pointer based. Ideally we'd be able to take a CPU virtual address and be able to get to a physical address taking into account IOMMUs, etc. Having struct pages for the memory would allow it to work more generally and wouldn't require as much explicit support in drivers that wanted to use it.
Some use cases:
1. Storage devices streaming directly to GPU device memory
2. GPU device memory to GPU device memory streaming
3. DVB/V4L/SDI devices streaming directly to GPU device memory
4. DVB/V4L/SDI devices streaming directly to storage devices
Here is a relatively simple example of how this could work for testing. This is obviously not a complete solution.
- Device memory will be registered with Linux memory sub-system by created corresponding struct page structures for device memory
- get_user_pages_fast() will return corresponding struct pages when CPU address points to the device memory
- put_page() will deal with struct pages for device memory
Previously proposed solutions and related proposals:
1.P2P DMA
DMA-API/PCI map_peer_resource support for peer-to-peer (http://www.spinics.net/lists/linux-pci/msg44560.html)
Pros: Low impact, already largely reviewed.
Cons: requires explicit support in all drivers that want to support it, doesn't handle S/G in device memory.
2. ZONE_DEVICE IO
Direct I/O and DMA for persistent memory (https://lwn.net/Articles/672457/)
Add support for ZONE_DEVICE IO memory with struct pages. (https://patchwork.kernel.org/patch/8583221/)
Pro: Doesn't waste system memory for ZONE metadata
Cons: CPU access to ZONE metadata slow, may be lost, corrupted on device reset.
3. DMA-BUF
RDMA subsystem DMA-BUF support (http://www.spinics.net/lists/linux-rdma/msg38748.html)
Pros: uses existing dma-buf interface
Cons: dma-buf is handle based, requires explicit dma-buf support in drivers.
4. iopmem
iopmem : A block device for PCIe memory (https://lwn.net/Articles/703895/)
5. HMM
Heterogeneous Memory Management (http://lkml.iu.edu/hypermail/linux/kernel/1611.2/02473.html)
6. Some new mmap-like interface that takes a userptr and a length and returns a dma-buf and offset?
Alex
3 years, 2 months
[RFC 00/16] NOVA: a new file system for persistent memory
by Steven Swanson
This is an RFC patch series that impements NOVA (NOn-Volatile memory
Accelerated file system), a new file system built for PMEM.
NOVA's goal is to provide a high-performance, full-featured, production-ready
file system tailored for byte-addressable non-volatile memories (e.g., NVDIMMs
and Intel's soon-to-be-released 3DXpoint DIMMs). It combines design elements
from many other file systems to provide a combination of high-performance,
strong consistency guarantees, and comprehensive data protection. NOVA supports
DAX-style mmap, and making DAX perform well is a first-order priority in NOVA's
design.
NOVA was developed at the Non-Volatile Systems Laboratory in the Computer
Science and Engineering Department at the University of California, San Diego.
Its primary authors are Andiry Xu <jix024(a)eng.ucsd.edu>, Lu Zhang
<luzh(a)eng.ucsd.edu>, and Steven Swanson <swanson(a)eng.ucsd.edu>.
NOVA is stable enough to run complex applications, but there is substantial
work left to do. This RFC is intended to gather feedback to guide its
development toward eventual inclusion upstream.
The patches are relative Linux 4.12.
Overview
========
NOVA is primarily a log-structured file system, but rather than maintain a
single global log for the entire file system, it maintains separate logs for
each file (inode). NOVA breaks the logs into 4KB pages, they need not be
contiguous in memory. The logs only contain metadata.
File data pages reside outside the log, and log entries for write operations
point to data pages they modify. File modification uses copy-on-write (COW) to
provide atomic file updates.
For file operations that involve multiple inodes, NOVA use small, fixed-sized
redo logs to atomically append log entries to the logs of the inodes involved.
This structure keeps logs small and makes garbage collection very fast. It also
enables enormous parallelism during recovery from an unclean unmount, since
threads can scan logs in parallel.
NOVA replicates and checksums all metadata structures and protects file data
with RAID-4-style parity. It supports checkpoints to facilitate backups.
Documentation/filesystems/NOVA.txt contains some lower-level implementation and
usage information. A more thorough discussion of NOVA's goals and design is
avaialable in two papers:
NOVA: A Log-structured File system for Hybrid Volatile/Non-volatile Main Memories
http://cseweb.ucsd.edu/~swanson/papers/FAST2016NOVA.pdf
Jian Xu and Steven Swanson
Published in FAST 2016
Hardening the NOVA File System
http://cseweb.ucsd.edu/~swanson/papers/TechReport2017HardenedNOVA.pdf UCSD-CSE
Techreport CS2017-1018
Jian Xu, Lu Zhang, Amirsaman Memaripour, Akshatha
Gangadharaiah, Amit Borase, Tamires Brito Da Silva, Andy Rudoff, Steven
Swanson
-steve
---
Steven Swanson (16):
NOVA: Documentation
NOVA: Superblock and fs layout
NOVA: PMEM allocation system
NOVA: Inode operations and structures
NOVA: Log data structures and operations
NOVA: Lite-weight journaling for complex ops
NOVA: File and directory operations
NOVA: Garbage collection
NOVA: DAX code
NOVA: File data protection
NOVA: Snapshot support
NOVA: Recovery code
NOVA: Sysfs and ioctl
NOVA: Read-only pmem devices
NOVA: Performance measurement
NOVA: Build infrastructure
Documentation/filesystems/00-INDEX | 2
Documentation/filesystems/nova.txt | 771 +++++++++++++++++
MAINTAINERS | 8
README.md | 173 ++++
arch/x86/include/asm/io.h | 1
arch/x86/mm/fault.c | 11
arch/x86/mm/ioremap.c | 25 -
drivers/nvdimm/pmem.c | 14
fs/Kconfig | 2
fs/Makefile | 1
fs/nova/Kconfig | 15
fs/nova/Makefile | 9
fs/nova/balloc.c | 827 +++++++++++++++++++
fs/nova/balloc.h | 118 +++
fs/nova/bbuild.c | 1602 ++++++++++++++++++++++++++++++++++++
fs/nova/checksum.c | 912 ++++++++++++++++++++
fs/nova/dax.c | 1346 ++++++++++++++++++++++++++++++
fs/nova/dir.c | 760 +++++++++++++++++
fs/nova/file.c | 943 +++++++++++++++++++++
fs/nova/gc.c | 739 +++++++++++++++++
fs/nova/inode.c | 1467 +++++++++++++++++++++++++++++++++
fs/nova/inode.h | 389 +++++++++
fs/nova/ioctl.c | 185 ++++
fs/nova/journal.c | 474 +++++++++++
fs/nova/journal.h | 61 +
fs/nova/log.c | 1411 ++++++++++++++++++++++++++++++++
fs/nova/log.h | 333 +++++++
fs/nova/mprotect.c | 604 ++++++++++++++
fs/nova/mprotect.h | 190 ++++
fs/nova/namei.c | 919 +++++++++++++++++++++
fs/nova/nova.h | 1137 ++++++++++++++++++++++++++
fs/nova/nova_def.h | 154 +++
fs/nova/parity.c | 411 +++++++++
fs/nova/perf.c | 594 +++++++++++++
fs/nova/perf.h | 96 ++
fs/nova/rebuild.c | 847 +++++++++++++++++++
fs/nova/snapshot.c | 1407 ++++++++++++++++++++++++++++++++
fs/nova/snapshot.h | 98 ++
fs/nova/stats.c | 685 +++++++++++++++
fs/nova/stats.h | 218 +++++
fs/nova/super.c | 1222 +++++++++++++++++++++++++++
fs/nova/super.h | 216 +++++
fs/nova/symlink.c | 153 +++
fs/nova/sysfs.c | 543 ++++++++++++
include/linux/io.h | 2
include/linux/mm.h | 2
include/linux/mm_types.h | 3
kernel/memremap.c | 24 +
mm/memory.c | 2
mm/mmap.c | 1
mm/mprotect.c | 13
51 files changed, 22129 insertions(+), 11 deletions(-)
create mode 100644 Documentation/filesystems/nova.txt
create mode 100644 README.md
create mode 100644 fs/nova/Kconfig
create mode 100644 fs/nova/Makefile
create mode 100644 fs/nova/balloc.c
create mode 100644 fs/nova/balloc.h
create mode 100644 fs/nova/bbuild.c
create mode 100644 fs/nova/checksum.c
create mode 100644 fs/nova/dax.c
create mode 100644 fs/nova/dir.c
create mode 100644 fs/nova/file.c
create mode 100644 fs/nova/gc.c
create mode 100644 fs/nova/inode.c
create mode 100644 fs/nova/inode.h
create mode 100644 fs/nova/ioctl.c
create mode 100644 fs/nova/journal.c
create mode 100644 fs/nova/journal.h
create mode 100644 fs/nova/log.c
create mode 100644 fs/nova/log.h
create mode 100644 fs/nova/mprotect.c
create mode 100644 fs/nova/mprotect.h
create mode 100644 fs/nova/namei.c
create mode 100644 fs/nova/nova.h
create mode 100644 fs/nova/nova_def.h
create mode 100644 fs/nova/parity.c
create mode 100644 fs/nova/perf.c
create mode 100644 fs/nova/perf.h
create mode 100644 fs/nova/rebuild.c
create mode 100644 fs/nova/snapshot.c
create mode 100644 fs/nova/snapshot.h
create mode 100644 fs/nova/stats.c
create mode 100644 fs/nova/stats.h
create mode 100644 fs/nova/super.c
create mode 100644 fs/nova/super.h
create mode 100644 fs/nova/symlink.c
create mode 100644 fs/nova/sysfs.c
--
Signature
3 years, 3 months
FIle copy to FAT FS on NVDIMM hits BUG_ON at fs/buffer.c:3305!
by Kani, Toshimitsu
Hi,
Copying files to vfat FS on an NVDIMM device hits
BUG_ON(!PageLocked(page)) in try_to_free_buffers(). It happens on
4.13-rc1, and happens on older kernels as well.
A simple reproducer is shown below. It is 100% reproducible on my
setup (8GB of regular memory and 16GB of NVDIMM). It usually hits in
the 3rd or 4th file copy and does not repeat with the while-loop.
Interestingly, it hits only when an NVDIMM device is set as raw or
memory mode. It does not hit with sector mode.
==
DEV=pmem0
set -x
mkfs.vfat /dev/$DEV
mount /dev/$DEV /mnt/$DEV
dd if=/dev/zero of=/mnt/$DEV/1Gfile bs=1M count=1024
while true; do
cp /mnt/$DEV/1Gfile /mnt/$DEV/file-1
cp /mnt/$DEV/1Gfile /mnt/$DEV/file-2
cp /mnt/$DEV/1Gfile /mnt/$DEV/file-3
cp /mnt/$DEV/1Gfile /mnt/$DEV/file-4
cp /mnt/$DEV/1Gfile /mnt/$DEV/file-5
cp /mnt/$DEV/1Gfile /mnt/$DEV/file-6
cp /mnt/$DEV/1Gfile /mnt/$DEV/file-7
cp /mnt/$DEV/1Gfile /mnt/$DEV/file-8
cp /mnt/$DEV/1Gfile /mnt/$DEV/file-9
cp /mnt/$DEV/1Gfile /mnt/$DEV/file-10
done
==
kernel BUG at fs/buffer.c:3305!
invalid opcode: 0000 [#1] SMP
:
Workqueue: writeback wb_workfn (flush-259:0)
task: ffff8d02595b8000 task.stack: ffffa22242400000
RIP: 0010:try_to_free_buffers+0xd2/0xe0
RSP: 0018:ffffa22242403830 EFLAGS: 00010246
RAX: 00afffc000001028 RBX: 0000000000000008 RCX: ffff8d012dcf19c0
RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffffc468e3b52b80
RBP: ffffa22242403858 R08: 0000000000000000 R09: 000000000002067c
R10: ffff8d027ffe6000 R11: 0000000000000000 R12: 0000000000000000
R13: ffff8d022fccdbe0 R14: ffffc468e3b52b80 R15: ffffa22242403ad0
FS: 0000000000000000(0000) GS:ffff8d027fd40000(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f9d2bb80b70 CR3: 000000084fe09000 CR4: 00000000007406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
clean_buffers+0x5d/0x70
__mpage_writepage+0x567/0x760
? page_mkclean+0x6a/0xb0
write_cache_pages+0x205/0x580
? clean_buffers+0x70/0x70
? fat_add_cluster+0x80/0x80 [fat]
mpage_writepages+0x7c/0x100
? fat_add_cluster+0x80/0x80 [fat]
? __set_page_dirty+0x9b/0xc0
? fprop_fraction_percpu+0x2f/0x80
fat_writepages+0x15/0x20 [fat]
? fat_writepages+0x15/0x20 [fat]
do_writepages+0x25/0x80
__writeback_single_inode+0x45/0x350
writeback_sb_inodes+0x25e/0x610
__writeback_inodes_wb+0x92/0xc0
wb_writeback+0x29b/0x340
wb_workfn+0x195/0x3d0
? wb_workfn+0x195/0x3d0
process_one_work+0x193/0x3d0
worker_thread+0x4e/0x3d0
kthread+0x114/0x150
? process_one_work+0x3d0/0x3d0
? kthread_park+0x60/0x60
? kthread_park+0x60/0x60
ret_from_fork+0x25/0x30
:
RIP: try_to_free_buffers+0xd2/0xe0 RSP: ffffa22242403830
Thanks,
-Toshi
3 years, 4 months
[PATCH] nvdimm: fix potential double-fetch bug
by Meng Xu
From: Meng Xu <mengxu.gatech(a)gmail.com>
While examining the kernel source code, I found a dangerous operation that
could turn into a double-fetch situation (a race condition bug) where
the same userspace memory region are fetched twice into kernel with sanity
checks after the first fetch while missing checks after the second fetch.
In the case of _IOC_NR(ioctl_cmd) == ND_CMD_CALL:
1. The first fetch happens in line 935 copy_from_user(&pkg, p, sizeof(pkg)
2. subsequently `pkg.nd_reserved2` is asserted to be all zeroes
(line 984 to 986).
3. The second fetch happens in line 1022 copy_from_user(buf, p, buf_len)
4. Given that `p` can be fully controlled in userspace, an attacker can
race condition to override the header part of `p`, say,
`((struct nd_cmd_pkg *)p)->nd_reserved2` to arbitrary value
(say nine 0xFFFFFFFF for `nd_reserved2`) after the first fetch but before the
second fetch. The changed value will be copied to `buf`.
5. There is no checks on the second fetches until the use of it in
line 1034: nd_cmd_clear_to_send(nvdimm_bus, nvdimm, cmd, buf) and
line 1038: nd_desc->ndctl(nd_desc, nvdimm, cmd, buf, buf_len, &cmd_rc)
which means that the assumed relation, `p->nd_reserved2` are all zeroes might
not hold after the second fetch. And once the control goes to these functions
we lose the context to assert the assumed relation.
6. Based on my manual analysis, `p->nd_reserved2` is not used in function
`nd_cmd_clear_to_send` and potential implementations of `nd_desc->ndctl`
so there is no working exploit against it right now. However, this could
easily turns to an exploitable one if careless developers start to use
`p->nd_reserved2` later and assume that they are all zeroes.
Proposed patch:
The patch explicitly overrides `buf->nd_reserved2` after the second fetch with
the value `pkg.nd_reserved2` from the first fetch. In this way, it is assured
that the relation, `buf->nd_reserved2` are all zeroes, holds after the second
fetch.
Signed-off-by: Meng Xu <mengxu.gatech(a)gmail.com>
---
drivers/nvdimm/bus.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
index 937fafa..20c4d0f 100644
--- a/drivers/nvdimm/bus.c
+++ b/drivers/nvdimm/bus.c
@@ -1024,6 +1024,12 @@ static int __nd_ioctl(struct nvdimm_bus *nvdimm_bus, struct nvdimm *nvdimm,
goto out;
}
+ if (cmd == ND_CMD_CALL) {
+ struct nd_cmd_pkg *hdr = (struct nd_cmd_pkg *)buf;
+ memcpy(hdr->nd_reserved2, pkg.nd_reserved2,
+ sizeof(pkg.nd_reserved2));
+ }
+
nvdimm_bus_lock(&nvdimm_bus->dev);
rc = nd_cmd_clear_to_send(nvdimm_bus, nvdimm, func, buf);
if (rc)
--
2.7.4
3 years, 4 months
[PATCH v4 0/3] MAP_DIRECT and block-map sealed files
by Dan Williams
Changes since v3 [1]:
* Move from an fallocate(2) interface to a new mmap(2) flag and rename
'immutable' to 'sealed'.
* Do not record the sealed state in permanent metadata it is now purely
a temporary state for as long as a MAP_DIRECT vma is referencing the
inode (Christoph)
* Drop the CAP_IMMUTABLE requirement, but do require a PROT_WRITE
mapping.
[1]: https://lwn.net/Articles/730570/
---
This is the next revision of a patch series that aims to enable
applications that otherwise need to resort to DAX mapping a raw device
file to instead move to a filesystem.
In the course of reviewing a previous posting, Christoph said:
That being said I think we absolutely should support RDMA memory
registrations for DAX mappings. I'm just not sure how S_IOMAP_IMMUTABLE
helps with that. We'll want a MAP_SYNC | MAP_POPULATE to make sure all
the blocks are populated and all ptes are set up. Second we need to
make sure get_user_page works, which for now means we'll need a struct
page mapping for the region (which will be really annoying for PCIe
mappings, like the upcoming NVMe persistent memory region), and we need
to guarantee that the extent mapping won't change while the
get_user_pages holds the pages inside it. I think that is true due to
side effects even with the current DAX code, but we'll need to make it
explicit. And maybe that's where we need to converge - "sealing" the
extent map makes sense as such a temporary measure that is not persisted
on disk, which automatically gets released when the holding process
exits, because we sort of already do this implicitly. It might also
make sense to have explicitly breakable seals similar to what I do for
the pNFS blocks kernel server, as any userspace RDMA file server would
also need those semantics.
So, this is an attempt to converge on the idea that we need an explicit
and process-lifetime-temporary mechanism for a process to be able to
make assumptions about the mapping to physical page to dax-file-offset
relationship. The "explicitly breakable seals" aspect is not addressed
in these patches, but I wonder if it might be a voluntary mechanism that
can implemented via userfaultfd.
These pass a basic smoke test and are meant to just gauge 'right track'
/ 'wrong track'. The main question it seems is whether the pinning done
in this patchset is too early (applies before get_user_pages()) and too
coarse (applies to the whole file). Perhaps this is where I discarded
too easily Jan's suggestion to look at Peter Z's mm_mpin() syscall [2]? On
the other hand, the coarseness and simple lifetime rules of MAP_DIRECT
make it an easy mechanism to implement and explain.
Another reason I kept the scope of S_IOMAP_SEALED coarsely defined was
to support Dave's desired use case of sealing for operating on reflinked
files [3].
Suggested mmap(2) man page edits are included in the changelog of patch
3.
[2]: https://lwn.net/Articles/600502/
[3]: https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1467677.html
---
Dan Williams (3):
fs, xfs: introduce S_IOMAP_SEALED
mm: introduce MAP_VALIDATE a mechanism for adding new mmap flags
fs, xfs: introduce MAP_DIRECT for creating block-map-sealed file ranges
fs/attr.c | 10 +++
fs/dax.c | 2 +
fs/open.c | 6 ++
fs/read_write.c | 3 +
fs/xfs/libxfs/xfs_bmap.c | 5 +
fs/xfs/xfs_bmap_util.c | 3 +
fs/xfs/xfs_file.c | 107 ++++++++++++++++++++++++++++++++
fs/xfs/xfs_inode.h | 1
fs/xfs/xfs_ioctl.c | 6 ++
fs/xfs/xfs_super.c | 1
include/linux/fs.h | 9 +++
include/linux/mm.h | 2 -
include/linux/mm_types.h | 1
include/linux/mman.h | 3 +
include/uapi/asm-generic/mman-common.h | 2 +
mm/filemap.c | 5 +
mm/mmap.c | 22 ++++++-
17 files changed, 183 insertions(+), 5 deletions(-)
3 years, 4 months
[PATCH v3 0/2] dax, dm: stop requiring dax for device-mapper
by Dan Williams
Changes since v2 [1]:
* rebase on -next to integrate with commit 273752c9ff03 "dm, dax: Make
sure dm_dax_flush() is called if device supports it" (kbuild robot)
* fix CONFIG_DAX dependencies to upgrade CONFIG_DAX=m to CONFIG_DAX=y
(kbuild robot)
[1]: https://www.spinics.net/lists/kernel/msg2570522.html
---
Bart points out that the DAX core is unconditionally enabled if
device-mapper is enabled. Add some config machinery and some
stub-static-inline routines to allow dax infrastructure to be deleted
from device-mapper at compile time.
Since this depends on commit 273752c9ff03 that's already in -next, this
should go through the device-mapper tree.
---
Dan Williams (2):
dax: introduce CONFIG_DAX_DRIVER
dm: allow device-mapper to operate without dax support
arch/powerpc/platforms/Kconfig | 1 +
drivers/block/Kconfig | 1 +
drivers/dax/Kconfig | 4 +++-
drivers/md/Kconfig | 2 +-
drivers/md/dm-linear.c | 6 ++++++
drivers/md/dm-stripe.c | 6 ++++++
drivers/md/dm.c | 10 ++++++----
drivers/nvdimm/Kconfig | 1 +
drivers/s390/block/Kconfig | 1 +
include/linux/dax.h | 30 ++++++++++++++++++++++++------
10 files changed, 50 insertions(+), 12 deletions(-)
3 years, 4 months