[PATCH v2 00/25] replace ioremap_{cache|wt} with memremap
by Dan Williams
Changes since v1 [1]:
1/ Drop the attempt at unifying ioremap() prototypes, just focus on
converting ioremap_cache and ioremap_wt over to memremap (Christoph)
2/ Drop the unrelated cleanups to use %pa in __ioremap_caller (Thomas)
3/ Add support for memremap() attempts on "System RAM" to simply return
the kernel virtual address for that range. ARM depends on this
functionality in ioremap_cache() and ACPI was open coding a similar
solution. (Mark)
4/ Split the conversions of ioremap_{cache|wt} into separate patches per
driver / arch.
5/ Fix bisection breakage and other reports from 0day-kbuild
---
While developing the pmem driver we noticed that the __iomem annotation
on the return value from ioremap_cache() was being mishandled by several
callers. We also observed that all of the call sites expected to be
able to treat the return value from ioremap_cache() as normal
(non-__iomem) pointer to memory.
This patchset takes the opportunity to clean up the above confusion as
well as a few issues with the ioremap_{cache|wt} interface, including:
1/ Eliminating the possibility of function prototypes differing between
architectures by defining a central memremap() prototype that takes
flags to determine the mapping type.
2/ Returning NULL rather than falling back silently to a different
mapping-type. This allows drivers to be stricter about the
mapping-type fallbacks that are permissible.
[1]: http://marc.info/?l=linux-arm-kernel&m=143735199029255&w=2
---
Dan Williams (22):
mm: enhance region_is_ram() to distinguish 'unknown' vs 'mixed'
arch, drivers: don't include <asm/io.h> directly, use <linux/io.h> instead
cleanup IORESOURCE_CACHEABLE vs ioremap()
intel_iommu: fix leaked ioremap mapping
arch: introduce memremap()
arm: switch from ioremap_cache to memremap
x86: switch from ioremap_cache to memremap
gma500: switch from acpi_os_ioremap to ioremap
i915: switch from acpi_os_ioremap to ioremap
acpi: switch from ioremap_cache to memremap
toshiba laptop: replace ioremap_cache with ioremap
memconsole: fix __iomem mishandling, switch to memremap
visorbus: switch from ioremap_cache to memremap
intel-iommu: switch from ioremap_cache to memremap
libnvdimm, pmem: switch from ioremap_cache to memremap
pxa2xx-flash: switch from ioremap_cache to memremap
sfi: switch from ioremap_cache to memremap
fbdev: switch from ioremap_wt to memremap
pmem: switch from ioremap_wt to memremap
arch: remove ioremap_cache, replace with arch_memremap
arch: remove ioremap_wt, replace with arch_memremap
pmem: convert to generic memremap
Toshi Kani (3):
mm, x86: Fix warning in ioremap RAM check
mm, x86: Remove region_is_ram() call from ioremap
mm: Fix bugs in region_is_ram()
arch/arc/include/asm/io.h | 1
arch/arm/Kconfig | 1
arch/arm/include/asm/io.h | 13 +++-
arch/arm/include/asm/xen/page.h | 4 +
arch/arm/mach-clps711x/board-cdb89712.c | 2 -
arch/arm/mach-shmobile/pm-rcar.c | 2 -
arch/arm/mm/ioremap.c | 12 +++-
arch/arm/mm/nommu.c | 11 ++-
arch/arm64/Kconfig | 1
arch/arm64/include/asm/acpi.h | 10 +--
arch/arm64/include/asm/dmi.h | 8 +--
arch/arm64/include/asm/io.h | 8 ++-
arch/arm64/kernel/efi.c | 9 ++-
arch/arm64/kernel/smp_spin_table.c | 19 +++---
arch/arm64/mm/ioremap.c | 20 ++----
arch/avr32/include/asm/io.h | 1
arch/frv/Kconfig | 1
arch/frv/include/asm/io.h | 17 ++---
arch/frv/mm/kmap.c | 6 ++
arch/ia64/Kconfig | 1
arch/ia64/include/asm/io.h | 11 +++
arch/ia64/kernel/cyclone.c | 2 -
arch/m32r/include/asm/io.h | 1
arch/m68k/Kconfig | 1
arch/m68k/include/asm/io_mm.h | 14 +---
arch/m68k/include/asm/io_no.h | 12 ++--
arch/m68k/include/asm/raw_io.h | 4 +
arch/m68k/mm/kmap.c | 17 +++++
arch/m68k/mm/sun3kmap.c | 6 ++
arch/metag/include/asm/io.h | 3 -
arch/microblaze/include/asm/io.h | 1
arch/mn10300/include/asm/io.h | 1
arch/nios2/include/asm/io.h | 1
arch/powerpc/kernel/pci_of_scan.c | 2 -
arch/s390/include/asm/io.h | 1
arch/sh/Kconfig | 1
arch/sh/include/asm/io.h | 20 ++++--
arch/sh/mm/ioremap.c | 10 +++
arch/sparc/include/asm/io_32.h | 1
arch/sparc/include/asm/io_64.h | 1
arch/sparc/kernel/pci.c | 3 -
arch/tile/include/asm/io.h | 1
arch/x86/Kconfig | 1
arch/x86/include/asm/efi.h | 3 +
arch/x86/include/asm/io.h | 17 +++--
arch/x86/kernel/crash_dump_64.c | 6 +-
arch/x86/kernel/kdebugfs.c | 8 +--
arch/x86/kernel/ksysfs.c | 28 ++++-----
arch/x86/mm/ioremap.c | 76 ++++++++++--------------
arch/xtensa/Kconfig | 1
arch/xtensa/include/asm/io.h | 9 ++-
drivers/acpi/apei/einj.c | 9 ++-
drivers/acpi/apei/erst.c | 6 +-
drivers/acpi/nvs.c | 6 +-
drivers/acpi/osl.c | 70 ++++++----------------
drivers/char/toshiba.c | 2 -
drivers/firmware/google/memconsole.c | 7 +-
drivers/gpu/drm/gma500/opregion.c | 2 -
drivers/gpu/drm/i915/intel_opregion.c | 2 -
drivers/iommu/intel-iommu.c | 10 ++-
drivers/iommu/intel_irq_remapping.c | 4 +
drivers/isdn/icn/icn.h | 2 -
drivers/mtd/devices/slram.c | 2 -
drivers/mtd/maps/pxa2xx-flash.c | 4 +
drivers/mtd/nand/diskonchip.c | 2 -
drivers/mtd/onenand/generic.c | 2 -
drivers/nvdimm/Kconfig | 2 -
drivers/pci/probe.c | 3 -
drivers/pnp/manager.c | 2 -
drivers/scsi/aic94xx/aic94xx_init.c | 7 --
drivers/scsi/arcmsr/arcmsr_hba.c | 5 --
drivers/scsi/mvsas/mv_init.c | 15 +----
drivers/scsi/sun3x_esp.c | 2 -
drivers/sfi/sfi_core.c | 4 +
drivers/staging/comedi/drivers/ii_pci20kc.c | 1
drivers/staging/unisys/visorbus/visorchannel.c | 16 +++--
drivers/staging/unisys/visorbus/visorchipset.c | 17 +++--
drivers/tty/serial/8250/8250_core.c | 2 -
drivers/video/fbdev/Kconfig | 2 -
drivers/video/fbdev/amifb.c | 5 +-
drivers/video/fbdev/atafb.c | 5 +-
drivers/video/fbdev/hpfb.c | 6 +-
drivers/video/fbdev/ocfb.c | 1
drivers/video/fbdev/s1d13xxxfb.c | 3 -
drivers/video/fbdev/stifb.c | 1
include/acpi/acpi_io.h | 6 +-
include/asm-generic/io.h | 8 ---
include/asm-generic/iomap.h | 4 -
include/linux/io-mapping.h | 2 -
include/linux/io.h | 9 +++
include/linux/mtd/map.h | 2 -
include/linux/pmem.h | 26 +++++---
include/video/vga.h | 2 -
kernel/Makefile | 2 +
kernel/memremap.c | 74 +++++++++++++++++++++++
kernel/resource.c | 43 +++++++-------
lib/Kconfig | 5 +-
lib/devres.c | 13 +---
lib/pci_iomap.c | 7 +-
tools/testing/nvdimm/Kbuild | 4 +
tools/testing/nvdimm/test/iomap.c | 34 ++++++++---
101 files changed, 482 insertions(+), 398 deletions(-)
create mode 100644 kernel/memremap.c
6 years, 2 months
[PATCH v2 00/20] get_user_pages() for dax mappings
by Dan Williams
Changes since v1 [1]:
1/ Rebased on the accepted cleanups to the memremap() api and the NUMA
hints for devm allocations. (see libnvdimm-for-next [2]).
2/ Rebased on DAX fixes from Ross [3], currently in -mm, and Dave [4],
applied locally for now.
3/ Renamed __pfn_t to pfn_t and converted KVM and UM accordingly (Dave
Hansen)
4/ Make pfn-to-pfn_t conversions a nop (binary identical) for typical
mapped pfns (Dave Hansen)
5/ Fixed up the devm_memremap_pages() api to require passing in a
percpu_ref object. Addresses a crash reported-by Logan.
6/ Moved the back pointer from a page to its hosting 'struct
dev_pagemap' to share storage with the 'lru' field rather than
'mapping'. Enables us to revoke mappings at devm_memunmap_page()
time and addresses a crash reported-by Logan.
7/ Rework dax_map_bh() into dax_map_atomic() to avoid proliferating
buffer_head usage deeper into the dax implementation. Also addresses
a crash reported by Logan (Dave Chinner)
8/ Include an initial, only lightly tested, implementation of revoking
usages of ZONE_DEVICE pages when the driver disables the pmem device.
This coordinates with blk_cleanup_queue() for the pmem gendisk, see
patch 19.
9/ Include a cleaned up version of the vmem_altmap infrastructure
allowing the struct page memmap to optionally be allocated from pmem
itself.
[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html
[2]: https://git.kernel.org/cgit/linux/kernel/git/nvdimm/nvdimm.git/log/?h=lib...
[3]: https://git.kernel.org/cgit/linux/kernel/git/nvdimm/nvdimm.git/commit/?h=...
[4]: https://lists.01.org/pipermail/linux-nvdimm/2015-October/002286.html
---
To date, we have implemented two I/O usage models for persistent memory,
PMEM (a persistent "ram disk") and DAX (mmap persistent memory into
userspace). This series adds a third, DAX-GUP, that allows DAX mappings
to be the target of direct-i/o. It allows userspace to coordinate
DMA/RDMA from/to persitent memory.
The implementation leverages the ZONE_DEVICE mm-zone that went into
4.3-rc1 to flag pages that are owned and dynamically mapped by a device
driver. The pmem driver, after mapping a persistent memory range into
the system memmap via devm_memremap_pages(), arranges for DAX to
distinguish pfn-only versus page-backed pmem-pfns via flags in the new
__pfn_t type. The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn,
flags the resulting pte(s) inserted into the process page tables with a
new _PAGE_DEVMAP flag. Later, when get_user_pages() is walking ptes it
keys off _PAGE_DEVMAP to pin the device hosting the page range active.
Finally, get_page() and put_page() are modified to take references
against the device driver established page mapping.
This series is available via git here:
git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm libnvdimm-pending
---
Dan Williams (20):
block: generic request_queue reference counting
dax: increase granularity of dax_clear_blocks() operations
block, dax: fix lifetime of in-kernel dax mappings with dax_map_atomic()
mm: introduce __get_dev_pagemap()
x86, mm: introduce vmem_altmap to augment vmemmap_populate()
libnvdimm, pfn, pmem: allocate memmap array in persistent memory
avr32: convert to asm-generic/memory_model.h
hugetlb: fix compile error on tile
frv: fix compiler warning from definition of __pmd()
um: kill pfn_t
kvm: rename pfn_t to kvm_pfn_t
mips: fix PAGE_MASK definition
mm, dax, pmem: introduce pfn_t
mm, dax, gpu: convert vm_insert_mixed to pfn_t, introduce _PAGE_DEVMAP
mm, dax: convert vmf_insert_pfn_pmd() to pfn_t
list: introduce list_poison() and LIST_POISON3
mm, dax, pmem: introduce {get|put}_dev_pagemap() for dax-gup
block: notify queue death confirmation
mm, pmem: devm_memunmap_pages(), truncate and unmap ZONE_DEVICE pages
mm, x86: get_user_pages() for dax mappings
arch/alpha/include/asm/pgtable.h | 1
arch/arm/include/asm/kvm_mmu.h | 5 -
arch/arm/kvm/mmu.c | 10 +
arch/arm64/include/asm/kvm_mmu.h | 3
arch/avr32/include/asm/page.h | 8 -
arch/frv/include/asm/page.h | 2
arch/ia64/include/asm/pgtable.h | 1
arch/m68k/include/asm/page_mm.h | 1
arch/m68k/include/asm/page_no.h | 1
arch/mips/include/asm/kvm_host.h | 6 -
arch/mips/include/asm/page.h | 2
arch/mips/kvm/emulate.c | 2
arch/mips/kvm/tlb.c | 14 +
arch/parisc/include/asm/pgtable.h | 1
arch/powerpc/include/asm/kvm_book3s.h | 4
arch/powerpc/include/asm/kvm_ppc.h | 2
arch/powerpc/include/asm/pgtable.h | 1
arch/powerpc/kvm/book3s.c | 6 -
arch/powerpc/kvm/book3s_32_mmu_host.c | 2
arch/powerpc/kvm/book3s_64_mmu_host.c | 2
arch/powerpc/kvm/e500.h | 2
arch/powerpc/kvm/e500_mmu_host.c | 8 -
arch/powerpc/kvm/trace_pr.h | 2
arch/powerpc/sysdev/axonram.c | 8 -
arch/sparc/include/asm/pgtable_64.h | 2
arch/tile/include/asm/pgtable.h | 1
arch/um/include/asm/page.h | 6 -
arch/um/include/asm/pgtable-3level.h | 5 -
arch/um/include/asm/pgtable.h | 2
arch/x86/include/asm/pgtable.h | 24 ++
arch/x86/include/asm/pgtable_types.h | 7 +
arch/x86/kvm/iommu.c | 11 +
arch/x86/kvm/mmu.c | 37 ++--
arch/x86/kvm/mmu_audit.c | 2
arch/x86/kvm/paging_tmpl.h | 6 -
arch/x86/kvm/vmx.c | 2
arch/x86/kvm/x86.c | 2
arch/x86/mm/gup.c | 56 +++++-
arch/x86/mm/init_64.c | 32 +++
arch/x86/mm/pat.c | 4
block/blk-core.c | 79 +++++++-
block/blk-mq-sysfs.c | 6 -
block/blk-mq.c | 87 +++------
block/blk-sysfs.c | 3
block/blk.h | 12 +
drivers/block/brd.c | 4
drivers/gpu/drm/exynos/exynos_drm_gem.c | 3
drivers/gpu/drm/gma500/framebuffer.c | 3
drivers/gpu/drm/msm/msm_gem.c | 3
drivers/gpu/drm/omapdrm/omap_gem.c | 6 -
drivers/gpu/drm/ttm/ttm_bo_vm.c | 3
drivers/nvdimm/pfn_devs.c | 3
drivers/nvdimm/pmem.c | 128 +++++++++----
drivers/s390/block/dcssblk.c | 10 -
fs/block_dev.c | 2
fs/dax.c | 199 +++++++++++++--------
include/asm-generic/pgtable.h | 6 -
include/linux/blk-mq.h | 1
include/linux/blkdev.h | 12 +
include/linux/huge_mm.h | 2
include/linux/hugetlb.h | 1
include/linux/io.h | 17 --
include/linux/kvm_host.h | 37 ++--
include/linux/kvm_types.h | 2
include/linux/list.h | 14 +
include/linux/memory_hotplug.h | 3
include/linux/mm.h | 300 +++++++++++++++++++++++++++++--
include/linux/mm_types.h | 5 +
include/linux/pfn.h | 9 +
include/linux/poison.h | 1
kernel/memremap.c | 187 +++++++++++++++++++
lib/list_debug.c | 2
mm/gup.c | 11 +
mm/huge_memory.c | 10 +
mm/hugetlb.c | 18 ++
mm/memory.c | 17 +-
mm/memory_hotplug.c | 66 +++++--
mm/page_alloc.c | 10 +
mm/sparse-vmemmap.c | 37 ++++
mm/sparse.c | 8 +
mm/swap.c | 15 ++
virt/kvm/kvm_main.c | 47 ++---
82 files changed, 1264 insertions(+), 418 deletions(-)
6 years, 6 months
[PATCH v3 0/2] Hotplug support for libnvdimm
by Vishal Verma
This series adds support for hotplug of NVDIMMs. Upon hotplug, the ACPI
core calls the .notify callback we register. From this, we evaluate the
_FIT method which returns an updated NFIT. This is scanned for any new
tables, and any new regions found from it are registered and made
available for use.
The series is tested with nfit_test (tools/testing/nvdimm) only, which
means the parts of getting a notification from the acpi core, and calling
_FIT are untested.
Changes from v2->v3:
- in acpi_nfit_init, splice off the old contents if to a "prev" list and
only check for duplicates when "prev" is not empty (Dan)
- in acpi_nfit_init, error out if tables are found to be deleted
- locking changes: Use device_lock for .add and .notify. Check if
dev->driver is valid during notify to protect against a prior
removal (Dan)
- Change IS_ERR_OR_NULL to IS_ERR for acpi_nfit_desc_init (Dan)
- nfit_test: for the hot-plug DIMM, add a flush hint table too for
completeness
Changes from v1->v2:
- If a 0-length header is found in the nfit (patch 1), also spew a
warning (Jeff)
- Don't make a new acpi_evaluate_fit helper - open code a call to
acpi_evaluate_object in nfit.c (Dan/Rafael)
- Remove a warning for duplicate DCRs (Toshi)
- Add an init_lock to protect the notify handler from racing with an
'add' or 'remove' (Dan)
- The only NVDIMM in a system *could* potentially come from a hotplug,
esp in the virtualization case. Refactor how acpi_nfit_desc is
initialized to account for this. For the same reason, don't fail when
a valid NFIT is not found at driver load time. A by-product of this
change is that we need to initialize lists and mutexes manually in
nfit test. (Dan)
- Remove acpi_nfit_merge (added in v1) as it is now essentially
the same as acpi_nfit_init
- Reword the commit message for patch 2/2 to say 'hot add' instead of
hotplug, making it clearer that hot removal support is not being added
Vishal Verma (2):
nfit: in acpi_nfit_init, break on a 0-length table
acpi: nfit: Add support for hot-add
drivers/acpi/nfit.c | 307 +++++++++++++++++++++++++++++++--------
drivers/acpi/nfit.h | 2 +
tools/testing/nvdimm/test/nfit.c | 164 ++++++++++++++++++++-
3 files changed, 413 insertions(+), 60 deletions(-)
--
2.4.3
6 years, 7 months
[PATCH 0/2] "big hammer" for DAX msync/fsync correctness
by Ross Zwisler
This series implements the very slow but correct handling for
blkdev_issue_flush() with DAX mappings, as discussed here:
https://lkml.org/lkml/2015/10/26/116
I don't think that we can actually do the
on_each_cpu(sync_cache, ...);
...where sync_cache is something like:
cache_disable();
wbinvd();
pcommit();
cache_enable();
solution as proposed by Dan because WBINVD + PCOMMIT doesn't guarantee that
your writes actually make it durably onto the DIMMs. I believe you really do
need to loop through the cache lines, flush them with CLWB, then fence and
PCOMMIT.
I do worry that the cost of blindly flushing the entire PMEM namespace on each
fsync or msync will be prohibitively expensive, and that we'll by very
incentivized to move to the radix tree based dirty page tracking as soon as
possible. :)
Ross Zwisler (2):
pmem: add wb_cache_pmem() to the PMEM API
pmem: Add simple and slow fsync/msync support
arch/x86/include/asm/pmem.h | 11 ++++++-----
drivers/nvdimm/pmem.c | 10 +++++++++-
include/linux/pmem.h | 22 +++++++++++++++++++++-
3 files changed, 36 insertions(+), 7 deletions(-)
--
2.1.0
6 years, 7 months
[RFC 00/11] DAX fsynx/msync support
by Ross Zwisler
This patch series adds support for fsync/msync to DAX.
Patches 1 through 8 add various utilities that the DAX code will eventually
need, and the DAX code itself is added by patch 9. Patches 10 and 11 are
filesystem changes that are needed after the DAX code is added, but these
patches may change slightly as the filesystem fault handling for DAX is
being modified ([1] and [2]).
I've marked this series as RFC because I'm still testing, but I wanted to
get this out there so people would see the direction I was going and
hopefully comment on any big red flags sooner rather than later.
I realize that we are getting pretty dang close to the v4.4 merge window,
but I think that if we can get this reviewed and working it's a much better
solution than the "big hammer" approach that blindly flushes entire PMEM
namespaces [3].
[1] http://oss.sgi.com/archives/xfs/2015-10/msg00523.html
[2] http://marc.info/?l=linux-ext4&m=144550211312472&w=2
[3] https://lists.01.org/pipermail/linux-nvdimm/2015-October/002614.html
Ross Zwisler (11):
pmem: add wb_cache_pmem() to the PMEM API
mm: add pmd_mkclean()
pmem: enable REQ_FLUSH handling
dax: support dirty DAX entries in radix tree
mm: add follow_pte_pmd()
mm: add pgoff_mkclean()
mm: add find_get_entries_tag()
fs: add get_block() to struct inode_operations
dax: add support for fsync/sync
xfs, ext2: call dax_pfn_mkwrite() on write fault
ext4: add ext4_dax_pfn_mkwrite()
arch/x86/include/asm/pgtable.h | 5 ++
arch/x86/include/asm/pmem.h | 11 +--
drivers/nvdimm/pmem.c | 3 +-
fs/dax.c | 161 +++++++++++++++++++++++++++++++++++++++--
fs/ext2/file.c | 5 +-
fs/ext4/file.c | 23 +++++-
fs/inode.c | 1 +
fs/xfs/xfs_file.c | 9 ++-
fs/xfs/xfs_iops.c | 1 +
include/linux/dax.h | 6 ++
include/linux/fs.h | 5 +-
include/linux/mm.h | 2 +
include/linux/pagemap.h | 3 +
include/linux/pmem.h | 22 +++++-
include/linux/radix-tree.h | 3 +
include/linux/rmap.h | 5 ++
mm/filemap.c | 73 ++++++++++++++++++-
mm/huge_memory.c | 14 ++--
mm/memory.c | 41 +++++++++--
mm/page-writeback.c | 9 +++
mm/rmap.c | 53 ++++++++++++++
mm/truncate.c | 5 +-
22 files changed, 418 insertions(+), 42 deletions(-)
--
2.1.0
6 years, 7 months
[PATCH v2 UPDATE-2 3/3] ACPI/APEI/EINJ: Allow memory error injection to NVDIMM
by Toshi Kani
In the case of memory error injection, einj_error_inject() checks
if a target address is regular RAM. Update this check to add a call
to region_intersects_pmem() to verify if a target address range is
NVDIMM. This allows injecting a memory error to both RAM and NVDIMM
for testing.
Also, the current RAM check, page_is_ram(), is replaced with
region_intersects_ram() so that it can verify a target address
range with the requested size.
Signed-off-by: Toshi Kani <toshi.kani(a)hpe.com>
Reviewed-by: Dan Williams <dan.j.williams(a)intel.com>
---
UPDATE:
- Add a blank line before if-statement. (Borislav Petkov)
- Check the param2 value before target memory type. (Tony Luck)
---
drivers/acpi/apei/einj.c | 13 +++++++++----
1 file changed, 9 insertions(+), 4 deletions(-)
diff --git a/drivers/acpi/apei/einj.c b/drivers/acpi/apei/einj.c
index 0431883..5d7c0b4 100644
--- a/drivers/acpi/apei/einj.c
+++ b/drivers/acpi/apei/einj.c
@@ -519,7 +519,7 @@ static int einj_error_inject(u32 type, u32 flags, u64 param1, u64 param2,
u64 param3, u64 param4)
{
int rc;
- unsigned long pfn;
+ u64 base_addr, size;
/* If user manually set "flags", make sure it is legal */
if (flags && (flags &
@@ -545,10 +545,15 @@ static int einj_error_inject(u32 type, u32 flags, u64 param1, u64 param2,
/*
* Disallow crazy address masks that give BIOS leeway to pick
* injection address almost anywhere. Insist on page or
- * better granularity and that target address is normal RAM.
+ * better granularity and that target address is normal RAM or
+ * NVDIMM.
*/
- pfn = PFN_DOWN(param1 & param2);
- if (!page_is_ram(pfn) || ((param2 & PAGE_MASK) != PAGE_MASK))
+ base_addr = param1 & param2;
+ size = (~param2) + 1;
+
+ if (((param2 & PAGE_MASK) != PAGE_MASK) ||
+ ((region_intersects_ram(base_addr, size) != REGION_INTERSECTS) &&
+ (region_intersects_pmem(base_addr, size) != REGION_INTERSECTS)))
return -EINVAL;
inject:
6 years, 8 months
[PATCH v2 0/5] block, dax: updates for 4.4
by Dan Williams
Changes since v1: https://lists.01.org/pipermail/linux-nvdimm/2015-October/002538.html
1/ Rename file_bd_inode to bdev_file_inode (Jan Kara)
2/ Clarify sb_start_pagefault() comment (Jan Kara)
3/ Collect Reviewed-by's
---
As requested [1], break out the block specific updates from the dax-gup
series [2], to merge via the block tree.
1/ Enable dax mappings for raw block devices. This addresses the review
comments (from Ross and Honza) from the RFC [3].
2/ Introduce dax_map_atomic() to fix races between device teardown and
new mapping requests. This depends on commit 2a9067a91825 "block:
generic request_queue reference counting" in for-4.4/integrity branch
of the block tree.
3/ Cleanup clear_pmem() and its usage in dax. This depends on commit
0f90cc6609c7 "mm, dax: fix DAX deadlocks" that was merged into v4.3-rc6.
These pass the nvdimm unit tests and have passed a 0day-kbuild-robot run.
[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-October/002531.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2015-October/002387.html
[3]: https://lists.01.org/pipermail/linux-nvdimm/2015-October/002512.html
---
Dan Williams (5):
pmem, dax: clean up clear_pmem()
dax: increase granularity of dax_clear_blocks() operations
block, dax: fix lifetime of in-kernel dax mappings with dax_map_atomic()
block: introduce bdev_file_inode()
block: enable dax for raw block devices
arch/x86/include/asm/pmem.h | 7 --
block/blk.h | 2
fs/block_dev.c | 79 ++++++++++++++++-
fs/dax.c | 196 +++++++++++++++++++++++++++----------------
include/linux/blkdev.h | 2
5 files changed, 197 insertions(+), 89 deletions(-)
6 years, 8 months
Re: [PATCH 5/5] block: enable dax for raw block devices
by Dan Williams
On Mon, Oct 26, 2015 at 3:23 PM, Dave Chinner <david(a)fromorbit.com> wrote:
> On Mon, Oct 26, 2015 at 11:48:06AM +0900, Dan Williams wrote:
>> On Mon, Oct 26, 2015 at 6:22 AM, Dave Chinner <david(a)fromorbit.com> wrote:
>> > On Thu, Oct 22, 2015 at 11:08:18PM +0200, Jan Kara wrote:
>> >> Ugh2: Now I realized that DAX mmap isn't safe wrt fs freezing even for
>> >> filesystems since there's nothing which writeprotects pages that are
>> >> writeably mapped. In normal path, page writeback does this but that doesn't
>> >> happen for DAX. I remember we once talked about this but it got lost.
>> >> We need something like walk all filesystem inodes during fs freeze and
>> >> writeprotect all pages that are mapped. But that's going to be slow...
>> >
>> > fsync() has the same problem - we have no record of the pages that
>> > need to be committed and then write protected when fsync() is called
>> > after write()...
>>
>> I know Ross is still working on that implementation. However, I had a
>> thought on the flight to ksummit that maybe we shouldn't worry about
>> tracking dirty state on a per-page basis. For small / frequent
>> synchronizations an application really should be using the nvml
>> library [1] to issue cache flushes and pcommit from userspace on a
>> per-cacheline basis. That leaves unmodified apps that want to be
>> correct in the presence of dax mappings. Two things we can do to
>> mitigate that case:
>>
>> 1/ Make DAX mappings opt-in with a new MMAP_DAX (page-cache bypass)
>> flag. Applications shouldn't silently become incorrect simply because
>> the fs is mounted with -o dax. If an app doesn't understand DAX
>> mappings it should get page-cache semantics. This also protects apps
>> that are not expecting DAX semantics on raw block device mappings.
>
> Which is the complete opposite of what we are trying to acehive with
> DAX. i.e. that existing applications "just work" with DAX without
> modification. So this is a non-starter.
The list of things DAX breaks is getting shorter, but certainly the
page-less paths will not be without surprises for quite a while yet...
> Also, DAX access isn't a property of mmap - it's a property
> of the inode. We cannot do DAX access via mmap while mixing page
> cache based access through file descriptor based interfaces. This
> I why I'm adding an inode attribute (on disk) to enable per-file DAX
> capabilities - either everything is via the DAX paths, or nothing
> is.
>
Per-inode control sounds very useful, I'll look at a similar mechanism
for the raw block case.
However, still not quite convinced page-cache control is an inode-only
property, especially when direct-i/o is not an inode-property. That
said, I agree the complexity of handling mixed mappings of the same
file is prohibitive.
>> 2/ Even if we get a new flag that lets the kernel know the app
>> understands DAX mappings, we shouldn't leave fsync broken. Can we
>> instead get by with a simple / big hammer solution? I.e.
>
> Because we don't physically have to write back data the problem is
> both simpler and more complex. The simplest solution is for the
> underlying block device to implement blkdev_issue_flush() correctly.
>
> i.e. if blkdev_issue_flush() behaves according to it's required
> semantics - that all volatile cached data is flushed to stable
> storage - then fsync-on-DAX will work appropriately. As it is, this is
> needed for journal based filesystems to work correctly, as they are
> assuming that their journal writes are being treated correctly as
> REQ_FLUSH | REQ_FUA to ensure correct data/metadata/journal
> ordering is maintained....
>
> So, to begin with, this problem needs to be solved at the block
> device level. That's the simple, brute-force, big hammer solution to
> the problem, and it requires no changes at the filesystem level at
> all.
>
> However, to avoid having to flush the entire block device range on
> fsync we need a much more complex solution that tracks the dirty
> ranges of the file and hence what needs committing when fsync is
> run....
>
>> Disruptive, yes, but if an app cares about efficient persistent memory
>> synchronization fsync is already the wrong api.
>
> I don't really care about efficiency right now - correctness comes
> first. Fundamentally, the app should not care whether it is writing to
> persistent memory or spinning rust - the filesystem needs to
> provide the application with exactly the same integrity guarantees
> regardless of the underlying storage.
>
Sounds good, get blkdev_issue_flush() functional first and then worry
about building a more efficient solution on top.
6 years, 8 months
[PATCH v2 UPDATE 3/3] ACPI/APEI/EINJ: Allow memory error injection to NVDIMM
by Toshi Kani
In the case of memory error injection, einj_error_inject() checks
if a target address is regular RAM. Update this check to add a call
to region_intersects_pmem() to verify if a target address range is
NVDIMM. This allows injecting a memory error to both RAM and NVDIMM
for testing.
Also, the current RAM check, page_is_ram(), is replaced with
region_intersects_ram() so that it can verify a target address
range with the requested size.
Signed-off-by: Toshi Kani <toshi.kani(a)hpe.com>
Reviewed-by: Dan Williams <dan.j.williams(a)intel.com>
---
Add a blank line before if-statement. (Borislav Petkov)
---
drivers/acpi/apei/einj.c | 13 +++++++++----
1 file changed, 9 insertions(+), 4 deletions(-)
diff --git a/drivers/acpi/apei/einj.c b/drivers/acpi/apei/einj.c
index 0431883..db21efe 100644
--- a/drivers/acpi/apei/einj.c
+++ b/drivers/acpi/apei/einj.c
@@ -519,7 +519,7 @@ static int einj_error_inject(u32 type, u32 flags, u64 param1, u64 param2,
u64 param3, u64 param4)
{
int rc;
- unsigned long pfn;
+ u64 base_addr, size;
/* If user manually set "flags", make sure it is legal */
if (flags && (flags &
@@ -545,10 +545,15 @@ static int einj_error_inject(u32 type, u32 flags, u64 param1, u64 param2,
/*
* Disallow crazy address masks that give BIOS leeway to pick
* injection address almost anywhere. Insist on page or
- * better granularity and that target address is normal RAM.
+ * better granularity and that target address is normal RAM or
+ * NVDIMM.
*/
- pfn = PFN_DOWN(param1 & param2);
- if (!page_is_ram(pfn) || ((param2 & PAGE_MASK) != PAGE_MASK))
+ base_addr = param1 & param2;
+ size = (~param2) + 1;
+
+ if (((region_intersects_ram(base_addr, size) != REGION_INTERSECTS) &&
+ (region_intersects_pmem(base_addr, size) != REGION_INTERSECTS)) ||
+ ((param2 & PAGE_MASK) != PAGE_MASK))
return -EINVAL;
inject:
6 years, 8 months