[PATCH v3 0/2] Support ACPI 6.1 update in NFIT Control Region Structure
by Toshi Kani
ACPI 6.1, Table 5-133, updates NVDIMM Control Region Structure as
follows.
- Valid Fields, Manufacturing Location, and Manufacturing Date
are added from reserved range. No change in the structure size.
- IDs (SPD values) are stored as arrays of bytes (i.e. big-endian
format). The spec clarifies that they need to be represented
as arrays of bytes as well.
Patch 1 changes the NFIT driver to comply with ACPI 6.1.
Patch 2 adds a new sysfs file "id" to show NVDIMM ID defined in ACPI 6.1.
The patch-set applies on linux-pm.git acpica.
link: http://www.uefi.org/sites/default/files/resources/ACPI_6_1.pdf
---
v3:
- Need to coordinate with ACPICA update (Bob Moore, Dan Williams)
- Integrate with ACPICA changes in struct acpi_nfit_control_region.
(commit 138a95547ab0)
v2:
- Remove 'mfg_location' and 'mfg_date'. (Dan Williams)
- Rename 'unique_id' to 'id' and make this change as a separate patch.
(Dan Williams)
---
Toshi Kani (3):
1/2 acpi/nfit: Update nfit driver to comply with ACPI 6.1
2/3 acpi/nfit: Add sysfs "id" for NVDIMM ID
---
drivers/acpi/nfit.c | 29 ++++++++++++++++++++++++-----
1 file changed, 24 insertions(+), 5 deletions(-)
2 years, 7 months
Enabling peer to peer device transactions for PCIe devices
by Deucher, Alexander
This is certainly not the first time this has been brought up, but I'd like to try and get some consensus on the best way to move this forward. Allowing devices to talk directly improves performance and reduces latency by avoiding the use of staging buffers in system memory. Also in cases where both devices are behind a switch, it avoids the CPU entirely. Most current APIs (DirectGMA, PeerDirect, CUDA, HSA) that deal with this are pointer based. Ideally we'd be able to take a CPU virtual address and be able to get to a physical address taking into account IOMMUs, etc. Having struct pages for the memory would allow it to work more generally and wouldn't require as much explicit support in drivers that wanted to use it.
Some use cases:
1. Storage devices streaming directly to GPU device memory
2. GPU device memory to GPU device memory streaming
3. DVB/V4L/SDI devices streaming directly to GPU device memory
4. DVB/V4L/SDI devices streaming directly to storage devices
Here is a relatively simple example of how this could work for testing. This is obviously not a complete solution.
- Device memory will be registered with Linux memory sub-system by created corresponding struct page structures for device memory
- get_user_pages_fast() will return corresponding struct pages when CPU address points to the device memory
- put_page() will deal with struct pages for device memory
Previously proposed solutions and related proposals:
1.P2P DMA
DMA-API/PCI map_peer_resource support for peer-to-peer (http://www.spinics.net/lists/linux-pci/msg44560.html)
Pros: Low impact, already largely reviewed.
Cons: requires explicit support in all drivers that want to support it, doesn't handle S/G in device memory.
2. ZONE_DEVICE IO
Direct I/O and DMA for persistent memory (https://lwn.net/Articles/672457/)
Add support for ZONE_DEVICE IO memory with struct pages. (https://patchwork.kernel.org/patch/8583221/)
Pro: Doesn't waste system memory for ZONE metadata
Cons: CPU access to ZONE metadata slow, may be lost, corrupted on device reset.
3. DMA-BUF
RDMA subsystem DMA-BUF support (http://www.spinics.net/lists/linux-rdma/msg38748.html)
Pros: uses existing dma-buf interface
Cons: dma-buf is handle based, requires explicit dma-buf support in drivers.
4. iopmem
iopmem : A block device for PCIe memory (https://lwn.net/Articles/703895/)
5. HMM
Heterogeneous Memory Management (http://lkml.iu.edu/hypermail/linux/kernel/1611.2/02473.html)
6. Some new mmap-like interface that takes a userptr and a length and returns a dma-buf and offset?
Alex
3 years, 3 months
[PATCH 0/6] introduce DAX tracepoint support
by Ross Zwisler
Tracepoints are the standard way to capture debugging and tracing
information in many parts of the kernel, including the XFS and ext4
filesystems. This series creates a tracepoint header for FS DAX and add
the first few DAX tracepoints to the PMD fault handler. This allows the
tracing for DAX to be done in the same way as the filesystem tracing so
that developers can look at them together and get a coherent idea of what
the system is doing.
I do intend to add tracepoints to the normal 4k DAX fault path and to the
DAX I/O path, but I wanted to get feedback on the PMD tracepoints before I
went any further.
This series is based on Jan Kara's "dax: Clear dirty bits after flushing
caches" series:
https://lists.01.org/pipermail/linux-nvdimm/2016-November/007864.html
I've pushed a git tree with this work here:
https://git.kernel.org/cgit/linux/kernel/git/zwisler/linux.git/log/?h=dax...
Ross Zwisler (6):
dax: fix build breakage with ext4, dax and !iomap
dax: remove leading space from labels
dax: add tracepoint infrastructure, PMD tracing
dax: update MAINTAINERS entries for FS DAX
dax: add tracepoints to dax_pmd_load_hole()
dax: add tracepoints to dax_pmd_insert_mapping()
MAINTAINERS | 4 +-
fs/Kconfig | 1 +
fs/dax.c | 78 ++++++++++++++----------
fs/ext2/Kconfig | 1 -
include/linux/mm.h | 14 +++++
include/linux/pfn_t.h | 6 ++
include/trace/events/fs_dax.h | 135 ++++++++++++++++++++++++++++++++++++++++++
7 files changed, 206 insertions(+), 33 deletions(-)
create mode 100644 include/trace/events/fs_dax.h
--
2.7.4
3 years, 10 months
multi-threads libvmmalloc fork test hang
by Xiong Zhou
# description
nvml test suite vmmalloc_fork test hang.
$ ps -eo stat,comm | grep vmma
S+ vmmalloc_fork
Sl+ vmmalloc_fork
Z+ vmmalloc_fork <defunct>
Sl+ vmmalloc_fork
Z+ vmmalloc_fork <defunct>
Z+ vmmalloc_fork <defunct>
Sl+ vmmalloc_fork
Z+ vmmalloc_fork <defunct>
Z+ vmmalloc_fork <defunct>
Z+ vmmalloc_fork <defunct>
dmesg:
[ 250.499097] INFO: task vmmalloc_fork:9805 blocked for more than 120 seconds.
[ 250.530667] Not tainted 4.9.09fe68ca+ #27
[ 250.550901] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 250.585752] vmmalloc_fork D[ 250.598362] ffffffff8171813c 0 9805 9765 0x00000080
[ 250.623445] ffff88075dc68f80[ 250.636052] 0000000000000000 ffff88076058db00 ffff88017c5b0000 ffff880763b19340[ 250.668510] ffffc9000fe1bbb0 ffffffff8171813c ffffc9000fe1bc20 ffffc9000fe1bbe0[ 250.704220] ffffffff82248898 ffff88076058db00 ffffffff82248898Call Trace:
[ 250.738382] [<ffffffff8171813c>] ? __schedule+0x21c/0x6a0
[ 250.763404] [<ffffffff817185f6>] schedule+0x36/0x80
[ 250.786177] [<ffffffff81284471>] get_unlocked_mapping_entry+0xc1/0x120
[ 250.815869] [<ffffffff81283810>] ? iomap_dax_rw+0x110/0x110
[ 250.841350] [<ffffffff81284c0a>] grab_mapping_entry+0x4a/0x220
[ 250.868442] [<ffffffff812851e9>] iomap_dax_fault+0xa9/0x3b0
[ 250.894437] [<ffffffffa02b15fe>] xfs_filemap_fault+0xce/0xf0 [xfs]
[ 250.922805] [<ffffffff811d3159>] __do_fault+0x79/0x100
[ 250.947035] [<ffffffff811d7a2b>] do_fault+0x49b/0x690
[ 250.970964] [<ffffffffa02b146c>] ? xfs_filemap_pmd_fault+0x9c/0x160 [xfs]
[ 251.001812] [<ffffffff811d94ba>] handle_mm_fault+0x61a/0xa50
[ 251.027736] [<ffffffff8106c3da>] __do_page_fault+0x22a/0x4a0
[ 251.053700] [<ffffffff8106c680>] do_page_fault+0x30/0x80
[ 251.077962] [<ffffffff81003b55>] ? do_syscall_64+0x175/0x180
[ 251.103835] [<ffffffff8171e208>] page_fault+0x28/0x30
# kernel versions:
v4.6 pass in seconds
v4.7 hang
v4.9-rc1 hang
Linus tree to commit 9fe68ca hang
bisect points to
first bad commit: [ac401cc782429cc8560ce4840b1405d603740917] dax: New fault locking
v4.7 with these 3 commits reverted pass:
4d9a2c8 - Jan Kara, 6 months ago : dax: Remove i_mmap_lock protection
bc2466e - Jan Kara, 6 months ago : dax: Use radix tree entry lock to protect cow faults
ac401cc - Jan Kara, 6 months ago : dax: New fault locking
# nvml version:
https://github.com/pmem/nvml.git
to commit:
feab4d6f65102139ce460890c898fcad09ce20ae
# How reproducible:
always
# Test steps:
<git clone and pmem0 setup>
$cd nvml
$make install -j64
$cat > src/test/testconfig.sh <<EOF
PMEM_FS_DIR=/daxmnt
NON_PMEM_FS_DIR=/tmp
EOF
$mkfs.xfs /dev/pmem0
$mkdir -p /daxmnt/
$mount -o dax /dev/pmem0 /daxmnt/
$make -C src/test/vmmalloc_fork/ TEST_TIME=60m clean
$make -C src/test/vmmalloc_fork/ TEST_TIME=60m check
$umount /daxmnt
4 years
[PATCH v2 0/3] use nocache copy in copy_from_iter_nocache()
by Brian Boylston
Currently, copy_from_iter_nocache() uses "nocache" copies only for
iovecs; bvecs and kvecs use normal copies. This requires
x86's arch_copy_from_iter_pmem() to issue flushes for bvecs and kvecs,
which has a negative impact on performance when splice()ing from a pipe
to a pmem-backed file on a DAX-mounted file system.
This patch set enables nocache copies in copy_from_iter_nocache() for
bvecs and kvecs for arches that support it (x86 initially). This provides
a 2-3X improvement in splice() pipe-to-DAX-file throughput.
The first patch introduces memcpy_nocache(), which defaults to just
memcpy(), but for which an x86-specific implementation is provided.
For this patch, I sought to use a static inline function for x86, but
I could not find an obvious header file to put it in.
The build seemed to work when I put it in arch/x86/include/asm/uaccess.h,
but that didn't feel completely right. I also tried
arch/x86/include/asm/pmem.h, but that doesn't feel right either and it
didn't build. So, I offer it here in arch/x86/lib/misc.c for discussion.
The second patch updates copy_from_iter_nocache() to use the new
memcpy_nocache().
The third patch removes the flushes from x86's arch_copy_from_iter_pmem().
For testing, I ran fio with the posixaio, mmap, sync, psync, vsync, pvsync,
and splice engines, against both ext4 and xfs. Only the splice engine
showed any change in performance. For example, for xfs:
Unpatched 4.8:
Run status group 2 (all jobs):
WRITE: io=37602MB, aggrb=641724KB/s, minb=641724KB/s, maxb=641724KB/s, mint=60001msec, maxt=60001msec
Run status group 3 (all jobs):
WRITE: io=36244MB, aggrb=618553KB/s, minb=618553KB/s, maxb=618553KB/s, mint=60001msec, maxt=60001msec
With this patch set:
Run status group 2 (all jobs):
WRITE: io=128055MB, aggrb=2134.3MB/s, minb=2134.3MB/s, maxb=2134.3MB/s, mint=60001msec, maxt=60001msec
Run status group 3 (all jobs):
WRITE: io=122586MB, aggrb=2043.8MB/s, minb=2043.8MB/s, maxb=2043.8MB/s, mint=60001msec, maxt=60001msec
Cc: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: "H. Peter Anvin" <hpa(a)zytor.com>
Cc: <x86(a)kernel.org>
Cc: Al Viro <viro(a)ZenIV.linux.org.uk>
Cc: Dan Williams <dan.j.williams(a)intel.com>
Signed-off-by: Brian Boylston <brian.boylston(a)hpe.com>
Reviewed-by: Toshi Kani <toshi.kani(a)hpe.com>
Reported-by: Oliver Moreno <oliver.moreno(a)hpe.com>
Changes in v2:
- Split into multiple patches (Toshi Kani)
- Introduce memcpy_nocache() (Al Viro)
- Use nocache for kvecs as well
Brian Boylston (3):
introduce memcpy_nocache()
use a nocache copy for bvecs and kvecs in copy_from_iter_nocache()
x86: remove unneeded flush in arch_copy_from_iter_pmem()
arch/x86/include/asm/pmem.h | 19 +------------------
arch/x86/include/asm/string_32.h | 3 +++
arch/x86/include/asm/string_64.h | 3 +++
arch/x86/lib/misc.c | 12 ++++++++++++
include/linux/string.h | 15 +++++++++++++++
lib/iov_iter.c | 14 +++++++++++---
6 files changed, 45 insertions(+), 21 deletions(-)
--
2.8.3
4 years
[PATCH] x86: fix kaslr and memmap collision
by Dave Jiang
CONFIG_RANDOMIZE_BASE relocates the kernel to a random base address.
However it does not take into account the memmap= parameter passed in from
the kernel commandline. This results in the kernel sometimes being put in
the middle of the user memmap. Check has been added in the kaslr in order
to avoid the region marked by memmap.
Signed-off-by: Dave Jiang <dave.jiang(a)intel.com>
---
arch/x86/boot/boot.h | 2 ++
arch/x86/boot/compressed/kaslr.c | 45 ++++++++++++++++++++++++++++++++++++++
arch/x86/boot/string.c | 25 +++++++++++++++++++++
3 files changed, 72 insertions(+)
diff --git a/arch/x86/boot/boot.h b/arch/x86/boot/boot.h
index e5612f3..0d5fe5b 100644
--- a/arch/x86/boot/boot.h
+++ b/arch/x86/boot/boot.h
@@ -332,6 +332,8 @@ int strncmp(const char *cs, const char *ct, size_t count);
size_t strnlen(const char *s, size_t maxlen);
unsigned int atou(const char *s);
unsigned long long simple_strtoull(const char *cp, char **endp, unsigned int base);
+unsigned long simple_strtoul(const char *cp, char **endp, unsigned int base);
+long simple_strtol(const char *cp, char **endp, unsigned int base);
size_t strlen(const char *s);
/* tty.c */
diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c
index a66854d..6fb8f1ec 100644
--- a/arch/x86/boot/compressed/kaslr.c
+++ b/arch/x86/boot/compressed/kaslr.c
@@ -11,6 +11,7 @@
*/
#include "misc.h"
#include "error.h"
+#include "../boot.h"
#include <generated/compile.h>
#include <linux/module.h>
@@ -61,6 +62,7 @@ enum mem_avoid_index {
MEM_AVOID_INITRD,
MEM_AVOID_CMDLINE,
MEM_AVOID_BOOTPARAMS,
+ MEM_AVOID_MEMMAP,
MEM_AVOID_MAX,
};
@@ -77,6 +79,37 @@ static bool mem_overlaps(struct mem_vector *one, struct mem_vector *two)
return true;
}
+#include "../../../../lib/cmdline.c"
+
+static int
+parse_memmap(char *p, unsigned long long *start, unsigned long long *size)
+{
+ char *oldp;
+
+ if (!p)
+ return -EINVAL;
+
+ /* we don't care about this option here */
+ if (!strncmp(p, "exactmap", 8))
+ return -EINVAL;
+
+ oldp = p;
+ *size = memparse(p, &p);
+ if (p == oldp)
+ return -EINVAL;
+
+ switch (*p) {
+ case '@':
+ case '#':
+ case '$':
+ case '!':
+ *start = memparse(p+1, &p);
+ return 0;
+ }
+
+ return -EINVAL;
+}
+
/*
* In theory, KASLR can put the kernel anywhere in the range of [16M, 64T).
* The mem_avoid array is used to store the ranges that need to be avoided
@@ -158,6 +191,8 @@ static void mem_avoid_init(unsigned long input, unsigned long input_size,
u64 initrd_start, initrd_size;
u64 cmd_line, cmd_line_size;
char *ptr;
+ char arg[38];
+ unsigned long long memmap_start, memmap_size;
/*
* Avoid the region that is unsafe to overlap during
@@ -195,6 +230,16 @@ static void mem_avoid_init(unsigned long input, unsigned long input_size,
add_identity_map(mem_avoid[MEM_AVOID_BOOTPARAMS].start,
mem_avoid[MEM_AVOID_BOOTPARAMS].size);
+ /* see if we have any memmap areas */
+ if (cmdline_find_option("memmap", arg, sizeof(arg)) > 0) {
+ int rc = parse_memmap(arg, &memmap_start, &memmap_size);
+
+ if (!rc) {
+ mem_avoid[MEM_AVOID_MEMMAP].start = memmap_start;
+ mem_avoid[MEM_AVOID_MEMMAP].size = memmap_size;
+ }
+ }
+
/* We don't need to set a mapping for setup_data. */
#ifdef CONFIG_X86_VERBOSE_BOOTUP
diff --git a/arch/x86/boot/string.c b/arch/x86/boot/string.c
index cc3bd58..7a376c1 100644
--- a/arch/x86/boot/string.c
+++ b/arch/x86/boot/string.c
@@ -122,6 +122,31 @@ unsigned long long simple_strtoull(const char *cp, char **endp, unsigned int bas
}
/**
+ * simple_strtoul - convert a string to an unsigned long
+ * @cp: The start of the string
+ * @endp: A pointer to the end of the parsed string will be placed here
+ * @base: The number base to use
+ */
+unsigned long simple_strtoul(const char *cp, char **endp, unsigned int base)
+{
+ return simple_strtoull(cp, endp, base);
+}
+
+/**
+ * simple_strtol - convert a string to a signed long
+ * @cp: The start of the string
+ * @endp: A pointer to the end of the parsed string will be placed here
+ * @base: The number base to use
+ */
+long simple_strtol(const char *cp, char **endp, unsigned int base)
+{
+ if (*cp == '-')
+ return -simple_strtoul(cp + 1, endp, base);
+
+ return simple_strtoul(cp, endp, base);
+}
+
+/**
* strlen - Find the length of a string
* @s: The string to be sized
*/
4 years
[PATCH v3] libnvdimm: clear poison in mem map metadata
by Dave Jiang
Clearing out the poison in the metadata block of the namespace before
we use it. Range from start + 8k to pfn_sb->dataoff.
Signed-off-by: Dave Jiang <dave.jiang(a)intel.com>
---
drivers/nvdimm/pfn_devs.c | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)
diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index cea8350..7fa428e 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -527,11 +527,39 @@ static struct vmem_altmap *__nvdimm_setup_pfn(struct nd_pfn *nd_pfn,
.base_pfn = init_altmap_base(base),
.reserve = init_altmap_reserve(base),
};
+ sector_t sector;
+ resource_size_t meta_start, meta_size;
+ long cleared;
+ unsigned int sz_align;
memcpy(res, &nsio->res, sizeof(*res));
res->start += start_pad;
res->end -= end_trunc;
+ meta_start = res->start + SZ_8K;
+ meta_size = offset - meta_start + 1;
+
+ if (meta_start + meta_size > offset)
+ return ERR_PTR(-EINVAL);
+
+ sector = meta_start >> 9;
+ sz_align = ALIGN(meta_size + (meta_start & (512 - 1)), 512);
+
+ if (unlikely(is_bad_pmem(&nsio->bb, sector, sz_align))) {
+ if (!IS_ALIGNED(meta_start, 512) ||
+ !IS_ALIGNED(meta_size, 512))
+ return ERR_PTR(-EIO);
+
+ cleared = nvdimm_clear_poison(&nd_pfn->dev,
+ meta_start, meta_size);
+ if (cleared <= 0)
+ return ERR_PTR(-EIO);
+
+ badblocks_clear(&nsio->bb, sector, cleared >> 9);
+ if (cleared != meta_size)
+ return ERR_PTR(-EIO);
+ }
+
if (nd_pfn->mode == PFN_MODE_RAM) {
if (offset < SZ_8K)
return ERR_PTR(-EINVAL);
4 years
DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)
by Dan Williams
[ adding linux-fsdevel and linux-nvdimm ]
On Wed, Sep 7, 2016 at 8:36 PM, Xiao Guangrong
<guangrong.xiao(a)linux.intel.com> wrote:
[..]
> However, it is not easy to handle the case that the new VMA overlays with
> the old VMA
> already got by userspace. I think we have some choices:
> 1: One way is completely skipping the new VMA region as current kernel code
> does but i
> do not think this is good as the later VMAs will be dropped.
>
> 2: show the un-overlayed portion of new VMA. In your case, we just show the
> region
> (0x2000 -> 0x3000), however, it can not work well if the VMA is a new
> created
> region with different attributions.
>
> 3: completely show the new VMA as this patch does.
>
> Which one do you prefer?
>
I don't have a preference, but perhaps this breakage and uncertainty
is a good opportunity to propose a more reliable interface for NVML to
get the information it needs?
My understanding is that it is looking for the VM_MIXEDMAP flag which
is already ambiguous for determining if DAX is enabled even if this
dynamic listing issue is fixed. XFS has arranged for DAX to be a
per-inode capability and has an XFS-specific inode flag. We can make
that a common inode flag, but it seems we should have a way to
interrogate the mapping itself in the case where the inode is unknown
or unavailable. I'm thinking extensions to mincore to have flags for
DAX and possibly whether the page is part of a pte, pmd, or pud
mapping. Just floating that idea before starting to look into the
implementation, comments or other ideas welcome...
4 years, 1 month
[PATCH v2] x86: fix kaslr and memmap collision
by Dave Jiang
CONFIG_RANDOMIZE_BASE relocates the kernel to a random base address.
However it does not take into account the memmap= parameter passed in from
the kernel cmdline. This results in the kernel sometimes being put in
the middle of the user memmap. Teaching kaslr to not insert the kernel in
memmap defined regions. We will support up to 4 memmap regions. Any
additional regions will cause kaslr to disable. The mem_avoid set has
been augmented to add up to 4 regions of memmaps provided by the user
to exclude those regions from the set of valid address range to insert
the uncompressed kernel image.
Signed-off-by: Dave Jiang <dave.jiang(a)intel.com>
---
arch/x86/boot/boot.h | 3 +
arch/x86/boot/compressed/kaslr.c | 82 ++++++++++++++++++++++++++++++++++++++
arch/x86/boot/string.c | 38 ++++++++++++++++++
3 files changed, 123 insertions(+)
diff --git a/arch/x86/boot/boot.h b/arch/x86/boot/boot.h
index e5612f3..59c2075 100644
--- a/arch/x86/boot/boot.h
+++ b/arch/x86/boot/boot.h
@@ -332,7 +332,10 @@ int strncmp(const char *cs, const char *ct, size_t count);
size_t strnlen(const char *s, size_t maxlen);
unsigned int atou(const char *s);
unsigned long long simple_strtoull(const char *cp, char **endp, unsigned int base);
+unsigned long simple_strtoul(const char *cp, char **endp, unsigned int base);
+long simple_strtol(const char *cp, char **endp, unsigned int base);
size_t strlen(const char *s);
+char *strchr(const char *s, int c);
/* tty.c */
void puts(const char *);
diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c
index a66854d..915509f 100644
--- a/arch/x86/boot/compressed/kaslr.c
+++ b/arch/x86/boot/compressed/kaslr.c
@@ -11,6 +11,7 @@
*/
#include "misc.h"
#include "error.h"
+#include "../boot.h"
#include <generated/compile.h>
#include <linux/module.h>
@@ -61,9 +62,16 @@ enum mem_avoid_index {
MEM_AVOID_INITRD,
MEM_AVOID_CMDLINE,
MEM_AVOID_BOOTPARAMS,
+ MEM_AVOID_MEMMAP1,
+ MEM_AVOID_MEMMAP2,
+ MEM_AVOID_MEMMAP3,
+ MEM_AVOID_MEMMAP4,
MEM_AVOID_MAX,
};
+/* only supporting at most 4 memmap regions with kaslr */
+#define MAX_MEMMAP_REGIONS 4
+
static struct mem_vector mem_avoid[MEM_AVOID_MAX];
static bool mem_overlaps(struct mem_vector *one, struct mem_vector *two)
@@ -77,6 +85,72 @@ static bool mem_overlaps(struct mem_vector *one, struct mem_vector *two)
return true;
}
+#include "../../../../lib/cmdline.c"
+
+static int
+parse_memmap(char *p, unsigned long long *start, unsigned long long *size)
+{
+ char *oldp;
+
+ if (!p)
+ return -EINVAL;
+
+ /* we don't care about this option here */
+ if (!strncmp(p, "exactmap", 8))
+ return -EINVAL;
+
+ oldp = p;
+ *size = memparse(p, &p);
+ if (p == oldp)
+ return -EINVAL;
+
+ switch (*p) {
+ case '@':
+ case '#':
+ case '$':
+ case '!':
+ *start = memparse(p + 1, &p);
+ return 0;
+ }
+
+ return -EINVAL;
+}
+
+static int mem_avoid_memmap(void)
+{
+ char arg[128];
+ int rc = 0;
+
+ /* see if we have any memmap areas */
+ if (cmdline_find_option("memmap", arg, sizeof(arg)) > 0) {
+ int i = 0;
+ char *str = arg;
+
+ while (str && (i < MAX_MEMMAP_REGIONS)) {
+ unsigned long long start, size;
+ char *k = strchr(str, ',');
+
+ if (k)
+ *k++ = 0;
+
+ rc = parse_memmap(str, &start, &size);
+ if (rc < 0)
+ break;
+ str = k;
+
+ mem_avoid[MEM_AVOID_MEMMAP1 + i].start = start;
+ mem_avoid[MEM_AVOID_MEMMAP1 + i].size = size;
+ i++;
+ }
+
+ /* more than 4 memmaps, fail kaslr */
+ if ((i >= MAX_MEMMAP_REGIONS) && str)
+ rc = -EINVAL;
+ }
+
+ return rc;
+}
+
/*
* In theory, KASLR can put the kernel anywhere in the range of [16M, 64T).
* The mem_avoid array is used to store the ranges that need to be avoided
@@ -429,6 +503,7 @@ void choose_random_location(unsigned long input,
unsigned long *virt_addr)
{
unsigned long random_addr, min_addr;
+ int rc;
/* By default, keep output position unchanged. */
*virt_addr = *output;
@@ -438,6 +513,13 @@ void choose_random_location(unsigned long input,
return;
}
+ /* Mark the memmap regions we need to avoid */
+ rc = mem_avoid_memmap();
+ if (rc < 0) {
+ warn("KASLR disabled: memmap exceeds limit of 4, giving up.");
+ return;
+ }
+
boot_params->hdr.loadflags |= KASLR_FLAG;
/* Prepare to add new identity pagetables on demand. */
diff --git a/arch/x86/boot/string.c b/arch/x86/boot/string.c
index cc3bd58..0464aaa 100644
--- a/arch/x86/boot/string.c
+++ b/arch/x86/boot/string.c
@@ -122,6 +122,31 @@ unsigned long long simple_strtoull(const char *cp, char **endp, unsigned int bas
}
/**
+ * simple_strtoul - convert a string to an unsigned long
+ * @cp: The start of the string
+ * @endp: A pointer to the end of the parsed string will be placed here
+ * @base: The number base to use
+ */
+unsigned long simple_strtoul(const char *cp, char **endp, unsigned int base)
+{
+ return simple_strtoull(cp, endp, base);
+}
+
+/**
+ * simple_strtol - convert a string to a signed long
+ * @cp: The start of the string
+ * @endp: A pointer to the end of the parsed string will be placed here
+ * @base: The number base to use
+ */
+long simple_strtol(const char *cp, char **endp, unsigned int base)
+{
+ if (*cp == '-')
+ return -simple_strtoul(cp + 1, endp, base);
+
+ return simple_strtoul(cp, endp, base);
+}
+
+/**
* strlen - Find the length of a string
* @s: The string to be sized
*/
@@ -155,3 +180,16 @@ char *strstr(const char *s1, const char *s2)
}
return NULL;
}
+
+/**
+ * strchr - Find the first occurrence of the character c in the string s.
+ * @s: the string to be searched
+ * @c: the character to search for
+ */
+char *strchr(const char *s, int c)
+{
+ while (*s != (char)c)
+ if (*s++ == '\0')
+ return NULL;
+ return (char *)s;
+}
4 years, 1 month
[PATCH 0/6 v2] dax: Page invalidation fixes
by Jan Kara
Hello,
this is second revision of my fixes of races when invalidating hole pages in
DAX mappings. See changelogs for details. The series is based on my patches to
write-protect DAX PTEs which are currently carried in mm tree. This is a hard
dependency because we really need to closely track dirtiness (and cleanness!)
of radix tree entries in DAX mappings in order to avoid discarding valid dirty
bits leading to missed cache flushes on fsync(2).
The tests have passed xfstests for xfs and ext4 in DAX and non-DAX mode.
I'd like to get some review of the patches (MM/FS people, please check whether
you like the direction changes in mm/truncate.c take in patch 2/6 - added
Johannes to CC since he was touching related code recently) so that these
patches can land in some tree once DAX write-protection patches are merged.
I'm hoping to get at least first three patches merged for 4.10-rc2... Thanks!
Changes since v1:
* Rebased on top of patches in mm tree
* Added some Reviewed-by tags
* renamed some functions based on review feedback
Honza
4 years, 1 month