[PATCH v2 0/4] Remove nrexceptional tracking
by Matthew Wilcox (Oracle)
We actually use nrexceptional for very little these days. It's a minor
pain to keep in sync with nrpages, but the pain becomes much bigger
with the THP patches because we don't know how many indices a shadow
entry occupies. It's easier to just remove it than keep it accurate.
Also, we save 8 bytes per inode which is nothing to sneeze at; on my
laptop, it would improve shmem_inode_cache from 22 to 23 objects per
16kB, and inode_cache from 26 to 27 objects. Combined, that saves
a megabyte of memory from a combined usage of 25MB for both caches.
Unfortunately, ext4 doesn't cross a magic boundary, so it doesn't save
any memory for ext4.
Matthew Wilcox (Oracle) (4):
mm: Introduce and use mapping_empty
mm: Stop accounting shadow entries
dax: Account DAX entries as nrpages
mm: Remove nrexceptional from inode
fs/block_dev.c | 2 +-
fs/dax.c | 8 ++++----
fs/gfs2/glock.c | 3 +--
fs/inode.c | 2 +-
include/linux/fs.h | 2 --
include/linux/pagemap.h | 5 +++++
mm/filemap.c | 16 ----------------
mm/swap_state.c | 4 ----
mm/truncate.c | 19 +++----------------
mm/workingset.c | 1 -
10 files changed, 15 insertions(+), 47 deletions(-)
--
2.28.0
1 year, 2 months
[RFC 0/2] virtio-pmem: Asynchronous flush
by Pankaj Gupta
Jeff reported preflush order issue with the existing implementation
of virtio pmem preflush. Dan suggested[1] to implement asynchronous flush
for virtio pmem using work queue as done in md/RAID. This patch series
intends to solve the preflush ordering issue and also makes the flush
asynchronous from the submitting thread POV.
Submitting this patch series for feeback and is in WIP. I have
done basic testing and currently doing more testing.
Pankaj Gupta (2):
pmem: make nvdimm_flush asynchronous
virtio_pmem: Async virtio-pmem flush
drivers/nvdimm/nd_virtio.c | 66 ++++++++++++++++++++++++++----------
drivers/nvdimm/pmem.c | 15 ++++----
drivers/nvdimm/region_devs.c | 3 +-
drivers/nvdimm/virtio_pmem.c | 9 +++++
drivers/nvdimm/virtio_pmem.h | 12 +++++++
5 files changed, 78 insertions(+), 27 deletions(-)
[1] https://marc.info/?l=linux-kernel&m=157446316409937&w=2
--
2.20.1
1 year, 2 months
[RFC PATCH 1/3] fs: dax.c: move fs hole signifier from DAX_ZERO_PAGE
to XA_ZERO_ENTRY
by Amy Parker
DAX uses the DAX_ZERO_PAGE bit to represent holes in files. It could also use
a single entry, such as XArray's XA_ZERO_ENTRY. This distinguishes zero pages
and allows us to shift DAX_EMPTY down (see patch 2/3).
Signed-off-by: Amy Parker <enbyamy(a)gmail.com>
---
fs/dax.c | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index 5b47834f2e1b..fa8ca1a71bbd 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -77,9 +77,14 @@ fs_initcall(init_dax_wait_table);
#define DAX_SHIFT (4)
#define DAX_LOCKED (1UL << 0)
#define DAX_PMD (1UL << 1)
-#define DAX_ZERO_PAGE (1UL << 2)
#define DAX_EMPTY (1UL << 3)
+/*
+ * A zero entry, XA_ZERO_ENTRY, is used to represent a zero page. This
+ * definition helps with checking if an entry is a PMD size.
+ */
+#define XA_ZERO_PMD_ENTRY DAX_PMD | (unsigned long)XA_ZERO_ENTRY
+
static unsigned long dax_to_pfn(void *entry)
{
return xa_to_value(entry) >> DAX_SHIFT;
@@ -114,7 +119,7 @@ static bool dax_is_pte_entry(void *entry)
static int dax_is_zero_entry(void *entry)
{
- return xa_to_value(entry) & DAX_ZERO_PAGE;
+ return xa_to_value(entry) & (unsigned long)XA_ZERO_ENTRY;
}
static int dax_is_empty_entry(void *entry)
@@ -738,7 +743,7 @@ static void *dax_insert_entry(struct xa_state *xas,
if (dirty)
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
- if (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE)) {
+ if (dax_is_zero_entry(entry) && !(flags & (unsigned long)XA_ZERO_ENTRY)) {
unsigned long index = xas->xa_index;
/* we are replacing a zero page with block mapping */
if (dax_is_pmd_entry(entry))
@@ -1047,7 +1052,7 @@ static vm_fault_t dax_load_hole(struct xa_state *xas,
vm_fault_t ret;
*entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn,
- DAX_ZERO_PAGE, false);
+ XA_ZERO_ENTRY, false);
ret = vmf_insert_mixed(vmf->vma, vaddr, pfn);
trace_dax_load_hole(inode, vmf, ret);
@@ -1434,7 +1439,7 @@ static vm_fault_t dax_pmd_load_hole(struct
xa_state *xas, struct vm_fault *vmf,
pfn = page_to_pfn_t(zero_page);
*entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn,
- DAX_PMD | DAX_ZERO_PAGE, false);
+ XA_ZERO_PMD_ENTRY, false);
if (arch_needs_pgtable_deposit()) {
pgtable = pte_alloc_one(vma->vm_mm);
--
2.29.2
1 year, 3 months
[PATCH v2] nvdimm: Avoid race between probe and reading device attributes
by Richard Palethorpe
It is possible to cause a division error and use-after-free by querying the
nmem device before the driver data is fully initialised in nvdimm_probe. E.g
by doing
(while true; do
cat /sys/bus/nd/devices/nmem*/available_slots 2>&1 > /dev/null
done) &
while true; do
for i in $(seq 0 4); do
echo nmem$i > /sys/bus/nd/drivers/nvdimm/bind
done
for i in $(seq 0 4); do
echo nmem$i > /sys/bus/nd/drivers/nvdimm/unbind
done
done
On 5.7-rc3 this causes:
[ 12.711578] divide error: 0000 [#1] SMP KASAN PTI
[ 12.712321] CPU: 0 PID: 231 Comm: cat Not tainted 5.7.0-rc3 #48
[ 12.713188] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
[ 12.714857] RIP: 0010:nd_label_nfree+0x134/0x1a0 [libnvdimm]
[ 12.715772] Code: ba 00 00 00 00 00 fc ff df 48 89 f9 48 c1 e9 03 0f b6 14 11 84 d2 74 05 80 fa 03 7e 52 8b 73 08 31 d2 89 c1 48 83 c4 08 5b 5d <f7> f6 31 d2 41 5c 83 c0 07 c1 e8 03 48 8d 84 00 8e 02 00 00 25 00
[ 12.718311] RSP: 0018:ffffc9000046fd08 EFLAGS: 00010282
[ 12.719030] RAX: 0000000000000000 RBX: ffffffffc0073aa0 RCX: 0000000000000000
[ 12.720005] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff888060931808
[ 12.720970] RBP: ffff88806609d018 R08: 0000000000000001 R09: ffffed100cc0a2b1
[ 12.721889] R10: ffff888066051587 R11: ffffed100cc0a2b0 R12: ffff888060931800
[ 12.722744] R13: ffff888064362000 R14: ffff88806609d018 R15: ffffffff8b1a2520
[ 12.723602] FS: 00007fd16f3d5580(0000) GS:ffff88806b400000(0000) knlGS:0000000000000000
[ 12.724600] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 12.725308] CR2: 00007fd16f1ec000 CR3: 0000000064322006 CR4: 0000000000160ef0
[ 12.726268] Call Trace:
[ 12.726633] available_slots_show+0x4e/0x120 [libnvdimm]
[ 12.727380] dev_attr_show+0x42/0x80
[ 12.727891] ? memset+0x20/0x40
[ 12.728341] sysfs_kf_seq_show+0x218/0x410
[ 12.728923] seq_read+0x389/0xe10
[ 12.729415] vfs_read+0x101/0x2d0
[ 12.729891] ksys_read+0xf9/0x1d0
[ 12.730361] ? kernel_write+0x120/0x120
[ 12.730915] do_syscall_64+0x95/0x4a0
[ 12.731435] entry_SYSCALL_64_after_hwframe+0x49/0xb3
[ 12.732163] RIP: 0033:0x7fd16f2fe4be
[ 12.732685] Code: c0 e9 c6 fe ff ff 50 48 8d 3d 2e 12 0a 00 e8 69 e9 01 00 66 0f 1f 84 00 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 0f 05 <48> 3d 00 f0 ff ff 77 5a c3 66 0f 1f 84 00 00 00 00 00 48 83 ec 28
[ 12.735207] RSP: 002b:00007ffd3177b838 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[ 12.736261] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007fd16f2fe4be
[ 12.737233] RDX: 0000000000020000 RSI: 00007fd16f1ed000 RDI: 0000000000000003
[ 12.738203] RBP: 00007fd16f1ed000 R08: 00007fd16f1ec010 R09: 0000000000000000
[ 12.739172] R10: 00007fd16f3f4f70 R11: 0000000000000246 R12: 00007ffd3177ce23
[ 12.740144] R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000020000
[ 12.741139] Modules linked in: nfit libnvdimm
[ 12.741783] ---[ end trace 99532e4b82410044 ]---
[ 12.742452] RIP: 0010:nd_label_nfree+0x134/0x1a0 [libnvdimm]
[ 12.743167] Code: ba 00 00 00 00 00 fc ff df 48 89 f9 48 c1 e9 03 0f b6 14 11 84 d2 74 05 80 fa 03 7e 52 8b 73 08 31 d2 89 c1 48 83 c4 08 5b 5d <f7> f6 31 d2 41 5c 83 c0 07 c1 e8 03 48 8d 84 00 8e 02 00 00 25 00
[ 12.745709] RSP: 0018:ffffc9000046fd08 EFLAGS: 00010282
[ 12.746340] RAX: 0000000000000000 RBX: ffffffffc0073aa0 RCX: 0000000000000000
[ 12.747209] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff888060931808
[ 12.748081] RBP: ffff88806609d018 R08: 0000000000000001 R09: ffffed100cc0a2b1
[ 12.748977] R10: ffff888066051587 R11: ffffed100cc0a2b0 R12: ffff888060931800
[ 12.749849] R13: ffff888064362000 R14: ffff88806609d018 R15: ffffffff8b1a2520
[ 12.750729] FS: 00007fd16f3d5580(0000) GS:ffff88806b400000(0000) knlGS:0000000000000000
[ 12.751708] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 12.752441] CR2: 00007fd16f1ec000 CR3: 0000000064322006 CR4: 0000000000160ef0
[ 12.821357] ==================================================================
[ 12.822284] BUG: KASAN: use-after-free in __mutex_lock+0x111c/0x11a0
[ 12.823084] Read of size 4 at addr ffff888065c26238 by task reproducer/218
[ 12.823968]
[ 12.824183] CPU: 2 PID: 218 Comm: reproducer Tainted: G D 5.7.0-rc3 #48
[ 12.825167] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
[ 12.826595] Call Trace:
[ 12.826926] dump_stack+0x97/0xe0
[ 12.827362] print_address_description.constprop.0+0x1b/0x210
[ 12.828111] ? __mutex_lock+0x111c/0x11a0
[ 12.828645] __kasan_report.cold+0x37/0x92
[ 12.829179] ? __mutex_lock+0x111c/0x11a0
[ 12.829706] kasan_report+0x38/0x50
[ 12.830158] __mutex_lock+0x111c/0x11a0
[ 12.830666] ? ftrace_graph_stop+0x10/0x10
[ 12.831193] ? is_nvdimm_bus+0x40/0x40 [libnvdimm]
[ 12.831820] ? mutex_trylock+0x2b0/0x2b0
[ 12.832333] ? nvdimm_probe+0x259/0x420 [libnvdimm]
[ 12.832975] ? mutex_trylock+0x2b0/0x2b0
[ 12.833500] ? nvdimm_probe+0x259/0x420 [libnvdimm]
[ 12.834122] ? prepare_ftrace_return+0xa1/0xf0
[ 12.834724] ? ftrace_graph_caller+0x6b/0xa0
[ 12.835269] ? acpi_label_write+0x390/0x390 [nfit]
[ 12.835909] ? nvdimm_probe+0x259/0x420 [libnvdimm]
[ 12.836558] ? nvdimm_probe+0x259/0x420 [libnvdimm]
[ 12.837179] nvdimm_probe+0x259/0x420 [libnvdimm]
[ 12.837802] nvdimm_bus_probe+0x110/0x6b0 [libnvdimm]
[ 12.838470] really_probe+0x212/0x9a0
[ 12.838954] driver_probe_device+0x1cd/0x300
[ 12.839511] ? driver_probe_device+0x5/0x300
[ 12.840063] device_driver_attach+0xe7/0x120
[ 12.840623] bind_store+0x18d/0x230
[ 12.841075] kernfs_fop_write+0x200/0x420
[ 12.841606] vfs_write+0x154/0x450
[ 12.842047] ksys_write+0xf9/0x1d0
[ 12.842497] ? __ia32_sys_read+0xb0/0xb0
[ 12.843010] do_syscall_64+0x95/0x4a0
[ 12.843495] entry_SYSCALL_64_after_hwframe+0x49/0xb3
[ 12.844140] RIP: 0033:0x7f5b235d3563
[ 12.844607] Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 55 c3 0f 1f 40 00 48 83 ec 28 48 89 54 24 18
[ 12.846877] RSP: 002b:00007fff1c3bc578 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 12.847822] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007f5b235d3563
[ 12.848717] RDX: 0000000000000006 RSI: 000055f9576710d0 RDI: 0000000000000001
[ 12.849594] RBP: 000055f9576710d0 R08: 000000000000000a R09: 0000000000000000
[ 12.850470] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000006
[ 12.851333] R13: 00007f5b236a3500 R14: 0000000000000006 R15: 00007f5b236a3700
[ 12.852247]
[ 12.852466] Allocated by task 225:
[ 12.852893] save_stack+0x1b/0x40
[ 12.853310] __kasan_kmalloc.constprop.0+0xc2/0xd0
[ 12.853918] kmem_cache_alloc_node+0xef/0x270
[ 12.854475] copy_process+0x485/0x6130
[ 12.854945] _do_fork+0xf1/0xb40
[ 12.855353] __do_sys_clone+0xc3/0x100
[ 12.855843] do_syscall_64+0x95/0x4a0
[ 12.856302] entry_SYSCALL_64_after_hwframe+0x49/0xb3
[ 12.856939]
[ 12.857140] Freed by task 0:
[ 12.857522] save_stack+0x1b/0x40
[ 12.857940] __kasan_slab_free+0x12c/0x170
[ 12.858464] kmem_cache_free+0xb0/0x330
[ 12.858945] rcu_core+0x55f/0x19f0
[ 12.859385] __do_softirq+0x228/0x944
[ 12.859869]
[ 12.860075] The buggy address belongs to the object at ffff888065c26200
[ 12.860075] which belongs to the cache task_struct of size 6016
[ 12.861638] The buggy address is located 56 bytes inside of
[ 12.861638] 6016-byte region [ffff888065c26200, ffff888065c27980)
[ 12.863084] The buggy address belongs to the page:
[ 12.863702] page:ffffea0001970800 refcount:1 mapcount:0 mapping:0000000021ee3712 index:0x0 head:ffffea0001970800 order:3 compound_mapcount:0 compound_pincount:0
[ 12.865478] flags: 0x80000000010200(slab|head)
[ 12.866039] raw: 0080000000010200 0000000000000000 0000000100000001 ffff888066c0f980
[ 12.867010] raw: 0000000000000000 0000000080050005 00000001ffffffff 0000000000000000
[ 12.867986] page dumped because: kasan: bad access detected
[ 12.868696]
[ 12.868900] Memory state around the buggy address:
[ 12.869514] ffff888065c26100: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 12.870414] ffff888065c26180: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 12.871318] >ffff888065c26200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 12.872238] ^
[ 12.872870] ffff888065c26280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 12.873754] ffff888065c26300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 12.874640]
==================================================================
This can be prevented by setting the driver data after initialisation is
complete.
Fixes: 4d88a97aa9e8 ("libnvdimm, nvdimm: dimm driver and base libnvdimm device-driver infrastructure")
Cc: Dan Williams <dan.j.williams(a)intel.com>
Cc: Vishal Verma <vishal.l.verma(a)intel.com>
Cc: Dave Jiang <dave.jiang(a)intel.com>
Cc: Ira Weiny <ira.weiny(a)intel.com>
Cc: linux-nvdimm(a)lists.01.org
Cc: linux-kernel(a)vger.kernel.org
Cc: Coly Li <colyli(a)suse.com>
Signed-off-by: Richard Palethorpe <rpalethorpe(a)suse.com>
---
V2:
+ Reviewed by Coly and removed unecessary lock
drivers/nvdimm/dimm.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/nvdimm/dimm.c b/drivers/nvdimm/dimm.c
index 7d4ddc4d9322..3d3988e1d9a0 100644
--- a/drivers/nvdimm/dimm.c
+++ b/drivers/nvdimm/dimm.c
@@ -43,7 +43,6 @@ static int nvdimm_probe(struct device *dev)
if (!ndd)
return -ENOMEM;
- dev_set_drvdata(dev, ndd);
ndd->dpa.name = dev_name(dev);
ndd->ns_current = -1;
ndd->ns_next = -1;
@@ -106,6 +105,8 @@ static int nvdimm_probe(struct device *dev)
if (rc)
goto err;
+ dev_set_drvdata(dev, ndd);
+
return 0;
err:
--
2.26.2
1 year, 3 months
[PATCH 1/1] ndctl/namespace: Fix disable-namespace accounting relative to seed devices
by Redhairer Li
Seed namespaces are included in "ndctl disable-namespace all". However
since the user never "creates" them it is surprising to see
"disable-namespace" report 1 more namespace relative to the number that
have been created. Catch attempts to disable a zero-sized namespace:
Before:
{
"dev":"namespace1.0",
"size":"492.00 MiB (515.90 MB)",
"blockdev":"pmem1"
}
{
"dev":"namespace1.1",
"size":"492.00 MiB (515.90 MB)",
"blockdev":"pmem1.1"
}
{
"dev":"namespace1.2",
"size":"492.00 MiB (515.90 MB)",
"blockdev":"pmem1.2"
}
disabled 4 namespaces
After:
{
"dev":"namespace1.0",
"size":"492.00 MiB (515.90 MB)",
"blockdev":"pmem1"
}
{
"dev":"namespace1.3",
"size":"492.00 MiB (515.90 MB)",
"blockdev":"pmem1.3"
}
{
"dev":"namespace1.1",
"size":"492.00 MiB (515.90 MB)",
"blockdev":"pmem1.1"
}
disabled 3 namespaces
Signed-off-by: Redhairer Li <redhairer.li(a)intel.com>
---
ndctl/lib/libndctl.c | 11 ++++++++---
ndctl/region.c | 4 +++-
2 files changed, 11 insertions(+), 4 deletions(-)
diff --git a/ndctl/lib/libndctl.c b/ndctl/lib/libndctl.c
index ee737cb..49f362b 100644
--- a/ndctl/lib/libndctl.c
+++ b/ndctl/lib/libndctl.c
@@ -4231,6 +4231,7 @@ NDCTL_EXPORT int ndctl_namespace_disable_safe(struct ndctl_namespace *ndns)
const char *bdev = NULL;
char path[50];
int fd;
+ unsigned long long size = ndctl_namespace_get_size(ndns);
if (pfn && ndctl_pfn_is_enabled(pfn))
bdev = ndctl_pfn_get_block_device(pfn);
@@ -4260,9 +4261,13 @@ NDCTL_EXPORT int ndctl_namespace_disable_safe(struct ndctl_namespace *ndns)
devname, bdev, strerror(errno));
return -errno;
}
- } else
- ndctl_namespace_disable_invalidate(ndns);
-
+ } else {
+ if (size == 0)
+ /* Don't try to disable idle namespace (no capacity allocated) */
+ return -ENXIO;
+ else
+ ndctl_namespace_disable_invalidate(ndns);
+ }
return 0;
}
diff --git a/ndctl/region.c b/ndctl/region.c
index 7945007..0014bb9 100644
--- a/ndctl/region.c
+++ b/ndctl/region.c
@@ -72,6 +72,7 @@ static int region_action(struct ndctl_region *region, enum device_action mode)
{
struct ndctl_namespace *ndns;
int rc = 0;
+ unsigned long long size;
switch (mode) {
case ACTION_ENABLE:
@@ -80,7 +81,8 @@ static int region_action(struct ndctl_region *region, enum device_action mode)
case ACTION_DISABLE:
ndctl_namespace_foreach(region, ndns) {
rc = ndctl_namespace_disable_safe(ndns);
- if (rc)
+ size = ndctl_namespace_get_size(ndns);
+ if (rc && size != 0)
return rc;
}
rc = ndctl_region_disable_invalidate(region);
--
2.20.1.windows.1
1 year, 3 months
[PATCH V3 00/10] PKS: Add Protection Keys Supervisor (PKS) support V3
by ira.weiny@intel.com
From: Ira Weiny <ira.weiny(a)intel.com>
Changes from V2 [4]
Rebased on tip-tree/core/entry
From Thomas Gleixner
Address bisectability
Drop Patch:
x86/entry: Move nmi entry/exit into common code
From Greg KH
Remove WARN_ON's
From Dan Williams
Add __must_check to pks_key_alloc()
New patch: x86/pks: Add PKS defines and config options
Split from Enable patch to build on through the series
Fix compile errors
Changes from V1
Rebase to TIP master; resolve conflicts and test
Clean up some kernel docs updates missed in V1
Add irqentry_state_t kernel doc for PKRS field
Removed redundant irq_state->pkrs
This is only needed when we add the global state and somehow
ended up in this patch series. That will come back when we add
the global functionality in.
From Thomas Gleixner
Update commit messages
Add kernel doc for struct irqentry_state_t
From Dave Hansen add flags to pks_key_alloc()
Changes from RFC V3[3]
Rebase to TIP master
Update test error output
Standardize on 'irq_state' for state variables
From Dave Hansen
Update commit messages
Add/clean up comments
Add X86_FEATURE_PKS to disabled-features.h and remove some
explicit CONFIG checks
Move saved_pkrs member of thread_struct
Remove superfluous preempt_disable()
s/irq_save_pks/irq_save_set_pks/
Ensure PKRS is not seen in faults if not configured or not
supported
s/pks_mknoaccess/pks_mk_noaccess/
s/pks_mkread/pks_mk_readonly/
s/pks_mkrdwr/pks_mk_readwrite/
Change pks_key_alloc return to -EOPNOTSUPP when not supported
From Peter Zijlstra
Clean up Attribution
Remove superfluous preempt_disable()
Add union to differentiate exit_rcu/lockdep use in
irqentry_state_t
From Thomas Gleixner
Add preliminary clean up patch and adjust series as needed
Introduce a new page protection mechanism for supervisor pages, Protection Key
Supervisor (PKS).
2 use cases for PKS are being developed, trusted keys and PMEM. Trusted keys
is a newer use case which is still being explored. PMEM was submitted as part
of the RFC (v2) series[1]. However, since then it was found that some callers
of kmap() require a global implementation of PKS. Specifically some users of
kmap() expect mappings to be available to all kernel threads. While global use
of PKS is rare it needs to be included for correctness. Unfortunately the
kmap() updates required a large patch series to make the needed changes at the
various kmap() call sites so that patch set has been split out. Because the
global PKS feature is only required for that use case it will be deferred to
that set as well.[2] This patch set is being submitted as a precursor to both
of the use cases.
For an overview of the entire PKS ecosystem, a git tree including this series
and 2 proposed use cases can be found here:
https://lore.kernel.org/lkml/20201009195033.3208459-1-ira.weiny@intel.com/
https://lore.kernel.org/lkml/20201009201410.3209180-1-ira.weiny@intel.com/
PKS enables protections on 'domains' of supervisor pages to limit supervisor
mode access to those pages beyond the normal paging protections. PKS works in
a similar fashion to user space pkeys, PKU. As with PKU, supervisor pkeys are
checked in addition to normal paging protections and Access or Writes can be
disabled via a MSR update without TLB flushes when permissions change. Also
like PKU, a page mapping is assigned to a domain by setting pkey bits in the
page table entry for that mapping.
Access is controlled through a PKRS register which is updated via WRMSR/RDMSR.
XSAVE is not supported for the PKRS MSR. Therefore the implementation
saves/restores the MSR across context switches and during exceptions. Nested
exceptions are supported by each exception getting a new PKS state.
For consistent behavior with current paging protections, pkey 0 is reserved and
configured to allow full access via the pkey mechanism, thus preserving the
default paging protections on mappings with the default pkey value of 0.
Other keys, (1-15) are allocated by an allocator which prepares us for key
contention from day one. Kernel users should be prepared for the allocator to
fail either because of key exhaustion or due to PKS not being supported on the
arch and/or CPU instance.
The following are key attributes of PKS.
1) Fast switching of permissions
1a) Prevents access without page table manipulations
1b) No TLB flushes required
2) Works on a per thread basis
PKS is available with 4 and 5 level paging. Like PKRU it consumes 4 bits from
the PTE to store the pkey within the entry.
[1] https://lore.kernel.org/lkml/20200717072056.73134-1-ira.weiny@intel.com/
[2] https://lore.kernel.org/lkml/20201009195033.3208459-2-ira.weiny@intel.com/
[3] https://lore.kernel.org/lkml/20201009194258.3207172-1-ira.weiny@intel.com/
[4] https://lore.kernel.org/lkml/20201102205320.1458656-1-ira.weiny@intel.com/
Fenghua Yu (2):
x86/pks: Add PKS kernel API
x86/pks: Enable Protection Keys Supervisor (PKS)
Ira Weiny (8):
x86/pkeys: Create pkeys_common.h
x86/fpu: Refactor arch_set_user_pkey_access() for PKS support
x86/pks: Add PKS defines and Kconfig options
x86/pks: Preserve the PKRS MSR on context switch
x86/entry: Pass irqentry_state_t by reference
x86/entry: Preserve PKRS MSR across exceptions
x86/fault: Report the PKRS state on fault
x86/pks: Add PKS test code
Documentation/core-api/protection-keys.rst | 103 ++-
arch/x86/Kconfig | 1 +
arch/x86/entry/common.c | 46 +-
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/disabled-features.h | 8 +-
arch/x86/include/asm/idtentry.h | 25 +-
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/include/asm/pgtable.h | 13 +-
arch/x86/include/asm/pgtable_types.h | 12 +
arch/x86/include/asm/pkeys.h | 15 +
arch/x86/include/asm/pkeys_common.h | 40 ++
arch/x86/include/asm/processor.h | 18 +-
arch/x86/include/uapi/asm/processor-flags.h | 2 +
arch/x86/kernel/cpu/common.c | 15 +
arch/x86/kernel/cpu/mce/core.c | 4 +-
arch/x86/kernel/fpu/xstate.c | 22 +-
arch/x86/kernel/kvm.c | 6 +-
arch/x86/kernel/nmi.c | 4 +-
arch/x86/kernel/process.c | 26 +
arch/x86/kernel/traps.c | 21 +-
arch/x86/mm/fault.c | 87 ++-
arch/x86/mm/pkeys.c | 196 +++++-
include/linux/entry-common.h | 31 +-
include/linux/pgtable.h | 4 +
include/linux/pkeys.h | 24 +
kernel/entry/common.c | 44 +-
lib/Kconfig.debug | 12 +
lib/Makefile | 3 +
lib/pks/Makefile | 3 +
lib/pks/pks_test.c | 692 ++++++++++++++++++++
mm/Kconfig | 2 +
tools/testing/selftests/x86/Makefile | 3 +-
tools/testing/selftests/x86/test_pks.c | 66 ++
33 files changed, 1410 insertions(+), 140 deletions(-)
create mode 100644 arch/x86/include/asm/pkeys_common.h
create mode 100644 lib/pks/Makefile
create mode 100644 lib/pks/pks_test.c
create mode 100644 tools/testing/selftests/x86/test_pks.c
--
2.28.0.rc0.12.gb6a658bd00c9
1 year, 4 months
[PATCH 1/1] device-dax: avoid an unnecessary check in alloc_dev_dax_range()
by Zhen Lei
Swap the calling sequence of krealloc() and __request_region(), call the
latter first. In this way, the value of dev_dax->nr_range does not need to
be considered when __request_region() failed.
Signed-off-by: Zhen Lei <thunder.leizhen(a)huawei.com>
---
drivers/dax/bus.c | 29 ++++++++++++-----------------
1 file changed, 12 insertions(+), 17 deletions(-)
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 27513d311242..1efae11d947a 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -763,23 +763,15 @@ static int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 start,
return 0;
}
- ranges = krealloc(dev_dax->ranges, sizeof(*ranges)
- * (dev_dax->nr_range + 1), GFP_KERNEL);
- if (!ranges)
- return -ENOMEM;
-
alloc = __request_region(res, start, size, dev_name(dev), 0);
- if (!alloc) {
- /*
- * If this was an empty set of ranges nothing else
- * will release @ranges, so do it now.
- */
- if (!dev_dax->nr_range) {
- kfree(ranges);
- ranges = NULL;
- }
- dev_dax->ranges = ranges;
+ if (!alloc)
return -ENOMEM;
+
+ ranges = krealloc(dev_dax->ranges, sizeof(*ranges)
+ * (dev_dax->nr_range + 1), GFP_KERNEL);
+ if (!ranges) {
+ rc = -ENOMEM;
+ goto err;
}
for (i = 0; i < dev_dax->nr_range; i++)
@@ -808,11 +800,14 @@ static int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 start,
dev_dbg(dev, "delete range[%d]: %pa:%pa\n", dev_dax->nr_range - 1,
&alloc->start, &alloc->end);
dev_dax->nr_range--;
- __release_region(res, alloc->start, resource_size(alloc));
- return rc;
+ goto err;
}
return 0;
+
+err:
+ __release_region(res, alloc->start, resource_size(alloc));
+ return rc;
}
static int adjust_dev_dax_range(struct dev_dax *dev_dax, struct resource *res, resource_size_t size)
--
2.26.0.106.g9fadedd
1 year, 4 months
[PATCH 1/1] device-dax: delete a redundancy check in dev_dax_validate_align()
by Zhen Lei
After we have done the alignment check for the length of each range, the
alignment check for dev_dax_size(dev_dax) is no longer needed, because it
get the sum of the length of each range.
Signed-off-by: Zhen Lei <thunder.leizhen(a)huawei.com>
---
drivers/dax/bus.c | 7 -------
1 file changed, 7 deletions(-)
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 1efae11d947a..35f9238c0139 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -1109,16 +1109,9 @@ static ssize_t align_show(struct device *dev,
static ssize_t dev_dax_validate_align(struct dev_dax *dev_dax)
{
- resource_size_t dev_size = dev_dax_size(dev_dax);
struct device *dev = &dev_dax->dev;
int i;
- if (dev_size > 0 && !alloc_is_aligned(dev_dax, dev_size)) {
- dev_dbg(dev, "%s: align %u invalid for size %pa\n",
- __func__, dev_dax->align, &dev_size);
- return -EINVAL;
- }
-
for (i = 0; i < dev_dax->nr_range; i++) {
size_t len = range_len(&dev_dax->ranges[i].range);
--
2.26.0.106.g9fadedd
1 year, 4 months
[PATCH ndctl v1 0/8] daxctl: Add device align and range mapping allocation
by Joao Martins
Hey,
This series builds on top of this one[0] and does the following improvements
to the Soft-Reserved subdivision:
1) Support for {create,reconfigure}-device for selecting @align (hugepage size).
Here we add a '-a|--align 4K|2M|1G' option to the existing commands;
2) Listing improvements for device alignment and mappings;
Note: Perhaps it is better to hide the mappings by default, and only
print with -v|--verbose. This would align with ndctl, as the mappings
info can be quite large.
3) Allow creating devices from selecting ranges. This allows to keep the
same GPA->HPA mapping as before we kexec the hypervisor with running guests:
daxctl list -d dax0.1 > /var/log/dax0.1.json
kexec -d -l bzImage
systemctl kexec
daxctl create -u --restore /var/log/dax0.1.json
The JSON was what I though it would be easier for an user, given that it is
the data format daxctl outputs. Alternatives could be adding multiple:
--mapping <pgoff>:<start>-<end>
But that could end up in a gigantic line and a little more
unmanageable I think.
This series requires this series[0] on top of Dan's patches[1]:
[0] https://lore.kernel.org/linux-nvdimm/20200716172913.19658-1-joao.m.martin...
[1] https://lore.kernel.org/linux-nvdimm/159457116473.754248.7879464730875147...
The only TODO here is docs and improving tests to validate mappings, and test
the restore path.
Suggestions/comments are welcome.
Joao
Joao Martins (8):
daxctl: add daxctl_dev_{get,set}_align()
util/json: Print device align
daxctl: add align support in reconfigure-device
daxctl: add align support in create-device
libdaxctl: add mapping iterator APIs
daxctl: include mappings when listing
libdaxctl: add daxctl_dev_set_mapping()
daxctl: Allow restore devices from JSON metadata
daxctl/device.c | 154 +++++++++++++++++++++++++++++++++++++++--
daxctl/lib/libdaxctl-private.h | 9 +++
daxctl/lib/libdaxctl.c | 152 +++++++++++++++++++++++++++++++++++++++-
daxctl/lib/libdaxctl.sym | 9 +++
daxctl/libdaxctl.h | 16 +++++
util/json.c | 63 ++++++++++++++++-
util/json.h | 3 +
7 files changed, 396 insertions(+), 10 deletions(-)
--
1.8.3.1
1 year, 4 months
[ndctl PATCH V2 0/8] fix serverl issues reported by Coverity
by Zhiqiang Liu
Changes: V1->V2
- add one empty line in 1/8 patch as suggested by Jeff Moyer <jmoyer(a)redhat.com>.
Recently, we use Coverity to analysis the ndctl package.
Several issues should be resolved to make Coverity happy.
Zhiqiang Liu (8):
namespace: check whether pfn|dax|btt is NULL in setup_namespace
lib/libndctl: fix memory leakage problem in add_bus
libdaxctl: fix memory leakage in add_dax_region()
dimm: fix potential fd leakage in dimm_action()
util/help: check whether strdup returns NULL in exec_man_konqueror
lib/inject: check whether cmd is created successfully
libndctl: check whether ndctl_btt_get_namespace returns NULL in
callers
namespace: check whether seed is NULL in validate_namespace_options
daxctl/lib/libdaxctl.c | 3 +++
ndctl/dimm.c | 12 +++++++-----
ndctl/lib/inject.c | 8 ++++++++
ndctl/lib/libndctl.c | 1 +
ndctl/namespace.c | 23 ++++++++++++++++++-----
test/libndctl.c | 16 +++++++++++-----
test/parent-uuid.c | 2 +-
util/help.c | 8 +++++++-
util/json.c | 3 +++
9 files changed, 59 insertions(+), 17 deletions(-)
--
1.8.3.1
1 year, 5 months