Re: Detecting NUMA per pmem
by Oren Berman
Hi Ross
Thanks for the speedy reply. I am also adding the public list to this
thread as you suggested.
We have tried to dump the SPA table and this is what we get:
/*
* Intel ACPI Component Architecture
* AML/ASL+ Disassembler version 20160108-64
* Copyright (c) 2000 - 2016 Intel Corporation
*
* Disassembly of NFIT, Sun Oct 22 10:46:19 2017
*
* ACPI Data Table [NFIT]
*
* Format: [HexOffset DecimalOffset ByteLength] FieldName : FieldValue
*/
[000h 0000 4] Signature : "NFIT" [NVDIMM Firmware
Interface Table]
[004h 0004 4] Table Length : 00000028
[008h 0008 1] Revision : 01
[009h 0009 1] Checksum : B2
[00Ah 0010 6] Oem ID : "SUPERM"
[010h 0016 8] Oem Table ID : "SMCI--MB"
[018h 0024 4] Oem Revision : 00000001
[01Ch 0028 4] Asl Compiler ID : " "
[020h 0032 4] Asl Compiler Revision : 00000001
[024h 0036 4] Reserved : 00000000
Raw Table Data: Length 40 (0x28)
0000: 4E 46 49 54 28 00 00 00 01 B2 53 55 50 45 52 4D // NFIT(.....SUPERM
0010: 53 4D 43 49 2D 2D 4D 42 01 00 00 00 01 00 00 00 // SMCI--MB........
0020: 01 00 00 00 00 00 00 00
As you can see the memory region info is missing.
This specific check was done on a supermicro server.
We also performed a bios update but the results were the same.
As said before ,the pmem devices are detected correctly and we verified
that they correspond to different numa nodes using the PCM utility.However,
linux still reports both pmem devices to be on the same numa - Numa 0.
If this information is missing, why pmem devices and address ranges are
still detected correctly?
Is there another table that we need to check?
I also ran dmidecode and the NVDIMMs are being listed (we tested with
netlist NVDIMMs). I can also see the bank locator showing P0 and P1 which I
think indicates the numa. Here is an example:
Handle 0x002D, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x002A
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 16384 MB
Form Factor: DIMM
Set: None
Locator: P1-DIMMA3
Bank Locator: P0_Node0_Channel0_Dimm2
Type: DDR4
Type Detail: Synchronous
Speed: 2400 MHz
Manufacturer: Netlist
Serial Number: 66F50006
Asset Tag: P1-DIMMA3_AssetTag (date:16/42)
Part Number: NV3A74SBT20-000
Rank: 1
Configured Clock Speed: 1600 MHz
Minimum Voltage: Unknown
Maximum Voltage: Unknown
Configured Voltage: Unknown
Handle 0x003B, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x0038
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 16384 MB
Form Factor: DIMM
Set: None
Locator: P2-DIMME3
Bank Locator: P1_Node1_Channel0_Dimm2
Type: DDR4
Type Detail: Synchronous
Speed: 2400 MHz
Manufacturer: Netlist
Serial Number: 66B50010
Asset Tag: P2-DIMME3_AssetTag (date:16/42)
Part Number: NV3A74SBT20-000
Rank: 1
Configured Clock Speed: 1600 MHz
Minimum Voltage: Unknown
Maximum Voltage: Unknown
Configured Voltage: Unknown
Did you encounter such a a case? We would appreciate any insight you might
have.
BR
Oren Berman
On 20 October 2017 at 19:22, Ross Zwisler <ross.zwisler(a)linux.intel.com>
wrote:
> On Thu, Oct 19, 2017 at 06:12:24PM +0300, Oren Berman wrote:
> > Hi Ross
> > My name is Oren Berman and I am a senior developer at lightbitslabs.
> > We are working with NDIMMs but we encountered a problem that the
> kernel
> > does not seem to detect the numa id per PMEM device.
> > It always reports numa 0 although we have NVDIMM devices on both
> nodes.
> > We checked that it always returns 0 from sysfs and also from
> retrieving
> > the device of pmem in the kernel and calling dev_to_node.
> > The result is always 0 for both pmem0 and pmem1.
> > In order to make sure that indeed both numa sockets are used we ran
> > intel's pcm utlity. We verified that writing to pmem 0 increases
> socket 0
> > utilization and writing to pmem1 increases socket 1 utilization so
> the hw
> > works properly.
> > Only the detection seems to be invalid.
> > Did you encounter such a problem?
> > We are using kernel version 4.9 - are you aware of any fix for this
> issue
> > or workaround that we can use.
> > Are we missing something?
> > Thanks for any help you can give us.
> > BR
> > Oren Berman
>
> Hi Oren,
>
> My first guess is that your platform isn't properly filling out the
> "proximity
> domain" field in the NFIT SPA table.
>
> See section 5.2.25.2 in ACPI 6.2:
> http://uefi.org/sites/default/files/resources/ACPI_6_2.pdf
>
> Here's how to check that:
>
> # cd /tmp
> # cp /sys/firmware/acpi/tables/NFIT .
> # iasl NFIT
>
> Intel ACPI Component Architecture
> ASL+ Optimizing Compiler version 20160831-64
> Copyright (c) 2000 - 2016 Intel Corporation
>
> Binary file appears to be a valid ACPI table, disassembling
> Input file NFIT, Length 0xE0 (224) bytes
> ACPI: NFIT 0x0000000000000000 0000E0 (v01 BOCHS BXPCNFIT 00000001 BXPC
> 00000001)
> Acpi Data Table [NFIT] decoded
> Formatted output: NFIT.dsl - 5191 bytes
>
> This will give you an NFIT.dsl file which you can look at. Here is what my
> SPA table looks like for an emulated QEMU NVDIMM:
>
> [028h 0040 2] Subtable Type : 0000 [System Physical
> Address Range]
> [02Ah 0042 2] Length : 0038
>
> [02Ch 0044 2] Range Index : 0002
> [02Eh 0046 2] Flags (decoded below) : 0003
> Add/Online Operation Only : 1
> Proximity Domain Valid : 1
> [030h 0048 4] Reserved : 00000000
> [034h 0052 4] Proximity Domain : 00000000
> [038h 0056 16] Address Range GUID :
> 66F0D379-B4F3-4074-AC43-0D3318B78CDB
> [048h 0072 8] Address Range Base : 0000000240000000
> [050h 0080 8] Address Range Length : 0000000440000000
> [058h 0088 8] Memory Map Attribute : 0000000000008008
>
> So, the "Proximity Domain" field is 0, and this lets the system know which
> NUMA node to associate with this memory region.
>
> BTW, in the future it's best to CC our public list,
> linux-nvdimm(a)lists.01.org,
> as a) someone else might have the same question and b) someone else might
> know
> the answer.
>
> Thanks,
> - Ross
>
2 years, 7 months
[PATCH v3 0/2] Support ACPI 6.1 update in NFIT Control Region Structure
by Toshi Kani
ACPI 6.1, Table 5-133, updates NVDIMM Control Region Structure as
follows.
- Valid Fields, Manufacturing Location, and Manufacturing Date
are added from reserved range. No change in the structure size.
- IDs (SPD values) are stored as arrays of bytes (i.e. big-endian
format). The spec clarifies that they need to be represented
as arrays of bytes as well.
Patch 1 changes the NFIT driver to comply with ACPI 6.1.
Patch 2 adds a new sysfs file "id" to show NVDIMM ID defined in ACPI 6.1.
The patch-set applies on linux-pm.git acpica.
link: http://www.uefi.org/sites/default/files/resources/ACPI_6_1.pdf
---
v3:
- Need to coordinate with ACPICA update (Bob Moore, Dan Williams)
- Integrate with ACPICA changes in struct acpi_nfit_control_region.
(commit 138a95547ab0)
v2:
- Remove 'mfg_location' and 'mfg_date'. (Dan Williams)
- Rename 'unique_id' to 'id' and make this change as a separate patch.
(Dan Williams)
---
Toshi Kani (3):
1/2 acpi/nfit: Update nfit driver to comply with ACPI 6.1
2/3 acpi/nfit: Add sysfs "id" for NVDIMM ID
---
drivers/acpi/nfit.c | 29 ++++++++++++++++++++++++-----
1 file changed, 24 insertions(+), 5 deletions(-)
2 years, 7 months
Re: [PATCH] nvdimm: Remove minimum size requirement
by Soccer Liu
Hi:
As part of processing in setting up the environment for running unitests, I was able to work through the instrcutions in https://github.com/pmem/ndctl/tree/0a628fdf4fe58a283b16c1bbaa49bb28b1842b... the way until I hit the followingbuild error (Segmentation fault) when buiding libnvdimm.o.
Anyone hit this before?
root@ubuntu:/home/soccerl/nvdimm# make M=tools/testing/nvdimm AR tools/testing/nvdimm/built-in.o CC [M] tools/testing/nvdimm/../../../drivers/nvdimm/core.o CC [M] tools/testing/nvdimm/../../../drivers/nvdimm/bus.o CC [M] tools/testing/nvdimm/../../../drivers/nvdimm/dimm_devs.o CC [M] tools/testing/nvdimm/../../../drivers/nvdimm/dimm.o CC [M] tools/testing/nvdimm/../../../drivers/nvdimm/region_devs.o CC [M] tools/testing/nvdimm/../../../drivers/nvdimm/region.o CC [M] tools/testing/nvdimm/../../../drivers/nvdimm/namespace_devs.o CC [M] tools/testing/nvdimm/../../../drivers/nvdimm/label.o CC [M] tools/testing/nvdimm/../../../drivers/nvdimm/claim.o CC [M] tools/testing/nvdimm/../../../drivers/nvdimm/btt_devs.o CC [M] tools/testing/nvdimm/../../../drivers/nvdimm/pfn_devs.o CC [M] tools/testing/nvdimm/../../../drivers/nvdimm/dax_devs.o CC [M] tools/testing/nvdimm/config_check.o LD [M] tools/testing/nvdimm/libnvdimm.oSegmentation faultscripts/Makefile.build:548: recipe for target 'tools/testing/nvdimm/libnvdimm.o' failedmake[1]: *** [tools/testing/nvdimm/libnvdimm.o] Error 139Makefile:1511: recipe for target '_module_tools/testing/nvdimm' failedmake: *** [_module_tools/testing/nvdimm] Error 2
My devbox has 4.13 Linux in it.I am not sure whether it has anything to do with fact that I didnt do anything with ndctl/ndctl.spec.in (because I am not sure how to apply those dependendies to my testbox)
Any idea?
ThanksCheng-mean
On Thursday, August 31, 2017 3:31 PM, Dan Williams <dan.j.williams(a)intel.com> wrote:
On Mon, Aug 7, 2017 at 11:13 AM, Dan Williams <dan.j.williams(a)intel.com> wrote:
> On Mon, Aug 7, 2017 at 11:09 AM, Cheng-mean Liu (SOCCER)
> <soccerl(a)microsoft.com> wrote:
>> Hi Dan:
>>
>> I am wondering if failing on those unittests is still an issue for this minimum size requirement change.
>
> Yes, I just haven't had a chance to circle back and get this fixed up.
>
> You can reproduce by running:
>
> make TESTS=dpa-alloc check
>
> ...in a checkout of the ndctl project: https://github.com/pmem/ndctl
>
> If you attempt that, note the required setup of the nfit_test modules
> documented in README.md in that same repository.
I have not had any time to fix up the unit test for this. Soccer, can
you take a look?
3 years
[PATCH 00/15] dax: prep work for fixing dax-dma vs truncate collisions
by Dan Williams
This is hopefully the uncontroversial lead-in set of changes that lay
the groundwork for solving the dax-dma vs truncate problem. The overview
of the changes is:
1/ Disable DAX when we do not have struct page entries backing dax
mappings, or otherwise allow limited DAX support for axonram and
dcssblk. Is anyone actually using the DAX capability of axonram
dcssblk?
2/ Disable code paths that establish potentially long lived DMA
access to a filesystem-dax memory mapping, i.e. RDMA and V4L2. In the
4.16 timeframe the plan is to introduce a "register memory for DMA
with a lease" mechanism for userspace to establish mappings but also
be responsible for tearing down the mapping when the kernel needs to
invalidate the mapping due to truncate or hole-punch.
3/ Add a wakeup mechanism for awaiting for DAX pages to be released
from DMA access.
This overall effort started when Christoph noted during the review of
the MAP_DIRECT proposal:
get_user_pages on DAX doesn't give the same guarantees as on
pagecache or anonymous memory, and that is the problem we need to
fix. In fact I'm pretty sure if we try hard enough (and we might
have to try very hard) we can see the same problem with plain direct
I/O and without any RDMA involved, e.g. do a larger direct I/O write
to memory that is mmap()ed from a DAX file, then truncate the DAX
file and reallocate the blocks, and we might corrupt that new file.
We'll probably need a special setup where there is little other
chance but to reallocate those used blocks.
So what we need to do first is to fix get_user_pages vs unmapping
DAX mmap()ed blocks, be that from a hole punch, truncate, COW
operation, etc.
Included in the changes is a nfit_test mechanism to trivially trigger
this collision by delaying the put_page() that the block layer performs
after performing direct-I/O to a filesystem-DAX page.
Given the ongoing coordination of this set across multiple sub-systems
and the dax core my proposal is to manage this as a branch in the nvdimm
tree with acks from mm, rdma, v4l2, ext4, and xfs.
---
Dan Williams (15):
dax: quiet bdev_dax_supported()
mm, dax: introduce pfn_t_special()
dax: require 'struct page' by default for filesystem dax
brd: remove dax support
dax: stop using VM_MIXEDMAP for dax
dax: stop using VM_HUGEPAGE for dax
dax: stop requiring a live device for dax_flush()
dax: store pfns in the radix
tools/testing/nvdimm: add 'bio_delay' mechanism
IB/core: disable memory registration of fileystem-dax vmas
[media] v4l2: disable filesystem-dax mapping support
mm, dax: enable filesystems to trigger page-idle callbacks
mm, devmap: introduce CONFIG_DEVMAP_MANAGED_PAGES
dax: associate mappings with inodes, and warn if dma collides with truncate
wait_bit: introduce {wait_on,wake_up}_devmap_idle
arch/powerpc/platforms/Kconfig | 1
arch/powerpc/sysdev/axonram.c | 3 -
drivers/block/Kconfig | 12 ---
drivers/block/brd.c | 65 --------------
drivers/dax/device.c | 1
drivers/dax/super.c | 113 +++++++++++++++++++++----
drivers/infiniband/core/umem.c | 49 ++++++++---
drivers/media/v4l2-core/videobuf-dma-sg.c | 39 ++++++++-
drivers/nvdimm/pmem.c | 13 +++
drivers/s390/block/Kconfig | 1
drivers/s390/block/dcssblk.c | 4 +
fs/Kconfig | 8 ++
fs/dax.c | 131 +++++++++++++++++++----------
fs/ext2/file.c | 1
fs/ext2/super.c | 6 +
fs/ext4/file.c | 1
fs/ext4/super.c | 6 +
fs/xfs/xfs_file.c | 2
fs/xfs/xfs_super.c | 20 ++--
include/linux/dax.h | 17 ++--
include/linux/memremap.h | 24 +++++
include/linux/mm.h | 47 ++++++----
include/linux/mm_types.h | 20 +++-
include/linux/pfn_t.h | 13 +++
include/linux/vma.h | 33 +++++++
include/linux/wait_bit.h | 10 ++
kernel/memremap.c | 36 ++++++--
kernel/sched/wait_bit.c | 64 ++++++++++++--
mm/Kconfig | 5 +
mm/hmm.c | 13 ---
mm/huge_memory.c | 8 +-
mm/ksm.c | 3 +
mm/madvise.c | 2
mm/memory.c | 22 ++++-
mm/migrate.c | 3 -
mm/mlock.c | 5 +
mm/mmap.c | 8 +-
mm/swap.c | 3 -
tools/testing/nvdimm/Kbuild | 1
tools/testing/nvdimm/test/iomap.c | 62 ++++++++++++++
tools/testing/nvdimm/test/nfit.c | 34 ++++++++
tools/testing/nvdimm/test/nfit_test.h | 1
42 files changed, 650 insertions(+), 260 deletions(-)
create mode 100644 include/linux/vma.h
3 years, 1 month
[PATCH 1/2] dm log writes: Add support for inline data buffers
by Ross Zwisler
Currently dm-log-writes supports writing filesystem data via BIOs, and
writing internal metadata from a flat buffer via write_metadata().
For DAX writes, though, we won't have a BIO, but will instead have an
iterator that we'll want to use to fill a flat data buffer.
So, create write_inline_data() which allows us to write filesystem data
using a flat buffer as a source, and wire it up in log_one_block().
Signed-off-by: Ross Zwisler <ross.zwisler(a)linux.intel.com>
---
drivers/md/dm-log-writes.c | 90 +++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 86 insertions(+), 4 deletions(-)
diff --git a/drivers/md/dm-log-writes.c b/drivers/md/dm-log-writes.c
index 8b80a9c..c65f9d1 100644
--- a/drivers/md/dm-log-writes.c
+++ b/drivers/md/dm-log-writes.c
@@ -246,27 +246,109 @@ static int write_metadata(struct log_writes_c *lc, void *entry,
return -1;
}
+static int write_inline_data(struct log_writes_c *lc, void *entry,
+ size_t entrylen, void *data, size_t datalen,
+ sector_t sector)
+{
+ int num_pages, bio_pages, pg_datalen, pg_sectorlen, i;
+ struct page *page;
+ struct bio *bio;
+ size_t ret;
+ void *ptr;
+
+ while (datalen) {
+ num_pages = ALIGN(datalen, PAGE_SIZE) >> PAGE_SHIFT;
+ bio_pages = min(num_pages, BIO_MAX_PAGES);
+
+ atomic_inc(&lc->io_blocks);
+
+ bio = bio_alloc(GFP_KERNEL, bio_pages);
+ if (!bio) {
+ DMERR("Couldn't alloc inline data bio");
+ goto error;
+ }
+
+ bio->bi_iter.bi_size = 0;
+ bio->bi_iter.bi_sector = sector;
+ bio_set_dev(bio, lc->logdev->bdev);
+ bio->bi_end_io = log_end_io;
+ bio->bi_private = lc;
+ bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
+
+ for (i = 0; i < bio_pages; i++) {
+ pg_datalen = min(datalen, PAGE_SIZE);
+ pg_sectorlen = ALIGN(pg_datalen, lc->sectorsize);
+
+ page = alloc_page(GFP_KERNEL);
+ if (!page) {
+ DMERR("Couldn't alloc inline data page");
+ goto error_bio;
+ }
+
+ ptr = kmap_atomic(page);
+ memcpy(ptr, data, pg_datalen);
+ if (pg_sectorlen > pg_datalen)
+ memset(ptr + pg_datalen, 0,
+ pg_sectorlen - pg_datalen);
+ kunmap_atomic(ptr);
+
+ ret = bio_add_page(bio, page, pg_sectorlen, 0);
+ if (ret != pg_sectorlen) {
+ DMERR("Couldn't add page of inline data");
+ __free_page(page);
+ goto error_bio;
+ }
+
+ datalen -= pg_datalen;
+ data += pg_datalen;
+ }
+ submit_bio(bio);
+
+ sector += bio_pages * PAGE_SECTORS;
+ }
+ return 0;
+error_bio:
+ bio_free_pages(bio);
+ bio_put(bio);
+error:
+ put_io_block(lc);
+ return -1;
+}
+
static int log_one_block(struct log_writes_c *lc,
struct pending_block *block, sector_t sector)
{
struct bio *bio;
struct log_write_entry entry;
- size_t ret;
+ size_t metadlen, ret;
int i;
entry.sector = cpu_to_le64(block->sector);
entry.nr_sectors = cpu_to_le64(block->nr_sectors);
entry.flags = cpu_to_le64(block->flags);
entry.data_len = cpu_to_le64(block->datalen);
- if (write_metadata(lc, &entry, sizeof(entry), block->data,
- block->datalen, sector)) {
+
+ metadlen = (block->flags & LOG_MARK_FLAG) ? block->datalen : 0;
+ if (write_metadata(lc, &entry, sizeof(entry), block->data, metadlen,
+ sector)) {
free_pending_block(lc, block);
return -1;
}
+ sector += dev_to_bio_sectors(lc, 1);
+
+ if (block->datalen && metadlen == 0) {
+ if (write_inline_data(lc, &entry, sizeof(entry), block->data,
+ block->datalen, sector)) {
+ free_pending_block(lc, block);
+ return -1;
+ }
+ /* we don't support both inline data & bio data */
+ goto out;
+ }
+
if (!block->vec_cnt)
goto out;
- sector += dev_to_bio_sectors(lc, 1);
atomic_inc(&lc->io_blocks);
bio = bio_alloc(GFP_KERNEL, min(block->vec_cnt, BIO_MAX_PAGES));
--
2.9.5
3 years, 2 months
[PATCH v6 0/8] libnvdimm: add DMA supported blk-mq pmem driver
by Dave Jiang
v6:
- Put all common code for pmem drivers in pmem_core per Dan's suggestion.
- Added support code to get number of available DMA chans
- Fixed up Kconfig so that when pmem is built into the kernel, pmem_dma won't
show up.
v5:
- Added support to report descriptor transfer capability limit from dmaengine.
- Fixed up scatterlist support for dma_unmap_data per Dan's comments.
- Made the driver a separate pmem blk driver per Christoph's suggestion
and also fixed up all the issues pointed out by Christoph.
- Added pmem badblock checking/handling per Robert and also made DMA op to
be used by all buffer sizes.
v4:
- Addressed kbuild test bot issues. Passed kbuild test bot, 179 configs.
v3:
- Added patch to rename DMA_SG to DMA_SG_SG to make it explicit
- Added DMA_MEMCPY_SG transaction type to dmaengine
- Misc patch to add verification of DMA_MEMSET_SG that was missing
- Addressed all nd_pmem driver comments from Ross.
v2:
- Make dma_prep_memcpy_* into one function per Dan.
- Addressed various comments from Ross with code formatting and etc.
- Replaced open code with offset_in_page() macro per Johannes.
The following series implements a blk-mq pmem driver and
also adds infrastructure code to ioatdma and dmaengine in order to
support copying to and from scatterlist in order to process block
requests provided by blk-mq. The usage of DMA engines available on certain
platforms allow us to drastically reduce CPU utilization and at the same time
maintain performance that is good enough. Experimentations have been done on
DRAM backed pmem block device that showed the utilization of DMA engine is
beneficial. By default nd_pmem.ko will be loaded. This can be overridden
through module blacklisting in order to load nd_pmem_dma.ko.
---
Dave Jiang (8):
dmaengine: ioatdma: revert 7618d035 to allow sharing of DMA channels
dmaengine: Add DMA_MEMCPY_SG transaction op
dmaengine: add verification of DMA_MEMSET_SG in dmaengine
dmaengine: ioatdma: dma_prep_memcpy_sg support
dmaengine: add function to provide per descriptor xfercap for dma engine
dmaengine: add SG support to dmaengine_unmap
dmaengine: provide number of available channels
libnvdimm: Add blk-mq pmem driver
Documentation/dmaengine/provider.txt | 3
drivers/dma/dmaengine.c | 76 ++++
drivers/dma/ioat/dma.h | 4
drivers/dma/ioat/init.c | 6
drivers/dma/ioat/prep.c | 57 +++
drivers/nvdimm/Kconfig | 21 +
drivers/nvdimm/Makefile | 6
drivers/nvdimm/pmem.c | 264 ---------------
drivers/nvdimm/pmem.h | 48 +++
drivers/nvdimm/pmem_core.c | 298 +++++++++++++++++
drivers/nvdimm/pmem_dma.c | 606 ++++++++++++++++++++++++++++++++++
include/linux/dmaengine.h | 49 +++
12 files changed, 1170 insertions(+), 268 deletions(-)
create mode 100644 drivers/nvdimm/pmem_core.c
create mode 100644 drivers/nvdimm/pmem_dma.c
--
Signature
3 years, 2 months
[PATCH] acpi/nfit: export read_only attribute of dimms
by Lijun Pan
Though flags attribute provides enough information about
the dimm, it is nice to export the read_only attribute if
bit3 of NVDIMM state flag is set.
If error is injected by BIOS, bit3 and bit1 are both set.
If DIMM is set to read-only by BIOS, bit3 is set.
Hence bit3 is good enough to tell whether the dimm is in
read-only mode or not.
Signed-off-by: Lijun Pan <Lijun.Pan(a)dell.com>
---
drivers/acpi/nfit/core.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index ebe0857ac346..f96e65aa29dd 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -1480,6 +1480,16 @@ static ssize_t flags_show(struct device *dev,
}
static DEVICE_ATTR_RO(flags);
+static ssize_t read_only_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ u16 flags = to_nfit_memdev(dev)->flags;
+
+ return sprintf(buf, "%d\n",
+ flags & ACPI_NFIT_MEM_NOT_ARMED ? 1 : 0);
+}
+static DEVICE_ATTR_RO(read_only);
+
static ssize_t id_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
@@ -1512,6 +1522,7 @@ static struct attribute *acpi_nfit_dimm_attributes[] = {
&dev_attr_format1.attr,
&dev_attr_serial.attr,
&dev_attr_flags.attr,
+ &dev_attr_read_only.attr,
&dev_attr_id.attr,
&dev_attr_family.attr,
&dev_attr_dsm_mask.attr,
--
2.13.6
3 years, 2 months
Re: KVM "fake DAX" flushing interface - discussion
by Dan Williams
On Wed, Jul 26, 2017 at 2:27 PM, Rik van Riel <riel(a)redhat.com> wrote:
> On Wed, 2017-07-26 at 09:47 -0400, Pankaj Gupta wrote:
>> >
>> Just want to summarize here(high level):
>>
>> This will require implementing new 'virtio-pmem' device which
>> presents
>> a DAX address range(like pmem) to guest with read/write(direct
>> access)
>> & device flush functionality. Also, qemu should implement
>> corresponding
>> support for flush using virtio.
>>
> Alternatively, the existing pmem code, with
> a flush-only block device on the side, which
> is somehow associated with the pmem device.
>
> I wonder which alternative leads to the least
> code duplication, and the least maintenance
> hassle going forward.
I'd much prefer to have another driver. I.e. a driver that refactors
out some common pmem details into a shared object and can attach to
ND_DEVICE_NAMESPACE_{IO,PMEM}. A control device on the side seems like
a recipe for confusion.
With a $new_driver in hand you can just do:
modprobe $new_driver
echo $namespace > /sys/bus/nd/drivers/nd_pmem/unbind
echo $namespace > /sys/bus/nd/drivers/$new_driver/new_id
echo $namespace > /sys/bus/nd/drivers/$new_driver/bind
...and the guest can arrange for $new_driver to be the default, so you
don't need to do those steps each boot of the VM, by doing:
echo "blacklist nd_pmem" > /etc/modprobe.d/virt-dax-flush.conf
echo "alias nd:t4* $new_driver" >> /etc/modprobe.d/virt-dax-flush.conf
echo "alias nd:t5* $new_driver" >> /etc/modprobe.d/virt-dax-flush.conf
3 years, 2 months
[PATCH v2 0/2] acpi, nfit: support for new NVDIMM_FAMILY_INTEL commands
by Dan Williams
Change since v1 [1]:
* Introduce NVDIMM_STANDARD_CMDMASK and NVDIMM_INTEL_CMDMASK to replace
magic number usage in the driver. (Dave)
[1]: https://lists.01.org/pipermail/linux-nvdimm/2017-October/013082.html
---
The latest version of the NVDIMM_FAMILY_INTEL command set adds support
for firmware updates and setting SMART health alarms / thresholds among
other things. Given that these are command payloads that will only ever
be issued by userspace we only wire up the command numbers and revision
ids for use through the ND_CMD_CALL interface.
---
Dan Williams (2):
acpi, nfit: hide unknown commands from nmemX/commands
acpi, nfit: add support for NVDIMM_FAMILY_INTEL v1.6 DSMs
drivers/acpi/nfit/core.c | 55 ++++++++++++++++++++++++++++++++++++++++------
drivers/acpi/nfit/nfit.h | 32 ++++++++++++++++++++++++++-
2 files changed, 79 insertions(+), 8 deletions(-)
3 years, 2 months
[PATCH 0/17 v5] dax, ext4, xfs: Synchronous page faults
by Jan Kara
Hello,
here is the fifth version of my patches to implement synchronous page faults
for DAX mappings to make flushing of DAX mappings possible from userspace so
that they can be flushed on finer than page granularity and also avoid the
overhead of a syscall.
We use a new mmap flag MAP_SYNC to indicate that page faults for the mapping
should be synchronous. The guarantee provided by this flag is: While a block
is writeably mapped into page tables of this mapping, it is guaranteed to be
visible in the file at that offset also after a crash.
How I implement this is that ->iomap_begin() indicates by a flag that inode
block mapping metadata is unstable and may need flushing (use the same test as
whether fdatasync() has metadata to write). If yes, DAX fault handler refrains
from inserting / write-enabling the page table entry and returns special flag
VM_FAULT_NEEDDSYNC together with a PFN to map to the filesystem fault handler.
The handler then calls fdatasync() (vfs_fsync_range()) for the affected range
and after that calls DAX code to update the page table entry appropriately.
I did some basic performance testing on the patches over ramdisk - timed
latency of page faults when faulting 512 pages. I did several tests: with file
preallocated / with file empty, with background file copying going on / without
it, with / without MAP_SYNC (so that we get comparison). The results are
(numbers are in microseconds):
File preallocated, no background load no MAP_SYNC:
min=9 avg=10 max=46
8 - 15 us: 508
16 - 31 us: 3
32 - 63 us: 1
File preallocated, no background load, MAP_SYNC:
min=9 avg=10 max=47
8 - 15 us: 508
16 - 31 us: 2
32 - 63 us: 2
File empty, no background load, no MAP_SYNC:
min=21 avg=22 max=70
16 - 31 us: 506
32 - 63 us: 5
64 - 127 us: 1
File empty, no background load, MAP_SYNC:
min=40 avg=124 max=242
32 - 63 us: 1
64 - 127 us: 333
128 - 255 us: 178
File empty, background load, no MAP_SYNC:
min=21 avg=23 max=67
16 - 31 us: 507
32 - 63 us: 4
64 - 127 us: 1
File empty, background load, MAP_SYNC:
min=94 avg=112 max=181
64 - 127 us: 489
128 - 255 us: 23
So here we can see the difference between MAP_SYNC vs non MAP_SYNC is about
100-200 us when we need to wait for transaction commit in this setup.
Anyway, here are the patches and since Ross already posted his patches to test
the functionality, I think we are ready to get this merged. I've talked with
Dan and he said he could take the patches through his tree, I'd just like to
get a final ack from Christoph on the patch modifying mmap(2). Comments are
welcome.
Changes since v4:
* fixed couple of minor things in the manpage
* make legacy mmap flags always supported, remove them from mask declared
to be supported by ext4 and xfs
Changes since v3:
* updated some changelogs
* folded fs support for VM_SYNC flag into patches implementing the
functionality
* removed ->mmap_validate, use ->mmap_supported_flags instead
* added some Reviewed-by tags
* added manpage patch
Changes since v2:
* avoid unnecessary flushing of faulted page (Ross) - I've realized it makes no
sense to remeasure my benchmark results (after actually doing that and seeing
no difference, sigh) since I use ramdisk and not real PMEM HW and so flushes
are ignored.
* handle nojournal mode of ext4
* other smaller cleanups & fixes (Ross)
* factor larger part of finishing of synchronous fault into a helper (Christoph)
* reorder pfnp argument of dax_iomap_fault() (Christoph)
* add XFS support from Christoph
* use proper MAP_SYNC support in mmap(2)
* rebased on top of 4.14-rc4
Changes since v1:
* switched to using mmap flag MAP_SYNC
* cleaned up fault handlers to avoid passing pfn in vmf->orig_pte
* switched to not touching page tables before we are ready to insert final
entry as it was unnecessary and not really simplifying anything
* renamed fault flag to VM_FAULT_NEEDDSYNC
* other smaller fixes found by reviewers
Honza
3 years, 2 months