Re: Detecting NUMA per pmem
by Oren Berman
Hi Ross
Thanks for the speedy reply. I am also adding the public list to this
thread as you suggested.
We have tried to dump the SPA table and this is what we get:
/*
* Intel ACPI Component Architecture
* AML/ASL+ Disassembler version 20160108-64
* Copyright (c) 2000 - 2016 Intel Corporation
*
* Disassembly of NFIT, Sun Oct 22 10:46:19 2017
*
* ACPI Data Table [NFIT]
*
* Format: [HexOffset DecimalOffset ByteLength] FieldName : FieldValue
*/
[000h 0000 4] Signature : "NFIT" [NVDIMM Firmware
Interface Table]
[004h 0004 4] Table Length : 00000028
[008h 0008 1] Revision : 01
[009h 0009 1] Checksum : B2
[00Ah 0010 6] Oem ID : "SUPERM"
[010h 0016 8] Oem Table ID : "SMCI--MB"
[018h 0024 4] Oem Revision : 00000001
[01Ch 0028 4] Asl Compiler ID : " "
[020h 0032 4] Asl Compiler Revision : 00000001
[024h 0036 4] Reserved : 00000000
Raw Table Data: Length 40 (0x28)
0000: 4E 46 49 54 28 00 00 00 01 B2 53 55 50 45 52 4D // NFIT(.....SUPERM
0010: 53 4D 43 49 2D 2D 4D 42 01 00 00 00 01 00 00 00 // SMCI--MB........
0020: 01 00 00 00 00 00 00 00
As you can see the memory region info is missing.
This specific check was done on a supermicro server.
We also performed a bios update but the results were the same.
As said before ,the pmem devices are detected correctly and we verified
that they correspond to different numa nodes using the PCM utility.However,
linux still reports both pmem devices to be on the same numa - Numa 0.
If this information is missing, why pmem devices and address ranges are
still detected correctly?
Is there another table that we need to check?
I also ran dmidecode and the NVDIMMs are being listed (we tested with
netlist NVDIMMs). I can also see the bank locator showing P0 and P1 which I
think indicates the numa. Here is an example:
Handle 0x002D, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x002A
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 16384 MB
Form Factor: DIMM
Set: None
Locator: P1-DIMMA3
Bank Locator: P0_Node0_Channel0_Dimm2
Type: DDR4
Type Detail: Synchronous
Speed: 2400 MHz
Manufacturer: Netlist
Serial Number: 66F50006
Asset Tag: P1-DIMMA3_AssetTag (date:16/42)
Part Number: NV3A74SBT20-000
Rank: 1
Configured Clock Speed: 1600 MHz
Minimum Voltage: Unknown
Maximum Voltage: Unknown
Configured Voltage: Unknown
Handle 0x003B, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x0038
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 16384 MB
Form Factor: DIMM
Set: None
Locator: P2-DIMME3
Bank Locator: P1_Node1_Channel0_Dimm2
Type: DDR4
Type Detail: Synchronous
Speed: 2400 MHz
Manufacturer: Netlist
Serial Number: 66B50010
Asset Tag: P2-DIMME3_AssetTag (date:16/42)
Part Number: NV3A74SBT20-000
Rank: 1
Configured Clock Speed: 1600 MHz
Minimum Voltage: Unknown
Maximum Voltage: Unknown
Configured Voltage: Unknown
Did you encounter such a a case? We would appreciate any insight you might
have.
BR
Oren Berman
On 20 October 2017 at 19:22, Ross Zwisler <ross.zwisler(a)linux.intel.com>
wrote:
> On Thu, Oct 19, 2017 at 06:12:24PM +0300, Oren Berman wrote:
> > Hi Ross
> > My name is Oren Berman and I am a senior developer at lightbitslabs.
> > We are working with NDIMMs but we encountered a problem that the
> kernel
> > does not seem to detect the numa id per PMEM device.
> > It always reports numa 0 although we have NVDIMM devices on both
> nodes.
> > We checked that it always returns 0 from sysfs and also from
> retrieving
> > the device of pmem in the kernel and calling dev_to_node.
> > The result is always 0 for both pmem0 and pmem1.
> > In order to make sure that indeed both numa sockets are used we ran
> > intel's pcm utlity. We verified that writing to pmem 0 increases
> socket 0
> > utilization and writing to pmem1 increases socket 1 utilization so
> the hw
> > works properly.
> > Only the detection seems to be invalid.
> > Did you encounter such a problem?
> > We are using kernel version 4.9 - are you aware of any fix for this
> issue
> > or workaround that we can use.
> > Are we missing something?
> > Thanks for any help you can give us.
> > BR
> > Oren Berman
>
> Hi Oren,
>
> My first guess is that your platform isn't properly filling out the
> "proximity
> domain" field in the NFIT SPA table.
>
> See section 5.2.25.2 in ACPI 6.2:
> http://uefi.org/sites/default/files/resources/ACPI_6_2.pdf
>
> Here's how to check that:
>
> # cd /tmp
> # cp /sys/firmware/acpi/tables/NFIT .
> # iasl NFIT
>
> Intel ACPI Component Architecture
> ASL+ Optimizing Compiler version 20160831-64
> Copyright (c) 2000 - 2016 Intel Corporation
>
> Binary file appears to be a valid ACPI table, disassembling
> Input file NFIT, Length 0xE0 (224) bytes
> ACPI: NFIT 0x0000000000000000 0000E0 (v01 BOCHS BXPCNFIT 00000001 BXPC
> 00000001)
> Acpi Data Table [NFIT] decoded
> Formatted output: NFIT.dsl - 5191 bytes
>
> This will give you an NFIT.dsl file which you can look at. Here is what my
> SPA table looks like for an emulated QEMU NVDIMM:
>
> [028h 0040 2] Subtable Type : 0000 [System Physical
> Address Range]
> [02Ah 0042 2] Length : 0038
>
> [02Ch 0044 2] Range Index : 0002
> [02Eh 0046 2] Flags (decoded below) : 0003
> Add/Online Operation Only : 1
> Proximity Domain Valid : 1
> [030h 0048 4] Reserved : 00000000
> [034h 0052 4] Proximity Domain : 00000000
> [038h 0056 16] Address Range GUID :
> 66F0D379-B4F3-4074-AC43-0D3318B78CDB
> [048h 0072 8] Address Range Base : 0000000240000000
> [050h 0080 8] Address Range Length : 0000000440000000
> [058h 0088 8] Memory Map Attribute : 0000000000008008
>
> So, the "Proximity Domain" field is 0, and this lets the system know which
> NUMA node to associate with this memory region.
>
> BTW, in the future it's best to CC our public list,
> linux-nvdimm(a)lists.01.org,
> as a) someone else might have the same question and b) someone else might
> know
> the answer.
>
> Thanks,
> - Ross
>
4 years
[PATCH v3 0/2] Support ACPI 6.1 update in NFIT Control Region Structure
by Toshi Kani
ACPI 6.1, Table 5-133, updates NVDIMM Control Region Structure as
follows.
- Valid Fields, Manufacturing Location, and Manufacturing Date
are added from reserved range. No change in the structure size.
- IDs (SPD values) are stored as arrays of bytes (i.e. big-endian
format). The spec clarifies that they need to be represented
as arrays of bytes as well.
Patch 1 changes the NFIT driver to comply with ACPI 6.1.
Patch 2 adds a new sysfs file "id" to show NVDIMM ID defined in ACPI 6.1.
The patch-set applies on linux-pm.git acpica.
link: http://www.uefi.org/sites/default/files/resources/ACPI_6_1.pdf
---
v3:
- Need to coordinate with ACPICA update (Bob Moore, Dan Williams)
- Integrate with ACPICA changes in struct acpi_nfit_control_region.
(commit 138a95547ab0)
v2:
- Remove 'mfg_location' and 'mfg_date'. (Dan Williams)
- Rename 'unique_id' to 'id' and make this change as a separate patch.
(Dan Williams)
---
Toshi Kani (3):
1/2 acpi/nfit: Update nfit driver to comply with ACPI 6.1
2/3 acpi/nfit: Add sysfs "id" for NVDIMM ID
---
drivers/acpi/nfit.c | 29 ++++++++++++++++++++++++-----
1 file changed, 24 insertions(+), 5 deletions(-)
4 years
[PATCH v2] acpi: nfit: document sysfs interface
by Aishwarya Pant
This is an attempt to document the nfit sysfs interface. The
descriptions have been collected from git commit logs and the ACPI
specification 6.2.
Signed-off-by: Aishwarya Pant <aishpant(a)gmail.com>
---
Changes in v2:
- Add descriptions for range_index and ecc_unit_size
- Edit various descriptions as suggested
Documentation/ABI/testing/sysfs-bus-nfit | 233 +++++++++++++++++++++++++++++++
1 file changed, 233 insertions(+)
create mode 100644 Documentation/ABI/testing/sysfs-bus-nfit
diff --git a/Documentation/ABI/testing/sysfs-bus-nfit b/Documentation/ABI/testing/sysfs-bus-nfit
new file mode 100644
index 000000000000..619eb8ca0f99
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-bus-nfit
@@ -0,0 +1,233 @@
+For all of the nmem device attributes under nfit/*, see the 'NVDIMM Firmware
+Interface Table (NFIT)' section in the ACPI specification
+(http://www.uefi.org/specifications) for more details.
+
+What: /sys/bus/nd/devices/nmemX/nfit/serial
+Date: Jun, 2015
+KernelVersion: v4.2
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RO) Serial number of the NVDIMM (non-volatile dual in-line
+ memory module), assigned by the module vendor.
+
+
+What: /sys/bus/nd/devices/nmemX/nfit/handle
+Date: Apr, 2015
+KernelVersion: v4.2
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RO) The address (given by the _ADR object) of the device on its
+ parent bus of the NVDIMM device containing the NVDIMM region.
+
+
+What: /sys/bus/nd/devices/nmemX/nfit/device
+Date: Apr, 2015
+KernelVersion: v4.1
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RO) Device id for the NVDIMM, assigned by the module vendor.
+
+
+What: /sys/bus/nd/devices/nmemX/nfit/rev_id
+Date: Jun, 2015
+KernelVersion: v4.2
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RO) Revision of the NVDIMM, assigned by the module vendor.
+
+
+What: /sys/bus/nd/devices/nmemX/nfit/phys_id
+Date: Apr, 2015
+KernelVersion: v4.2
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RO) Handle (i.e., instance number) for the SMBIOS (system
+ management BIOS) Memory Device structure describing the NVDIMM
+ containing the NVDIMM region.
+
+
+What: /sys/bus/nd/devices/nmemX/nfit/flags
+Date: Jun, 2015
+KernelVersion: v4.2
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RO) The flags in the NFIT memory device sub-structure indicate
+ the state of the data on the nvdimm relative to its energy
+ source or last "flush to persistence".
+
+ The attribute is a translation of the 'NVDIMM State Flags' field
+ in section 5.2.25.3 'NVDIMM Region Mapping' Structure of the
+ ACPI specification 6.2.
+
+ The health states are "save_fail", "restore_fail", "flush_fail",
+ "not_armed", "smart_event", "map_fail" and "smart_notify".
+
+
+What: /sys/bus/nd/devices/nmemX/nfit/format
+What: /sys/bus/nd/devices/nmemX/nfit/format1
+What: /sys/bus/nd/devices/nmemX/nfit/formats
+Date: Apr, 2016
+KernelVersion: v4.7
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RO) The interface codes indicate support for persistent memory
+ mapped directly into system physical address space and / or a
+ block aperture access mechanism to the NVDIMM media.
+ The 'formats' attribute displays the number of supported
+ interfaces.
+
+ This layout is compatible with existing libndctl binaries that
+ only expect one code per-dimm as they will ignore
+ nmemX/nfit/formats and nmemX/nfit/formatN.
+
+
+What: /sys/bus/nd/devices/nmemX/nfit/vendor
+Date: Apr, 2016
+KernelVersion: v4.7
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RO) Vendor id of the NVDIMM.
+
+
+What: /sys/bus/nd/devices/nmemX/nfit/dsm_mask
+Date: May, 2016
+KernelVersion: v4.7
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RO) The bitmask indicates the supported device specific control
+ functions relative to the NVDIMM command family supported by the
+ device
+
+
+What: /sys/bus/nd/devices/nmemX/nfit/family
+Date: Apr, 2016
+KernelVersion: v4.7
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RO) Displays the NVDIMM family command sets. Values
+ 0, 1, 2 and 3 correspond to NVDIMM_FAMILY_INTEL,
+ NVDIMM_FAMILY_HPE1, NVDIMM_FAMILY_HPE2 and NVDIMM_FAMILY_MSFT
+ respectively.
+
+ See the specifications for these command families here:
+ http://pmem.io/documents/NVDIMM_DSM_Interface-V1.6.pdf
+ https://github.com/HewlettPackard/hpe-nvm/blob/master/Documentation/
+ https://msdn.microsoft.com/library/windows/hardware/mt604741"
+
+
+What: /sys/bus/nd/devices/nmemX/nfit/id
+Date: Apr, 2016
+KernelVersion: v4.7
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RO) ACPI specification 6.2 section 5.2.25.9, defines an
+ identifier for an NVDIMM, which refelects the id attribute.
+
+
+What: /sys/bus/nd/devices/nmemX/nfit/subsystem_vendor
+Date: Apr, 2016
+KernelVersion: v4.7
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RO) Sub-system vendor id of the NVDIMM non-volatile memory
+ subsystem controller.
+
+
+What: /sys/bus/nd/devices/nmemX/nfit/subsystem_rev_id
+Date: Apr, 2016
+KernelVersion: v4.7
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RO) Sub-system revision id of the NVDIMM non-volatile memory subsystem
+ controller, assigned by the non-volatile memory subsystem
+ controller vendor.
+
+
+What: /sys/bus/nd/devices/nmemX/nfit/subsystem_device
+Date: Apr, 2016
+KernelVersion: v4.7
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RO) Sub-system device id for the NVDIMM non-volatile memory
+ subsystem controller, assigned by the non-volatile memory
+ subsystem controller vendor.
+
+
+What: /sys/bus/nd/devices/ndbusX/nfit/revision
+Date: Jun, 2015
+KernelVersion: v4.2
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RO) ACPI NFIT table revision number.
+
+
+What: /sys/bus/nd/devices/ndbusX/nfit/scrub
+Date: Sep, 2016
+KernelVersion: v4.9
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RW) This shows the number of full Address Range Scrubs (ARS)
+ that have been completed since driver load time. Userspace can
+ wait on this using select/poll etc. A '+' at the end indicates
+ an ARS is in progress
+
+ Writing a value of 1 triggers an ARS scan.
+
+
+What: /sys/bus/nd/devices/ndbusX/nfit/hw_error_scrub
+Date: Sep, 2016
+KernelVersion: v4.9
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RW) Provides a way to toggle the behavior between just adding
+ the address (cache line) where the MCE happened to the poison
+ list and doing a full scrub. The former (selective insertion of
+ the address) is done unconditionally.
+
+ This attribute can have the following values written to it:
+
+ '0': Switch to the default mode where an exception will only
+ insert the address of the memory error into the poison and
+ badblocks lists.
+ '1': Enable a full scrub to happen if an exception for a memory
+ error is received.
+
+
+What: /sys/bus/nd/devices/ndbusX/nfit/dsm_mask
+Date: Jun, 2017
+KernelVersion: v4.13
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RO) The bitmask indicates the supported bus specific control
+ functions. See the section named 'NVDIMM Root Device _DSMs' in
+ the ACPI specification.
+
+
+What: /sys/bus/nd/devices/regionX/nfit/range_index
+Date: Jun, 2015
+KernelVersion: v4.2
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RO) A unique number provided by the BIOS to identify an address
+ range. Used by NVDIMM Region Mapping Structure to uniquely refer
+ to this structure. Value of 0 is reserved and not used as an
+ index.
+
+
+What: /sys/bus/nd/devices/regionX/nfit/ecc_unit_size
+Date: Aug, 2017
+KernelVersion: v4.14
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RO) Size of a write request to a DIMM that will not incur a
+ read-modify-write cycle at the memory controller.
+
+ When the nfit driver initializes it runs an ARS (Address Range
+ Scrub) operation across every pmem range. Part of that process
+ involves determining the ARS capabilities of a given address
+ range. One of the capabilities that is reported is the 'Clear
+ Uncorrectable Error Range Length Unit Size' (see: ACPI 6.2
+ section 9.20.7.4 Function Index 1 - Query ARS Capabilities).
+ This property indicates the boundary at which the NVDIMM may
+ need to perform read-modify-write cycles to maintain ECC (Error
+ Correcting Code) blocks.
--
2.16.2
4 years, 1 month
[PATCH 0/18 v6] dax, ext4, xfs: Synchronous page faults
by Jan Kara
Hello,
here is the sixth version of my patches to implement synchronous page faults
for DAX mappings to make flushing of DAX mappings possible from userspace so
that they can be flushed on finer than page granularity and also avoid the
overhead of a syscall.
I think we are ready to get this merged - I've talked to Dan and he said he
could take the patches through his tree. It would just be nice to get final
ack from Christoph for the first patch implementing MAP_VALIDATE and someone
from XFS folks to check patch 17 (make xfs_filemap_pfn_mkwrite use
__xfs_filemap_fault()).
---
We use a new mmap flag MAP_SYNC to indicate that page faults for the mapping
should be synchronous. The guarantee provided by this flag is: While a block
is writeably mapped into page tables of this mapping, it is guaranteed to be
visible in the file at that offset also after a crash.
How I implement this is that ->iomap_begin() indicates by a flag that inode
block mapping metadata is unstable and may need flushing (use the same test as
whether fdatasync() has metadata to write). If yes, DAX fault handler refrains
from inserting / write-enabling the page table entry and returns special flag
VM_FAULT_NEEDDSYNC together with a PFN to map to the filesystem fault handler.
The handler then calls fdatasync() (vfs_fsync_range()) for the affected range
and after that calls DAX code to update the page table entry appropriately.
I did some basic performance testing on the patches over ramdisk - timed
latency of page faults when faulting 512 pages. I did several tests: with file
preallocated / with file empty, with background file copying going on / without
it, with / without MAP_SYNC (so that we get comparison). The results are
(numbers are in microseconds):
File preallocated, no background load no MAP_SYNC:
min=9 avg=10 max=46
8 - 15 us: 508
16 - 31 us: 3
32 - 63 us: 1
File preallocated, no background load, MAP_SYNC:
min=9 avg=10 max=47
8 - 15 us: 508
16 - 31 us: 2
32 - 63 us: 2
File empty, no background load, no MAP_SYNC:
min=21 avg=22 max=70
16 - 31 us: 506
32 - 63 us: 5
64 - 127 us: 1
File empty, no background load, MAP_SYNC:
min=40 avg=124 max=242
32 - 63 us: 1
64 - 127 us: 333
128 - 255 us: 178
File empty, background load, no MAP_SYNC:
min=21 avg=23 max=67
16 - 31 us: 507
32 - 63 us: 4
64 - 127 us: 1
File empty, background load, MAP_SYNC:
min=94 avg=112 max=181
64 - 127 us: 489
128 - 255 us: 23
So here we can see the difference between MAP_SYNC vs non MAP_SYNC is about
100-200 us when we need to wait for transaction commit in this setup.
Changes since v5:
* really updated the manpage
* improved comment describing IOMAP_F_DIRTY
* fixed XFS handling of VM_FAULT_NEEDSYNC in xfs_filemap_pfn_mkwrite()
Changes since v4:
* fixed couple of minor things in the manpage
* make legacy mmap flags always supported, remove them from mask declared
to be supported by ext4 and xfs
Changes since v3:
* updated some changelogs
* folded fs support for VM_SYNC flag into patches implementing the
functionality
* removed ->mmap_validate, use ->mmap_supported_flags instead
* added some Reviewed-by tags
* added manpage patch
Changes since v2:
* avoid unnecessary flushing of faulted page (Ross) - I've realized it makes no
sense to remeasure my benchmark results (after actually doing that and seeing
no difference, sigh) since I use ramdisk and not real PMEM HW and so flushes
are ignored.
* handle nojournal mode of ext4
* other smaller cleanups & fixes (Ross)
* factor larger part of finishing of synchronous fault into a helper (Christoph)
* reorder pfnp argument of dax_iomap_fault() (Christoph)
* add XFS support from Christoph
* use proper MAP_SYNC support in mmap(2)
* rebased on top of 4.14-rc4
Changes since v1:
* switched to using mmap flag MAP_SYNC
* cleaned up fault handlers to avoid passing pfn in vmf->orig_pte
* switched to not touching page tables before we are ready to insert final
entry as it was unnecessary and not really simplifying anything
* renamed fault flag to VM_FAULT_NEEDDSYNC
* other smaller fixes found by reviewers
Honza
4 years, 2 months
[PATCH 0/5] Teach EDAC about non-volatile DIMMs and add partial support to skx_edac
by Tony Luck
This series was posted as RFC and got some review back in December:
https://marc.info/?l=linux-edac&m=151257213825419&w=2
It couldn't go upstream back then because I was waiting for some updates
to the ACPICA headers for NFIT support. Those patches are complete now,
so here is the series again.
The "partial support" part is that the changes here only do the detection of
the non-volatile DIMMs. Coming later will be another patch/series to teach
skx_edac how to decode system addresses to NVDIMM addresses. But this part
stands on its own and is a useful first step.
Tony Luck (5):
EDAC: Drop duplicated array of strings for memory type names
edac: Add new memory type for non-volatile DIMMs
acpi, nfit: Add function to look up nvdimm device and provide SMBIOS
handle
firmware: dmi: Add function to look up a handle and return DIMM size
EDAC, skx_edac: Detect non-volatile DIMMs
drivers/acpi/nfit/core.c | 27 ++++++++++++++++++
drivers/edac/Kconfig | 5 +++-
drivers/edac/edac_mc.c | 41 +++++++++++++-------------
drivers/edac/edac_mc_sysfs.c | 26 ++---------------
drivers/edac/skx_edac.c | 68 ++++++++++++++++++++++++++++++++++++++++----
drivers/firmware/dmi_scan.c | 31 ++++++++++++++++++++
include/acpi/nfit.h | 26 +++++++++++++++++
include/linux/dmi.h | 2 ++
include/linux/edac.h | 3 ++
9 files changed, 178 insertions(+), 51 deletions(-)
create mode 100644 include/acpi/nfit.h
--
2.14.1
4 years, 3 months
xfstests 344 deadlock on NOVA
by Andiry Xu
Hi,
I encounter a problem when running xfstests on NOVA. I appreciate your
help very much.
Some background: NOVA adopts a per-inode logging design. Metadata
changes are appended to the log and persisted before returning to the
user space. For example, a write() in NOVA works like this:
Allocate new pmem blocks and fill with user data
Append the metadata that describes this write to the end of the inode log
Update the log tail pointer atomically to commit the write
Update in-DRAM radix tree to point to the metadata (for fast lookup)
The log appending and radix tree update are protected by inode_lock().
For mmap, nova_dax_get_blocks (similar to ext2_get_blocks) needs to
persist the metadata for new allocations. So it has to append the new
allocation metadata to the log, and hence needs to lock the inode.
This causes deadlock in xfstests 344 with concurrent pwrite and mmap
write:
Thread 1:
pwrite
-> nova_dax_file_write()
---> inode_lock()
-----> invalidate_inode_pages2_range()
-------> schedule()
Thread 2:
dax_fault
-> nova_dax_get_blocks()
---> inode_lock() // deadlock
If I remove invalidate_inode_pages2_range() in write path, xfstests
344 passed, but 428 will fail.
It appears to me that ext2/ext4 does not have this issue because they
don't persist metadata immediately and hence do not take inode lock.
But nova_dax_get_blocks() has to persist the metadata and needs to
lock the inode to access the log. Is there a way to workaround this?
Thank you very much.
Thanks,
Andiry
4 years, 3 months
[PATCH v4 00/18] dax: fix dma vs truncate/hole-punch
by Dan Williams
Changes since v3 [1]:
* Kill the i_daxdma_lock, and do not impose any new locking constraints
on filesystem implementations (Dave)
* Reuse the existing i_mmap_lock for synchronizing against
get_user_pages() by unmapping and causing punch-hole/truncate to
re-fault the page before get_user_pages() can elevate the page reference
count (Jan)
* Create a dax-specifc address_space_operations instance for each
filesystem. This allows page->mapping to be set for dax pages. (Jan).
* Change the ext4 and ext2 policy of 'mount -o dax' vs a device that
does not support dax. This converts any environments that may have
been using 'page-less' dax back to using page cache.
* Rename wait_on_devmap_idle() to wait_on_atomic_one(), a generic
facility for waiting for an atomic counter to reach a value of '1'.
[1]: https://lwn.net/Articles/737273/
---
Background:
get_user_pages() pins file backed memory pages for access by dma
devices. However, it only pins the memory pages not the page-to-file
offset association. If a file is truncated the pages are mapped out of
the file and dma may continue indefinitely into a page that is owned by
a device driver. This breaks coherency of the file vs dma, but the
assumption is that if userspace wants the file-space truncated it does
not matter what data is inbound from the device, it is not relevant
anymore. The only expectation is that dma can safely continue while the
filesystem reallocates the block(s).
Problem:
This expectation that dma can safely continue while the filesystem
changes the block map is broken by dax. With dax the target dma page
*is* the filesystem block. The model of leaving the page pinned for dma,
but truncating the file block out of the file, means that the filesytem
is free to reallocate a block under active dma to another file and now
the expected data-incoherency situation has turned into active
data-corruption.
Solution:
Defer all filesystem operations (fallocate(), truncate()) on a dax mode
file while any page/block in the file is under active dma. This solution
assumes that dma is transient. Cases where dma operations are known to
not be transient, like RDMA, have been explicitly disabled via
commits like 5f1d43de5416 "IB/core: disable memory registration of
filesystem-dax vmas".
The dax_flush_dma() routine is called by filesystems with a lock held
against mm faults (i_mmap_lock). It then invalidates all mappings to
trigger any subsequent get_user_pages() to block on i_mmap_lock. Finally
it scans/rescans all pages in the mapping until it observes all pages
idle.
So far this solution only targets xfs since it already implements
xfs_break_layouts in all the locations that would need this
synchronization. It applies on top of the vmem_altmap / dev_pagemap
reworks from Christoph.
---
Dan Williams (18):
mm, dax: introduce pfn_t_special()
ext4: auto disable dax instead of failing mount
ext2: auto disable dax instead of failing mount
dax: require 'struct page' by default for filesystem dax
dax: stop using VM_MIXEDMAP for dax
dax: stop using VM_HUGEPAGE for dax
dax: store pfns in the radix
tools/testing/nvdimm: add 'bio_delay' mechanism
mm, dax: enable filesystems to trigger dev_pagemap ->page_free callbacks
mm, dev_pagemap: introduce CONFIG_DEV_PAGEMAP_OPS
fs, dax: introduce DEFINE_FSDAX_AOPS
xfs: use DEFINE_FSDAX_AOPS
ext4: use DEFINE_FSDAX_AOPS
ext2: use DEFINE_FSDAX_AOPS
mm, fs, dax: use page->mapping to warn if dma collides with truncate
wait_bit: introduce {wait_on,wake_up}_atomic_one
mm, fs, dax: dax_flush_dma, handle dma vs block-map-change collisions
xfs, dax: wire up dax_flush_dma support via a new xfs_sync_dma helper
arch/powerpc/platforms/Kconfig | 1
arch/powerpc/sysdev/axonram.c | 2
drivers/dax/device.c | 1
drivers/dax/super.c | 100 ++++++++++-
drivers/nvdimm/pmem.c | 3
drivers/s390/block/Kconfig | 1
drivers/s390/block/dcssblk.c | 3
fs/Kconfig | 8 +
fs/dax.c | 295 ++++++++++++++++++++++++++++-----
fs/ext2/ext2.h | 1
fs/ext2/file.c | 1
fs/ext2/inode.c | 23 ++-
fs/ext2/namei.c | 18 --
fs/ext2/super.c | 13 +
fs/ext4/file.c | 1
fs/ext4/inode.c | 6 +
fs/ext4/super.c | 15 +-
fs/xfs/Makefile | 3
fs/xfs/xfs_aops.c | 2
fs/xfs/xfs_aops.h | 1
fs/xfs/xfs_dma.c | 81 +++++++++
fs/xfs/xfs_dma.h | 24 +++
fs/xfs/xfs_file.c | 8 -
fs/xfs/xfs_ioctl.c | 7 -
fs/xfs/xfs_iops.c | 12 +
fs/xfs/xfs_super.c | 20 +-
include/linux/dax.h | 70 +++++++-
include/linux/memremap.h | 28 +--
include/linux/mm.h | 62 +++++--
include/linux/pfn_t.h | 13 +
include/linux/vma.h | 23 +++
include/linux/wait_bit.h | 13 +
kernel/memremap.c | 30 +++
kernel/sched/wait_bit.c | 59 ++++++-
mm/Kconfig | 5 +
mm/gup.c | 5 +
mm/hmm.c | 13 -
mm/huge_memory.c | 6 -
mm/ksm.c | 3
mm/madvise.c | 2
mm/memory.c | 22 ++
mm/migrate.c | 3
mm/mlock.c | 5 -
mm/mmap.c | 8 -
mm/swap.c | 3
tools/testing/nvdimm/Kbuild | 1
tools/testing/nvdimm/test/iomap.c | 62 +++++++
tools/testing/nvdimm/test/nfit.c | 34 ++++
tools/testing/nvdimm/test/nfit_test.h | 1
49 files changed, 918 insertions(+), 203 deletions(-)
create mode 100644 fs/xfs/xfs_dma.c
create mode 100644 fs/xfs/xfs_dma.h
create mode 100644 include/linux/vma.h
4 years, 3 months
[PATCH] loop: Fix lost writes caused by missing flag
by Ross Zwisler
The following commit:
commit aa4d86163e4e ("block: loop: switch to VFS ITER_BVEC")
replaced __do_lo_send_write(), which used ITER_KVEC iterators, with
lo_write_bvec() which uses ITER_BVEC iterators. In this change, though,
the WRITE flag was lost:
- iov_iter_kvec(&from, ITER_KVEC | WRITE, &kvec, 1, len);
+ iov_iter_bvec(&i, ITER_BVEC, bvec, 1, bvec->bv_len);
This flag is necessary for the DAX case because we make decisions based on
whether or not the iterator is a READ or a WRITE in dax_iomap_actor() and
in dax_iomap_rw().
We end up going through this path in configurations where we combine a PMEM
device with 4k sectors, a loopback device and DAX. The consequence of this
missed flag is that what we intend as a write actually turns into a read in
the DAX code, so no data is ever written.
The very simplest test case is to create a loopback device and try and
write a small string to it, then hexdump a few bytes of the device to see
if the write took. Without this patch you read back all zeros, with this
you read back the string you wrote.
For XFS this causes us to fail or panic during the following xfstests:
xfs/074 xfs/078 xfs/216 xfs/217 xfs/250
For ext4 we have a similar issue where writes never happen, but we don't
currently have any xfstests that use loopback and show this issue.
Fix this by restoring the WRITE flag argument to iov_iter_bvec(). This
causes the xfstests to all pass.
Signed-off-by: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Fixes: commit aa4d86163e4e ("block: loop: switch to VFS ITER_BVEC")
Cc: Christoph Hellwig <hch(a)lst.de>
Cc: Al Viro <viro(a)zeniv.linux.org.uk>
Cc: stable(a)vger.kernel.org
---
drivers/block/loop.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index d5fe720cf149..89d2ee00cced 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -266,7 +266,7 @@ static int lo_write_bvec(struct file *file, struct bio_vec *bvec, loff_t *ppos)
struct iov_iter i;
ssize_t bw;
- iov_iter_bvec(&i, ITER_BVEC, bvec, 1, bvec->bv_len);
+ iov_iter_bvec(&i, ITER_BVEC | WRITE, bvec, 1, bvec->bv_len);
file_start_write(file);
bw = vfs_iter_write(file, &i, ppos, 0);
--
2.14.3
4 years, 3 months
[PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
by Logan Gunthorpe
Hi Everyone,
Here's v2 of our series to introduce P2P based copy offload to NVMe
fabrics. This version has been rebased onto v4.16-rc3 which already
includes Christoph's devpagemap work the previous version was based
off as well as a couple of the cleanup patches that were in v1.
Additionally, we've made the following changes based on feedback:
* Renamed everything to 'p2pdma' per the suggestion from Bjorn as well
as a bunch of cleanup and spelling fixes he pointed out in the last
series.
* To address Alex's ACS concerns, we change to a simpler method of
just disabling ACS behind switches for any kernel that has
CONFIG_PCI_P2PDMA.
* We also reject using devices that employ 'dma_virt_ops' which should
fairly simply handle Jason's concerns that this work might break with
the HFI, QIB and rxe drivers that use the virtual ops to implement
their own special DMA operations.
Thanks,
Logan
--
This is a continuation of our work to enable using Peer-to-Peer PCI
memory in NVMe fabrics targets. Many thanks go to Christoph Hellwig who
provided valuable feedback to get these patches to where they are today.
The concept here is to use memory that's exposed on a PCI BAR as
data buffers in the NVME target code such that data can be transferred
from an RDMA NIC to the special memory and then directly to an NVMe
device avoiding system memory entirely. The upside of this is better
QoS for applications running on the CPU utilizing memory and lower
PCI bandwidth required to the CPU (such that systems could be designed
with fewer lanes connected to the CPU). However, presently, the
trade-off is currently a reduction in overall throughput. (Largely due
to hardware issues that would certainly improve in the future).
Due to these trade-offs we've designed the system to only enable using
the PCI memory in cases where the NIC, NVMe devices and memory are all
behind the same PCI switch. This will mean many setups that could likely
work well will not be supported so that we can be more confident it
will work and not place any responsibility on the user to understand
their topology. (We chose to go this route based on feedback we
received at the last LSF). Future work may enable these transfers behind
a fabric of PCI switches or perhaps using a white list of known good
root complexes.
In order to enable this functionality, we introduce a few new PCI
functions such that a driver can register P2P memory with the system.
Struct pages are created for this memory using devm_memremap_pages()
and the PCI bus offset is stored in the corresponding pagemap structure.
Another set of functions allow a client driver to create a list of
client devices that will be used in a given P2P transactions and then
use that list to find any P2P memory that is supported by all the
client devices. This list is then also used to selectively disable the
ACS bits for the downstream ports behind these devices.
In the block layer, we also introduce a P2P request flag to indicate a
given request targets P2P memory as well as a flag for a request queue
to indicate a given queue supports targeting P2P memory. P2P requests
will only be accepted by queues that support it. Also, P2P requests
are marked to not be merged seeing a non-homogenous request would
complicate the DMA mapping requirements.
In the PCI NVMe driver, we modify the existing CMB support to utilize
the new PCI P2P memory infrastructure and also add support for P2P
memory in its request queue. When a P2P request is received it uses the
pci_p2pmem_map_sg() function which applies the necessary transformation
to get the corrent pci_bus_addr_t for the DMA transactions.
In the RDMA core, we also adjust rdma_rw_ctx_init() and
rdma_rw_ctx_destroy() to take a flags argument which indicates whether
to use the PCI P2P mapping functions or not.
Finally, in the NVMe fabrics target port we introduce a new
configuration boolean: 'allow_p2pmem'. When set, the port will attempt
to find P2P memory supported by the RDMA NIC and all namespaces. If
supported memory is found, it will be used in all IO transfers. And if
a port is using P2P memory, adding new namespaces that are not supported
by that memory will fail.
Logan Gunthorpe (10):
PCI/P2PDMA: Support peer to peer memory
PCI/P2PDMA: Add sysfs group to display p2pmem stats
PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
block: Introduce PCI P2P flags for request and request queue
IB/core: Add optional PCI P2P flag to rdma_rw_ctx_[init|destroy]()
nvme-pci: Use PCI p2pmem subsystem to manage the CMB
nvme-pci: Add support for P2P memory in requests
nvme-pci: Add a quirk for a pseudo CMB
nvmet: Optionally use PCI P2P memory
Documentation/ABI/testing/sysfs-bus-pci | 25 ++
block/blk-core.c | 3 +
drivers/infiniband/core/rw.c | 21 +-
drivers/infiniband/ulp/isert/ib_isert.c | 5 +-
drivers/infiniband/ulp/srpt/ib_srpt.c | 7 +-
drivers/nvme/host/core.c | 4 +
drivers/nvme/host/nvme.h | 8 +
drivers/nvme/host/pci.c | 118 ++++--
drivers/nvme/target/configfs.c | 29 ++
drivers/nvme/target/core.c | 95 ++++-
drivers/nvme/target/io-cmd.c | 3 +
drivers/nvme/target/nvmet.h | 10 +
drivers/nvme/target/rdma.c | 43 +-
drivers/pci/Kconfig | 20 +
drivers/pci/Makefile | 1 +
drivers/pci/p2pdma.c | 713 ++++++++++++++++++++++++++++++++
drivers/pci/pci.c | 4 +
include/linux/blk_types.h | 18 +-
include/linux/blkdev.h | 3 +
include/linux/memremap.h | 19 +
include/linux/pci-p2pdma.h | 105 +++++
include/linux/pci.h | 4 +
include/rdma/rw.h | 7 +-
net/sunrpc/xprtrdma/svc_rdma_rw.c | 6 +-
24 files changed, 1204 insertions(+), 67 deletions(-)
create mode 100644 drivers/pci/p2pdma.c
create mode 100644 include/linux/pci-p2pdma.h
--
2.11.0
4 years, 3 months
[PATCH 1/3] nfit_test: improve structure offset handling
by Ross Zwisler
In nfit_test0_setup() and nfit_test1_setup() we keep an 'offset' value
which we use to calculate where in our 'nfit_buf' we will place our next
structure. The handling of 'offset' and the calculation of the placement
of the next structure is a bit inconsistent, though. We don't update
'offset' after we insert each structure, sometimes causing us to update it
for multiple structures' sizes at once. When calculating the position of
the next structure we aren't always able to just use 'offset', but
sometimes have to add in other structure sizes as well.
Fix this by updating 'offset' after each structure insertion in a
consistent way, allowing us to always calculate the position of the next
structure to be inserted by just using 'nfit_buf + offset'.
Signed-off-by: Ross Zwisler <ross.zwisler(a)linux.intel.com>
---
tools/testing/nvdimm/test/nfit.c | 183 +++++++++++++++++++++++----------------
1 file changed, 109 insertions(+), 74 deletions(-)
diff --git a/tools/testing/nvdimm/test/nfit.c b/tools/testing/nvdimm/test/nfit.c
index 620fa78b3b1b..1376fc95c33a 100644
--- a/tools/testing/nvdimm/test/nfit.c
+++ b/tools/testing/nvdimm/test/nfit.c
@@ -1366,7 +1366,7 @@ static void nfit_test0_setup(struct nfit_test *t)
struct acpi_nfit_data_region *bdw;
struct acpi_nfit_flush_address *flush;
struct acpi_nfit_capabilities *pcap;
- unsigned int offset, i;
+ unsigned int offset = 0, i;
/*
* spa0 (interleave first half of dimm0 and dimm1, note storage
@@ -1380,93 +1380,102 @@ static void nfit_test0_setup(struct nfit_test *t)
spa->range_index = 0+1;
spa->address = t->spa_set_dma[0];
spa->length = SPA0_SIZE;
+ offset += spa->header.length;
/*
* spa1 (interleave last half of the 4 DIMMS, note storage
* does not actually alias the related block-data-window
* regions)
*/
- spa = nfit_buf + sizeof(*spa);
+ spa = nfit_buf + offset;
spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
spa->header.length = sizeof(*spa);
memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_PM), 16);
spa->range_index = 1+1;
spa->address = t->spa_set_dma[1];
spa->length = SPA1_SIZE;
+ offset += spa->header.length;
/* spa2 (dcr0) dimm0 */
- spa = nfit_buf + sizeof(*spa) * 2;
+ spa = nfit_buf + offset;
spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
spa->header.length = sizeof(*spa);
memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_DCR), 16);
spa->range_index = 2+1;
spa->address = t->dcr_dma[0];
spa->length = DCR_SIZE;
+ offset += spa->header.length;
/* spa3 (dcr1) dimm1 */
- spa = nfit_buf + sizeof(*spa) * 3;
+ spa = nfit_buf + offset;
spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
spa->header.length = sizeof(*spa);
memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_DCR), 16);
spa->range_index = 3+1;
spa->address = t->dcr_dma[1];
spa->length = DCR_SIZE;
+ offset += spa->header.length;
/* spa4 (dcr2) dimm2 */
- spa = nfit_buf + sizeof(*spa) * 4;
+ spa = nfit_buf + offset;
spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
spa->header.length = sizeof(*spa);
memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_DCR), 16);
spa->range_index = 4+1;
spa->address = t->dcr_dma[2];
spa->length = DCR_SIZE;
+ offset += spa->header.length;
/* spa5 (dcr3) dimm3 */
- spa = nfit_buf + sizeof(*spa) * 5;
+ spa = nfit_buf + offset;
spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
spa->header.length = sizeof(*spa);
memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_DCR), 16);
spa->range_index = 5+1;
spa->address = t->dcr_dma[3];
spa->length = DCR_SIZE;
+ offset += spa->header.length;
/* spa6 (bdw for dcr0) dimm0 */
- spa = nfit_buf + sizeof(*spa) * 6;
+ spa = nfit_buf + offset;
spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
spa->header.length = sizeof(*spa);
memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_BDW), 16);
spa->range_index = 6+1;
spa->address = t->dimm_dma[0];
spa->length = DIMM_SIZE;
+ offset += spa->header.length;
/* spa7 (bdw for dcr1) dimm1 */
- spa = nfit_buf + sizeof(*spa) * 7;
+ spa = nfit_buf + offset;
spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
spa->header.length = sizeof(*spa);
memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_BDW), 16);
spa->range_index = 7+1;
spa->address = t->dimm_dma[1];
spa->length = DIMM_SIZE;
+ offset += spa->header.length;
/* spa8 (bdw for dcr2) dimm2 */
- spa = nfit_buf + sizeof(*spa) * 8;
+ spa = nfit_buf + offset;
spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
spa->header.length = sizeof(*spa);
memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_BDW), 16);
spa->range_index = 8+1;
spa->address = t->dimm_dma[2];
spa->length = DIMM_SIZE;
+ offset += spa->header.length;
/* spa9 (bdw for dcr3) dimm3 */
- spa = nfit_buf + sizeof(*spa) * 9;
+ spa = nfit_buf + offset;
spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
spa->header.length = sizeof(*spa);
memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_BDW), 16);
spa->range_index = 9+1;
spa->address = t->dimm_dma[3];
spa->length = DIMM_SIZE;
+ offset += spa->header.length;
- offset = sizeof(*spa) * 10;
/* mem-region0 (spa0, dimm0) */
memdev = nfit_buf + offset;
memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
@@ -1481,9 +1490,10 @@ static void nfit_test0_setup(struct nfit_test *t)
memdev->address = 0;
memdev->interleave_index = 0;
memdev->interleave_ways = 2;
+ offset += memdev->header.length;
/* mem-region1 (spa0, dimm1) */
- memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map);
+ memdev = nfit_buf + offset;
memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
memdev->header.length = sizeof(*memdev);
memdev->device_handle = handle[1];
@@ -1497,9 +1507,10 @@ static void nfit_test0_setup(struct nfit_test *t)
memdev->interleave_index = 0;
memdev->interleave_ways = 2;
memdev->flags = ACPI_NFIT_MEM_HEALTH_ENABLED;
+ offset += memdev->header.length;
/* mem-region2 (spa1, dimm0) */
- memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 2;
+ memdev = nfit_buf + offset;
memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
memdev->header.length = sizeof(*memdev);
memdev->device_handle = handle[0];
@@ -1513,9 +1524,10 @@ static void nfit_test0_setup(struct nfit_test *t)
memdev->interleave_index = 0;
memdev->interleave_ways = 4;
memdev->flags = ACPI_NFIT_MEM_HEALTH_ENABLED;
+ offset += memdev->header.length;
/* mem-region3 (spa1, dimm1) */
- memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 3;
+ memdev = nfit_buf + offset;
memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
memdev->header.length = sizeof(*memdev);
memdev->device_handle = handle[1];
@@ -1528,9 +1540,10 @@ static void nfit_test0_setup(struct nfit_test *t)
memdev->address = SPA0_SIZE/2;
memdev->interleave_index = 0;
memdev->interleave_ways = 4;
+ offset += memdev->header.length;
/* mem-region4 (spa1, dimm2) */
- memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 4;
+ memdev = nfit_buf + offset;
memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
memdev->header.length = sizeof(*memdev);
memdev->device_handle = handle[2];
@@ -1544,9 +1557,10 @@ static void nfit_test0_setup(struct nfit_test *t)
memdev->interleave_index = 0;
memdev->interleave_ways = 4;
memdev->flags = ACPI_NFIT_MEM_HEALTH_ENABLED;
+ offset += memdev->header.length;
/* mem-region5 (spa1, dimm3) */
- memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 5;
+ memdev = nfit_buf + offset;
memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
memdev->header.length = sizeof(*memdev);
memdev->device_handle = handle[3];
@@ -1559,9 +1573,10 @@ static void nfit_test0_setup(struct nfit_test *t)
memdev->address = SPA0_SIZE/2;
memdev->interleave_index = 0;
memdev->interleave_ways = 4;
+ offset += memdev->header.length;
/* mem-region6 (spa/dcr0, dimm0) */
- memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 6;
+ memdev = nfit_buf + offset;
memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
memdev->header.length = sizeof(*memdev);
memdev->device_handle = handle[0];
@@ -1574,9 +1589,10 @@ static void nfit_test0_setup(struct nfit_test *t)
memdev->address = 0;
memdev->interleave_index = 0;
memdev->interleave_ways = 1;
+ offset += memdev->header.length;
/* mem-region7 (spa/dcr1, dimm1) */
- memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 7;
+ memdev = nfit_buf + offset;
memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
memdev->header.length = sizeof(*memdev);
memdev->device_handle = handle[1];
@@ -1589,9 +1605,10 @@ static void nfit_test0_setup(struct nfit_test *t)
memdev->address = 0;
memdev->interleave_index = 0;
memdev->interleave_ways = 1;
+ offset += memdev->header.length;
/* mem-region8 (spa/dcr2, dimm2) */
- memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 8;
+ memdev = nfit_buf + offset;
memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
memdev->header.length = sizeof(*memdev);
memdev->device_handle = handle[2];
@@ -1604,9 +1621,10 @@ static void nfit_test0_setup(struct nfit_test *t)
memdev->address = 0;
memdev->interleave_index = 0;
memdev->interleave_ways = 1;
+ offset += memdev->header.length;
/* mem-region9 (spa/dcr3, dimm3) */
- memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 9;
+ memdev = nfit_buf + offset;
memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
memdev->header.length = sizeof(*memdev);
memdev->device_handle = handle[3];
@@ -1619,9 +1637,10 @@ static void nfit_test0_setup(struct nfit_test *t)
memdev->address = 0;
memdev->interleave_index = 0;
memdev->interleave_ways = 1;
+ offset += memdev->header.length;
/* mem-region10 (spa/bdw0, dimm0) */
- memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 10;
+ memdev = nfit_buf + offset;
memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
memdev->header.length = sizeof(*memdev);
memdev->device_handle = handle[0];
@@ -1634,9 +1653,10 @@ static void nfit_test0_setup(struct nfit_test *t)
memdev->address = 0;
memdev->interleave_index = 0;
memdev->interleave_ways = 1;
+ offset += memdev->header.length;
/* mem-region11 (spa/bdw1, dimm1) */
- memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 11;
+ memdev = nfit_buf + offset;
memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
memdev->header.length = sizeof(*memdev);
memdev->device_handle = handle[1];
@@ -1649,9 +1669,10 @@ static void nfit_test0_setup(struct nfit_test *t)
memdev->address = 0;
memdev->interleave_index = 0;
memdev->interleave_ways = 1;
+ offset += memdev->header.length;
/* mem-region12 (spa/bdw2, dimm2) */
- memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 12;
+ memdev = nfit_buf + offset;
memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
memdev->header.length = sizeof(*memdev);
memdev->device_handle = handle[2];
@@ -1664,9 +1685,10 @@ static void nfit_test0_setup(struct nfit_test *t)
memdev->address = 0;
memdev->interleave_index = 0;
memdev->interleave_ways = 1;
+ offset += memdev->header.length;
/* mem-region13 (spa/dcr3, dimm3) */
- memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 13;
+ memdev = nfit_buf + offset;
memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
memdev->header.length = sizeof(*memdev);
memdev->device_handle = handle[3];
@@ -1680,12 +1702,12 @@ static void nfit_test0_setup(struct nfit_test *t)
memdev->interleave_index = 0;
memdev->interleave_ways = 1;
memdev->flags = ACPI_NFIT_MEM_HEALTH_ENABLED;
+ offset += memdev->header.length;
- offset = offset + sizeof(struct acpi_nfit_memory_map) * 14;
/* dcr-descriptor0: blk */
dcr = nfit_buf + offset;
dcr->header.type = ACPI_NFIT_TYPE_CONTROL_REGION;
- dcr->header.length = sizeof(struct acpi_nfit_control_region);
+ dcr->header.length = sizeof(*dcr);
dcr->region_index = 0+1;
dcr_common_init(dcr);
dcr->serial_number = ~handle[0];
@@ -1696,11 +1718,12 @@ static void nfit_test0_setup(struct nfit_test *t)
dcr->command_size = 8;
dcr->status_offset = 8;
dcr->status_size = 4;
+ offset += dcr->header.length;
/* dcr-descriptor1: blk */
- dcr = nfit_buf + offset + sizeof(struct acpi_nfit_control_region);
+ dcr = nfit_buf + offset;
dcr->header.type = ACPI_NFIT_TYPE_CONTROL_REGION;
- dcr->header.length = sizeof(struct acpi_nfit_control_region);
+ dcr->header.length = sizeof(*dcr);
dcr->region_index = 1+1;
dcr_common_init(dcr);
dcr->serial_number = ~handle[1];
@@ -1711,11 +1734,12 @@ static void nfit_test0_setup(struct nfit_test *t)
dcr->command_size = 8;
dcr->status_offset = 8;
dcr->status_size = 4;
+ offset += dcr->header.length;
/* dcr-descriptor2: blk */
- dcr = nfit_buf + offset + sizeof(struct acpi_nfit_control_region) * 2;
+ dcr = nfit_buf + offset;
dcr->header.type = ACPI_NFIT_TYPE_CONTROL_REGION;
- dcr->header.length = sizeof(struct acpi_nfit_control_region);
+ dcr->header.length = sizeof(*dcr);
dcr->region_index = 2+1;
dcr_common_init(dcr);
dcr->serial_number = ~handle[2];
@@ -1726,11 +1750,12 @@ static void nfit_test0_setup(struct nfit_test *t)
dcr->command_size = 8;
dcr->status_offset = 8;
dcr->status_size = 4;
+ offset += dcr->header.length;
/* dcr-descriptor3: blk */
- dcr = nfit_buf + offset + sizeof(struct acpi_nfit_control_region) * 3;
+ dcr = nfit_buf + offset;
dcr->header.type = ACPI_NFIT_TYPE_CONTROL_REGION;
- dcr->header.length = sizeof(struct acpi_nfit_control_region);
+ dcr->header.length = sizeof(*dcr);
dcr->region_index = 3+1;
dcr_common_init(dcr);
dcr->serial_number = ~handle[3];
@@ -1741,8 +1766,8 @@ static void nfit_test0_setup(struct nfit_test *t)
dcr->command_size = 8;
dcr->status_offset = 8;
dcr->status_size = 4;
+ offset += dcr->header.length;
- offset = offset + sizeof(struct acpi_nfit_control_region) * 4;
/* dcr-descriptor0: pmem */
dcr = nfit_buf + offset;
dcr->header.type = ACPI_NFIT_TYPE_CONTROL_REGION;
@@ -1753,10 +1778,10 @@ static void nfit_test0_setup(struct nfit_test *t)
dcr->serial_number = ~handle[0];
dcr->code = NFIT_FIC_BYTEN;
dcr->windows = 0;
+ offset += dcr->header.length;
/* dcr-descriptor1: pmem */
- dcr = nfit_buf + offset + offsetof(struct acpi_nfit_control_region,
- window_size);
+ dcr = nfit_buf + offset;
dcr->header.type = ACPI_NFIT_TYPE_CONTROL_REGION;
dcr->header.length = offsetof(struct acpi_nfit_control_region,
window_size);
@@ -1765,10 +1790,10 @@ static void nfit_test0_setup(struct nfit_test *t)
dcr->serial_number = ~handle[1];
dcr->code = NFIT_FIC_BYTEN;
dcr->windows = 0;
+ offset += dcr->header.length;
/* dcr-descriptor2: pmem */
- dcr = nfit_buf + offset + offsetof(struct acpi_nfit_control_region,
- window_size) * 2;
+ dcr = nfit_buf + offset;
dcr->header.type = ACPI_NFIT_TYPE_CONTROL_REGION;
dcr->header.length = offsetof(struct acpi_nfit_control_region,
window_size);
@@ -1777,10 +1802,10 @@ static void nfit_test0_setup(struct nfit_test *t)
dcr->serial_number = ~handle[2];
dcr->code = NFIT_FIC_BYTEN;
dcr->windows = 0;
+ offset += dcr->header.length;
/* dcr-descriptor3: pmem */
- dcr = nfit_buf + offset + offsetof(struct acpi_nfit_control_region,
- window_size) * 3;
+ dcr = nfit_buf + offset;
dcr->header.type = ACPI_NFIT_TYPE_CONTROL_REGION;
dcr->header.length = offsetof(struct acpi_nfit_control_region,
window_size);
@@ -1789,54 +1814,56 @@ static void nfit_test0_setup(struct nfit_test *t)
dcr->serial_number = ~handle[3];
dcr->code = NFIT_FIC_BYTEN;
dcr->windows = 0;
+ offset += dcr->header.length;
- offset = offset + offsetof(struct acpi_nfit_control_region,
- window_size) * 4;
/* bdw0 (spa/dcr0, dimm0) */
bdw = nfit_buf + offset;
bdw->header.type = ACPI_NFIT_TYPE_DATA_REGION;
- bdw->header.length = sizeof(struct acpi_nfit_data_region);
+ bdw->header.length = sizeof(*bdw);
bdw->region_index = 0+1;
bdw->windows = 1;
bdw->offset = 0;
bdw->size = BDW_SIZE;
bdw->capacity = DIMM_SIZE;
bdw->start_address = 0;
+ offset += bdw->header.length;
/* bdw1 (spa/dcr1, dimm1) */
- bdw = nfit_buf + offset + sizeof(struct acpi_nfit_data_region);
+ bdw = nfit_buf + offset;
bdw->header.type = ACPI_NFIT_TYPE_DATA_REGION;
- bdw->header.length = sizeof(struct acpi_nfit_data_region);
+ bdw->header.length = sizeof(*bdw);
bdw->region_index = 1+1;
bdw->windows = 1;
bdw->offset = 0;
bdw->size = BDW_SIZE;
bdw->capacity = DIMM_SIZE;
bdw->start_address = 0;
+ offset += bdw->header.length;
/* bdw2 (spa/dcr2, dimm2) */
- bdw = nfit_buf + offset + sizeof(struct acpi_nfit_data_region) * 2;
+ bdw = nfit_buf + offset;
bdw->header.type = ACPI_NFIT_TYPE_DATA_REGION;
- bdw->header.length = sizeof(struct acpi_nfit_data_region);
+ bdw->header.length = sizeof(*bdw);
bdw->region_index = 2+1;
bdw->windows = 1;
bdw->offset = 0;
bdw->size = BDW_SIZE;
bdw->capacity = DIMM_SIZE;
bdw->start_address = 0;
+ offset += bdw->header.length;
/* bdw3 (spa/dcr3, dimm3) */
- bdw = nfit_buf + offset + sizeof(struct acpi_nfit_data_region) * 3;
+ bdw = nfit_buf + offset;
bdw->header.type = ACPI_NFIT_TYPE_DATA_REGION;
- bdw->header.length = sizeof(struct acpi_nfit_data_region);
+ bdw->header.length = sizeof(*bdw);
bdw->region_index = 3+1;
bdw->windows = 1;
bdw->offset = 0;
bdw->size = BDW_SIZE;
bdw->capacity = DIMM_SIZE;
bdw->start_address = 0;
+ offset += bdw->header.length;
- offset = offset + sizeof(struct acpi_nfit_data_region) * 4;
/* flush0 (dimm0) */
flush = nfit_buf + offset;
flush->header.type = ACPI_NFIT_TYPE_FLUSH_ADDRESS;
@@ -1845,48 +1872,52 @@ static void nfit_test0_setup(struct nfit_test *t)
flush->hint_count = NUM_HINTS;
for (i = 0; i < NUM_HINTS; i++)
flush->hint_address[i] = t->flush_dma[0] + i * sizeof(u64);
+ offset += flush->header.length;
/* flush1 (dimm1) */
- flush = nfit_buf + offset + flush_hint_size * 1;
+ flush = nfit_buf + offset;
flush->header.type = ACPI_NFIT_TYPE_FLUSH_ADDRESS;
flush->header.length = flush_hint_size;
flush->device_handle = handle[1];
flush->hint_count = NUM_HINTS;
for (i = 0; i < NUM_HINTS; i++)
flush->hint_address[i] = t->flush_dma[1] + i * sizeof(u64);
+ offset += flush->header.length;
/* flush2 (dimm2) */
- flush = nfit_buf + offset + flush_hint_size * 2;
+ flush = nfit_buf + offset;
flush->header.type = ACPI_NFIT_TYPE_FLUSH_ADDRESS;
flush->header.length = flush_hint_size;
flush->device_handle = handle[2];
flush->hint_count = NUM_HINTS;
for (i = 0; i < NUM_HINTS; i++)
flush->hint_address[i] = t->flush_dma[2] + i * sizeof(u64);
+ offset += flush->header.length;
/* flush3 (dimm3) */
- flush = nfit_buf + offset + flush_hint_size * 3;
+ flush = nfit_buf + offset;
flush->header.type = ACPI_NFIT_TYPE_FLUSH_ADDRESS;
flush->header.length = flush_hint_size;
flush->device_handle = handle[3];
flush->hint_count = NUM_HINTS;
for (i = 0; i < NUM_HINTS; i++)
flush->hint_address[i] = t->flush_dma[3] + i * sizeof(u64);
+ offset += flush->header.length;
/* platform capabilities */
- pcap = nfit_buf + offset + flush_hint_size * 4;
+ pcap = nfit_buf + offset;
pcap->header.type = ACPI_NFIT_TYPE_CAPABILITIES;
pcap->header.length = sizeof(*pcap);
pcap->highest_capability = 1;
pcap->capabilities = ACPI_NFIT_CAPABILITY_CACHE_FLUSH |
ACPI_NFIT_CAPABILITY_MEM_FLUSH;
+ offset += pcap->header.length;
if (t->setup_hotplug) {
- offset = offset + flush_hint_size * 4 + sizeof(*pcap);
/* dcr-descriptor4: blk */
dcr = nfit_buf + offset;
dcr->header.type = ACPI_NFIT_TYPE_CONTROL_REGION;
- dcr->header.length = sizeof(struct acpi_nfit_control_region);
+ dcr->header.length = sizeof(*dcr);
dcr->region_index = 8+1;
dcr_common_init(dcr);
dcr->serial_number = ~handle[4];
@@ -1897,8 +1928,8 @@ static void nfit_test0_setup(struct nfit_test *t)
dcr->command_size = 8;
dcr->status_offset = 8;
dcr->status_size = 4;
+ offset += dcr->header.length;
- offset = offset + sizeof(struct acpi_nfit_control_region);
/* dcr-descriptor4: pmem */
dcr = nfit_buf + offset;
dcr->header.type = ACPI_NFIT_TYPE_CONTROL_REGION;
@@ -1909,21 +1940,20 @@ static void nfit_test0_setup(struct nfit_test *t)
dcr->serial_number = ~handle[4];
dcr->code = NFIT_FIC_BYTEN;
dcr->windows = 0;
+ offset += dcr->header.length;
- offset = offset + offsetof(struct acpi_nfit_control_region,
- window_size);
/* bdw4 (spa/dcr4, dimm4) */
bdw = nfit_buf + offset;
bdw->header.type = ACPI_NFIT_TYPE_DATA_REGION;
- bdw->header.length = sizeof(struct acpi_nfit_data_region);
+ bdw->header.length = sizeof(*bdw);
bdw->region_index = 8+1;
bdw->windows = 1;
bdw->offset = 0;
bdw->size = BDW_SIZE;
bdw->capacity = DIMM_SIZE;
bdw->start_address = 0;
+ offset += bdw->header.length;
- offset = offset + sizeof(struct acpi_nfit_data_region);
/* spa10 (dcr4) dimm4 */
spa = nfit_buf + offset;
spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
@@ -1932,30 +1962,32 @@ static void nfit_test0_setup(struct nfit_test *t)
spa->range_index = 10+1;
spa->address = t->dcr_dma[4];
spa->length = DCR_SIZE;
+ offset += spa->header.length;
/*
* spa11 (single-dimm interleave for hotplug, note storage
* does not actually alias the related block-data-window
* regions)
*/
- spa = nfit_buf + offset + sizeof(*spa);
+ spa = nfit_buf + offset;
spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
spa->header.length = sizeof(*spa);
memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_PM), 16);
spa->range_index = 11+1;
spa->address = t->spa_set_dma[2];
spa->length = SPA0_SIZE;
+ offset += spa->header.length;
/* spa12 (bdw for dcr4) dimm4 */
- spa = nfit_buf + offset + sizeof(*spa) * 2;
+ spa = nfit_buf + offset;
spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
spa->header.length = sizeof(*spa);
memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_BDW), 16);
spa->range_index = 12+1;
spa->address = t->dimm_dma[4];
spa->length = DIMM_SIZE;
+ offset += spa->header.length;
- offset = offset + sizeof(*spa) * 3;
/* mem-region14 (spa/dcr4, dimm4) */
memdev = nfit_buf + offset;
memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
@@ -1970,10 +2002,10 @@ static void nfit_test0_setup(struct nfit_test *t)
memdev->address = 0;
memdev->interleave_index = 0;
memdev->interleave_ways = 1;
+ offset += memdev->header.length;
- /* mem-region15 (spa0, dimm4) */
- memdev = nfit_buf + offset +
- sizeof(struct acpi_nfit_memory_map);
+ /* mem-region15 (spa11, dimm4) */
+ memdev = nfit_buf + offset;
memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
memdev->header.length = sizeof(*memdev);
memdev->device_handle = handle[4];
@@ -1987,10 +2019,10 @@ static void nfit_test0_setup(struct nfit_test *t)
memdev->interleave_index = 0;
memdev->interleave_ways = 1;
memdev->flags = ACPI_NFIT_MEM_HEALTH_ENABLED;
+ offset += memdev->header.length;
/* mem-region16 (spa/bdw4, dimm4) */
- memdev = nfit_buf + offset +
- sizeof(struct acpi_nfit_memory_map) * 2;
+ memdev = nfit_buf + offset;
memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
memdev->header.length = sizeof(*memdev);
memdev->device_handle = handle[4];
@@ -2003,8 +2035,8 @@ static void nfit_test0_setup(struct nfit_test *t)
memdev->address = 0;
memdev->interleave_index = 0;
memdev->interleave_ways = 1;
+ offset += memdev->header.length;
- offset = offset + sizeof(struct acpi_nfit_memory_map) * 3;
/* flush3 (dimm4) */
flush = nfit_buf + offset;
flush->header.type = ACPI_NFIT_TYPE_FLUSH_ADDRESS;
@@ -2014,6 +2046,7 @@ static void nfit_test0_setup(struct nfit_test *t)
for (i = 0; i < NUM_HINTS; i++)
flush->hint_address[i] = t->flush_dma[4]
+ i * sizeof(u64);
+ offset += flush->header.length;
}
post_ars_status(&t->ars_state, &t->badrange, t->spa_set_dma[0],
@@ -2061,17 +2094,18 @@ static void nfit_test1_setup(struct nfit_test *t)
spa->range_index = 0+1;
spa->address = t->spa_set_dma[0];
spa->length = SPA2_SIZE;
+ offset += spa->header.length;
/* virtual cd region */
- spa = nfit_buf + sizeof(*spa);
+ spa = nfit_buf + offset;
spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
spa->header.length = sizeof(*spa);
memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_VCD), 16);
spa->range_index = 0;
spa->address = t->spa_set_dma[1];
spa->length = SPA_VCD_SIZE;
+ offset += spa->header.length;
- offset += sizeof(*spa) * 2;
/* mem-region0 (spa0, dimm0) */
memdev = nfit_buf + offset;
memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
@@ -2089,8 +2123,8 @@ static void nfit_test1_setup(struct nfit_test *t)
memdev->flags = ACPI_NFIT_MEM_SAVE_FAILED | ACPI_NFIT_MEM_RESTORE_FAILED
| ACPI_NFIT_MEM_FLUSH_FAILED | ACPI_NFIT_MEM_HEALTH_OBSERVED
| ACPI_NFIT_MEM_NOT_ARMED;
+ offset += memdev->header.length;
- offset += sizeof(*memdev);
/* dcr-descriptor0 */
dcr = nfit_buf + offset;
dcr->header.type = ACPI_NFIT_TYPE_CONTROL_REGION;
@@ -2101,8 +2135,8 @@ static void nfit_test1_setup(struct nfit_test *t)
dcr->serial_number = ~handle[5];
dcr->code = NFIT_FIC_BYTE;
dcr->windows = 0;
-
offset += dcr->header.length;
+
memdev = nfit_buf + offset;
memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
memdev->header.length = sizeof(*memdev);
@@ -2117,9 +2151,9 @@ static void nfit_test1_setup(struct nfit_test *t)
memdev->interleave_index = 0;
memdev->interleave_ways = 1;
memdev->flags = ACPI_NFIT_MEM_MAP_FAILED;
+ offset += memdev->header.length;
/* dcr-descriptor1 */
- offset += sizeof(*memdev);
dcr = nfit_buf + offset;
dcr->header.type = ACPI_NFIT_TYPE_CONTROL_REGION;
dcr->header.length = offsetof(struct acpi_nfit_control_region,
@@ -2129,6 +2163,7 @@ static void nfit_test1_setup(struct nfit_test *t)
dcr->serial_number = ~handle[6];
dcr->code = NFIT_FIC_BYTE;
dcr->windows = 0;
+ offset += dcr->header.length;
post_ars_status(&t->ars_state, &t->badrange, t->spa_set_dma[0],
SPA2_SIZE);
--
2.14.3
4 years, 3 months