On Friday, October 16, 2020 4:12 PM, Dan Williams <dan.j.williams(a)intel.com>
On Fri, Oct 16, 2020 at 2:59 PM Nabeel Meeramohideen Mohamed
(nmeeramohide) <nmeeramohide(a)micron.com> wrote:
> On Thursday, October 15, 2020 2:03 AM, Christoph Hellwig
> > I don't think this belongs into the kernel. It is a classic case for
> > infrastructure that should be built in userspace. If anything is
> > missing to implement it in userspace with equivalent performance we
> > need to improve out interfaces, although io_uring should cover pretty
> > much everything you need.
> Hi Christoph,
> We previously considered moving the mpool object store code to user-space.
> However, by implementing mpool as a device driver, we get several benefits
> in terms of scalability, performance, and functionality. In doing so, we relied
> only on standard interfaces and did not make any changes to the kernel.
> (1) mpool's "mcache map" facility allows us to memory-map (and later
> a collection of logically related objects with a single system call. The objects
> such a collection are created at different times, physically disparate, and may
> even reside on different media class volumes.
> For our HSE storage engine application, there are commonly 10's to 100's
> objects in a given mcache map, and 75,000 total objects mapped at a given
> Compared to memory-mapping objects individually, the mcache map facility
> scales well because it requires only a single system call and single
> to memory-map a complete collection of objects.
Why can't that be a batch of mmap calls on io_uring?
Agreed, we could add the capability to invoke mmap via io_uring to help mitigate the
system call overhead of memory-mapping individual objects, versus our mache map
mechanism. However, there is still the scalability issue of having a vm_area_struct
for each object (versus one for each mache map).
We ran YCSB workload C in two different configurations -
Config 1: memory-mapping each individual object
Config 2: memory-mapping a collection of related objects using mcache map
- Config 1 incurred ~3.3x additional kernel memory for the vm_area_struct slab -
24.8 MB (127188 objects) for config 1, versus 7.3 MB (37482 objects) for config 2.
- Workload C exhibited around 10-25% better tail latencies (4-nines) for config 2,
not sure if it's due the reduced complexity of searching VMAs during page faults.
> (2) The mcache map reaper mechanism proactively evicts object
data from the
> cache based on object-level metrics. This provides significant performance
> for many workloads.
> For example, we ran YCSB workloads B (95/5 read/write mix) and C (100% read)
> against our HSE storage engine using the mpool driver in a 5.9 kernel.
> For each workload, we ran with the reaper turned-on and turned-off.
> For workload B, the reaper increased throughput 1.77x, while reducing 99.99%
> latency for reads by 39% and updates by 99%. For workload C, the reaper
> throughput by 1.84x, while reducing the 99.99% read tail latency by 63%. These
> improvements are even more dramatic with earlier kernels.
What metrics proved useful and can the vanilla page cache / page
reclaim mechanism be augmented with those metrics?
The mcache map facility is designed to cache a collection of related immutable objects
with similar lifetimes. It is best suited for storage applications that run queries
organized collections of immutable objects, such as storage engines and DBs based on
Each mcache map is associated with a temperature (pinned, hot, warm, cold) and is left
to the application to tag it appropriately. For our HSE storage engine application,
the SSTables in the root/intermediate levels acts as a routing table to redirect queries
an appropriate leaf level SSTable, in which case, the mcache map corresponding to the
root/intermediate level SSTables can be tagged as pinned/hot.
The mcache reaper tracks the access time of each object in an mcache map. On memory
pressure, the access time is compared to a time-to-live metric that’s set based on the
map’s temperature, how close is the free memory to the low and high watermarks etc.
If the object was last accessed outside the ttl window, its pages are evicted from the
We also apply a few other techniques like throttling the readaheads and adding a delay
in the page fault handler to not overwhelm the page cache during memory pressure.
In the workloads that we run, we have noticed stalls when kswapd does the reclaim and
that impacts throughput and tail latencies as described in our last email. The mcache
reaper runs proactively and can make better reclaim decisions as it is designed to
address a specific class of workloads.
We doubt whether the same mechanisms can be employed in the vanilla page cache as
it is designed to work for a wide variety of workloads.
> (4) mpool's immutable object model allows the driver to
> of object data directly and memory-mapped without a performance penalty to
> coherence. This allows background operations, such as LSM-tree compaction,
> operate efficiently and without polluting the page cache.
How is this different than existing background operations / defrag
that filesystems perform today? Where are the opportunities to improve
We haven’t measured the benefit of eliminating the coherence check, which isn’t needed
in our case because objects are immutable. However the open(2) documentation makes
the statement that “applications should avoid mixing mmap(2) of files with direct I/O to
the same files”, which is what we are effectively doing when we directly read from an
object that is also in an mcache map.
> (5) Representing an mpool as a /dev/mpool/<mpool-name>
> convenient mechanism for controlling access to and managing the multiple
> volumes, and in the future pmem devices, that may comprise an logical mpool.
Christoph and I have talked about replacing the pmem driver's
dependence on device-mapper for pooling. What extensions would be
needed for the existing driver arch?
mpool doesn’t extend any of the existing driver arch to manage multiple storage volumes.
Mpool implements the concept of media classes, where each media class corresponds
to a different storage volume. Clients specify a media class when creating an object in
an mpool. mpool currently supports only two media classes, “capacity” for storing bulk
of the objects backed by, for instance, QLC SSDs and “staging” for storing objects
requiring lower latency/higher throughput backed by, for instance, 3DXP SSDs.
An mpool is accessed via the /dev/mpool/<mpool-name> device file and the
mpool descriptor attached to this device file instance tracks all its associated media
class volumes. mpool relies on device mapper to provide physical device aggregation
within a media class volume.