Here's today's Swift community meeting IRC chat about SPDK driven by Wewe's
proposal around blobstore, thanks Wewe!!
For those not wanting to read through the transcript, my takeaways are:
* SSDs are not in high use in Swift mainly due to cost making any SW effort in
optimizing for flash a relatively low priority for them
* Where SSD are used (container storage) there are no significant performance
issues
That said:
* The Swift project technical lead (notmyname) is always looking for new ways to
differentiate and wants to keep an open dialogue, just not willing to endorse activities
(merge any code into Swift related to this) to support the effort
So at this point anyone is, of course, free to work on a proof of concept w/o a ton of
help from the swift folks and, provided there's some compelling data, they'd
surely reconsider. I can't speak for the SPDK maintainers but I think w/o Swift
community interest using Swift as vehicle to introduce SPDK into a more object oriented
system is probably not the best route to go.
Let me know if there are any questions, these guys are always willing to talk.
Thx
Paul
<notmyname> today peluse is back with us (yay) to talk about something he's been
working on
<peluse> rock n roll
<notmyname> peluse: take it away
<peluse> I'm thinking I should have typed some shit up in advance to avoid all
the typos I'm about to introduce :)
<peluse> anyways...
<peluse>
http://spdk.io
<notmyname> #link
http://spdk.io
<peluse> is the URL as I mentioned before. Quick high level overview then I'll
bring up a proposal someone in our community has made
<peluse> that we haven't spent a whole lot of time thinking about TBH
<notmyname> ok
* andreas_s has quit (Ping timeout: 240 seconds)
<peluse> Also, here's a SNIA talk I did last month about SPDK in general and one
relevant component called blobstore
https://www.snia.org/sites/default/files/SDC/2017/presentations/Solid_Sta...
<peluse> So SPDK is a set of user space components that is all BSD licensed
* andreas_s (andreas_s@nat/ibm/x-tvmvoewqleavvoqo) has joined
<peluse> its used in a whole bunch of ways but mainly by storage appliances to
optimize SSD performance in what swift would call the storage node
<peluse> FYI its in Ceph already but not the default driver
<peluse> and when I say "it" I mean whatever component the system has
chosen to take on, in Ceph its the user space polled mode NVMe driver
<peluse> there are some basic perf marketing type hypes slides in that deck I pated
in for anyone interested
<peluse> pretty huge gains when you consider latency and CPU sensitive apps running
with latest SSDs
<notmyname> so the basic idea is a fast/efficient way to talk to fast storage media
that might potentially be useful in swift's object server?
<peluse> anyway, that's the real trick is that its all user space, direct access
to HW, no INTs and no locking
<peluse> yup
<peluse> but there are a ton of compoennts, well not a ton, but a bunch that would
not be relevant
<timburke> could it be useful for the account/container servers, too, or are we just
looking at object servers (and diskfile in particular)?
<notmyname> what are the integration points. I doubt it's as simple as mmaping a
file and your'e done
<peluse> and some are lirbaries and some are applications.
* baoli has quit (Remote host closed the connection)
<peluse> I think since its SSD only (well not techncially but it wouldn't make
sense to use on spinning media) most likelt container
<rledisez> so we are talking of objec servers on SSD. is it a real use case? (i
would think it's the target of ceph, very low latency)
<peluse> if you used object servers there are probably some limitations wrt what we
call blobstore
<peluse> I'l get to the integration question in a sec
<peluse> so, assuming a node takes on the user space NVMe driver and the driver
talks directly to HW you can see there no kernel and no FS
<peluse> so... unless the storage application talks in blocks it doesn't make
much sense
* TxGirlGeek has quit (Quit: My MacBook has gone to sleep. ZZZzzz...)
<notmyname> ok
* artom_ (~artom@205.233.59.73<mailto:~artom@205.233.59.73>) has joined
<peluse> blobstore is SPDK's answer to this but its not a FS
* artom_ has quit (Remote host closed the connection)
<peluse> it's a super simple way for apps that don't talk blocks that can
use a really simple file-ish object-ish like interface to take advantage of SPDK
<peluse> so for example, RocksDB
* artom_ (~artom@205.233.59.73<mailto:~artom@205.233.59.73>) has joined
<peluse> in that slide deck I mention some work we did there to bolt blobstore up to
RocksDB as a back end
<notmyname> so ... as you know swift likes to be HW and driver agnostic. what does
this tie in too? is it possible to write stuff in a way that works if you have fast media
or not?
<peluse> its that kind of idea that might makes sense for Swift
* rcernin (rcernin@nat/redhat/x-twozdfdqzloeefcs) has joined
* andreas_s has quit (Ping timeout: 260 seconds)
* jungleboyj looks in late
<notmyname> or is the idea that swift would engage spdk mode if it detects flash?
<peluse> so there are lots of things that can be done there
* andreas_s (andreas_s@nat/ibm/x-asrkfmvdxkgqcbhv) has joined
<peluse> but yeah I think anything more aggressive than NVMe only would not be worth
it
* artom has quit (Ping timeout: 255 seconds)
<peluse> SPDK doesn't automateically do any of that kind of detection
<peluse> so that would have to be considered
<notmyname> that makes sense
<notmyname> I could imagine swift detecting that
<peluse> and blocstore itself is pretty immature, need to point that out. We just
now added code to recover from a dirty shutdown if that gives you an idea
<notmyname> ok, so tell me (us) more about the blobstore. would that be a diskfile
thing?
<peluse> so this whole thing would be a proof of concept type activity for sure
<notmyname> how does this make rledisez's LOSF work awesomer?
<peluse> so yeah, I think diskfile would make sense
<peluse> but I don't rememeber the details there of course. my brain is pretty
small :)
<peluse> In that slide deck you can see a super simple example of the interface
<peluse> blobstore bascially takes over an entire disk, writes its own private
metadata and then the app create "blobs" and does basic LBA sized reads and
writes to them
<notmyname> ah, ok
<peluse> it can't handle sub-LBA access (by design)
<peluse> well, we can them pages in blobstore but they're 4K
<notmyname> that sounds like a haystack-in-a-library thing. or something similar to
what you're working on rledisez
<rledisez> yes, blobstore would be what we call volume. and I guess it embed its own
k/v indexation. so it looks similar in some ways
<peluse> yeah, I think the integration effort w/Swift for production would be a
decent sized lift but for a POC may be worth it provided, maybe for container SSDs, the
latency and CPU usage bebenfit made sense
<notmyname> peluse: is there any spdk component that could replace sqlite? eg some
kv store that does transactions?
<notmyname> eg to replace the container layer
* awaugama has quit (Quit: Leaving)
<peluse> rocksDB would be the closest match, using blobstore as a backing component
<peluse> but that's really what Wewe's proposal was - to add a k/v interface
on blobstore
<notmyname> ah ok. so a 3rd part db that works with spdk
<peluse> yeah, maybe that's the best first step
<notmyname> any questions from anyone, so far?
<peluse> I can't remember what sqlite guts look like, can you easily replace the
storage engine as its called in like MariaDB, anyone know?
* dprince has quit (Quit: leaving)
<notmyname> no
<peluse> yeah, OK didn't think so
<notmyname> sqlite is "just" a DB library
<tdasilva> dumb question from me, but can you explain the difference from spdk and
the intel cas tech?
* xyang1 has quit (Quit: xyang1)
<notmyname> ^ not a dumb question
<peluse> sure, good question
<peluse> they are totally different for one thing
<peluse> CAS is a caching project/product that works between an app and the FS.
<peluse> SPDK is a whole bunch of stuff, but not caching layers. It has to be
integrated with an application unless you use one of the things like the compiled iSCSI
target
* e0ne has quit (Quit: My MacBook Pro has gone to sleep. ZZZzzz...)
<peluse> dunno if that's enough explanation - block cache vs library of stuff
for integration, mainly polled mode device driver for NVMe
<peluse> so Q for you guys, is there any urgency with container SSDs and latency
and/or using a bunch of CPU?
<tdasilva> peluse, so spdk provides performance improvements by substituting the FS
and writing directly to block storage
<rledisez> do you handle caching in bdev or blobstore? or do you assume the
underlaying device is fast enought
<peluse> tdasilva, yup
<peluse> rledisez, there's no data caching at all right now
<tdasilva> peluse: very similar to bluestore?
<peluse> bdev is a layer for abstracting different types of block devices. For
example we can have an NVMe at the bottom of the stack or a RAM disk and for layers above
bdev they don't care. its super light wieght
<peluse> tdasilva, yeah, bluestore and blobstore area lot alike but bluestore was
done of course just for Ceph and I think is more mature/feature rich right now
<peluse> but Sage mentioned in his keynote at SNIA SDC about looking at maybe using
rocksdb w/blobstore at some point in the future (dont quote me though)
<peluse> that would be in addition to bluestore as backing FS though, no isntead of
<notmyname> peluse: what questions do you have for us?
<tdasilva> peluse: ack, thanks
<peluse> jsut the one above about pain points wrt latency and or CPU utilization
around SSDs
* esberglu has quit (Remote host closed the connection)
<peluse> well, and if anyone is interested enough to work with someone from the SPDK
community to try and see if there's some sort of proof of concept worth messing with
here
<notmyname> only pain points I've seen recently with the container layer is
drive fullness and the contaienr replicator not having all the goodness we've added to
the object replicator for when drives fill up
* priteau has quit ()
<notmyname> rledisez: how about you? any latency or cpu issues on containers or
accounts?
<rledisez> peluse: from my experience, there is not really a pain point about
storage speed on containers. having a lot of containers slo down some process (like
replicator) as they need to scan all db. not sure yet if blobstore would help here
<peluse> wen I say CPU util, there's more in that deck I referenced, using SPDK
(nvme + blobstore) greatly reduces CPU utillization while at the same time greatly
improving perf
<peluse> so you get kinda a two fer one thing
<peluse> so for containers you'll get more CPU utillization for other things
happening on the storage node, and the IOs will be faster and more repsonsive
<peluse> (or your money back)
<notmyname> heh
<rledisez> how can you measure that CPU usage related to kernel/fs. i don't
think i see any, but i would like to check
<rledisez> most of the cpu usage comes from replicator or container-server
<peluse> There's a perf blog on spdk.io that may have some good info in it,
honestly I haven't read it :(
<peluse> but we have some folks in our comm that live for that kinda stuff so I can
ask there and get back to y'all
<peluse> rledisez, yeah unless used for object storage wouldn't help
w/replicator
<rledisez> if you have a magic command to get the cpu usage i would be interested (i
guess it would be something related to perf command)
* msimonin has quit (Quit: Leaving.)
* msimonin
(~Adium@tsv35-1-78-232-147-61.fbx.proxad.net<mailto:~Adium@tsv35-1-78-232-147-61.fbx.proxad.net>)
has joined
<notmyname> honestly, spdk sounds really cool. it seems like something that would be
great for an all-flash future. (but I'm not sure if anyone deloying swift is there
yet)
<peluse> rledisez, yeah I dunno the details of the various measurements but the team
has looked at every metric known to man using a variety of tools
* msimonin has quit (Client Quit)
* msimonin
(~Adium@tsv35-1-78-232-147-61.fbx.proxad.net<mailto:~Adium@tsv35-1-78-232-147-61.fbx.proxad.net>)
has joined
<notmyname> peluse: do you have people in the spdk community who are interested in
swift? if so, are they interested because they just want to integrate spdk everywhere or
because they are using swift already?
* msimonin has quit (Client Quit)
* Alex_Staf
(~astafeye@bzq-109-65-185-7.red.bezeqint.net<mailto:~astafeye@bzq-109-65-185-7.red.bezeqint.net>)
has joined
* msimonin
(~Adium@tsv35-1-78-232-147-61.fbx.proxad.net<mailto:~Adium@tsv35-1-78-232-147-61.fbx.proxad.net>)
has joined
<peluse> Wewe is the only person I know that's brought it up and he wasn't
able to get connected today due to network issues
* VW_ (~vw@50.56.228.68<mailto:~vw@50.56.228.68>) has joined
* msimonin has quit (Client Quit)
<peluse> right now there's more demand on features/integration than there is
anything else so I don't think the former is driving anyone
* msimonin
(~Adium@tsv35-1-78-232-147-61.fbx.proxad.net<mailto:~Adium@tsv35-1-78-232-147-61.fbx.proxad.net>)
has joined
<notmyname> ok
<peluse> which is one of the reasons I wanted to chat w/you guys about this - if it
doesn't make a lot of sense to investigate from your perspective we certainly have
enough work on our plate :)
* msimonin has quit (Client Quit)
* msimonin
(~Adium@tsv35-1-78-232-147-61.fbx.proxad.net<mailto:~Adium@tsv35-1-78-232-147-61.fbx.proxad.net>)
has joined
<peluse> that's all I got for ya, other questions?
<notmyname> I think it makes sense when looking a few years into the future and
preparing for that. it doesn't make sense from the sense that all of our current
employers have a huge amount of stuff we need to do in swift way before we get to needing
spdk
* msimonin has quit (Client Quit)
<peluse> yup yup
<notmyname> (my opinion)
* msimonin
(~Adium@tsv35-1-78-232-147-61.fbx.proxad.net<mailto:~Adium@tsv35-1-78-232-147-61.fbx.proxad.net>)
has joined
<peluse> what is the current split of SSD usage, still mostly containers?
<notmyname> definitely something I want to keep an eye on
<notmyname> yeah
<peluse> cool
* msimonin has quit (Client Quit)
<notmyname> flash still too expensive for interesting-sized object server
deployments
* msimonin
(~Adium@tsv35-1-78-232-147-61.fbx.proxad.net<mailto:~Adium@tsv35-1-78-232-147-61.fbx.proxad.net>)
has joined
<peluse> makes sense
<notmyname> people these days are going for bigger nodes. 80 10TB in a single
chassis
<notmyname> (and getting all the eww that implies)
* msimonin has quit (Client Quit)
<peluse> well, that's not to say nobody on this end will work on a proof of
concept anyways and if so I'll encourage them to check in the Swift comm frequently of
course...
* msimonin
(~Adium@tsv35-1-78-232-147-61.fbx.proxad.net<mailto:~Adium@tsv35-1-78-232-147-61.fbx.proxad.net>)
has joined
<rledisez> i like the idea, and we can surely share some stuff between
LOSF/blobstore but i think that people looking for really low latency object store will
check ceph as by its design/implem, it looks more suited
<notmyname> let's move on so we can give m_kazuhiro appropriate time :-)
<notmyname> peluse: that's great!
<peluse> thanks for the time guys!!
<notmyname> and thanks for stopping by to give an update
* VW has quit (Ping timeout: 264 seconds)
<peluse> my pleasure... ping me later if anyone has followup questions. take care!
<notmyname> rledisez: I can get you in contact with peluse if you can't find him
on IRC late