Notes on building and manipulating large IPLD graphs in JavaScript and LevelDB. · Issue #1416 · ipfs/js-ipfs (original) (raw)

Because I need to use the data, and as a good project to learn the
ecosystem, I took on the task of loading gharchive into IPFS/IPLD.

A few things about gharchive.

A single gzipped JSON file for every hour of GitHub activity.
Typical hour in 2018 ranges from 20K-70K JSON objects.
Does not contain any code, just the activity metadata for a range
of GitHub activities.
Files for 2018 range from 20-100megs compressed.

Because of what I need to do with the data, I need each event
object to live in 3 paths.

/repos/:owner/:repo/:year/:month/:day/:time
/actors/:login/:year/:month/:day/:time
/timestamps/:year/:month/:day/:hour/:minute/:hash

So an hour of data will translate into 60K-210K inserts deep
into the graph.

Because I'll eventually want to pull parts of this data into the browser
I decided to compress the JSON objects with gzip. So, each event becomes a
raw node of gzipped json.

I started at the top of the IPFS stack and worked my way down as each layer
couldn't satisfy the performance requirements (eventually each layer would
take longer than an hour to insert an hour of data from the archive).

Before I explain the libraries and methods I finally ended up at, here's
the current performance parsing through 2018 on my laptop.

{ filename: '2018-01-07-14.json.gz' }
{ cid: 'zdpuAt6Hs61V1NPn7sudZeXmRsHC6RcKP7XSer4BcDvP7T5ta',
  events: 46886,
  walked: 106402,
  totalTime: '173s',
  processTime: '78s',
  graphBuildTime: '80s' }

Note: the times for "processing" vs. "graph build" are not entirely accurate since, as you'll read later, a lot of pre-optimizations for the graph build phase happen during processing.

So, with 7 days of activity already in the graph it takes 173 seconds to load an
hour of data. The increases in time are also leveling off the more data that gets
inserted. This set had (46886 gzipped JSON blocks + 106402 new dag-cbor nodes)
writes and probably ~70K reads of dag-cbor nodes.

Performance

While there are some performance limitations of merkle graphs compared to
other common database structures there's one key advantage:
predictability.

Typically in databases you take a set of writes for a transaction and have
to commit them in strict reverse order. Often, you need information from
each of those writes in order to perform the next write. But with a merkle
graph you could compute the entire graph before a single block was written
to disc. This means that optimizing write performance on disc comes down
to writing batches of key/value pairs as fast as the underlying store can
handle.

The basic strategy for performance I took is in 3 categories.

level batch writes.
Bulk graph building.
Plain old JS performance. For a minute there I forgot about this cause
I was too focused on IO and my largest performance regression was in a
simple function implemented poorly. Afterwards I got in the habit of
perf profiling every few days.

ipld-store

I needed a new bulk interface closer to leveldb so I built a
simple key/value store for ipld.

The most important part here is the bulk interface it exposes. Rather
than wait to commit pending transactions until flush() is called, this
bulk interface is continuously writing a queue of transactions using the
level.batch() interface. I've spent enough time in the level ecosystem
to know that the batch interface is significantly more performant than trying
to send a lot of concurrent writes to leveldb. However, I still don't want
to queue too many transactions as Node.js will get very slow, so I have a 1GB
limit on the total size of the pending blocks. The bulk.put() interface
returns a promise that immediately resolves unless this limit is reached so
that a consumer can still implement some back-pressure.

ipld-complex-graph-builder

I really liked js-ipld-graph-builder but I needed something
a little better suited to working with such an incredibly large and
complicated graph, and I needed something that would take advantage of
the ipld-store's bulk interface.

It currently only supports inserts, cause that's all I need right now :)

All inserts into the graph have their block sent immediately to the bulk
writer and the CID for the link put into a nested Map object for the graph build later on.

Additionally, all the reads the graph will need later when it
builds the graph on flush() are "primed".
By the time the graph build actually happens all the dag-cbor
nodes it needs to read are already in a Set() as a resolved or unresolved
promise (if the read hasn't returned yet).

I also needed to find an efficient way to shard a few parts of the graph.

Rather than using hamt, which would require occasional re-balanching, I
went with a "static" approach. I already have a rough idea of the total
keys I'll need at a few layers (a couple million actors and owners) so I
can just create a two level shard up-front using simple numeric hashing.

The final structure means, rather than the previous paths, you end up with
this:

/repos/:shard1/:shard2/:owner/:repo/:year/:month/:day/:time
/actors/:shard1/:shard2/:login/:year/:month/:day/:time
/timestamps/:year/:month/:day/:hour/:minute/:hash

Additionally, the occasional edge node ends up being larger than the current
serializers can handle. This is rare and the node doesn't mutate so hamt
is a bit of overkill and there isn't a simple off-the-shelf way to support it
with dag-cbor yet. This is my current workaround
which transparently breaks up and re-assembles nodes limited to 500 keys.
It is not ideal and it complicates the block writer a bit but it works fine for
now and doesn't break any of the linking so the whole graph is still visible/pinnable
by any other IPLD implementation (IPFS).

Conclusions

I found this experiment incredibly validating of IPLD and the
primitives we're using. Without using any of the current implementations,
without any code from IPFS I was able to build a fully compatible graph.
Once I write a small library to share the ipld-store on bitswap I can
tell IPFS to pin it.

However, messaging is sort of a mess. IPFS has some history now and the
older something is in this ecosystem the faster you find it. I found the old
files API before MFS, I found the object API before the dag API. I
found dag-pb before dag-cbor. All of these translated into me spending
time using an interface or library I would eventually discard.

I'll be writing and thinking about this a lot more in the coming weeks, but
I'm starting to wonder if the concepts and features we present that are
familiar to users (like MFS) are actually making it more difficult for
people to learn our stack in the long run.

Most developers are used to client/server development. There's a lot of
assumptions they carry around as part of that which have to be unwound in
order to write decentralized applications. The more familiar the interface we
present the less adapted their mental model is as they use it.

I end up talking to developers all the time who have these assumptions.
In the coming months, as I talk to them about a decentralized web, I'm going
to do it through IPLD/CID and see if it clicks a little better than a
filesystem metaphor.

I don't think concepts like merkle trees are actually more difficult to
understand than many of the concepts developers have to learn for
client/server development, they're just different and information about them
less abundant.

I also think there's a lot of use cases these solve, like offline, that we can
present for people to learn without needing to also explain the whole p2p
stack.

You can see the final code for pulling the data out of gharchive and
building the graph here.