Skip to content

Bacalhau project report 20220823

lukemarsden edited this page Aug 23, 2022 · 5 revisions

⚡️⚡️⚡️ 1000 node cluster with performance wins ⚡️⚡️⚡️

1000-nodes-bacalhau-omg.mp4

As shown in this video, Bacalhau can now run on 1000 nodes and run jobs with just 200ms overhead versus natively running the same job in Docker. All with significantly improved CPU and dramatically slashed memory usage.

This is versus the previous limit (last month) of 750 nodes with very high latencies and instability at this scale.

Hashing algorithm to reduce quadratic to linear job scheduling wrt cluster size

There were a series of changes needed to achieve these new scale limits. The most interesting and important one is a change to deciding who bids on a job. We randomly distribute node ids in a hash space, and map job ids onto the same hash space, and then calculate the distances - dividing the space up into a number of chunks proportional to the number of nodes in the cluster and the number of nodes that we want to bid on a job. This way, the expected number of nodes that will fall within the first chunk is exactly the constant number of nodes that we want to bid on the job! It's described in more detail in this well-commented function:

image (6)

We then never bid on jobs if the delay is above a certain threshold. The key observation here is that we are making the number of messages sent on the libp2p network no longer O(size of network) and instead O(1). So while the amount of CPU needed overall to run a larger libp2p network is higher, the CPU used to schedule jobs is no longer quadratically higher O(number of nodes * number of messages), but rather just linear in the number of nodes O(number of nodes * constant)

Understanding memory usage and CPU

The built-in pprof tool in Golang came in very handy for the profiling and optimization work, for example we were able to trace down where the memory usage was going:

image (2)

And reduce it: image (4)

And then finally by implementing the hashing algorithm above, reduce it even further:

image (8)

We also did some manual timing work to look at reducing actual job latencies:

image (10)

And found lots of mutex contention using a mutex logging library which logs whenever a mutex is waited for or held for longer than a given interval.

This allowed us to narrow down and fix all sorts of subtle mutex bugs, like this one:

image (11)

(spot the bug!)

All of which gets us down to few-second latencies in 1000 node clusters, and in the best case, just 200ms latency over the native docker run latency :-)

Filecoin support!

Kai has finished splitting the publisher interface out from the verifier interface:

  • verification
    • compute nodes submit their "proposals" to be verified
    • the requester node collates all proposals
    • it triggers the verification when enough have been collected
    • accepted and rejected results are broadcast to the network
  • publishing
    • the compute node has a collection of "publish" drivers
    • once accepted and rejected events are seen - the publisher is called
    • IPFS publish driver

And we are now writing a Filecoin lotus publisher as well as an estuary publisher. We're on track to have publishing to Filecoin done by the end of the month.

Changes to the plan, switching DAGs for Filecoin+

Rather than working on DAGs next month, we're planning instead to work on Filecoin+.

We plan to provide an API for attesting that certain jobs ran on CIDs which are “filecoin+” verified (by humans, as being datasets that are valuable to humanity), and then publish a list of CIDs - along with the jobspecs - which are derivative datasets (results of bacalhau jobs) of those filecoin+ datasets and thus candidates to themselves be deemed valuable to humanity, as hooking this all up will incentivise SPs to run bacalhau.

CoD WG meeting - Bacalhau as a toolkit for makers, and sharding demo

We also presented Bacalhau in the CoD working group meeting. Skip to 28 minutes in this video to see an overview of the project, in particular with reference to how we are looking to build a toolkit for makers in the distributed compute space to reuse, much like libp2p is reused in many projects now.

There's also a nice demo of running video processing locally and then in parallel on the production Bacalhau network, seeing performance improvements through parallelism at the same video at 34 minutes.

Team growth

We have secured another two developers to start work in September, and another two to start in January!

What's next?

  • Finish filecoin (lotus + estuary) publishing support
  • Work on verification towards our 10% BFT target
  • Work on Filecoin+
Clone this wiki locally