Tips for adding large datasets into ipfs #212

whyrusleeping · 2017-01-07T10:56:03Z

Some of my thoughts on adding lots of data to ipfs:

go-ipfs is currently still alpha software. It is designed to handle absurdly
huge amounts of data across vast expanses of spacetime, but our current
implementation has its fair share of inefficiencies. This guide will serve as
a collection of optimization notes and best practices for efficiently storing
large amounts of information in ipfs.

Daemon Configuration

This section discusses configurations to apply before starting the process of
ingesting data into ipfs.

Set flatfs 'NoSync'

ipfs config --json Datastore.NoSync true

Ipfs currently stores all data blocks in flat files on disk. There is quite a
ways to go in optimizing this storage engine, but one quick optimization for
now is to disable some excess fsync calls made by the code. The drawback this
has is that if the machine ipfs is running on unexpectedly crashes (without
proper disk unmounting) then some recently added data may be lost.

Disable Reproviding

ipfs config Reprovider.Interval "0"

By default, the ipfs daemon will announce all of its content to the dht once a
day. This works great for small to medium sized datasets, but for huge datasets
this becomes incredibly costly. Until we optimize the content routing system
(see: #162), it's best to disable this
feature.

Directory sharding

ipfs config --json Experimental.ShardingEnabled true

If your dataset contains huge directories (1k+ entries) sharding will enable ipfs to handle those better (without it might hang or crash without any notice).

The Add Process

The primary way to get data into ipfs is through the ipfs add command.
There are a few optimizations here and different things to note that will aid
in efficiently getting data ingested.

'Local' adding

When content is added to ipfs in this way, we automatically start announcing
the content to the dht as it is added. For huge masses of data, we would prefer
not to do that given the cost. To avoid this, pass the --local flag when
invoking ipfs add. For example:

ipfs add -r --local /data/some_huge_dataset

Raw Leaves

All file data that goes into ipfs is broken into chunks, and built into a
merkledag. Initially, the leaf nodes of the dag had some amount of framing.
Recently (still in master at time of writing, should ship in 0.4.5) we added an
option to add that allows us to create leaf data nodes without that framing.
This cuts roughly 12bytes per 256k chunk off, but the real benefit it provides
is making the blocks stored on disk evenly divisible by 4096, resulting in
fewer wasted disk blocks.

Example:

ipfs add --raw-leaves -r /data/some_huge_dataset

Breaking Up Adds

ipfs add calls are not currently interuptable, if something happens during the
add you will have to restart from the beginning (though previously added
segments will progress much more quickly). To mitigate this risk, it is
generally advisable to add smaller amounts of data and then patch the peices
together afterwards. This process might look something like this:

# add the individual pieces
$ ipfs add part1
QmPartOne
$ ipfs add part2
QmPartTwo
$ ipfs add part3
QmPartThree
# now patch them all together
$ ipfs object new unixfs-dir # get an empty directory
QmUNLLsPACCz1vLxQVkXqqLX5R1X345qqfHbsf67hvA3Nn
$ ipfs object patch QmUNLLsPACCz1vLxQVkXqqLX5R1X345qqfHbsf67hvA3Nn add-link part1 QmPartOne
QmStuff1
$ ipfs object patch QmStuff1 add-link part2 QmPartTwo
QmStuff2
$ ipfs object patch QmStuff2 add-link part3 QmPartThree
QmFinalDir

The text was updated successfully, but these errors were encountered:

ghost · 2017-01-08T16:25:18Z

I like it 👍

Can we just move it to the docs/ directory in go-ipfs? I'm afraid here it'll get lost quickly.

victorb · 2017-01-08T17:39:16Z

How about making a blogpost about it for extra discoverability?

victorb · 2017-04-06T22:33:54Z

Not duplicating files when adding (experimental)

From: ipfs/kubo#3397 (comment)

Change the config to set Experimental.FilestoreEnabled to true

ipfs config --json Experimental.FilestoreEnabled true

Before adding files, create a new repository and run daemon at the same level as the directory you want to add

IPFS_PATH=$(pwd)/.ipfs ipfs daemon --init

How the directory hierarchy should look now:

..
.
./my-directory-to-be-added
./.ipfs # < this is your new IPFS repository
./.ipfs/blocks
./.ipfs/datastore
./.ipfs/keystore

Then when adding files, pass the --nocopy flag

IPFS_PATH=$(pwd)/.ipfs ipfs add -r --nocopy ./my-directory-to-be-added

ghost · 2017-04-26T15:58:47Z

Fetching large files

Much of the opening comment apply, but you'll also want to start the receiving daemon with --routing=none and connect it to the transmitting daemon using ipfs swarm connect. This will keep the receiving daemon from sending out provider records for every new block it receives.

hleb-albau · 2019-01-17T07:09:47Z

Hi, I need to add about 1kkk tiny objects(about 100bytes) using IPFS and create single CID to allow easy pinning for others nodes.
Firstly, I try to create a directory with files, containing those objects and using CLI add it recursively. The main problem here is OS file min block size is 4k, so for 1kkk objects, I need 4tb disk space.

Can you, please, give me advise: how to load objects in my case?

whyrusleeping · 2019-01-17T09:51:22Z

I would make sure to use badger as the datastore. Then I would use 'ipfs files' to create the virtual directory one item at a time.

…

On Thu, Jan 17, 2019, 7:09 AM Hleb Albau ***@***.***> wrote: Hi, I need to add about 1kkk tiny objects(about 100bytes) using IPFS and create single CID to allow easy pinning for others nodes. Firstly, I try to create a directory with files, containing those objects and using CLI add it recursively. The main problem here is OS file min block size is 4k, so for 1kkk objects, I need 4tb disk space. Can you, please, give me advise: how to load objects in my case? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#212 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABL4HIS4sVU8gTkxFPZfp-JWc8OX4kl1ks5vECG_gaJpZM4LdZGN> .

hleb-albau · 2019-01-17T20:21:16Z

I would make sure to use badger as the datastore. Then I would use 'ipfs files' to create the virtual directory one item at a time.
…
On Thu, Jan 17, 2019, 7:09 AM Hleb Albau @.***> wrote: Hi, I need to add about 1kkk tiny objects(about 100bytes) using IPFS and create single CID to allow easy pinning for others nodes. Firstly, I try to create a directory with files, containing those objects and using CLI add it recursively. The main problem here is OS file min block size is 4k, so for 1kkk objects, I need 4tb disk space. Can you, please, give me advise: how to load objects in my case? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#212 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/ABL4HIS4sVU8gTkxFPZfp-JWc8OX4kl1ks5vECG_gaJpZM4LdZGN .

Hi, thanks for the fast reply.

One item at a time - you mean using a single command to create final unixfs folder containing all objects? Or, objects one by one to the unixfs folder? If the first, what command I should use?
Thanks!

whyrusleeping · 2019-01-21T20:49:56Z

Take a look at ipfs files write, it should allow you to write data into ‘mfs’ one file at a time, then once you’re done, you can use ipfs files stat to find the hash of the resultant directory.

Also, i would ask people on IRC for faster help on this. It’s likely that I will miss a notification here and leave you hanging for a long time.

Ref: ipfs/notes#212

Kubuxu mentioned this issue Apr 29, 2017

ipfs add hangs and stalls with large folders ipfs/kubo#3885

Closed

ZenGround0 mentioned this issue Jan 10, 2018

IPFS Cluster sharding RFC #278

Open

NiKiZe mentioned this issue Jul 22, 2018

IPFS and Gentoo Portage (distfiles) #296

Open

andrew mentioned this issue Mar 8, 2019

Experiment: Setting up an Ubuntu mirror on IPFS ipfs-inactive/package-managers#18

Closed

andrew mentioned this issue Jul 17, 2019

Determine providing strategies for filesystem based package managers ipfs-inactive/package-managers#76

Closed

dokterbob added a commit to ipfs-search/ipfs-search that referenced this issue Apr 5, 2020

Optimise for efficient publishing of snapshots.

3b7f398

Ref: ipfs/notes#212

dokterbob added a commit to ipfs-search/ipfs-search-deployment that referenced this issue Nov 12, 2020

Optimise for efficient publishing of snapshots.

85b6ab8

Ref: ipfs/notes#212

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tips for adding large datasets into ipfs #212

Tips for adding large datasets into ipfs #212

whyrusleeping commented Jan 7, 2017 •

edited by Kubuxu

Loading

ghost commented Jan 8, 2017

victorb commented Jan 8, 2017

victorb commented Apr 6, 2017 •

edited

Loading

ghost commented Apr 26, 2017 •

edited by ghost

Loading

hleb-albau commented Jan 17, 2019

whyrusleeping commented Jan 17, 2019 via email

hleb-albau commented Jan 17, 2019

whyrusleeping commented Jan 21, 2019

Tips for adding large datasets into ipfs #212

Tips for adding large datasets into ipfs #212

Comments

whyrusleeping commented Jan 7, 2017 • edited by Kubuxu Loading

Daemon Configuration

Set flatfs 'NoSync'

Disable Reproviding

Directory sharding

The Add Process

'Local' adding

Raw Leaves

Breaking Up Adds

ghost commented Jan 8, 2017

victorb commented Jan 8, 2017

victorb commented Apr 6, 2017 • edited Loading

Not duplicating files when adding (experimental)

ghost commented Apr 26, 2017 • edited by ghost Loading

Fetching large files

hleb-albau commented Jan 17, 2019

whyrusleeping commented Jan 17, 2019 via email

hleb-albau commented Jan 17, 2019

whyrusleeping commented Jan 21, 2019

whyrusleeping commented Jan 7, 2017 •

edited by Kubuxu

Loading

victorb commented Apr 6, 2017 •

edited

Loading

ghost commented Apr 26, 2017 •

edited by ghost

Loading