Skip to content
This repository has been archived by the owner on Feb 8, 2023. It is now read-only.

IPFS and Gentoo Portage (distfiles) #296

Open
NiKiZe opened this issue Jul 21, 2018 · 23 comments
Open

IPFS and Gentoo Portage (distfiles) #296

NiKiZe opened this issue Jul 21, 2018 · 23 comments

Comments

@NiKiZe
Copy link

NiKiZe commented Jul 21, 2018

Just as #84 Gentoo could use the same concepts. Creating this issue to inform about and track progress.

Relevant Gentoo forum thread
Some info about Gentoo distfile mirrors; https://wiki.gentoo.org/wiki/Project:Infrastructure/Mirrors/Source
Don't know yet, but I expect everything to be 4-500GB of data.

Currently working on creating a gentoo distfiles mirror, using similar concepts to VictorBjelkholm/arch-mirror

Updates coming when I have succeed with initial sync and started testing.
Current WIP repo NiKiZe/Gentoo-distfiles-IPFS

@NiKiZe
Copy link
Author

NiKiZe commented Jul 22, 2018

A few things that I have hit so far;

  • Using ipfs add -w -r --nocopy --local ${DSTBASE}/* takes a long time for ~400GB of data, what is the best way to do this?, The best would be to have files being added during the rsync process, alternatively a flag that reuses existing hashes (--assume-clean-if-exists and would work with --nocopy) seems there is a few small files (timestamp files and symlinks) that does get updated, so those needs handling
  • The generated directories are large and don't work, example:
    QmY6jEuqY3U5N9yxeBs6uoM63QnV2gjewXop9U4cQKauX8 has a distfiles directory with hash QmWvMmUjrH6neokv6RTqdgaYZSgxzr6hrjqaHosnDh6o7C which is 4821289 bytes in one block, I have republished that raw block as QmNzePsWN7kCF9fvbFBAnoVERWbCiVFjVEmr5EFqXFLtxL, Another example is QmfSAWwPTx5Kmre4bDF3zbkDuYXeNo9X7eTLvRUxyfr52B which can be found in QmPctnSoGCY8j19Lhxez3wqH2Ld55DzSqjnrbvNUuxoYgw. Using Sharding as described in Tips for adding large datasets into ipfs #212 might solve this, but still the client should give error (or at least warning) when this happens.
    I guess that is the size here that is the issue? But should go-ipfs allow to create so large blocks? (It would be better for it to throw an error until there is a fix) I get a feeling there already must be an issue created about this?, but can't find a good match (that is still open), created Adding large blocks/directories (4MiB+) should cause error kubo#5282
  • If I would like to create a filelist for publish (to get actual file hashes) what is the best way, output from the add command is one alternative, but is there any way to re-display those hashes in a filestore other then rerunning a full add? (the idea is that it could be used as a file to hash mapping by clients) something similar to filestore ls --file-order | grep " 0$" but returning file hash instead of hash for first block. Also an recursive option to ipfs ls would be useful.
  • Symlinks are not working, might not be major but still worth a mention. Also documented at Stop rewriting symlinks victorb/arch-mirror#1
  • While running the above ipfs add the node seems to be unresponsive
  • "Error: cannot manually publish while IPNS is mounted" Not possible to do ipfs name publish - this removes the possibility to run mirror, and use it on the same machine. (disable 'ipfs name publish' while /ipns is mounted kubo#964 (comment))

@lidel
Copy link
Member

lidel commented Jul 25, 2018

Hi, thank you for pushing this! 👍

  • FYI go-ipfs 0.4.17-rc1 fixes some issues with sharding, give it a try!
    • it also ships with experiment called URLStore, which (if I understand it correctly) may be useful if you already have HTTP-based data repositories.
  • Tips for adding large datasets into ipfs: Tips for adding large datasets into ipfs #212

@NiKiZe
Copy link
Author

NiKiZe commented Jul 25, 2018

Please remember to read the below as things to make something awesome even better

FYI go-ipfs 0.4.17-rc1 fixes some issues with sharding, give it a try!

Still using 0.4.16 so it works with sharding locally, but the gateways are currently "broken" (ipfs/kubo#5270) - the main issue I had here was that ipfs add allows for creating a directory block that is larger then 4MiB and is then inaccessible by other nodes. This should probably cause an error instead.

Yes UrlStore seems interesting indeed, just need to hook it into existing sync process somehow.
Another thing that is interesting is if shards can be used/created on different servers but all ending up in one large browseable directory listing.

Most of #212 is followed so far, I'm still having major issues however with files that are deleted, or even more so symlinks that is modified (this is due to using filestore)

It's up and "running" on /ipns/QmescA7sGoc4yZEe3Gof7dYt2qkkxDEXQPT2z84MpjVu8o/
In total ~485GB of data.
Running add on the full set takes 1-2 hours (this is needed since rsync is used as the source)
ipfs filstore verify Have been running for 12h+ and are not done.
Either filestore needs an update here, or maybe there will be something available in ipfs files/mfs
(writeable fuse mount that rsync writes directly to, and all changes are published directly to the configured ipns)

One observation while exploring ipfs is that there is multiple commands that does almost the same thing, this is to an degree confusing, but even more so a time thief.

@Stebalien
Copy link
Member

Still using 0.4.16 so it works with sharding locally, but the gateways are currently "broken" - the main issue I had here was that ipfs add allows for creating a directory block that is larger then 4MiB and is then inaccessible by other nodes. This should probably cause an error instead.

You are absolutely correct. Please file a bug in go-ipfs.

ipfs filstore verify Have been running for 12h+ and are not done.

Please file a bug in go-ipfs.

One observation while exploring ipfs is that there is multiple commands that does almost the same thing, this is to an degree confusing, but even more so a time thief.

That often happens due to backwards compatibility concerns. Eventually, we'd like to release a new command with an entirely new, thought out API (and make that the default for the 1.0 release). The current thinking is to make everything use something like "mfs". This should significantly reduce confusion as all files will get names and can be managed through a file system.

@NiKiZe
Copy link
Author

NiKiZe commented Jul 25, 2018

Really thank you for your feedback!

ipfs filstore verify Have been running for 12h+ and are not done.

Please file a bug in go-ipfs.

It simply takes time because it is huge dataset (will try to collect actual figures), and I think that ipfs/kubo#4260 (comment) which includes --overwrite and rm mentions already is appropriate for this since we just need something smarter to handle changes in filestore.

Thinking about it right now I could use a ipfs filestore verify <filename> this would be called with all files that was modified in the last x hours. But that is also just a hack for the actual issue with keeping large filestores synced. (and might be a bad feature to have)

Ok update;
(time (ipfs filestore verify --local --file-order | grep -v ^ok)) is now around 1h, so that is acceptable, but still high when it is run in conjunction with add as well.

@Stebalien
Copy link
Member

It simply takes time because it is huge dataset

I'd expect it to take time, but it shouldn't be slower than an add. Could you try ipfs/kubo#5286? Is that any faster? Note: It may actually be slower as I haven't benchmarked it.

@NiKiZe
Copy link
Author

NiKiZe commented Jul 26, 2018

Started update at Thu Jul 26 06:57:20 CEST 2018
#rsync runs here
End: Thu Jul 26 06:58:15 CEST 2018 - spent on stage 0:00:55
#another rsync
Delete dl done: Thu Jul 26 06:58:16 CEST 2018 - spent on stage 0:00:01
ipfs filestore verify --local --file-order | grep -v ^ok
verify done: Thu Jul 26 07:49:53 CEST 2018 - spent on stage 0:51:43
ipfs repo gc
gc done: Thu Jul 26 07:50:14 CEST 2018 - spent on stage 0:00:11
ipfs add QmPrbM7rqrUVbf9Guhdhs2yo6WRAGrbE21N75RyttfyQWY done: Thu Jul 26 08:54:55 CEST 2018 - spent on stage 1:04:41

With verify --file-order this is on the same level in time as the add.
I think a big reason for it being slow without --file-order is BTRFS which is used on these disks. (it''s just horrible with random seeks)
Must also add that I have not done controlled tests here, just "random trying to get the mirror up and running"
I will try out that branch and see how it behaves, but I will need better testcase for that to be useful.

One weird thing IMHO, old versions of the the files, those that are changed, or are missing does not get removed by the verify command (logs of verify command is found at QmZ35fVbUUMoTcz5a24f17qHvguZVwBiZ5nxz3pUwnhRjq) when I run verify again I get the same output, isn't it supposed to remove those links?
This does also cause the daemon to spew

16:57:42.928 ERROR engine: tried to execute a task and errored fetching block: data in file did not match. gentoo-distfiles/releases/ia64/autobuilds/latest-install-ia64-minimal.txt offset 0 engine.go:141

and similar lines

@NiKiZe
Copy link
Author

NiKiZe commented Jul 26, 2018

Just a quick note about time taken for verify ..
with 0.4.15 and running with --file-order it is done in around 56minutes (guestimating 70-120MB/s disk usage)
with 0.4.15 and running without --file-order I canceled after 2hours 30minutes (guestimating ~16-30MB/s disk usage)
after 6-to-7 repo migration I'm now running the branch commit 23f5cd4f0 done after 3hours 36minutes (seeing 25-33MB/s disk usage) Will leave it running over night.

But the whole reason to run verify was to clean out deleted files, and/or files that have changed contents (timstamps and symlinks) which it seems verify does not do, and I have missunderstood how to deal with this. (for this to be viable the ipfs filestore clean mentioned in ipfs/kubo#4260 (comment) is what we/I want)

@NiKiZe
Copy link
Author

NiKiZe commented Jul 29, 2018

A smallish update.
Just using add with --nocopy for an rsyncd directory is currently not an option.
It among other things has the issue with not being able to list files and is to slow (with verify it takes at least 2 hours).
However there is mfs which does this quite nicely.

I have rewriten the sync script to loop over recently modified files, add them one by one with nocopy and then update relevant mfs nodes.
Next stage is to find all removed files, this is done by a separate rsync that gives a list of deleted files, and then a similar loop is done over these where files are removed from mfs.
NiKiZe/Gentoo-distfiles-IPFS@bc71758
With this I'm down to 2-3 minutes for doing a normal update of the repo mirror.

Since everything now is in mfs, it is easy to get filelists as well.
Now ipfs files ls just needs to get an --recursive option for that filelist to be easily created (but I can probably hack bash for that as well)

@Stebalien
Copy link
Member

after 6-to-7 repo migration I'm now running the branch commit 23f5cd4f0 done after 3hours 36minutes (seeing 25-33MB/s disk usage) Will leave it running over night.

Are you sayng it regressed in 0.4.16?

@NiKiZe
Copy link
Author

NiKiZe commented Jul 30, 2018

running the branch commit 23f5cd4f0 done after 3hours 36minutes

Are you sayng it regressed in 0.4.16?

The test was done with commit 23f5cd4f0, in PR ipfs/kubo#5286
And i guess that it was mainly due to the file-sort change that it slowed down. (that is discussed in the PR)
I'm happy to run more benchmark tests on it, but preferably with the file-sort changes reverted.

@jcaesar
Copy link

jcaesar commented Sep 25, 2018

I've been working on something in this area, it's currently coming up here.

The initial add is rather slow and takes days, which is probably caused by the fact that I'm using a 1TB storage instance from time4vps. A recheck with no new files added can be complete in seconds, since all it does is listing the folder structure in mfs and comparing files existing on disk based on file change time. (I would have liked to use xattrs, but uh, nothing like that supported there. Linux 2.6.32-042stab133.2 and all other kinds of weirdness.)

So all in all: This might work, but it likely requires a bit better hardware.

@jcaesar
Copy link

jcaesar commented Sep 27, 2020

(Just played around a bit with mirroring again. 1 2 3)

@jcaesar
Copy link

jcaesar commented Oct 13, 2020

Okay, so I am able to put all the gentoo distfiles onto IPFS in about half a day and even have a way to deal with the symlinks (not that that is necessary for a usable mirror).
But essentially, resolution wouldn't work unless I force connect or ping the node. aschmahmann finally explained to me why:
the mirror consists of roughly 2.2 million blocks. Providing takes about 60 seconds per block and has to be done at least once per day. One can't provide 1500 blocks in parallel.

This could be optimized by increasing the block size to the maximum of 1 MB, that would leave a bit more than 400,000 blocks for the 400 GB. That's still too much.

One further step of optimization would be to not provide all the blocks, but only provide files and folders. That might infrequently lead to some situations where one has the root block of a file but not the actual content blocks, and can't find them either. But I suppose that would be rare. Problem is: There are 66941 files and 542 folders (today).
"That's not impossible, but it's definitely not happening by default." -- aschmahmann

@NiKiZe
Copy link
Author

NiKiZe commented Oct 13, 2020

I think it is important to not forget to try and use standard settings if possible to reuse blocks between distros/mirrors

@aschmahmann
Copy link

Problem is: There are 66941 and 542 folders (today). "That's not impossible, but it's definitely not happening by default." -- aschmahmann

To clarify - given the large number of files and the request that they each be individually findable (i.e. instead of finding the data as /ipfs/QmRoot/folder1/folder2/file finding it as /ipfs/QmFile) you're going to have trouble in the short run if you just leave everything as the default.

Some things you can do about this now:

  • If you have an application that is involved in this (e.g. some Gentoo package manager) then you can setup the application to try and find other relevant nodes (e.g. could use the DHT to find "people who have Gentoo data" instead of people who have a given CID - this is how IPNS over PubSub works).
  • Build your own massively parallel provider
  • Decrease the number of "hooks" (i.e. user footholds into the data such as QmRoot above) required by your dataset

This problem (making huge numbers of files accessible over the network) is IMO a pretty important one to deal with and has a lot of moving parts. I'm hoping go-ipfs will make some progress here next year, but given some of the complexities we'll just have to see how it goes 😃.

@likewhoa
Copy link

Problem is: There are 66941 and 542 folders (today). "That's not impossible, but it's definitely not happening by default." -- aschmahmann

To clarify - given the large number of files and the request that they each be individually findable (i.e. instead of finding the data as /ipfs/QmRoot/folder1/folder2/file finding it as /ipfs/QmFile) you're going to have trouble in the short run if you just leave everything as the default.

Some things you can do about this now:

  • If you have an application that is involved in this (e.g. some Gentoo package manager) then you can setup the application to try and find other relevant nodes (e.g. could use the DHT to find "people who have Gentoo data" instead of people who have a given CID - this is how IPNS over PubSub works).
  • Build your own massively parallel provider
  • Decrease the number of "hooks" (i.e. user footholds into the data such as QmRoot above) required by your dataset

This problem (making huge numbers of files accessible over the network) is IMO a pretty important one to deal with and has a lot of moving parts. I'm hoping go-ipfs will make some progress here next year, but given some of the complexities we'll just have to see how it goes .

thanks for putting in the effort and time to make this happen!

@jcaesar
Copy link

jcaesar commented Oct 16, 2020

To clarify - given the large number of files and the request that they each be individually findable (i.e. instead of finding the data as /ipfs/QmRoot/folder1/folder2/file finding it as /ipfs/QmFile) you're going to have trouble in the short run if you just leave everything as the default.

I actually only need to find things by one root, e.g. as /ipns/gentoo.liftm.de/distfilesfolder/hashfolder/filename (ping 12D3KooWAivw38E5ohfpYo66FJTqYuWCsrBPPXjpHiCHh7fB5YYn if you want to have a look at the structure yourself). But if I then just provide the root folder, couldn't this happen:

  • some peer A may accesses /ipns/gentoo.liftm.de/distfiles/6d/go-ipfs_v0.4.22_linux-amd64.tar.gz and gets that from my node
  • A has .../distfiles and provides it
  • some other peer B might try to access /ipns/gentoo.liftm.de/distfiles/fc/go-ipfs_v0.4.20_linux-arm.tar.gz
    • gets distfiles from A
    • then has no idea where to get fc or go-ipfs_v0.4.20_linux-arm.tar.gz

[Edit:]
I set the reproviding strategy to roots and added

for f in / /gentoo /gentoo/data /gentoo/data/distfiles /gentoo/data/distfiles/layout.conf; do
	ipfs dht provide "$(ipfs files stat --format '<hash>' "$f")"
done

to be run after the sync but before the publish. It does seem to work with some oddities: e.g. I can't seem to ipfs get /ipns/gentoo.liftm.de/releases/arm64/autobuilds/current-stage3-arm64/stage3-arm64-systemd-20201004T190540Z.tar.xz (it doesn't even start) even though I can ipfs object get the same thing.

@aschmahmann
Copy link

couldn't this happen:

@jcaesar Yes, that's absolutely a problem. I'd like to make Bitswap session a little more complex so that the session can track exactly why the request is being made. For example, I'm not just asking for QmFoo, I'm trying to download QmFoo because I was asked to get QmBar/field1. This means that if the DHT fails to find QmFoo, it could go up the path and eventually hit QmBar.

This would mean that at the very least we'd ask the peer with QmBar if they have QmFoo which seems totally plausible. This isn't the be-all-end-all solution, but IMO it would be a useful step forward here.

I suspect people will start putting together issues describing proposals for how we deal with this type of provider records problem over the next couple months and if/when that happens I'll do my best to remember to link to this issue 🤞.

@mrusme
Copy link

mrusme commented May 15, 2022

I came across this and was wondering if anyone of the previous authors re-tried their experiments using the latest IPFS version @NiKiZe @jcaesar ? Would be interested to know whether the issues described here are still relevant or whether this could be something that could actually be done. :)

//edit: The issue in regard of symlinks still seems to be open.
//edit2: The NixOS folks seem to have something cooking along these lines.

@NiKiZe
Copy link
Author

NiKiZe commented May 15, 2022

I think they are still relevant. And the best approach (scalable) would be if ipfs links could be included in the ebuilds

@jcaesar
Copy link

jcaesar commented May 15, 2022

I was running a Gentoo mirror with my ftp2mfs thingie on a time4vps storage vps. I deleted it three months ago because:

  • Nobody besides me was using it
  • Since a single IPFS node isn't able to keep all those files visible on the DHT, I occasionally had troubles with "The file won't download until I ipfs ping my mirror" (I did mostly get those in check with selectively only providing folders. The recent DHT improvements didn't seem to change that much.)
  • ftp2mfs makes direct writes to mfs, which creates lots of tiny IPFS objects. Combined with the bad performance of the VPS (well, it really isn't designed for this), this made some IPFS operations ridiculously slow (several hour startup, e.g.).

I think it should be possible to run a working gentoo mirror on IPFS if you spend a bit more hardware on it. If you'd like to try ftp2mfs yourself, you roughly have to:

  • cargo install ftp2mfs
  • Hand it a configuration file à la { source: "rsync://ftp.gwdg.de/pub/linux/gentoo/", target: "/gentoo", ignore: ["/distfiles/*","!/distfiles/layout.conf","!/distfiles/*/"] } (please pick a mirror close to you) (the distfiles directory contains too many symlinks, would degrade performance, and portage doesn't need those anyway. Hence the ignore.)
  • Give it a day or two for the initial sync
  • ipfs files stat --hash /gentoo and either ipfs publish that or put it into some dnslink
  • Set GENTOO_MIRRORS="https://127.0.0.1:8080/ipns/$mirror-ipns/ in your /etc/portage/make.conf.
  • Set sync-type = webrsync in /etc/portage/repos.conf/gentoo.conf

@NiKiZe: Having the IPFS CID for each distfile in the ebuilds actually worsens the problem with the number of hashes you have to keep available to the DHT / makes the "provide only the folder" trick impossible. And you can absolutely run a successful mirror without having IPFS information in the package files, as the Arch mirror demonstrates.

(As for the symlink issue: ftp2mfs does have a mechanism to resolve symlinks with copies (all copies are shallow in IPFS after all) and keep those copies up to date, but that isn't really necessary for a functioning distfiles mirror.)

@aheadoftrends
Copy link

Hello, how active is the project. Are the packages still up to date?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

9 participants