-
Notifications
You must be signed in to change notification settings - Fork 297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for remote repo on S3 and others for upload after commiting #2237
Comments
What's stopping you from pulling from s3? You can set up an archive type repo and serve ostree files over http Pushing is a different story, but there are things such as ostree-releng-scripts or s3fs-fuse |
Thanks @AdrianVovk . I had looked at
The thinking is that something a little more integrated with ostree would be able to better address / avoid situations that can break a remote repository or leave it in a partial state. Or is corrupting the meta-data not really a concern in practice? |
Pushing things out of order will definitely cause problems, but if you only have controlled clients it might be fine. Look at the ordering in the releng script and try to match that. Essentially, everything in objects and deltas is probably fine to push concurrently, but once you're dealing with refs and the summary you're potentially corrupting the repo. I'd probably try to have an out of band locking mechanism to make sure there was only one publisher at a time. As for using s3, it's going to be fine for pulling so long as concurrent pushes are done in the correct order, but the only way you're going to get something like a native push is to use one of the fuse options. Ostree really expects to be working with a real filesystem. We've used s3fs and goofys for other purposes and I doubt they'd be up to the requirements. It's why our repos are stored in a massive EBS volume 😛 |
Pinging in my 2 cents for onlookers who stumble on this issue. I currently am trying to use goofys and catfs together to make this work with Doing something simple like bash-5.1# time tree
.
├── appcenter
│ ├── config
│ ├── extensions
│ ├── objects
│ ├── refs
│ │ ├── heads
│ │ ├── mirrors
│ │ └── remotes
│ ├── state
│ └── tmp
│ └── cache
└── elementary
├── config
├── extensions
├── objects
├── refs
│ ├── heads
│ ├── mirrors
│ └── remotes
├── state
└── tmp
└── cache
20 directories, 2 files
real 0m0.249s
user 0m0.004s
sys 0m0.000s But calculating the diff will cause a timeout due to all of the lookups :( - <-- getattr 16 "elementary/objects" 0 bytes
- queue size is 0
- LOOKUP(124) parent 0x0000000000000010, name "e9"
- <-- !lookup "elementary/objects/e9" = No such file or directory (os error 2)
- queue size is 0
- LOOKUP(125) parent 0x0000000000000010, name "34"
- <-- !lookup "elementary/objects/34" = No such file or directory (os error 2)
- queue size is 0
- LOOKUP(126) parent 0x0000000000000010, name "c2"
- <-- !lookup "elementary/objects/c2" = No such file or directory (os error 2)
- queue size is 0
- LOOKUP(127) parent 0x0000000000000010, name "ac"
- <-- !lookup "elementary/objects/ac" = No such file or directory (os error 2)
- queue size is 0
- LOOKUP(128) parent 0x0000000000000010, name "7f"
- <-- !lookup "elementary/objects/7f" = No such file or directory (os error 2)
- queue size is 0 I think supporting things like S3 would make it much easier to host stuff like flatpak repositories in cloud native environments. But I would expect it to require a pretty massive rewrite and mixing things like S3 in a low level library like ostree seems... over extending. I'll probably end up rolling a large volume attached to a server with |
@btkostner ostree does a lot of stats of the object tree, so it might be worth trying goofys without catfs. That way you get the metadata caching without the content caching. We use S3 a lot and it's just not a filesystem as much as people would like it to be. I think to truly support it you'd essentially have to have a separate backend that converts the POSIX APIs to S3 HTTP calls or some other abstraction layer. Or just keep banging on goofys until it's close enough to a real filesystem. If you're already in AWS you could also try using EFS, which is an NFS volume. It's more expensive than EBS and far more expensive than S3, but that's probably the easiest way to get something working with containers in AWS. |
This is more of a design question: what would you think about an abstraction layer in ostree to allow using blob storage? Maybe something that is more simplified than POSIX (to address issues with S3 and others) but generic enough to allow us to implement remote storage? I'm thinking it could be something similar to the way that Kopia works. It has a different use case (backup) but I see a lot of parallels in the way they do de-duplication of files coming from different machine file systems (which conceptually in ostree could be compared to different branches of the same OS image) - by having a uniquely and globally addressable version of each file version stored exactly once in a remote repository. On top of that they use a relatively simple blob storage API to pull / push objects and metadata. Through that API they provide integrations with a couple of cloud storages. In addition, they also implemented their blob API using rclone which means they now support any cloud storage that is supported by rclone. Here's also their architecture diagram. I wanted to brain storm this thought here. The ask wouldn't be to implement cloud storage support but an API that would allow others to do so. |
I'm personally not opposed to making an abstraction layer between an OSTree repo and the underlying storage, but I think it's going to be an uphill battle. The way an OSTree repo currently works is different than something like kopia. In kopia, the local filesystem is the source and you're trying to mirror parts of it somewhere else. In ostree, the repo is the source and you're trying to mirror parts of it to a local filesystem. On a server where you can enforce a single application, the repo is purely for storage, and you can deal with any fallout that might be fine. On the client side where there can be multiple applications accessing the repo and you're working with checkouts of live objects, it's a different story. To get an idea of what's needed, just look at the main OstreeRepo struct. Most of the fields there are closely associated to details of POSIX (really Linux) filesystems. Having had to deal with these details many times, I would consider inserting a backend abstraction layer in there a massive task. To think of it another way, consider why git doesn't allow you to host a live repo on S3. The way ostree actively uses a repo is very similar. You can take a bare git repo and put it in S3 just fine. But you certainly can't have checkouts where your I think you'd be better off using one of the FUSE S3 filesystems and running the ostree test suite against it to find out what the shortcomings are. Or just use kopia or rclone or something like it to sync a local filesystem repo to S3. Or you could work on a libostree API that natively syncs a repo to remote location. |
Yeah, agree with dbnicholson. That said...you may find all the effort we've been putting into "ostree native container" bits useful related to this: (This is particularly useful now that ostreedev/ostree-rs-ext#123 merged). Basically given an ostree commit, you can wrap it in a container image - and there's tons of tools that store those in all sorts of backends (often, object stores). And then you can losslessly "un-encapsulate" it back into an ostree repo, where you can set up static deltas, etc. (Or, it can be pulled directly by a client system) |
thanks @cgwalters and @dbnicholson for the quick feedback.
It's actually the other way around in kopia as well - the remote repository has a "superset" of all the local files. Multiple machines are able to sync their data into the common repository and locally deleted files of one machine still exist in the remote repository. So in some ways, each time a backup is made, a "commit" happens into the remote and the local can check that version out again.
That's actually the part we are trying to achieve. The goal would be to do this syncing/upload to the remote repo in a robust way. We are currently using ideas from https://github.com/ostreedev/ostree-releng-scripts/blob/master/rsync-repos#L52-L63 and maybe it would give some benefits to have this more tightly integrated. For the part that pulls from the remote repo the currently existing support would be sufficient already (since you can access s3 through http) |
I don't think it is currently possible to pull from or push to repositories that are in AWS S3 or similar storage. Is this something you guys have thought about to support?
I'm thinking mostly about the use case of supporting a build system, which would pull down files from S3, then builds and makes changes, commits them to the local repository and then at the end pushes these changes back to a remote repository in S3.
Another possible use case would be a "pull-local" which would copy commits from one repository in S3 to another (without requiring pulling down both repositories onto a local machine)
The text was updated successfully, but these errors were encountered: