Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tests for push-to-create #46

Draft
wants to merge 9 commits into
base: git-annex
Choose a base branch
from

Conversation

matrss
Copy link

@matrss matrss commented Nov 10, 2023

This adds tests for creating a repository using the push-to-create feature. The tests are the equivalent of doing

git remote add origin <url>
git annex sync --content

in a local git-annex repository.

The steps it does are:

  1. create a local git-annex repository
  2. add a non-existing repository as a remote
  3. sync using git annex sync --content

It then checks that:

  • the repository was indeed created
  • the default branch matches the HEAD of the local repository
  • all annexed files exist on the remote and have the correct content

This is done both for a repository in a user's namespace and for a repository in an organization's namespace.

These tests fail at the moment because the default branch does not match. The created repository has a default branch of synced/master while it should be master, which is the current HEAD when pushing. I'll have to investigate how to fix that bug.

Fixes #14.

kousu and others added 9 commits November 10, 2023 13:11
[git-annex](https://git-annex.branchable.com/) is a more complicated cousin to
git-lfs, storing large files in an optional-download side content.  Unlike lfs,
it allows mixing and matching storage remotes, so the content remote(s) doesn't
need to be on the same server as the git remote, making it feasible to scatter
a collection across cloud storage, old harddrives, or anywhere else storage can
be scavenged.  Since this can get complicated, fast, it has a content-tracking
database (`git annex whereis`) to help find everything later.

The use-case we imagine for including it in Gitea is just the simple case, where
we're primarily emulating git-lfs: each repo has its large content at the same URL.

Our motivation is so we can self-host https://www.datalad.org/ datasets, which
currently are only hostable by fragilely scrounging together cloud storage --
and having to manage all the credentials associated with all the pieces -- or at
https://openneuro.org which is fragile in its own ways.

Supporting git-annex also allows multiple Gitea instance to be annex remotes for
each other, mirroring the content or otherwise collaborating the split up the
hosting costs.

Enabling
--------

TODO

HTTP
----

TODO

Permission Checking
-------------------

This tweaks the API in routers/private/serv.go to expose the calling user's
computed permission, instead of just returning HTTP 403.

This doesn't fit in super well. It's the opposite from how the git-lfs support is
done, where there's a complete list of possible subcommands and their matching
permission levels, and then the API compares the requested with the actual level
and returns HTTP 403 if the check fails.

But it's necessary. The main git-annex verbs, 'git-annex-shell configlist' and
'git-annex-shell p2pstdio' are both either read-only or read-write operations,
depending on the state on disk on either end of the connection and what the user
asked it to ask for, with no way to know before git-annex examines the situation.
So tell the level via GIT_ANNEX_READONLY and trust it to handle itself.

In the older Gogs version, the permission was directly read in cmd/serv.go:

```
mode, err = db.UserAccessMode(user.ID, repo)
```
- https://github.com/G-Node/gogs/blob/966e925cf320beff768b192276774d9265706df5/internal/cmd/serv.go#L334

but in Gitea permission enforcement has been centralized in the API layer.
(perhaps so the cmd layer can avoid making direct DB connections?)

Deletion
--------

git-annex has this "lockdown" feature where it tries
really quite very hard to prevent you deleting its
data, to the point that even an rm -rf won't do it:
each file in annex/objects/ is nested inside a
folder with read-only permissions.

The recommended workaround is to run chmod -R +w when
you're sure you actually want to delete a repo. See
https://git-annex.branchable.com/internals/lockdown

So we edit util.RemoveAll() to do just that, so now
it's `chmod -R +w && rm -rf` instead of just `rm -rf`.
Fixes neuropoly#11

Tests:

* `git annex init`
* `git annex copy --from origin`
* `git annex copy --to origin`

over:

* ssh

for:

* the owner
* a collaborator
* a read-only collaborator
* a stranger

in a

* public repo
* private repo

And then confirms:

* Deletion of the remote repo (to ensure lockdown isn't messing with us: https://git-annex.branchable.com/internals/lockdown/#comment-0cc5225dc5abe8eddeb843bfd2fdc382)

------

To support all this:

* Add util.FileCmp()
* Patch withKeyFile() so it can be nested in other copies of itself

-------

Many thanks to Mathieu for giving style tips and catching several bugs,
including a subtle one in util.filecmp() which neutered it.

Co-authored-by: Mathieu Guay-Paquet <[email protected]>
This makes HTTP symmetric with SSH clone URLs.

This gives us the fancy feature of _anonymous_ downloads,
so people can access datasets without having to set up an
account or manage ssh keys.

Previously, to access "open access" data shared this way,
users would need to:

  1. Create an account on gitea.example.com
  2. Create ssh keys
  3. Upload ssh keys (and make sure to find and upload the correct file)
  4. `git clone [email protected]:user/dataset.git`
  5. `cd dataset`
  6. `git annex get`

This cuts that down to just the last three steps:

  1. `git clone https://gitea.example.com/user/dataset.git`
  2. `cd dataset`
  3. `git annex get`

This is significantly simpler for downstream users, especially for those
unfamiliar with the command line.

Unfortunately there's no uploading. While git-annex supports uploading
over HTTP to S3 and some other special remotes, it seems to fail on a
_plain_ HTTP remote. See neuropoly#7
and https://git-annex.branchable.com/forum/HTTP_uploads/#comment-ce28adc128fdefe4c4c49628174d9b92.

This is not a major loss since no one wants uploading to be anonymous anyway.

To support private repos, I had to hunt down and patch a secret extra security
corner that Gitea only applies to HTTP for some reason (services/auth/basic.go).

This was guided by https://git-annex.branchable.com/tips/setup_a_public_repository_on_a_web_site/

Fixes neuropoly#3

Co-authored-by: Mathieu Guay-Paquet <[email protected]>
This moves the `annexObjectPath()` helper out of the tests and into a
dedicated sub-package as `annex.ContentLocation()`, and expands it with
`.Pointer()` (which validates using `git annex examinekey`),
`.IsAnnexed()` and `.Content()` to make it a more useful module.

The tests retain their own wrapper version of `ContentLocation()`
because I tried to follow close to the API modules/lfs uses, which in
terms of abstract `git.Blob` and `git.TreeEntry` objects, not in terms
of `repoPath string`s which are more convenient for the tests.
Previously, Gitea's LFS support allowed direct-downloads of LFS content,
via http://$HOSTNAME:$PORT/$USER/$REPO/media/branch/$BRANCH/$FILE
Expand that grace to git-annex too. Now /media should provide the
relevant *content* from the .git/annex/objects/ folder.

This adds tests too. And expands the tests to try symlink-based annexing,
since /media implicitly supports both that and pointer-file-based annexing.
This updates the repo index/file view endpoints so annex files match the way
LFS files are rendered, making annexed files accessible via the web instead of
being black boxes only accessible by git clone.

This mostly just duplicates the existing LFS logic. It doesn't try to combine itself
with the existing logic, to make merging with upstream easier. If upstream ever
decides to accept, I would like to try to merge the redundant logic.

The one bit that doesn't directly copy LFS is my choice to hide annex-symlinks.
LFS files are always _pointer files_ and therefore always render with the "file"
icon and no special label, but annex files come in two flavours: symlinks or
pointer files. I've conflated both kinds to try to give a consistent experience.

The tests in here ensure the correct download link (/media, from the last PR)
renders in both the toolbar and, if a binary file (like most annexed files will be),
in the main pane, but it also adds quite a bit of code to make sure text files
that happen to be annexed are dug out and rendered inline like LFS files are.
Upstream can handle the full test suite; to avoid tedious waiting,
we only test the code added in this fork.
This adds tests for creating a repository using the push-to-create
feature. The tests are the equivalent of doing
```
git remote add origin <url>
git annex sync --content
```
in a local git-annex repository.

The steps it does are:
1. create a local git-annex repository
2. add a non-existing repository as a remote
3. sync using `git annex sync --content`

It then checks that:
- the repository was indeed created
- the default branch matches the HEAD of the local repository
- all annexed files exist on the remote and have the correct content

This is done both for a repository in a user's namespace and for a
repository in an organization's namespace.
@kousu
Copy link
Member

kousu commented Nov 10, 2023

You're on such a roll! omg. I still need to merge the original PR 😵
This is great, really awesome. Please keep them coming.

@kousu
Copy link
Member

kousu commented Nov 10, 2023

By the way I noticed you customised your templates in your fork directly; you could take a look at https://docs.gitea.com/administration/customizing-gitea. We use gitea embedded extract during install to get files out then like, sed to patch them with our logos.

@matrss
Copy link
Author

matrss commented Nov 10, 2023

This is great, really awesome. Please keep them coming.

Slowly chipping away at some of the low-hanging fruits 😃 forking is something I would like to look at too at some point, although that seems to be more complicated.

By the way I noticed you customised your templates in your fork directly; you could take a look at https://docs.gitea.com/administration/customizing-gitea. We use gitea embedded extract during install to get files out then like, sed to patch them with our logos.

Thanks for the pointer, I'll take a look at that. I spend quite a bit of time yesterday bringing those templates up-to-date and your approach would indeed be easier. The changes aren't that big anyway. I remember I had some issues getting the templates out of gitea embedded though...

These tests fail at the moment because the default branch does not match. The created repository has a default branch of synced/master while it should be master, which is the current HEAD when pushing. I'll have to investigate how to fix that bug.

The setting.Repository.DefaultBranch is set to master in the tests. But I think git annex sync --content pushes synced/master first and because master does not immediately exist it will then default to the first branch it sees instead. I am not sure how that would or should be fixed... Maybe recommending to git push && git annex sync --content would be better, and if someone runs into this they can still change the default branch afterwards. I am not sure yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants