Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

need to "shrink" or annex .dandi/assets.json #244

Open
yarikoptic opened this issue Aug 1, 2022 · 6 comments
Open

need to "shrink" or annex .dandi/assets.json #244

yarikoptic opened this issue Aug 1, 2022 · 6 comments

Comments

@yarikoptic
Copy link
Member

as initially reported in #230 (comment), git push fails for 000026 with

(base) dandi@drogon:/mnt/backup/dandi/dandisets/000026$ git push github draft
Enumerating objects: 34616, done.
Counting objects: 100% (34616/34616), done.
Delta compression using up to 4 threads
Compressing objects: 100% (21288/21288), done.
Writing objects: 100% (34613/34613), 7.80 MiB | 1.11 MiB/s, done.
Total 34613 (delta 14508), reused 33351 (delta 13291), pack-reused 0
remote: Resolving deltas: 100% (14508/14508), completed with 1 local object.
remote: warning: File .dandi/assets.json is 66.19 MB; this is larger than GitHub's recommended maximum file size of 50.00 MB
remote: error: Trace: 56fd59e743acd49c2770b9086f66648be36813b2bbf7b48913ebc4ab5f9ce6cf
remote: error: See http://git.io/iEPt8g for more information.
remote: error: File .dandi/assets.json is 131.89 MB; this exceeds GitHub's file size limit of 100.00 MB
remote: error: GH001: Large files detected. You may want to try Git Large File Storage - https://git-lfs.github.com.
To github.com:dandisets/000026.git
 ! [remote rejected]       draft -> draft (pre-receive hook declined)
error: failed to push some refs to 'github.com:dandisets/000026.git'

so -- github/microsoft is pushing to use of their LFS. I think we should just make use of annex unlocked (so we do not need to mess with all the lock/unlock etc) file functionality and keep it under git-annex for this and all the rest of dandisets (for the uniformity). Or do you see some other way @jwodder ?

@jwodder
Copy link
Member

jwodder commented Aug 1, 2022

@yarikoptic So annex .dandi/assets.json whenever it's larger than 100 MB? Exactly what commands do I need to run to store a pre-existing file under git-annex, and how would we make it available to other users cloning the repository?

@jwodder
Copy link
Member

jwodder commented Aug 1, 2022

@yarikoptic Note that, in this particular case, you can get rid of the large .dandi/assets.json file by squashing the "garbage collection" commit into the previous commit.

@yarikoptic
Copy link
Member Author

So annex .dandi/assets.json whenever it's larger than 100 MB? Exactly what commands do I need to run to store a pre-existing file under git-annex

Following https://git-annex.branchable.com/todo/annex.addunlocked_in_gitattributes/
and RTFM, apparently we need to use git annex config (not just .gitattributes) so for consistency we could have just done

git annex config --set annex.addunlocked 'include=size*'
git annex config --set annex.largefiles 'largerthan=3b and include=size*'

NB there is also --get analog to test that they are set already or not.

full script
#!/bin/bash

cd "$(mktemp -d ${TMPDIR:-/tmp}/dl-XXXXXXX)"
set -eu

git init
git annex init

set -x

git show git-annex:config.log || :

touch size1 size5 large
git add *
git commit -m 'added empty files' 

git show

echo size5 >| size5
echo 1 >| size1
echo 1000000000000000000 >| large


git annex config --set annex.addunlocked 'include=size*'
git annex config --set annex.largefiles 'largerthan=3b and include=size*'

git show git-annex:config.log

git add *
git commit -m 'added populated size*'
git show

which ended doing the right thing -- adding size1 into git and the "large" size5 into git-annex, and large (not matching path pattern) to git:

commit 80cd3354dcd494b9372512994bb0496417fdcc22 (HEAD -> master)
Author: Yaroslav Halchenko <[email protected]>
Date:   Mon Aug 1 12:11:01 2022 -0400

    added populated size*

diff --git a/large b/large
index e69de29..c9a4149 100644
--- a/large
+++ b/large
@@ -0,0 +1 @@
+1000000000000000000
diff --git a/size1 b/size1
index e69de29..d00491f 100644
--- a/size1
+++ b/size1
@@ -0,0 +1 @@
+1
diff --git a/size5 b/size5
index e69de29..c99f40a 100644
--- a/size5
+++ b/size5
@@ -0,0 +1 @@
+/annex/objects/SHA256E-s6--b389ab2d20c61de2680db3fbed2c4f9dac6b68c1b4125ef0abeee1cf0136b1a6

but we already rely on generic * .gitattributes setting for large files. I don't know how well that would play if we also set it based on pattern via git annex config . So please set annex.large files simply directly in top level .gitattributes .

On 000026 I did call annex config to set addunlocked but then can't seems to manage to add that damn .dandi/assets.json to annex -- annex keeps saying it is not large file:

(dandisets) dandi@drogon:/mnt/backup/dandi/dandisets/000026$ echo '' >> .dandi/assets.json
(dandisets) dandi@drogon:/mnt/backup/dandi/dandisets/000026$ git annex add .dandi/assets.json
add .dandi/assets.json (non-large file; adding content to git repository) ok
(recording state in git...)

(dandisets) dandi@drogon:/mnt/backup/dandi/dandisets/000026$ git check-attr --all .dandi/assets.json
.dandi/assets.json: annex.backend: SHA256E
.dandi/assets.json: annex.largefiles: (largerthan=1mb)
.dandi/assets.json: filter: annex

(dandisets) dandi@drogon:/mnt/backup/dandi/dandisets/000026$ du -scm .dandi/assets.json
67      .dandi/assets.json
67      total

May be you see what I am doing wrong -- or might be some bug/intricacy of git-annex???

... how would we make it available to other users cloning the repository?

dunno yet, but do we need to make it available to others? isn't it only for internal use by our scripts ATM?

@yarikoptic
Copy link
Member Author

@yarikoptic Note that, in this particular case, you can get rid of the large .dandi/assets.json file by squashing the "garbage collection" commit into the previous commit.

ha -- missed that there is a warning about 50MB limit and hard error after 100MB -- will squash now

dandibot pushed a commit to dandisets/000026 that referenced this issue Aug 1, 2022
This is manually squashed commit to overcome
dandi/dandisets#244

2nd commit after "Added" was:

[backups2datalad] 39814 assets garbage-collected from .dandi/assets.json
@yarikoptic
Copy link
Member Author

squashed, and pushed . Got a warning but no error. So immediate problem mitigated but let's still do that limiting by 1MB.. actually let's boost it to 5MB (10th of warning size). Here is current sizes:

(dandisets) dandi@drogon:/mnt/backup/dandi/dandisets$ du -scm */.dandi/assets.json | sort -n  | tail
3       000165/.dandi/assets.json
3       000212/.dandi/assets.json
3       000217/.dandi/assets.json
4       000008/.dandi/assets.json
8       000108/.dandi/assets.json
8       000231/.dandi/assets.json
13      000020/.dandi/assets.json
23      000045/.dandi/assets.json
67      000026/.dandi/assets.json
154     total

so most are small and will go under git.

@yarikoptic
Copy link
Member Author

just a reminder that apparently we never made those large ones locked: https://github.com/dandisets/000026/blob/967c5f0d35d2c59d0ec958c6458902003e3a170e/.dandi/assets.json has it in full. It seems that I did not add annex.largefiles setting via git annex config if that matters:

(dandisets-2) dandi@drogon:/mnt/backup/dandi/dandisets/000026$ ls -l .dandi/assets.json
-rw-r--r-- 1 dandi dandi 94855506 Feb  7 01:55 .dandi/assets.json
(dandisets-2) dandi@drogon:/mnt/backup/dandi/dandisets/000026$ git check-attr --all .dandi/assets.json
.dandi/assets.json: annex.backend: SHA256E
.dandi/assets.json: annex.largefiles: (largerthan=1mb)
.dandi/assets.json: filter: annex
(dandisets-2) dandi@drogon:/mnt/backup/dandi/dandisets/000026$ git annex config --get annex.addunlocked
include=.dandi/*
(dandisets-2) dandi@drogon:/mnt/backup/dandi/dandisets/000026$ git annex config --get annex.largefiles
(dandisets-2) dandi@drogon:/mnt/backup/dandi/dandisets/000026$ 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants