Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zarr backup might need "optimization" #363

Closed
yarikoptic opened this issue Nov 1, 2023 · 7 comments
Closed

zarr backup might need "optimization" #363

yarikoptic opened this issue Nov 1, 2023 · 7 comments

Comments

@yarikoptic
Copy link
Member

yarikoptic commented Nov 1, 2023

I found 4 days old process still running for non 000108 dandiset. The process tree

dandi      27781 50.7  1.5 3886708 1025188 ?     Rl   Oct27 3546:00                 python -m tools.backups2datalad -l WARNING --backup-root /mnt/backup/dandi --config tools/backups2datalad.cfg.yaml update-from-backup --workers 5 -e 000108$
dandi      90653  0.0  0.0  10820  2868 ?        S    Oct27   0:00                   git -c receive.autogc=0 -c gc.auto=0 annex examinekey --batch --migrate-to-backend=MD5E
dandi      90655  0.0  0.0 1074053100 11264 ?    Sl   Oct27   4:49                     /home/dandi/miniconda3/envs/dandisets-2/bin/git-annex examinekey --batch --migrate-to-backend=MD5E
dandi      91009  0.0  0.0  10820  2856 ?        S    Oct27   0:00                   git -c receive.autogc=0 -c gc.auto=0 annex whereis --batch-keys --json --json-error-messages
dandi      91011  6.1  0.8 1074060012 545916 ?   Sl   Oct27 426:27                     /home/dandi/miniconda3/envs/dandisets-2/bin/git-annex whereis --batch-keys --json --json-error-messages
dandi      91021  0.0  0.0  14788  5308 ?        S    Oct27   4:55                       git --git-dir=.git --work-tree=. --literal-pathspecs cat-file --batch
dandi      91432  0.0  0.0  10820  3008 ?        S    Oct27   0:00                   git -c receive.autogc=0 -c gc.auto=0 annex fromkey --force --batch --json --json-error-messages
dandi      91434  0.1  0.0 1074053256 33024 ?    Sl   Oct27  13:16                     /home/dandi/miniconda3/envs/dandisets-2/bin/git-annex fromkey --force --batch --json --json-error-messages
dandi      91443  0.0  0.0  14748  2236 ?        S    Oct27   0:00                       git --git-dir=.git --work-tree=. --literal-pathspecs cat-file --batch
dandi      91499  0.0  0.0  11736  3136 ?        S    Oct27   2:54                       git --git-dir=.git --work-tree=. --literal-pathspecs hash-object -w --stdin-paths --no-filters
dandi      91566  0.0  0.0  10820  2944 ?        S    Oct27   0:00                   git -c receive.autogc=0 -c gc.auto=0 annex registerurl -c annex.alwayscompact=false --batch --json --json-error-messages
dandi      91569  2.1  0.1 1074126968 71816 ?    Sl   Oct27 149:30                     /home/dandi/miniconda3/envs/dandisets-2/bin/git-annex registerurl -c annex.alwayscompact=false --batch --json --json-error-messages
dandi      91598  0.1  0.0  14852  4868 ?        S    Oct27   9:32                       git --git-dir=.git --work-tree=. --literal-pathspecs -c annex.alwayscompact=false cat-file --batch
dandi      91599  0.0  0.0   6952  2472 ?        S    Oct27   3:13                       /bin/bash /usr/bin/git-annex-remote-rclone
dandi      27782  0.0  0.0   6384  2084 ?        S    Oct27   0:00                 grep -v nothing to save, working tree clean

and looking at that zarr

dandi@drogon:/mnt/backup/dandi/dandisets/000108$ ls -l /proc/91011/cwd
lrwxrwxrwx 1 dandi dandi 0 Nov  1 12:32 /proc/91011/cwd -> /mnt/backup/dandi/dandizarrs/5c37c233-222f-4e60-96e7-a7536e08ef61
dandi@drogon:/mnt/backup/dandi/dandisets/000108$ ls /mnt/backup/dandi/dandizarrs/5c37c233-222f-4e60-96e7-a7536e08ef61
0  1  2  3  4
dandi@drogon:/mnt/backup/dandi/dandisets/000108$ find /mnt/backup/dandi/dandizarrs/5c37c233-222f-4e60-96e7-a7536e08ef61/* | nl | tail
494148  /mnt/backup/dandi/dandizarrs/5c37c233-222f-4e60-96e7-a7536e08ef61/1/26/35/29
494149  /mnt/backup/dandi/dandizarrs/5c37c233-222f-4e60-96e7-a7536e08ef61/1/26/35/3
494150  /mnt/backup/dandi/dandizarrs/5c37c233-222f-4e60-96e7-a7536e08ef61/1/26/35/30
494151  /mnt/backup/dandi/dandizarrs/5c37c233-222f-4e60-96e7-a7536e08ef61/1/26/35/31
494152  /mnt/backup/dandi/dandizarrs/5c37c233-222f-4e60-96e7-a7536e08ef61/2
494153  /mnt/backup/dandi/dandizarrs/5c37c233-222f-4e60-96e7-a7536e08ef61/2/.zarray
494154  /mnt/backup/dandi/dandizarrs/5c37c233-222f-4e60-96e7-a7536e08ef61/3
494155  /mnt/backup/dandi/dandizarrs/5c37c233-222f-4e60-96e7-a7536e08ef61/3/.zarray
494156  /mnt/backup/dandi/dandizarrs/5c37c233-222f-4e60-96e7-a7536e08ef61/4
494157  /mnt/backup/dandi/dandizarrs/5c37c233-222f-4e60-96e7-a7536e08ef61/4/.zarray

so it is a "hefty" zarr -- half a million files. I wonder if we could make that process anyhow faster. there was some splitindex etc.

FWIW -- above count is with folders. Without folders:

dandi@drogon:/mnt/backup/dandi/dandisets/000108$ find /mnt/backup/dandi/dandizarrs/5c37c233-222f-4e60-96e7-a7536e08ef61/* \! -type d | nl | tail -n 1
487185  /mnt/backup/dandi/dandizarrs/5c37c233-222f-4e60-96e7-a7536e08ef61/4/.zarray

and that particular zarr is almost done so I will keep it going for now

❯ curl --silent -X 'GET' 'https://api.dandiarchive.org/api/zarr/5c37c233-222f-4e60-96e7-a7536e08ef61/' -H 'accept: application/json' | jq .file_count
526320

edits:

  • it seems to have a lot of not packed git objects. I am getting stats using ncdu ATM... since we are running with receive.autogc=0 and gc.auto=0 -- should we trigger it "manually" but wouldn't it then interfere with running batched processes? we might need to stop and redo. Might be worth simulating that all with some dedicated script to time it all up. Also might be worth moving all the dandizarrs to some faster / dedicated medium (SSDs?)
  • it is that top level python process ( 27781) which is relatively CPU busy -- 60-100% CPU, looking at what it is doing might be relevant. Doing some py-spy top sampling gives the top of
Total Samples 3277
GIL: 1.00%, Active: 17.00%, Threads: 15

  %Own   %Total  OwnTime  TotalTime  Function (filename)                                                                                                                                                                                                           
 11.00%  11.00%   27.13s    27.13s   _worker (concurrent/futures/thread.py)
  5.00%   5.00%   15.57s    15.57s   _do_waitpid (asyncio/unix_events.py)
  0.00%   1.00%   0.080s     1.03s   _run_once (asyncio/base_events.py)
  0.00%   0.00%   0.070s    0.090s   _execute_child (subprocess.py)
  0.00%   0.00%   0.070s    0.070s   _add_callback (asyncio/base_events.py)
  0.00%   1.00%   0.050s    0.880s   _run (asyncio/events.py)
  0.00%   0.00%   0.040s    0.040s   register (selectors.py)
  0.00%   0.00%   0.030s    0.030s   raw_decode (json/decoder.py)

so is it just jumping between different async items or really doing some useful work???

edit: some stats from ncdu. A LOT of files during the backup, then just few

at some point there were over 900,000 files in .git/annex/journal !

--- /mnt/backup/dandi/dandizarrs/5c37c233-222f-4e60-96e7-a7536e08ef61/.git/annex ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                                /..
    3.5 GiB [##########] 904.4k /journal                                                                                                                                                                                                                           
    2.0 MiB [          ]         index
    1.5 MiB [          ]      1 /keysdb
   12.0 KiB [          ]      3 /fsck
    4.0 KiB [          ]         index.lck

and separate objects (no packing performed) for each tiny file

--- /mnt/backup/dandi/dandizarrs/5c37c233-222f-4e60-96e7-a7536e08ef61/.git ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    3.5 GiB [##########] 904.4k /annex
.   1.7 GiB [####      ] 451.7k /objects    

which then all get handled eventually and .git/objects packed too:

--- /mnt/backup/dandi/dandizarrs/5c37c233-222f-4e60-96e7-a7536e08ef61/.git ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  254.8 MiB [##########]     16 /objects                                                                                                                                                                                                                           
  241.3 MiB [######### ]     18 /annex
   38.9 MiB [#         ]         index
@satra
Copy link
Member

satra commented Nov 1, 2023

which asset is this? i want to check that the shape/compression characteristics did not change in the process and this is indeed a hefty zarr (i.e. could be one of the 4mm slices).

also i'm going to start rolling out not storing rawest data but stitched data.

@yarikoptic
Copy link
Member Author

❯ curl --silent -X 'GET' 'https://api.dandiarchive.org/api/zarr/5c37c233-222f-4e60-96e7-a7536e08ef61/' -H 'accept: application/json' | jq .
{
  "name": "sub-I58/ses-Hip-CT/micr/sub-I58_sample-01_chunk-01_hipCT.ome.zarr",
  "dandiset": "000026",
  "zarr_id": "5c37c233-222f-4e60-96e7-a7536e08ef61",
  "status": "Complete",
  "checksum": "4cb549b2e2346bb1a30f493b50fb6a2e-526320--1023396474554",
  "file_count": 526320,
  "size": 1023396474554
}

@satra
Copy link
Member

satra commented Nov 1, 2023

this is dandiset 26, not 108. it's probably the TB one. it's an entire hemisphere and more at 15um resolution.

@satra
Copy link
Member

satra commented Nov 1, 2023

i didn't read "non" 000108 dandiset - i thought it was in 108. but this one is beautiful. yael posted the neuroglancer rendering in the bids spec addition of HiPCT.

@yarikoptic
Copy link
Member Author

yael posted the neuroglancer rendering in the bids spec addition of HiPCT.

is there a link?

@satra
Copy link
Member

satra commented Nov 3, 2023

@yarikoptic
Copy link
Member Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants