Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FOI: backups2datalad-update-cron --mode verify got stuck #48

Open
yarikoptic opened this issue Jun 6, 2024 · 1 comment
Open

FOI: backups2datalad-update-cron --mode verify got stuck #48

yarikoptic opened this issue Jun 6, 2024 · 1 comment
Labels
bug:stall report Report of a backup2datalad run getting stuck or stalling

Comments

@yarikoptic
Copy link
Member

Today is June 7 and I spotted that last update to dandisets was in May... I see that there is still process running

root     3911865  0.0  0.0   8804  1548 ?        Ss   Mar26   0:32   /usr/sbin/cron -f
root     4052378  0.0  0.0   9740  1856 ?        S    May25   0:00     /usr/sbin/CRON -f
dandi    4052380  0.0  0.0   2484   408 ?        Ss   May25   0:00       /bin/sh -c chronic flock -E 0 -e -n /home/dandi/.run/backup2datalad-cron-nonzarr.lock bash -c '/mnt/backup/dandi/dandisets/tools/backups2datalad-update-cron --mode verify'
dandi    4052383  0.0  0.0  18640  5980 ?        S    May25   4:25         /usr/bin/perl /usr/bin/chronic flock -E 0 -e -n /home/dandi/.run/backup2datalad-cron-nonzarr.lock bash -c /mnt/backup/dandi/dandisets/tools/backups2datalad-update-cron --mode verify
dandi    4052385  0.0  0.0   5548   472 ?        S    May25   0:00           /usr/bin/flock -E 0 -e -n /home/dandi/.run/backup2datalad-cron-nonzarr.lock bash -c /mnt/backup/dandi/dandisets/tools/backups2datalad-update-cron --mode verify 
dandi    4052386  0.0  0.0   6952  2060 ?        S    May25   0:00             /bin/bash /mnt/backup/dandi/dandisets/tools/backups2datalad-update-cron --mode verify 
dandi    4052695  0.0  0.0   6952   748 ?        S    May25   0:00               /bin/bash /mnt/backup/dandi/dandisets/tools/backups2datalad-update-cron --mode verify
dandi    4059070  0.0  0.6 1499228 448460 ?      Dl   May25   3:29                 /home/dandi/miniconda3/envs/dandisets-2/bin/python /home/dandi/miniconda3/envs/dandisets-2/bin/backups2datalad -l WARNING --backup-root /mnt/backup/dandi --config tools/backups2datalad.cfg.yaml update-from-backup --workers 5 -e 000(026|108|243|876)$ --mode verify
dandi    4059071  0.0  0.0   6384  2248 ?        S    May25   0:00                 grep -v nothing to save, working tree clean

and looking at the log /mnt/backup/dandi/dandisets/.git/dandi/backups2datalad/2024.05.25.05.53.54Z.log I see lots of

2024-06-06T05:21:13-0400 [DEBUG   ] backups2datalad: Finished [rc=1]: git -c receive.autogc=0 -c gc.auto=0 config --file .datalad/config --get dandi.dandiset.embargo-status [cwd=/mnt/backup/dandi/dandisets/001022]
2024-06-06T05:21:13-0400 [INFO    ] backups2datalad: Dandiset 001022: Updating metadata file
2024-06-06T05:22:13-0400 [WARNING ] backups2datalad: Retrying GET request to /dandisets/001022/versions/draft/ in 1.001618 seconds as it raised PoolTimeout:
2024-06-06T05:23:14-0400 [WARNING ] backups2datalad: Retrying GET request to /dandisets/001022/versions/draft/ in 1.903866 seconds as it raised PoolTimeout:
2024-06-06T05:24:16-0400 [WARNING ] backups2datalad: Retrying GET request to /dandisets/001022/versions/draft/ in 4.186317 seconds as it raised PoolTimeout:
2024-06-06T05:25:21-0400 [WARNING ] backups2datalad: Retrying GET request to /dandisets/001022/versions/draft/ in 7.632318 seconds as it raised PoolTimeout:
2024-06-06T05:26:18-0400 [WARNING ] backups2datalad: Retrying GET request to /dandisets/000989/versions/draft/ in 487.782007 seconds as it raised PoolTimeout:
2024-06-06T05:26:28-0400 [WARNING ] backups2datalad: Retrying GET request to /dandisets/001022/versions/draft/ in 15.465852 seconds as it raised PoolTimeout:
2024-06-06T05:27:44-0400 [WARNING ] backups2datalad: Retrying GET request to /dandisets/001022/versions/draft/ in 30.563104 seconds as it raised PoolTimeout:
2024-06-06T05:29:14-0400 [WARNING ] backups2datalad: Retrying GET request to /dandisets/001022/versions/draft/ in 65.885731 seconds as it raised PoolTimeout:
2024-06-06T05:29:50-0400 [WARNING ] backups2datalad: Retrying GET request to /dandisets/000988/versions/draft/ in 989.482749 seconds as it raised PoolTimeout:
2024-06-06T05:31:20-0400 [WARNING ] backups2datalad: Retrying GET request to /dandisets/001022/versions/draft/ in 129.670206 seconds as it raised PoolTimeout:
2024-06-06T05:34:30-0400 [WARNING ] backups2datalad: Retrying GET request to /dandisets/001022/versions/draft/ in 265.165215 seconds as it raised PoolTimeout:
2024-06-06T05:35:26-0400 [WARNING ] backups2datalad: Retrying GET request to /dandisets/000989/versions/draft/ in 1043.047437 seconds as it raised PoolTimeout:
2024-06-06T05:39:56-0400 [WARNING ] backups2datalad: Retrying GET request to /dandisets/001022/versions/draft/ in 494.584960 seconds as it raised PoolTimeout:
2024-06-06T05:47:20-0400 [WARNING ] backups2datalad: Retrying GET request to /dandisets/000988/versions/draft/ in 2034.338609 seconds as it raised PoolTimeout:
2024-06-06T05:49:10-0400 [WARNING ] backups2datalad: Retrying GET request to /dandisets/001022/versions/draft/ in 1017.563094 seconds as it raised PoolTimeout:
2024-06-06T05:53:49-0400 [WARNING ] backups2datalad: Retrying GET request to /dandisets/000989/versions/draft/ in 2058.971743 seconds as it raised PoolTimeout:
2024-06-06T06:07:08-0400 [WARNING ] backups2datalad: Retrying GET request to /dandisets/001022/versions/draft/ in 2065.306574 seconds as it raised PoolTimeout:
2024-06-06T06:22:14-0400 [WARNING ] backups2datalad: Retrying GET request to /dandisets/000988/versions/draft/ in 4080.329725 seconds as it raised PoolTimeout:
2024-06-06T06:29:08-0400 [WARNING ] backups2datalad: Retrying GET request to /dandisets/000989/versions/draft/ in 4039.886215 seconds as it raised PoolTimeout:
2024-06-06T06:39:52-0400 [ERROR   ] backups2datalad: Job failed on input <Dandiset 000986/draft>:
Traceback (most recent call last):
  File "/home/dandi/miniconda3/envs/dandisets-2/lib/python3.10/site-packages/anyio/_core/_tasks.py", line 115, in fail_after
    yield cancel_scope
  File "/home/dandi/miniconda3/envs/dandisets-2/lib/python3.10/site-packages/httpcore/_synchronization.py", line 123, in wait
    await self._anyio_event.wait()
  File "/home/dandi/miniconda3/envs/dandisets-2/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 1621, in wait
    await self._event.wait()
  File "/home/dandi/miniconda3/envs/dandisets-2/lib/python3.10/asyncio/locks.py", line 213, in wait
    await fut
asyncio.exceptions.CancelledError: Cancelled by cancel scope 7f1ccdb91cf0

which if ran on command line passes just fine and quick (curl -X 'GET' \ 'https://api.dandiarchive.org/api/dandisets/000988/versions/draft/' \ -H 'accept: application/json')

I wonder now if that is related to that network switch changeover / MTU fiasco, but I can't fathom why/how it would affect this process specifically. I will interrupt it now and do update without verify "manually" and then will do with verify (unless cron beats me to those)... drogon overall seems a bit overwhelmed ATM as well :-/

@yarikoptic
Copy link
Member Author

I do not think it relates to MTUs since I do have 1500 ATM for the outside interface

eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 129.170.233.10  netmask 255.255.255.192  broadcast 129.170.233.63
        inet6 fe80::ec4:7aff:fe1d:c450  prefixlen 64  scopeid 0x20<link>
        ether 0c:c4:7a:1d:c4:50  txqueuelen 1000  (Ethernet)
        RX packets 213498991317  bytes 315123181593716 (286.6 TiB)
        RX errors 0  dropped 426903  overruns 0  frame 0
        TX packets 8821365339  bytes 554271074390 (516.2 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
        inet 192.168.6.149  netmask 255.255.255.0  broadcast 192.168.6.255
        inet6 fe80::ec4:7aff:fe1d:c451  prefixlen 64  scopeid 0x20<link>
        ether 0c:c4:7a:1d:c4:51  txqueuelen 1000  (Ethernet)
        RX packets 1147  bytes 219672 (214.5 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 56  bytes 10843 (10.5 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

and we have a similar stuck, and curl working just fine. I think it is somewhere within python stack used here. Will need to cause upgrades and see if resolves.

@yarikoptic yarikoptic reopened this Oct 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug:stall report Report of a backup2datalad run getting stuck or stalling
Projects
None yet
Development

No branches or pull requests

2 participants