-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
failed at task download-video #353
Comments
How long do you think the archive was running when this happened? Could you provide logs from the API container when the workflows failed, if possible? |
The task terminated just over 3h31m after it started:
I'm not sure how to go back in the api logs, are they logged to a file? If so, where - I couldn't find it in /logs?
|
I have some changes available on the
If you're not pushing the logs to a central location, you'll be limited to the history of |
Thanks, have deployed that, will let you know |
I had this happen again today, this time after about 1h40m of the stream (so it doesn't seem to be length-related). Running Here was the notification:
api log:
The temporal processes have quit: |
Can you supply the full API logs when this error occurred? |
API log here |
Temporal log here |
Error occurred around 2024-01-19 11:33 |
The API logs appears to only have worker-related logs. Can you enable |
I'll turn debug on and let you know if/when it happens again. The api logs I provided I collected with |
Sorry, I forgot how few logs are printed when |
@c-hri-s I deleted your message as your log contains your Twitch token.
When the error notification was fired, does the video download still continue? If you view the failed workflow for the I believe something is causing the streamlink download process to error out which causes the workflow to return a workflow error which basically stops the rest of the workflows. I can't confirm for sure unless you're able to find the exit reason in the workflow. |
The video download continues to the end, the file stays in /tmp once it completes. I checked the segment of the video where the error occurred and it seemed okay. I tidied things up last night so have lost the temporal logs. I'll check in the UI next time it happens and let you know. |
I've had another failure and can access the temporalUI. Not sure if these are any help (pulled from the failed tasks)
|
Something is causing the heartbeat to die which in turn causes an error in the live video archive. I don't kill the streamlink process when an error is detected...I probably should. If you still have the API container logs from when this occurred (with debug enabled), can you post them if possible. Also, do you see a log similar to the following? If so, any think of interest near it?
|
ganymede-api.zip Logs are attached - I couldn't see the message above in either of them. In the workflow page the first failed task started at 2024/02/01 21:47:25 and terminated at 2024/02/02 01:05:31 to give you a timeframe to look at |
The temporal container logs make it seem like it's an issue with the sqlite database given the "context deadline exceeded" error. This error can mean a number of things but I'm guessing it's related to IO issues or the sqlite database itself. I've seen this error before in #339 on Synology devices. I'm using the "not-so-production dev CLI" to run the temporal server as it's simply easier and the more production version adds some complexity. I would like you to try the more production version though, to see if this helps at all. If you're willing, please follow the below instructions to temporarily swap over.
version: "3.5"
services:
postgresql:
container_name: temporal-postgresql
environment:
POSTGRES_PASSWORD: temporal
POSTGRES_USER: temporal
image: postgres:13
volumes:
- ./temporal_postgres:/var/lib/postgresql/data
temporal:
container_name: temporal
depends_on:
- postgresql
environment:
- DB=postgresql
- DB_PORT=5432
- POSTGRES_USER=temporal
- POSTGRES_PWD=temporal
- POSTGRES_SEEDS=postgresql
- DYNAMIC_CONFIG_FILE_PATH=config/dynamicconfig/development-sql.yaml
image: temporalio/auto-setup:latest
ports:
- 7333:7233
volumes:
- ./dynamicconfig:/etc/temporal/config/dynamicconfig
temporal-ui:
container_name: temporal-ui
depends_on:
- temporal
environment:
- TEMPORAL_ADDRESS=temporal:7233
- TEMPORAL_CORS_ORIGINS=http://localhost:3000
image: temporalio/ui:latest
ports:
- 8028:8080
limit.maxIDLength:
- value: 255
constraints: {}
system.forceSearchAttributesCacheRefreshOnRead:
- value: true # Dev setup only. Please don't turn this on in production.
constraints: {}
Now you need to update Ganymede to use this new Temporal instance. When you have no running archives, perform the following.
Let me know how it goes. |
Thanks - all done (some of the ports you've listed in the instructions above weren't quite right, but I knew what you meant). Common to the other issue you mentioned, I am using a Synology (DS1821+). It is configured with a 512GB read-write SSD cache, so writes should be quick (as should reads of commonly-used data). Although you don't have any control over what it caches, it reports a 95-99% hit rate. |
That's the default image that Twitch provides when it doesn't yet have a screen shot of the stream. I mitigated this before by re-fetching the thumbnails after 15 minutes. I don't believe I got that setup again after migrating to the workflows, I'll look into doing that. In the meantime, you can copy the |
No further issues since moving to the temporal/postgresql setup .. not sure what you want to do with this issue, do you want me to close it? I guess staying on the alternative temporal setup is okay for now, but not so much if it diverges from whatever you're doing with the core release. |
I'll figure out how to switch from the not-so-production-ready Temporal container that I'm currently deploying, to use the real version that has been working fine for you. It may require an extra step or two for new users to setup, but that's worth it, if it works for everyone. |
v2.1.0 has been released with steps to migrate to the new Temporal container. I was able to simplify some things and reuse the existing DB so I recommend performing the steps outlined on the release page. |
Running :latest, on long live streams I have seen errors which hang the processing of the video.
Today I received notification of this error:
⚠️ Error: Queue ID 95e00ab2-9e94-451c-9aab-97bb9efe36d5 for DJCquence failed at task download-video.
Despite this the video seemed to continue downloading, just the tasks were killed.
Here's the video log:
The queue shows it as processing, but the 'video download' and 'chat download' tasks are still executing. It didn't move to the convert tasks.
The video file is in /tmp and never got moved to the regular archive folder.
In the workflows I see four failed tasks and a terminated task:
I assume the issue is the the temporary download error which kills the tasks.
Is there a way to be more tolerant of transient failures, or somehow add the ability to resume the conversion?
Thanks
The text was updated successfully, but these errors were encountered: