Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate project backup export from biosoundscape project #2037

Closed
koonchaya opened this issue Jun 11, 2024 · 17 comments
Closed

Investigate project backup export from biosoundscape project #2037

koonchaya opened this issue Jun 11, 2024 · 17 comments
Assignees
Milestone

Comments

@koonchaya
Copy link

Original report: https://rfcx.slack.com/archives/C03FD1WD02J/p1718028385095719
@carlybatist

User reported that number of recordings and templates exported from the project didn't match data in the project.
Project https://arbimon.org/p/biosoundscape/overview

"When I made a backup of our BioSoundSCape project, the zip did not include the recordings.csv that has the AWS links to recordings. This file is invaluable. Also, it doesn't appear that we got a full list of templates in templates.csv. Maybe this is because there is a lot of data in BioSoundSCape?"
This project - https://arbimon.org/p/biosoundscape/overview

Additional information
I exported the files from the project and found that the number of recordings and templates didn't match.
Export file https://drive.google.com/file/d/1_l2LmgDrn_CIMDx23Bq5JWIlVhmF15A2/view?usp=sharing

@koonchaya koonchaya added this to the User support milestone Jun 11, 2024
@grindarius
Copy link
Contributor

grindarius commented Jun 14, 2024

I dug up the logs and found that it's an error at the SQL level. Most statements failed to run when the server is likely at its max. See error images below...

Screenshot 2567-06-14 at 13 03 13 Screenshot 2567-06-14 at 13 03 36 Screenshot 2567-06-14 at 13 03 49 Screenshot 2567-06-14 at 13 04 04

It's an error from legacy, but since our backup system is designed to be fault-tolerant, they can still do an export even if all queries failed. But when the error comes, the next batch won't be queried. So if the first batch fail you will get nothing.

The problem is the query took too long to do so. I guess there are a couple places where we can improve it.

  • Index-based pointer on tables that have sortable index.
  • Reduce database chunk query to something like 50k but still save the file at 200k rows as normal.

@carlybatist
Copy link

carlybatist commented Jun 14, 2024

@grindarius ok how long do you anticipate it will take to implement these fixes so that we can check if they work in fixing the issue?
We need to be able to have large projects work with the backup. And if there is an error where not all rows are going to show up, there needs to be an error message to the user demonstrating that. The user only realized this was a problem when they double-checked the CSVs against the project data.

@antonyharfield
Copy link
Member

Reduce database chunk query to something like 50k

That sounds reasonable.

This query is very light because the only ordering is by PK so it shouldn't need to do much to read this data. It more likely failed because there was a lot of db activity at the same time. We could try some exponential backoff: if a query fails then retry in 10 sec, then 20 sec, then 40 sec then 80 sec else fail completely.

If one of the queries fails then I think the whole job should fail -- we don't want to continue and send the user incomplete data.

@carlybatist We are going to need this week to work on some improvements.

@koonchaya
Copy link
Author

@antonyharfield @grindarius To find the solution for the job fail case.

@koonchaya
Copy link
Author

Email fail status to user and [email protected]/slack

@koonchaya
Copy link
Author

Draft email to notify failure export:

Subject: Arbimon project export failed

Hello,

Thanks so much for using Arbimon! We encountered an issue while backing up your project '...'. Our apologies for the inconvenience. Please contact our support team at [[email protected]] for assistance.

@antonyharfield @carlybatist Can you check if this message need any changes?

@carlybatist
Copy link

@koonchaya
Tech team would be getting this error notification too right? They should then immediately start looking into it as a support ticket. So I would think the email to the user should be informing them that there was an error and that our team is looking into it and will update them. Noon and I should be auto-cc'd on these emails to users too. So it would be --

Hello,

There was an issue with your project backup of '...'. Our engineering team is looking into this and will update you when we have resolved it. We apologize for the inconvenience and thank you for your patience!

All the best,
Arbimon team

@koonchaya
Copy link
Author

Ideally, @carlybatist and I would get the email that forwarded from [email protected]. I am not sure about the eng-team will get alert elsewhere.

@carlybatist
Copy link

@koonchaya @grindarius what do you expect the timeline for fixing the underlying issue will be?

@koonchaya
Copy link
Author

I tested export backup from project https://staging.arbimon.org/p/bci-panama-2018/overview
@grindarius here is some feedback

playlists.csv

pattern_matchings.csv

  • pattern_matching_rois.csv is empty but there should be some rois data

pattern_matching_rois.csv

recordings.csv

  • recording_validations.csv is empty but there are some validations in the project

recording_validations.csv

Image

  • rfm_models.csv is empty

rfm_models.csv

  • sites.csv is empty

sites.csv

  • soundscapes.csv is empty

soundscapes.csv

  • species.csv is empty

species.csv

  • templates.csv is empty

templates.csv

grindarius added a commit that referenced this issue Jun 24, 2024
@koonchaya
Copy link
Author

koonchaya commented Jun 24, 2024

@grindarius

pattern_matchings.csv

  • and I believe the pattern_matching_rois.csv contains the same data as above which are from the PM jobs that not listed in the project.

pattern_matching_rois.csv

  • playlists.csv is good but I found there are 2 sites that I could not find in the project playlist.
    • playlist id: 8802 (Test cluster1)
    • playlist id: 8878 (Test playlist not filter)

playlists.csv

  • rfm_models.csv contains jobs that are not in the project (not highlighted list in the file = not found in the project)

rfm_models.csv

  • rfm_classifications.csv seems to have more data in export than in the project

rfm_classifications_001.csv
rfm_classifications_002.csv
rfm_classifications_003.csv
rfm_classifications_004.csv
rfm_classifications_005.csv
rfm_classifications_006.csv
rfm_classifications_007.csv
rfm_classifications_008.csv

  • templates.csv contains less species than in the project

templates.csv

@grindarius
Copy link
Contributor

For pattern matchings export, we grab data directly from pattern_matchings table. What you see in the UI are jobs that are joined with the jobs table. But in pattern_matchings table we can get all the jobs from there even some jobs are not related to the jobs table. Same goes for pattern_matching_rois.

For playlists export I did find both playlists inside the export file so it's all good.

For rfm models there are deleted models being exported into the file.

Same goes for rfm classifications, we did not have condition to remove deleted classifications out.

@koonchaya
Copy link
Author

@grindarius

  • pattern_matchings.csv
    • in project = 6 jobs
    • in export = 50
  • rfm_classifications.csv
    • only 3 jobs in the project but there are a lot more models in the export
      The rest of them are correct

@RatreeOchn
Copy link
Contributor

Released on v1.4.2

grindarius added a commit that referenced this issue Jun 27, 2024
grindarius added a commit that referenced this issue Jun 27, 2024
I refactored most parts out into functions for easier abstraction and flexibility. I also redo the recordings export to query recordings by site because it seems to take a long time to get recordings by the project_id.
grindarius added a commit that referenced this issue Jun 27, 2024
@carlybatist
Copy link

@koonchaya I see this is closed - can I tell the user to try the backup again? Or is there still more work to be done for the backup performance

@grindarius
Copy link
Contributor

@koonchaya I see this is closed - can I tell the user to try the backup again? Or is there still more work to be done for the backup performance

@carlybatist We are testing the perf upgrades on staging right now. I would expect the user to be able to export all the files tomorrow. I have yet to get all of the performance changes onto production because I wanted to make sure we did not miss anything.

@carlybatist
Copy link

Ok no worries thanks for the update! Just wasn't sure of the status since the issue had been 'closed'.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

5 participants