Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingest: Multiple (repeated) ingests of tabular data files #6510

Closed
landreev opened this issue Jan 13, 2020 · 6 comments · Fixed by #6663
Closed

Ingest: Multiple (repeated) ingests of tabular data files #6510

landreev opened this issue Jan 13, 2020 · 6 comments · Fixed by #6663
Assignees

Comments

@landreev
Copy link
Contributor

Back in 2018 we have enabled ingest on tab-delimited text files. (This was done as part of the CSV ingest improvement; by reusing the same parsing code, but with the TAB for the delimiter character).

Apparently, we now have a condition where a successfully ingested file gets picked up for ingest AGAIN - because the content type is tab-delimited, and tab-delimited files are now ingestable... this of course should never happen, because the file already has a datatable object associated with it. But apparently it does occasionally, and an ingested file gets ingested again, corrupting the tab file and the saved original in the process.

I have a list. There are relatively few of these cases, but this is still very annoying and I believe we should treat it as urgent.

@djbrooke
Copy link
Contributor

@landreev I moved this into the ready column and we'll take it on in a sprint soon. Just to clarify, this would be a code change and some cleanup?

@landreev
Copy link
Contributor Author

@landreev I moved this into the ready column and we'll take it on in a sprint soon. Just to clarify, this would be a code change and some cleanup?

Correct.

@landreev
Copy link
Contributor Author

This doesn't seem to have affected any older files - meaning, any tabular files ingested prior to the introduction of ingest of text/tab-separated-values. So I'm going to guess that it's something that only happens when the initial ingest session goes wrong and starts an extra iteration - ?
Meaning, my initial "relatively few of these cases" statement appears to still be true.

@landreev landreev changed the title Ingest: Multiple ingests of tabular data files Ingest: Multiple (repeated) ingests of tabular data files Jan 15, 2020
@landreev
Copy link
Contributor Author

The remaining tasks are mostly cleanup.
In the emergency PR that was merged yesterday (#6517) extra code was added that would prevent an ingest process from being performed again, should a datafile with an existing datatable be placed in the ingest queue again.
The original plan was to maintain 2 distinct mime types, "text/tab-separated-values" for ingested files, and "text/tsv" for uningested text files that just happen to be tab-delimited; with only the latter treated as potentially ingestable. Unfortunately it wasn't being done consistently, but the PR above addresses that as well.
It also fixed an unrelated (but annoying) issue, where some uningested tab-delimeted files were mistakenly showing with "explore" buttons. (I thought there was an already opened issue for that, but couldn't find one).

@landreev
Copy link
Contributor Author

All that said, I still have no idea how it was possible, for an ingested file to be added to the processing queue again. Our system of status flags - separate flags for "ingest scheduled" and "ingest in progress" - was supposed to prevent that. But there must be some ways in which things can go wrong that could make it possible... a page operating on a stale copy of the dataset, allowing to bypass an existing ingest lock? a system crash leaving the flags and the queue in an inconsistent state? something specific to how ingest queue is activated when files are uploaded via the API? ... Whatever it is, it is apparently possible. But if it does happen again, the added checks, and the fact that the ingested mime format is no longer considered ingestable, should prevent any repeated processing from being done.

@djbrooke
Copy link
Contributor

  • We should add in some more code to allow tsvs and text/tab delimited to be ingested when using the browser (carefully!). This can be defined using the API but when a browser is used we let it handle this
  • We need to contact the authors of the datasets where issues exist

@landreev landreev self-assigned this Jan 29, 2020
@djbrooke djbrooke self-assigned this Jan 30, 2020
landreev added a commit that referenced this issue Feb 20, 2020
(this is a *compbined* script for BOTH #6510 and #6522!)
landreev added a commit that referenced this issue Feb 20, 2020
This (and the proper release note) SUPERCEDES what was in PR #6522!
i.e. we are sending out only ONE note, not TWO, there's only one
script to run, etc.
(ref. #6510)
landreev added a commit that referenced this issue Feb 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants