Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

csv batch search and replace #10

Closed
brendanheywood opened this issue Sep 18, 2024 · 0 comments
Closed

csv batch search and replace #10

brendanheywood opened this issue Sep 18, 2024 · 0 comments

Comments

@brendanheywood
Copy link
Contributor

The use case here is replacing a large number of similar urls but where the id's need to be explicitly remapped. eg migrating from service a to service b.

search,replace
https://www.youtube.com/watch?v=abcd,https://video.com/watch?123456
https://www.youtube.com/watch?v=bcde,https://video.com/watch?234567
https://www.youtube.com/watch?v=cdef,https://video.com/watch?345678

Because most of the urls will be in a similar shape we can do things like pre-scanning with the shared prefix to filter out things we don't need to touch which would in theory make this much faster than doing a simple replace_all multiple times one for each mapping.

We also need to make sure that on any given chunk of content each key is mapped to exactly one value, and they don't get replaced multiple times. In general we wouldn't expect this in most use cases anyway, ie in the example we would never have a youtube url on the right, which could be matched again on a second run on the left side.

It is assumed that we can't do this easily in sql, so we need to so some performance testing of doing this in php as a batch replace with all the IO, vs doing this in sql but repeatedly. It may also be a hybrid, ie we search in sql for the records which contain the shared prefix, then use that subset of record ids to do a faster pure sql replace repeatedly for each mapping. It all comes down to speed testing. It is assumed that in the general case the number of mappings could be huge, like 100,000s so gut feel doing it in php is going to end up better overall.

This whole process should be idempotent by design, so if it breaks then we just run it again and it picks up from where it left off and keeps going and its no big deal.

Also unclear is if we want to load the data into a temporary db table or just into memory. This will probably depends on if we end up with pure sql replace or php replace.

One thing currently unclear is if we want to leave behind any audit trail of changes. This would be quite hard in most cases, but if we know that we are replacing data in say the course table then we could emit a moodle event so there is evidence of the edit. Status quo here is no.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant