Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: parallelize #3

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Feat: parallelize #3

wants to merge 3 commits into from

Conversation

Murray2015
Copy link
Collaborator

Hey Evangelos. I've had a play with your code and got parallelization working. This PR does the following:

  1. Adds a requirements.txt so the repo can be installed more easily
  2. Removes all hard-coded paths from the codebase and moves them up several levels to main (because otherwise, only your laptop can run this code).
  3. Adds a command line interface in main, so that main can be called by other users. This is optional and defaults to all your hardcoded paths, which now live in main. However, if you want to call it via the command line, it can be called like so: python main.py --geojson_dir geojson_files --log_dir logs --output_path output/. This is different to the makefile approach, which doesn't play nicely with the command line for passing file paths.
  4. Parallelizes the code. This is done at the download and processing level rather than at the country level. i.e., it will sequentially run each country, but all the downloaders for that country will then run in parallel. This will make it faster to run for all countries, but it will also make it faster to run if just a single country is being run.
  5. Note that I haven't tested this extensively. I spot lots of warning and error messages in the output. Lots were pre-existing before I touched the code, but there may be bugs I've introduced hidden in there now. You'll probably have more context than I do to be able to debug them if so. For context, removing the paths, and refactoring to only read the geojson once were the key steps to getting parallelization working - the first errors I got were around resource contention as multiple downloaders tried to read the geojson file at the same time. However, this might be a source of bugs if any of the downloaders mutate the data frame (I don't think any do, but I haven't read every line of code in the codebase)
  6. I'm now off on holiday for a week, so will pick up any replies to this when I'm back!
  7. p.s. I didn't actually ask if it was ok to start messing around in your code, which is very bad etiquette - hopefully it's ok, and don't feel any pressure to merge the code in if you hate my approach!

Murray Hoggett added 3 commits May 3, 2024 20:21
…eers can work on the codebase. Kept defaults in main so it should keep working for Evangelos locally without changes
…ore I added this code. But I spot some complaining about mutating data in a dataframe copy, so possible one of the downloaders is mutating the geodataframe. Might be fine or might not be, needs investigation.
@Murray2015 Murray2015 requested a review from ediakatos May 4, 2024 16:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant