-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up conflict event #108
Conversation
On the NER p-codes, the framework (by way of the Adminlevel class) should attempt to match NER to NE |
Pull Request Test Coverage Report for Build 9307755362Details
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really great find about the duplicate checking being the bottleneck! As I mentioned below, try making the list into a set - I think searching is O(1) with sets so that should also offer a speedup (although maybe with your change this will be minimal).
I say go for it with adding all of the dates - it should be faster on github as you say.
@@ -46,6 +47,7 @@ def populate(self): | |||
values = admin_results["values"] | |||
|
|||
for admin_code in admin_codes: | |||
admin_rows = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try making this a set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately I can't add a dict to a set! Unhashable, so I think I'm stuck with list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if it's still needed, but you can use a named tuple instead of a dict and then you could add that to a set. Ex:
from collections import namedtuple
# defined named tuple and its fields
MyTuple = namedtuple('MyTuple', ['resource_hdx_id', 'events'])
item = MyTuple(resource_hdx_id='abcd-efgh', events=5)
a_set = set()
a_set.add(item1)
# If you need the item as a dict
item1._asdict() # {'resource_hdx_id': 'abcd-efgh', 'events': 5}
Not sure how big admin_rows
gets. But if it does, having a set can make a difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked and it didn't make any difference when I ran the full set of data (all countries and all years). Next PR I can switch it to a set and see if it's faster on github. admin_rows will have a max length of around 100 for most cases, going up to several hundred for a few countries.
_BATCH_SIZE = 1000 | ||
|
||
|
||
def batch_populate(rows: List[Dict], session, DBTable): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
Fixed the issue in NER, just needed to add in the ISO code! I also removed the date filter. |
Update operational presence for AFG
I think the main issue was the checking for duplicates against a very long list. I've rearranged some of the logic to check against a smaller list and added batch committing in a utility function that operational presence uses as well.
For now I limited the data to 2023 and 2024 instead of the full date range. It took ~30 minutes to download and populate the conflict table this way, and ~50 minutes when all years are included. It could definitely be faster on GitHub Actions, do you think I should remove the filter?
There's an issue with the p-codes in NER - they start with "NER" in the conflict data and "NE" in the CODs. I can't remember if there's a way to address this in the yaml.