Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up conflict event #108

Merged
merged 12 commits into from
May 30, 2024
Merged

Speed up conflict event #108

merged 12 commits into from
May 30, 2024

Conversation

b-j-mills
Copy link
Collaborator

I think the main issue was the checking for duplicates against a very long list. I've rearranged some of the logic to check against a smaller list and added batch committing in a utility function that operational presence uses as well.

For now I limited the data to 2023 and 2024 instead of the full date range. It took ~30 minutes to download and populate the conflict table this way, and ~50 minutes when all years are included. It could definitely be faster on GitHub Actions, do you think I should remove the filter?

There's an issue with the p-codes in NER - they start with "NER" in the conflict data and "NE" in the CODs. I can't remember if there's a way to address this in the yaml.

@b-j-mills b-j-mills requested review from mcarans and turnerm May 30, 2024 02:46
@mcarans
Copy link
Contributor

mcarans commented May 30, 2024

On the NER p-codes, the framework (by way of the Adminlevel class) should attempt to match NER to NE

Copy link

github-actions bot commented May 30, 2024

Test Results

6 tests  ±0   6 ✅ ±0   11m 48s ⏱️ +58s
1 suites ±0   0 💤 ±0 
1 files   ±0   0 ❌ ±0 

Results for commit bfcdb3e. ± Comparison against base commit d17f513.

♻️ This comment has been updated with latest results.

@coveralls
Copy link

coveralls commented May 30, 2024

Pull Request Test Coverage Report for Build 9307755362

Details

  • 30 of 34 (88.24%) changed or added relevant lines in 4 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.005%) to 93.31%

Changes Missing Coverage Covered Lines Changed/Added Lines %
src/hapi/pipelines/database/conflict_event.py 11 13 84.62%
src/hapi/pipelines/database/operational_presence.py 3 5 60.0%
Totals Coverage Status
Change from base Build 9286633165: -0.005%
Covered Lines: 1339
Relevant Lines: 1435

💛 - Coveralls

Copy link
Member

@turnerm turnerm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really great find about the duplicate checking being the bottleneck! As I mentioned below, try making the list into a set - I think searching is O(1) with sets so that should also offer a speedup (although maybe with your change this will be minimal).

I say go for it with adding all of the dates - it should be faster on github as you say.

@@ -46,6 +47,7 @@ def populate(self):
values = admin_results["values"]

for admin_code in admin_codes:
admin_rows = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try making this a set

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately I can't add a dict to a set! Unhashable, so I think I'm stuck with list.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it's still needed, but you can use a named tuple instead of a dict and then you could add that to a set. Ex:

from collections import namedtuple
# defined named tuple and its fields
MyTuple = namedtuple('MyTuple', ['resource_hdx_id', 'events'])

item = MyTuple(resource_hdx_id='abcd-efgh', events=5)
a_set = set()
a_set.add(item1)

# If you need the item as a dict
item1._asdict() # {'resource_hdx_id': 'abcd-efgh', 'events': 5}

Not sure how big admin_rows gets. But if it does, having a set can make a difference.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked and it didn't make any difference when I ran the full set of data (all countries and all years). Next PR I can switch it to a set and see if it's faster on github. admin_rows will have a max length of around 100 for most cases, going up to several hundred for a few countries.

_BATCH_SIZE = 1000


def batch_populate(rows: List[Dict], session, DBTable):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

@b-j-mills
Copy link
Collaborator Author

Fixed the issue in NER, just needed to add in the ISO code! I also removed the date filter.

Update operational presence for AFG
@b-j-mills b-j-mills merged commit 2db832d into main May 30, 2024
3 checks passed
@turnerm turnerm deleted the bugfix/conflict-speed branch August 27, 2024 10:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants