Speed up conflict event #108

b-j-mills · 2024-05-30T02:46:17Z

I think the main issue was the checking for duplicates against a very long list. I've rearranged some of the logic to check against a smaller list and added batch committing in a utility function that operational presence uses as well.

For now I limited the data to 2023 and 2024 instead of the full date range. It took ~30 minutes to download and populate the conflict table this way, and ~50 minutes when all years are included. It could definitely be faster on GitHub Actions, do you think I should remove the filter?

There's an issue with the p-codes in NER - they start with "NER" in the conflict data and "NE" in the CODs. I can't remember if there's a way to address this in the yaml.

mcarans · 2024-05-30T02:53:19Z

On the NER p-codes, the framework (by way of the Adminlevel class) should attempt to match NER to NE

github-actions · 2024-05-30T02:59:32Z

Test Results

6 tests ±0 6 ✅ ±0 11m 48s ⏱️ +58s
1 suites ±0 0 💤 ±0
1 files ±0 0 ❌ ±0

Results for commit bfcdb3e. ± Comparison against base commit d17f513.

♻️ This comment has been updated with latest results.

coveralls · 2024-05-30T03:00:38Z

Pull Request Test Coverage Report for Build 9307755362

Details

30 of 34 (88.24%) changed or added relevant lines in 4 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage decreased (-0.005%) to 93.31%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
src/hapi/pipelines/database/conflict_event.py	11	13	84.62%
src/hapi/pipelines/database/operational_presence.py	3	5	60.0%

Totals
Change from base Build 9286633165:	-0.005%
Covered Lines:	1339
Relevant Lines:	1435

💛 - Coveralls

turnerm

Really great find about the duplicate checking being the bottleneck! As I mentioned below, try making the list into a set - I think searching is O(1) with sets so that should also offer a speedup (although maybe with your change this will be minimal).

I say go for it with adding all of the dates - it should be faster on github as you say.

turnerm · 2024-05-30T09:51:42Z

src/hapi/pipelines/database/conflict_event.py

@@ -46,6 +47,7 @@ def populate(self):
                values = admin_results["values"]

                for admin_code in admin_codes:
+                    admin_rows = []


Try making this a set

Unfortunately I can't add a dict to a set! Unhashable, so I think I'm stuck with list.

Not sure if it's still needed, but you can use a named tuple instead of a dict and then you could add that to a set. Ex:

from collections import namedtuple # defined named tuple and its fields MyTuple = namedtuple('MyTuple', ['resource_hdx_id', 'events']) item = MyTuple(resource_hdx_id='abcd-efgh', events=5) a_set = set() a_set.add(item1) # If you need the item as a dict item1._asdict() # {'resource_hdx_id': 'abcd-efgh', 'events': 5}

Not sure how big admin_rows gets. But if it does, having a set can make a difference.

I checked and it didn't make any difference when I ran the full set of data (all countries and all years). Next PR I can switch it to a set and see if it's faster on github. admin_rows will have a max length of around 100 for most cases, going up to several hundred for a few countries.

turnerm · 2024-05-30T09:52:33Z

src/hapi/pipelines/utilities/batch_populate.py

+_BATCH_SIZE = 1000
+
+
+def batch_populate(rows: List[Dict], session, DBTable):


b-j-mills · 2024-05-30T14:35:56Z

Fixed the issue in NER, just needed to add in the ISO code! I also removed the date filter.

Update operational presence for AFG

b-j-mills added 7 commits May 29, 2024 11:52

Batch commit conflict

055ac82

Correct row type

3f49997

Move check for duplicates

f772b13

Add batch populate function

71d3aa1

Add all countries, errors

a39b689

Move error catching

9d6b369

Include 2023 and later data

a931baf

b-j-mills requested review from mcarans and turnerm May 30, 2024 02:46

Typo

346604d

turnerm approved these changes May 30, 2024

View reviewed changes

b-j-mills added 3 commits May 30, 2024 09:19

Remove Year filter

8b0b35f

Fix NER matching issue

f9a7108

Update requirements

449787e

Use batch populate for orgs

bfcdb3e

Update operational presence for AFG

b-j-mills merged commit 2db832d into main May 30, 2024
3 checks passed

turnerm deleted the bugfix/conflict-speed branch August 27, 2024 10:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up conflict event #108

Speed up conflict event #108

b-j-mills commented May 30, 2024

mcarans commented May 30, 2024

github-actions bot commented May 30, 2024 •

edited

Loading

coveralls commented May 30, 2024 •

edited

Loading

turnerm left a comment

turnerm May 30, 2024

b-j-mills May 30, 2024

alexandru-m-g May 30, 2024

b-j-mills May 30, 2024

turnerm May 30, 2024

b-j-mills commented May 30, 2024

		_BATCH_SIZE = 1000


		def batch_populate(rows: List[Dict], session, DBTable):

Speed up conflict event #108

Speed up conflict event #108

Conversation

b-j-mills commented May 30, 2024

mcarans commented May 30, 2024

github-actions bot commented May 30, 2024 • edited Loading

Test Results

coveralls commented May 30, 2024 • edited Loading

Pull Request Test Coverage Report for Build 9307755362

Details

💛 - Coveralls

turnerm left a comment

Choose a reason for hiding this comment

turnerm May 30, 2024

Choose a reason for hiding this comment

b-j-mills May 30, 2024

Choose a reason for hiding this comment

alexandru-m-g May 30, 2024

Choose a reason for hiding this comment

b-j-mills May 30, 2024

Choose a reason for hiding this comment

turnerm May 30, 2024

Choose a reason for hiding this comment

b-j-mills commented May 30, 2024

github-actions bot commented May 30, 2024 •

edited

Loading

coveralls commented May 30, 2024 •

edited

Loading