Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heartbeat duplicate detection potentially yielding false positives #454

Closed
IgnisDa opened this issue Jan 14, 2023 · 17 comments
Closed

Heartbeat duplicate detection potentially yielding false positives #454

IgnisDa opened this issue Jan 14, 2023 · 17 comments
Assignees
Labels
bug Something isn't working effort:3 prio a

Comments

@IgnisDa
Copy link
Contributor

IgnisDa commented Jan 14, 2023

Describe the bug

I made an import from Wakatime, but the entire data in not imported. I have about 100hrs missing from each of my major worked on projects. I could not find any issues like this. Is this a known bug?

System information

Output of uname -ar

Linux main 5.15.0-56-generic #62-Ubuntu SMP Tue Nov 22 19:54:14 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Hosted using dokku.

@muety
Copy link
Owner

muety commented Jan 14, 2023

Did you get any error messages in the server's console? For how long did you wait? If the amount of data is very large, it might well take a couple of hours, since it's currently downloaded in small batches (see #323).

@IgnisDa
Copy link
Contributor Author

IgnisDa commented Jan 14, 2023

For how long did you wait?

About 10 minutes after which I got an email saying the import has completed successfully.

Here are the entire logs since server startup: http://sprunge.us/mbNHSN

Edit:

I do not see any failed logs. Could it happen that wakatime is not returning the entire data?
update: likely not. i downloaded their JSON dump, looks complete to me.

@muety
Copy link
Owner

muety commented Jan 14, 2023

It's probably related to #334 (comment). As already explained there, Wakapi and WakaTime calculate coding duration differently, so there will naturally be a discrepancy. Also, a lot of duplicates seem to have been filtered out during your import (only 373086 of 389859 downloaded heartbeats were actually persists).

Could you maybe check the WakaTime CSV dump if you can spot any duplicates or other irregularities? Might well be that there is a bug in Wakapi, that causes too many heartbeats to be filtered out. Would love to get your support on investigating this!

@muety
Copy link
Owner

muety commented Jan 14, 2023

As mentioned on that other issue, we hash every heartbeat object to check for duplicates. I just briefly reviewed the implementation of that again and there is a chance that hashing for a heartbeat's time attribute might not be working properly, due to mitchellh/hashstructure#38. I'll have to investigate deeper, but if that turns out true, then the "duplicate detection" might yield false positives.

Could you potentially send me a subset of your CSV export (pseudonymized, if you prefer that) that includes a portion of records where all relevant attributes (entity, type, project, language, ...) are identical, except for the timestamp? But don't worry, I can also just handcraft a bit of fake data for that!

Will keep you posted.

@muety muety self-assigned this Jan 14, 2023
@muety muety added bug Something isn't working prio a effort:3 labels Jan 14, 2023
@muety muety changed the title All data from wakatime was not imported heartbeat duplicate detection potentially yielding false positives Jan 14, 2023
@muety muety changed the title heartbeat duplicate detection potentially yielding false positives Heartbeat duplicate detection potentially yielding false positives Jan 14, 2023
@IgnisDa
Copy link
Contributor Author

IgnisDa commented Jan 14, 2023

@muety I am not sure what you want. Would the complete export work? Wakatime gives a download link, you can download it from there. It is about 250MBs.

@muety
Copy link
Owner

muety commented Jan 14, 2023

Complete export would help as well, but it obviously contains quite a lot of potentially personally identifiable information. So feel free to only take a portion of it and / or replace project names or so. Send it to [email protected].

But no worries if that's too much to ask for! I can probably also just write a quick script to generate test data for debugging the above!

@IgnisDa
Copy link
Contributor Author

IgnisDa commented Jan 14, 2023

Looks like you will have to generate the data yourself, the files are just too large to upload. Sorry :(

@muety
Copy link
Owner

muety commented Jan 14, 2023

The CSVs can probably be compressed quite tremendously. But no worries if not! Thanks for help.

@IgnisDa
Copy link
Contributor Author

IgnisDa commented Jan 14, 2023

I'm not in front of my system anymore. Will try to convert the json to CSV tomorrow and update here.

@IgnisDa
Copy link
Contributor Author

IgnisDa commented Jan 15, 2023

@muety I was able to compress the entire data to 23MB. Sent you the json.

I used https://github.com/ouch-org/ouch to compress it. You can decompress it with that.

@muety
Copy link
Owner

muety commented Jan 15, 2023

@muety I was able to compress the entire data to 23MB. Sent you the json.

Where did you send it? Didn't receive an e-mail, yet.

Btw., I did some testing and the hashing seems to be working fine. The cause of this problem has to be somewhere else. Looking at your data will hopefully reveal something in that regard.

@IgnisDa
Copy link
Contributor Author

IgnisDa commented Jan 16, 2023

I sent it to you on the email you wrote above.

If you still did not receive it, perhaps you can share your discord username?

@muety
Copy link
Owner

muety commented Jan 17, 2023

What is your overall, total coding time shown in WakaTime and what is it in Wakapi?

@muety
Copy link
Owner

muety commented Jan 17, 2023

I checked the data you sent. The discrepancy between how many heartbeats were downloaded from WakaTime and how many were imported into Wakapi actually solely seems to be due to duplicate timestamps. I wrote a small script to analyze your dump and it outputs that around 4.5 % of WakaTime heartbeats have non-unique timestamps, which is something that Wakapi can not handle.

import json

with open('wakatime-dump.json', 'r') as f:
    data = json.load(f)

timestamps = [heartbeat['time'] for day in data['days'] for heartbeat in day['heartbeats']]
timestamps_unique = frozenset(timestamps)

print(f'got {len(timestamps_unique)} / {len(timestamps)}')  # got 373095 / 389886

To be honest, I tend to think that there's nothing wrong with Wakapi and the difference is just due to the different methodology of interpolating between heartbeats.

@IgnisDa
Copy link
Contributor Author

IgnisDa commented Jan 18, 2023

What is your overall, total coding time shown in WakaTime and what is it in Wakapi?

I am not sure about this since i do not have a wakatime pro account. However one of my projects (incento-server) shows a total time of 700hrs on wakatime while it showed about 500hrs on wakapi.

I checked the data you sent. The discrepancy between how many heartbeats were downloaded from WakaTime and how many were imported into Wakapi actually solely seems to be due to duplicate timestamps. I wrote a small script to analyze your dump and it outputs that around 4.5 % of WakaTime heartbeats have non-unique timestamps, which is something that Wakapi can not handle.

import json

with open('wakatime-dump.json', 'r') as f:
    data = json.load(f)

timestamps = [heartbeat['time'] for day in data['days'] for heartbeat in day['heartbeats']]
timestamps_unique = frozenset(timestamps)

print(f'got {len(timestamps_unique)} / {len(timestamps)}')  # got 373095 / 389886

To be honest, I tend to think that there's nothing wrong with Wakapi and the difference is just due to the different methodology of interpolating between heartbeats.

If that is the case then i think this issue is solved?

@muety
Copy link
Owner

muety commented Jan 18, 2023

If that is the case then i think this issue is solved?

Frankly, yes. I don't see anything we could do on Wakapi's side at this point, sorry. Once we have #156, you'll be able to tweak the interpolation methodology to your needs in the future. Please stay tuned until then :-).

@muety muety closed this as completed Jan 18, 2023
@IgnisDa
Copy link
Contributor Author

IgnisDa commented Jan 18, 2023

@muety Thank you for looking into this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working effort:3 prio a
Projects
None yet
Development

No branches or pull requests

2 participants