Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add additional fields to existing models #76

Merged
merged 17 commits into from
Sep 26, 2024

Conversation

mighabana
Copy link
Contributor

@mighabana mighabana commented Sep 20, 2024

Hi, I was reviewing the parser and comparing its output with my own Google Takeout data and I noticed some interesting fields that are not being handled by the parser. I went ahead and made some changes to include them, but if you think they are not important to track I'd be happy to discuss! 😄

Summary of Changes

  1. PlayStoreAppInstall model
    • Added deviceName, deviceCarrier, deviceManufacturer and firstInstallationDt fields
    • Changed dt field from the firstInstallationTime to the lastUpdateTime since this may be a more accurate representation of the installation event. I understand this may have some backward compatibility issues so it might be better to create a new installationTime field instead but then the name dt can be misleading. I'd love to hear your thoughts on this.
  2. Location model
    • Added deviceTag and source fields
    • This addition introduced a failing test in test_location_new() caused by how orjson handles large integers. I've already raised the issue in the orjson repository here.
    • Added a new test test_location_2024 that tests based on the current version of the location data.
  3. ChromeHistory model
    • Added the pageTransition field
  4. Added "Chrome/History.json" to the en locale HandlerMapping. In my version of the Takeout files it seems to have been renamed from BrowserHistory.json to History.json.
  5. Added a new test for the Semantic Location data called _parse_semantic_location_history_2024 with the current version of the data mocked. I was going to try to include the additional activitySegment field, but it seems a bit complex. I'd be happy to discuss how to parse it in a separate issue.

@purarue
Copy link
Owner

purarue commented Sep 20, 2024

Yeah, these are welcome improvements, always willing to parse additional useful data.

Haven't looked at the code but generally as a guide for any new additions, if looking up a key in a dictionary should be Optional[str] and use a .get() on the dictionary as we want to maintain backwards compatibility and not break parsing old exports.

Will take a look later today

@purarue
Copy link
Owner

purarue commented Sep 21, 2024

Changed dt field from the firstInstallationTime to the lastUpdateTime since this may be a more accurate representation of the installation event

yeah, I think this is a small enough breaking change that its worth it. I will bump the takeout parser version which should force a full recompute for caches after merging this.

if you could just put some comments into the code itself describing how the two fields are different and that you've noticed that one is more accurate that the other, that would be great.

I understand this may have some backward compatibility issues so it might be better to create a new installationTime field instead but then the name dt can be misleading. I'd love to hear your thoughts on this.

yeah, if we're including both times, I think we can do that

you can use a dt property wrapper and then maybe in a bit I can mark it as deprecated? not sure what the best way to do that with a @property is

I think since we're adding the wrapper you can just use the keys that google does

so maybe something like

class ...:
    lastUpdateTime: datetime
    installTime: datetime
    
    @property
    def dt(self) -> datetime:
        return self.lastUpdateTime

and then just update the key function to use lastUpdatedTime instead of dt so there isn't a function call every time, its just there to prevent code that was using it from breaking

activitySegment

yeah, Im tracking that here if you want to leave any thoughts, I tried a few months ago and didn't feel my solution was great, it felt prone to breaking...

I think cachew now supports serializing dicts in its latest versions, so maybe it would be useful to just include the whole dict and pass it through unparsed so that the user can use it if they want? Dont feel you have to do that in this PR, you can open another one for that.

orjson

weird bug, wonder if this has something to do with JSON standard for integers/floating point...

If there isn't a response there for a while I'll think about a flag to optionally enable/disable orjson in lib code

feel free to ping me here again if this PR becomes a bit stale, I am pretty busy with other stuff right now so it might slip my mind

@mighabana
Copy link
Contributor Author

Thanks for taking a look!

if you could just put some comments into the code itself describing how the two fields are different and that you've noticed that one is more accurate that the other, that would be great.

Sure I'll add some comments in my next commit to clarify these changes.

so maybe something like...

Yeah this makes a lot of sense to me. Thanks for the suggestion I would not have thought of this as a solution without it. I'll make this change as well in a while.

feel free to ping me here again if this PR becomes a bit stale, I am pretty busy with other stuff right now so it might slip my mind

No worries, I can work off of my fork for other downstream stuff in the meantime but if there's anything else I need to clean up on this PR just let me know as well. 😄

@karlicoss
Copy link
Contributor

weird bug, wonder if this has something to do with JSON standard for integers/floating point...

Yeah in the example @mighabana provided, "deviceTag": -80241446968629135069, -- this doesn't fit in 64 bits, which is supported in orjson. I imagine this would be a won't fix as orjson is implemented in C, and doesn't looks like orjson has some sort of decoder customizations (e.g. to parse it as string -- in principle it doesn't matter since no one is going to do math on device tags anyway 😛 )

Actually I checked my location history, and looks like all device tags are not as big in my case. It is possible that @seanbreckenridge just randomly typed something here as mock data? :) cb07912

In that case we just need to change it in test as it wouldn't occur in reality.

@purarue
Copy link
Owner

purarue commented Sep 24, 2024

Heh, I don't think I just typed a random number for the device tag, but will compare with my data later.

@mighabana
Copy link
Contributor Author

mighabana commented Sep 24, 2024

Hey @karlicoss thanks for the additional review! I'll resolve the issues you raised soon.

Actually I checked my location history, and looks like all device tags are not as big in my case.

Same for me actually my device tags were not as big, but I just didn't want to assume and thought that the large integers could be a small edge case. 🤔

Also is there's any reason only LocationHistory is parsed with orjson? I just checked and saw that the other parsers are using json.loads() instead of the _read_json_data() function. Maybe we can switch to use json to fix the issue since the bug doesn't seem to be occurring with it? I'd also be happy to make another code change to switch them all to _read_json_data() if that would be more appropriate.

@purarue
Copy link
Owner

purarue commented Sep 24, 2024

ok, checked back on my device tag and it looks to be around something like

-1798586493

so that probably just was me anonymizing the data and not considering the integer size since python has no bit-specific integers

you can update the test to something more reasonable @mighabana

Copy link
Owner

@purarue purarue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which seemed like an elegant way to get it to return None if it doesn't exist without breaking the .get() call if one of the parent fields didn't exist

yeah, it sort of looks nicer semantically in code... ? but when things actually break, we now don't know which key is missing. Its sort of like the javascript something?.chained?.operator in that it just becomes None and you lose any info about what actually happened.

google_takeout_parser/parse_json.py Outdated Show resolved Hide resolved
google_takeout_parser/parse_json.py Outdated Show resolved Hide resolved
google_takeout_parser/parse_json.py Outdated Show resolved Hide resolved
google_takeout_parser/parse_json.py Outdated Show resolved Hide resolved
Copy link
Contributor Author

@mighabana mighabana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted back the dictionary access changes for the required fields and also updated the failing test for the deviceTag. I left one comment about a code change I was unsure of.

@@ -186,15 +193,15 @@ def _parse_semantic_location_history(p: Path) -> Iterator[Res[PlaceVisit]]:
timelineObjects = json_data.get("timelineObjects", [])
for timelineObject in timelineObjects:
if "placeVisit" not in timelineObject:
# yield RuntimeError(f"PlaceVisit: no 'placeVisit' key in '{p}'")
yield RuntimeError(f"PlaceVisit: no 'placeVisit' key in '{p}'")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uncommented this to raise the RuntimeError instead of failing silently. Let me know if this should be left commented out.

Copy link
Owner

@purarue purarue Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one can be reverted, it just leads to too much spam:

[W 240925 08:57:25 path_dispatch:385] PlaceVisit: no 'placeVisit' key in '/home/sean/.cache/gt/Takeout-New/Location History/Semantic Location History/2023/2023_SEPTEMBER.json'
[W 240925 08:57:25 path_dispatch:385] PlaceVisit: no 'placeVisit' key in '/home/sean/.cache/gt/Takeout-New/Location History/Semantic Location History/2023/2023_SEPTEMBER.json'
[W 240925 08:57:25 path_dispatch:385] PlaceVisit: no 'placeVisit' key in '/home/sean/.cache/gt/Takeout-New/Location History/Semantic Location History/2023/2023_SEPTEMBER.json'
[W 240925 08:57:25 path_dispatch:385] PlaceVisit: no 'placeVisit' key in '/home/sean/.cache/gt/Takeout-New/Location History/Semantic Location History/2023/2023_SEPTEMBER.json'
[W 240925 08:57:25 path_dispatch:385] PlaceVisit: no 'placeVisit' key in '/home/sean/.cache/gt/Takeout-New/Location History/Semantic Location History/2023/2023_SEPTEMBER.json'
[W 240925 08:57:25 path_dispatch:385] PlaceVisit: no 'placeVisit' key in '/home/sean/.cache/gt/Takeout-New/Location History/Semantic Location History/2023/2023_SEPTEMBER.json'
[W 240925 08:57:25 path_dispatch:385] PlaceVisit: no 'placeVisit' key in '/home/sean/.cache/gt/Takeout-New/Location History/Semantic Location History/2023/2023_SEPTEMBER.json'
[W 240925 08:57:25 path_dispatch:385] PlaceVisit: no 'placeVisit' key in '/home/sean/.cache/gt/Takeout-New/Location History/Semantic Location History/2023/2023_SEPTEMBER.json'
[W 240925 08:57:25 path_dispatch:385] PlaceVisit: no 'placeVisit' key in '/home/sean/.cache/gt/Takeout-New/Location History/Semantic Location History/2023/2023_SEPTEMBER.json'
[W 240925 08:57:25 path_dispatch:385] PlaceVisit: no 'placeVisit' key in '/home/sean/.cache/gt/Takeout-New/Location History/Semantic Location History/2023/2023_SEPTEMBER.json'
[W 240925 08:57:25 path_dispatch:385] PlaceVisit: no 'placeVisit' key in '/home/sean/.cache/gt/Takeout-New/Location History/Semantic Location History/2023/2023_SEPTEMBER.json'
[W 240925 08:57:25 path_dispatch:385] PlaceVisit: no 'placeVisit' key in '/home/sean/.cache/gt/Takeout-New/Location History/Semantic Location History/2023/2023_SEPTEMBER.json'
[W 240925 08:57:25 path_dispatch:385] PlaceVisit: no 'placeVisit' key in '/home/sean/.cache/gt/Takeout-New/Location History/Semantic Location History/2023/2023_SEPTEMBER.json'

Its not required for every json blob in the timeline object, and we are only interested in places that have a placevisit (maybe that changes in the future, its just how I decided to parse it)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it! I revised it in the latest commit.

Copy link
Owner

@purarue purarue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated a few other points, I thought mypy would catch these errors, maybe it isnt strict enough in the config here

google_takeout_parser/parse_json.py Outdated Show resolved Hide resolved
google_takeout_parser/parse_json.py Outdated Show resolved Hide resolved
google_takeout_parser/parse_json.py Outdated Show resolved Hide resolved
@mighabana
Copy link
Contributor Author

Reverted back the remaining issues with the dictionary access. Thanks again for the review!

@purarue
Copy link
Owner

purarue commented Sep 25, 2024

Alright great, will take a look at it again later today and merge

@purarue purarue merged commit b930293 into purarue:master Sep 26, 2024
7 checks passed
@karlicoss
Copy link
Contributor

Thanks! @seanbreckenridge could you cut off a release please when you got a minute? :)

@purarue
Copy link
Owner

purarue commented Sep 26, 2024

Yeah sure, will do one in a couple hours

purarue added a commit that referenced this pull request Sep 26, 2024
* support both "Location History" and "Location History (Timeline)" directories by @karlicoss in #70
* _parse_semantic_location_history: handle missing placeId in otherCandidateLocations by @karlicoss in #71
* path_dispatch: speedup dispatch map about 80% by @karlicoss in #72
* fix for youtube comments csv + old Chrome/MyActivity.json by @karlicoss in #73
* Add missing type annotations for bs4 and pytz in testing environment by @mighabana in #75
* get_paths_for_function: fix failure on Windows by @karlicoss in #78
* Add additional fields to existing models by @mighabana in #76
@purarue
Copy link
Owner

purarue commented Sep 26, 2024

@karlicoss published v0.1.12

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants