-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add additional fields to existing models #76
Conversation
Yeah, these are welcome improvements, always willing to parse additional useful data. Haven't looked at the code but generally as a guide for any new additions, if looking up a key in a dictionary should be Optional[str] and use a .get() on the dictionary as we want to maintain backwards compatibility and not break parsing old exports. Will take a look later today |
yeah, I think this is a small enough breaking change that its worth it. I will bump the takeout parser version which should force a full recompute for caches after merging this. if you could just put some comments into the code itself describing how the two fields are different and that you've noticed that one is more accurate that the other, that would be great.
yeah, if we're including both times, I think we can do that you can use a I think since we're adding the wrapper you can just use the keys that google does so maybe something like
and then just update the
yeah, Im tracking that here if you want to leave any thoughts, I tried a few months ago and didn't feel my solution was great, it felt prone to breaking... I think
weird bug, wonder if this has something to do with JSON standard for integers/floating point... If there isn't a response there for a while I'll think about a flag to optionally enable/disable orjson in lib code feel free to ping me here again if this PR becomes a bit stale, I am pretty busy with other stuff right now so it might slip my mind |
Thanks for taking a look!
Sure I'll add some comments in my next commit to clarify these changes.
Yeah this makes a lot of sense to me. Thanks for the suggestion I would not have thought of this as a solution without it. I'll make this change as well in a while.
No worries, I can work off of my fork for other downstream stuff in the meantime but if there's anything else I need to clean up on this PR just let me know as well. 😄 |
Yeah in the example @mighabana provided, Actually I checked my location history, and looks like all device tags are not as big in my case. It is possible that @seanbreckenridge just randomly typed something here as mock data? :) cb07912 In that case we just need to change it in test as it wouldn't occur in reality. |
Heh, I don't think I just typed a random number for the device tag, but will compare with my data later. |
Hey @karlicoss thanks for the additional review! I'll resolve the issues you raised soon.
Same for me actually my device tags were not as big, but I just didn't want to assume and thought that the large integers could be a small edge case. 🤔 Also is there's any reason only LocationHistory is parsed with orjson? I just checked and saw that the other parsers are using |
ok, checked back on my device tag and it looks to be around something like -1798586493 so that probably just was me anonymizing the data and not considering the integer size since python has no bit-specific integers you can update the test to something more reasonable @mighabana |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which seemed like an elegant way to get it to return None if it doesn't exist without breaking the .get() call if one of the parent fields didn't exist
yeah, it sort of looks nicer semantically in code... ? but when things actually break, we now don't know which key is missing. Its sort of like the javascript something?.chained?.operator
in that it just becomes None
and you lose any info about what actually happened.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted back the dictionary access changes for the required fields and also updated the failing test for the deviceTag
. I left one comment about a code change I was unsure of.
google_takeout_parser/parse_json.py
Outdated
@@ -186,15 +193,15 @@ def _parse_semantic_location_history(p: Path) -> Iterator[Res[PlaceVisit]]: | |||
timelineObjects = json_data.get("timelineObjects", []) | |||
for timelineObject in timelineObjects: | |||
if "placeVisit" not in timelineObject: | |||
# yield RuntimeError(f"PlaceVisit: no 'placeVisit' key in '{p}'") | |||
yield RuntimeError(f"PlaceVisit: no 'placeVisit' key in '{p}'") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uncommented this to raise the RuntimeError instead of failing silently. Let me know if this should be left commented out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this one can be reverted, it just leads to too much spam:
[W 240925 08:57:25 path_dispatch:385] PlaceVisit: no 'placeVisit' key in '/home/sean/.cache/gt/Takeout-New/Location History/Semantic Location History/2023/2023_SEPTEMBER.json'
[W 240925 08:57:25 path_dispatch:385] PlaceVisit: no 'placeVisit' key in '/home/sean/.cache/gt/Takeout-New/Location History/Semantic Location History/2023/2023_SEPTEMBER.json'
[W 240925 08:57:25 path_dispatch:385] PlaceVisit: no 'placeVisit' key in '/home/sean/.cache/gt/Takeout-New/Location History/Semantic Location History/2023/2023_SEPTEMBER.json'
[W 240925 08:57:25 path_dispatch:385] PlaceVisit: no 'placeVisit' key in '/home/sean/.cache/gt/Takeout-New/Location History/Semantic Location History/2023/2023_SEPTEMBER.json'
[W 240925 08:57:25 path_dispatch:385] PlaceVisit: no 'placeVisit' key in '/home/sean/.cache/gt/Takeout-New/Location History/Semantic Location History/2023/2023_SEPTEMBER.json'
[W 240925 08:57:25 path_dispatch:385] PlaceVisit: no 'placeVisit' key in '/home/sean/.cache/gt/Takeout-New/Location History/Semantic Location History/2023/2023_SEPTEMBER.json'
[W 240925 08:57:25 path_dispatch:385] PlaceVisit: no 'placeVisit' key in '/home/sean/.cache/gt/Takeout-New/Location History/Semantic Location History/2023/2023_SEPTEMBER.json'
[W 240925 08:57:25 path_dispatch:385] PlaceVisit: no 'placeVisit' key in '/home/sean/.cache/gt/Takeout-New/Location History/Semantic Location History/2023/2023_SEPTEMBER.json'
[W 240925 08:57:25 path_dispatch:385] PlaceVisit: no 'placeVisit' key in '/home/sean/.cache/gt/Takeout-New/Location History/Semantic Location History/2023/2023_SEPTEMBER.json'
[W 240925 08:57:25 path_dispatch:385] PlaceVisit: no 'placeVisit' key in '/home/sean/.cache/gt/Takeout-New/Location History/Semantic Location History/2023/2023_SEPTEMBER.json'
[W 240925 08:57:25 path_dispatch:385] PlaceVisit: no 'placeVisit' key in '/home/sean/.cache/gt/Takeout-New/Location History/Semantic Location History/2023/2023_SEPTEMBER.json'
[W 240925 08:57:25 path_dispatch:385] PlaceVisit: no 'placeVisit' key in '/home/sean/.cache/gt/Takeout-New/Location History/Semantic Location History/2023/2023_SEPTEMBER.json'
[W 240925 08:57:25 path_dispatch:385] PlaceVisit: no 'placeVisit' key in '/home/sean/.cache/gt/Takeout-New/Location History/Semantic Location History/2023/2023_SEPTEMBER.json'
Its not required for every json blob in the timeline object, and we are only interested in places that have a placevisit (maybe that changes in the future, its just how I decided to parse it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it! I revised it in the latest commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated a few other points, I thought mypy would catch these errors, maybe it isnt strict enough in the config here
Reverted back the remaining issues with the dictionary access. Thanks again for the review! |
Alright great, will take a look at it again later today and merge |
Thanks! @seanbreckenridge could you cut off a release please when you got a minute? :) |
Yeah sure, will do one in a couple hours |
* support both "Location History" and "Location History (Timeline)" directories by @karlicoss in #70 * _parse_semantic_location_history: handle missing placeId in otherCandidateLocations by @karlicoss in #71 * path_dispatch: speedup dispatch map about 80% by @karlicoss in #72 * fix for youtube comments csv + old Chrome/MyActivity.json by @karlicoss in #73 * Add missing type annotations for bs4 and pytz in testing environment by @mighabana in #75 * get_paths_for_function: fix failure on Windows by @karlicoss in #78 * Add additional fields to existing models by @mighabana in #76
@karlicoss published |
Hi, I was reviewing the parser and comparing its output with my own Google Takeout data and I noticed some interesting fields that are not being handled by the parser. I went ahead and made some changes to include them, but if you think they are not important to track I'd be happy to discuss! 😄
Summary of Changes
PlayStoreAppInstall
modeldeviceName
,deviceCarrier
,deviceManufacturer
andfirstInstallationDt
fieldsdt
field from thefirstInstallationTime
to thelastUpdateTime
since this may be a more accurate representation of the installation event. I understand this may have some backward compatibility issues so it might be better to create a newinstallationTime
field instead but then the namedt
can be misleading. I'd love to hear your thoughts on this.Location
modeldeviceTag
andsource
fieldstest_location_new()
caused by how orjson handles large integers. I've already raised the issue in the orjson repository here.test_location_2024
that tests based on the current version of the location data.ChromeHistory
modelpageTransition
fielden
locale HandlerMapping. In my version of the Takeout files it seems to have been renamed from BrowserHistory.json to History.json._parse_semantic_location_history_2024
with the current version of the data mocked. I was going to try to include the additionalactivitySegment
field, but it seems a bit complex. I'd be happy to discuss how to parse it in a separate issue.