-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crawl ID -> Browser ID, and matching data structures #701
Comments
cc @vringar |
I agree we should rename I use the term "crawl" to refer a single parquet directory or sqlite database. This is currently how we bucket data on S3 and how we shared datasets back in OpenWPM's Princeton days: https://webtransparency.cs.princeton.edu/webcensus/data-release/. If you don't think of a single directory as a "crawl", what name would you give it? If we agree on that notion of "crawl", then I know you have a use case that you'd like
I'm fine with removing it if it's not needed. Or if it's only needed internally and not for analysis we can consider dropping it from the datasets. One use I can think of off the top of my head is a stateful crawl, where you want to know that two sites were visited sequentially by the same browser instance and thus with the same profile. But that's not perfect, because profiles can be saved and loaded. |
I already wrote down my thoughts on this in #605 and mostly agree. As I understand it a crawl consists of multiple tasks because we have multiple instances of a crawler running in parallel for cloud scale runs.
The only thing that this wouldn't capture would be what a crawl is, but I also think that's very hard to capture since single instance and multi instance crawls are so different. |
@englehardt thanks for the thoughts. And I think I agree with you (and not myself). @vringar I'm not sure I agree with you.
yes
yes
no Hopefully my use case can illustrate why I agree with @englehardt but not @vringar. I want to run a crawl (gather one large dataset) with the majority of browser APIs instrumented. This can't be done in a single pass. I need to hit the same site over and over again with different sets of APIs instrumented. These different sets of APIs would be
Right. In my case, I'll actually be varying browser_params per "task". I'm not sure the separation between manager_params and browser_params is that meaningful. Perhaps they should all be
It's not particularly intuitive, but it's fine.
|
We discussed this in person and, if I remember right, agreed the following:
Also see notes in #704. |
We may want a follow on issue to look at combining params together. |
Following up from discussion on chat.
TIL that Crawl ID is actually meant to uniquely identify browser instances, rather than to give a "crawl" a global ID.
In the LocalAggregator, the concept exists of a Task ID, which identifies a set of browsers and visits as all belonging to one "task".
The "task" is what we would normally think of as a "crawl", in the sense of "I'm going to run an alexa 10k crawl".
The crawl id is actually identifying browser instances.
Looking at the LocalAggregator sql schema https://github.com/mozilla/OpenWPM/blob/1655dedc954733b9d67fa02e2547beb3654ad5cc/automation/DataAggregator/schema.sql#L6-L18 you can also see that storage of manager params against a task id and browser params against a crawl id.
This issue proposes:
crawl_id
tobrowser_id
task_id
tocrawl_id
@englehardt asked on chat
Moving forward in this discussion, I am going to use the proposed nomenclature. That is,
browser_id
refers to what is currentlycrawl_id
in the code, andcrawl_id
refers totask_id
.@englehardt also noted that browser_id was used in the S3 aggregator to infer that a browser instance has moved to a new site by observing that data with the same browser_id has moved from one visit_id to another. But, Stefan recently removed that with good reason.
So to take a stab at @englehardt's question:
browser_id
andvisit_id
both go to the extension, are used for the reporting back of data. Butvisit_id
is highly unique. If we are not usingbrowser_id
for anything any more then it seems that we could get rid of it.Alternatively, we could have a transition period where
crawl_id
is renamed tobrowser_id
so the data structure stays the same, and the extension stays the same (save for some renaming), and then at a future date we can remove it as obsolete. This may make the transition easier and may make it easier to identify if there's use cases for both acrawl_id
and abrowser_id
.Regardless, I think the renaming will be clarifying and S3Aggregator data should match LocalAggregator.
(Aside: The discovery of this mismatch is additional support for splitting S3Aggregator up, having one (sqlalchemy) base for all structured data that can then be saved to multiple backends - sqlite, postgres, bigquery, parquet etc.)
The text was updated successfully, but these errors were encountered: