Crawl ID -> Browser ID, and matching data structures #701

birdsarah · 2020-06-26T00:52:08Z

Following up from discussion on chat.

TIL that Crawl ID is actually meant to uniquely identify browser instances, rather than to give a "crawl" a global ID.

In the LocalAggregator, the concept exists of a Task ID, which identifies a set of browsers and visits as all belonging to one "task".

The "task" is what we would normally think of as a "crawl", in the sense of "I'm going to run an alexa 10k crawl".

The crawl id is actually identifying browser instances.

Looking at the LocalAggregator sql schema https://github.com/mozilla/OpenWPM/blob/1655dedc954733b9d67fa02e2547beb3654ad5cc/automation/DataAggregator/schema.sql#L6-L18 you can also see that storage of manager params against a task id and browser params against a crawl id.

This issue proposes:

Rename crawl_id to browser_id
Rename task_id to crawl_id
Ensure that s3 aggregator also saves a task table

@englehardt asked on chat

Makes sense. But I still think task_id -> crawl_id, crawl_id -> browser_id makes sense

I'm not sure why you'd want both (I have to head out now, but we can continue tomorrow / in an issue, sorry about that).

Moving forward in this discussion, I am going to use the proposed nomenclature. That is, browser_id refers to what is currently crawl_id in the code, and crawl_id refers to task_id.

@englehardt also noted that browser_id was used in the S3 aggregator to infer that a browser instance has moved to a new site by observing that data with the same browser_id has moved from one visit_id to another. But, Stefan recently removed that with good reason.

So to take a stab at @englehardt's question: browser_id and visit_id both go to the extension, are used for the reporting back of data. But visit_id is highly unique. If we are not using browser_id for anything any more then it seems that we could get rid of it.
Alternatively, we could have a transition period where crawl_id is renamed to browser_id so the data structure stays the same, and the extension stays the same (save for some renaming), and then at a future date we can remove it as obsolete. This may make the transition easier and may make it easier to identify if there's use cases for both a crawl_id and a browser_id.

Regardless, I think the renaming will be clarifying and S3Aggregator data should match LocalAggregator.

(Aside: The discovery of this mismatch is additional support for splitting S3Aggregator up, having one (sqlalchemy) base for all structured data that can then be saved to multiple backends - sqlite, postgres, bigquery, parquet etc.)

The text was updated successfully, but these errors were encountered:

birdsarah · 2020-06-26T00:52:20Z

cc @vringar

englehardt · 2020-06-26T23:57:49Z

I agree we should rename crawl_id to browser_id, but I'm not sure about task_id to crawl_id.

I use the term "crawl" to refer a single parquet directory or sqlite database. This is currently how we bucket data on S3 and how we shared datasets back in OpenWPM's Princeton days: https://webtransparency.cs.princeton.edu/webcensus/data-release/. If you don't think of a single directory as a "crawl", what name would you give it?

If we agree on that notion of "crawl", then task_id as currently defined is more narrow in scope scope than "crawl". There can be multiple tasks within a crawl. I'm not saying that "task" is the best term for this. My original intention in saving the manager params inside the task table was that maybe you'd want to switch up some of the manager config params and re-visit some sites under different settings within the same crawl. Looking at the current manager params I'm not sure one would ever really want to do that. It seems like all of those should be constant for a specific directory. The only time I've actually found task_id useful was understanding if OpenWPM was re-started mid-crawl, or if a crawl took place in sections.

I know you have a use case that you'd like task_id for, but I don't think I've really groked it. Would you mind to include more detail on that?

So to take a stab at @englehardt's question: browser_id and visit_id both go to the extension, are used for the reporting back of data. But visit_id is highly unique. If we are not using browser_id for anything any more then it seems that we could get rid of it.
Alternatively, we could have a transition period where crawl_id is renamed to browser_id so the data structure stays the same, and the extension stays the same (save for some renaming), and then at a future date we can remove it as obsolete. This may make the transition easier and may make it easier to identify if there's use cases for both a crawl_id and a browser_id.

I'm fine with removing it if it's not needed. Or if it's only needed internally and not for analysis we can consider dropping it from the datasets.

One use I can think of off the top of my head is a stateful crawl, where you want to know that two sites were visited sequentially by the same browser instance and thus with the same profile. But that's not perfect, because profiles can be saved and loaded.

vringar · 2020-06-29T12:04:56Z

I already wrote down my thoughts on this in #605 and mostly agree.

As I understand it a crawl consists of multiple tasks because we have multiple instances of a crawler running in parallel for cloud scale runs.
My suggestions would be:

Rename the crawl table to browser_configs
Rename the task table to manager_configs
task_id becomes manager_id
crawl_id becomes browser_id

The only thing that this wouldn't capture would be what a crawl is, but I also think that's very hard to capture since single instance and multi instance crawls are so different.

birdsarah · 2020-06-29T16:19:03Z

@englehardt thanks for the thoughts. And I think I agree with you (and not myself). @vringar I'm not sure I agree with you.

@englehardt

I use the term "crawl" to refer a single parquet directory or sqlite database.

yes

@englehardt

If we agree on that notion of "crawl", then task_id as currently defined is more narrow in scope scope than "crawl". There can be multiple tasks within a crawl.

yes

@vringar

As I understand it a crawl consists of multiple tasks because we have multiple instances of a crawler running in parallel for cloud scale runs.

no

Hopefully my use case can illustrate why I agree with @englehardt but not @vringar. I want to run a crawl (gather one large dataset) with the majority of browser APIs instrumented. This can't be done in a single pass. I need to hit the same site over and over again with different sets of APIs instrumented. These different sets of APIs would be tasks. I would be doing this in the cloud over lots of sites so my "task" would be parallelized across lots of instances. I see this parallelization as an implementation detail it's not something I need to track ids for - whereas I do need the id of my task because I need to be able to rejoin my data to see what APIs I tried to instrument and what APIs I got data back from.

@englehardt

My original intention in saving the manager params inside the task table was that maybe you'd want to switch up some of the manager config params and re-visit some sites under different settings within the same crawl. Looking at the current manager params I'm not sure one would ever really want to do that.

Right. In my case, I'll actually be varying browser_params per "task". I'm not sure the separation between manager_params and browser_params is that meaningful. Perhaps they should all be task_params.

@englehardt

I'm not saying that "task" is the best term for this.

It's not particularly intuitive, but it's fine.

@vringar

task_id becomes manager_id

no - because of reasons outlined above I think the concept of task_id is valid and should not be used to capture parallelization.
but - if you want to keep track of the cloud parallelization there maybe a separate case for a manager_id or something like that.

birdsarah · 2020-07-09T02:04:10Z

We discussed this in person and, if I remember right, agreed the following:

crawl_id -> browser_id
task_id stays the same
no need to track additional things (like instances) for now

Also see notes in #704.

birdsarah · 2020-07-09T02:05:00Z

We may want a follow on issue to look at combining params together.

birdsarah mentioned this issue Jun 26, 2020

Structured aggregator #704

Closed

englehardt added the discussion label Jun 26, 2020

vringar mentioned this issue Jun 29, 2020

Rename and move some properties to where they belong #605

Closed

birdsarah changed the title ~~Task ID -> Crawl ID, Crawl ID -> Browser ID, and matching data structures~~ Crawl ID -> Browser ID, and matching data structures Jul 9, 2020

vringar mentioned this issue Aug 7, 2020

Renamed crawl_id to browser_id #730

Merged

vringar closed this as completed in #730 Aug 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawl ID -> Browser ID, and matching data structures #701

Crawl ID -> Browser ID, and matching data structures #701

birdsarah commented Jun 26, 2020 •

edited by vringar

Loading

birdsarah commented Jun 26, 2020

englehardt commented Jun 26, 2020 •

edited by vringar

Loading

vringar commented Jun 29, 2020 •

edited

Loading

birdsarah commented Jun 29, 2020

birdsarah commented Jul 9, 2020

birdsarah commented Jul 9, 2020

Crawl ID -> Browser ID, and matching data structures #701

Crawl ID -> Browser ID, and matching data structures #701

Comments

birdsarah commented Jun 26, 2020 • edited by vringar Loading

birdsarah commented Jun 26, 2020

englehardt commented Jun 26, 2020 • edited by vringar Loading

vringar commented Jun 29, 2020 • edited Loading

birdsarah commented Jun 29, 2020

birdsarah commented Jul 9, 2020

birdsarah commented Jul 9, 2020

birdsarah commented Jun 26, 2020 •

edited by vringar

Loading

englehardt commented Jun 26, 2020 •

edited by vringar

Loading

vringar commented Jun 29, 2020 •

edited

Loading