Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawl ID -> Browser ID, and matching data structures #701

Closed
birdsarah opened this issue Jun 26, 2020 · 6 comments · Fixed by #730
Closed

Crawl ID -> Browser ID, and matching data structures #701

birdsarah opened this issue Jun 26, 2020 · 6 comments · Fixed by #730

Comments

@birdsarah
Copy link
Contributor

birdsarah commented Jun 26, 2020

Following up from discussion on chat.

TIL that Crawl ID is actually meant to uniquely identify browser instances, rather than to give a "crawl" a global ID.

In the LocalAggregator, the concept exists of a Task ID, which identifies a set of browsers and visits as all belonging to one "task".

The "task" is what we would normally think of as a "crawl", in the sense of "I'm going to run an alexa 10k crawl".

The crawl id is actually identifying browser instances.

Looking at the LocalAggregator sql schema https://github.com/mozilla/OpenWPM/blob/1655dedc954733b9d67fa02e2547beb3654ad5cc/automation/DataAggregator/schema.sql#L6-L18 you can also see that storage of manager params against a task id and browser params against a crawl id.

This issue proposes:

  • Rename crawl_id to browser_id
  • Rename task_id to crawl_id
  • Ensure that s3 aggregator also saves a task table

@englehardt asked on chat

Makes sense. But I still think task_id -> crawl_id, crawl_id -> browser_id makes sense

I'm not sure why you'd want both (I have to head out now, but we can continue tomorrow / in an issue, sorry about that).

Moving forward in this discussion, I am going to use the proposed nomenclature. That is, browser_id refers to what is currently crawl_id in the code, and crawl_id refers to task_id.

@englehardt also noted that browser_id was used in the S3 aggregator to infer that a browser instance has moved to a new site by observing that data with the same browser_id has moved from one visit_id to another. But, Stefan recently removed that with good reason.

So to take a stab at @englehardt's question: browser_id and visit_id both go to the extension, are used for the reporting back of data. But visit_id is highly unique. If we are not using browser_id for anything any more then it seems that we could get rid of it.
Alternatively, we could have a transition period where crawl_id is renamed to browser_id so the data structure stays the same, and the extension stays the same (save for some renaming), and then at a future date we can remove it as obsolete. This may make the transition easier and may make it easier to identify if there's use cases for both a crawl_id and a browser_id.

Regardless, I think the renaming will be clarifying and S3Aggregator data should match LocalAggregator.

(Aside: The discovery of this mismatch is additional support for splitting S3Aggregator up, having one (sqlalchemy) base for all structured data that can then be saved to multiple backends - sqlite, postgres, bigquery, parquet etc.)

@birdsarah
Copy link
Contributor Author

cc @vringar

@englehardt
Copy link
Collaborator

englehardt commented Jun 26, 2020

I agree we should rename crawl_id to browser_id, but I'm not sure about task_id to crawl_id.

I use the term "crawl" to refer a single parquet directory or sqlite database. This is currently how we bucket data on S3 and how we shared datasets back in OpenWPM's Princeton days: https://webtransparency.cs.princeton.edu/webcensus/data-release/. If you don't think of a single directory as a "crawl", what name would you give it?

If we agree on that notion of "crawl", then task_id as currently defined is more narrow in scope scope than "crawl". There can be multiple tasks within a crawl. I'm not saying that "task" is the best term for this. My original intention in saving the manager params inside the task table was that maybe you'd want to switch up some of the manager config params and re-visit some sites under different settings within the same crawl. Looking at the current manager params I'm not sure one would ever really want to do that. It seems like all of those should be constant for a specific directory. The only time I've actually found task_id useful was understanding if OpenWPM was re-started mid-crawl, or if a crawl took place in sections.

I know you have a use case that you'd like task_id for, but I don't think I've really groked it. Would you mind to include more detail on that?

So to take a stab at @englehardt's question: browser_id and visit_id both go to the extension, are used for the reporting back of data. But visit_id is highly unique. If we are not using browser_id for anything any more then it seems that we could get rid of it.
Alternatively, we could have a transition period where crawl_id is renamed to browser_id so the data structure stays the same, and the extension stays the same (save for some renaming), and then at a future date we can remove it as obsolete. This may make the transition easier and may make it easier to identify if there's use cases for both a crawl_id and a browser_id.

I'm fine with removing it if it's not needed. Or if it's only needed internally and not for analysis we can consider dropping it from the datasets.

One use I can think of off the top of my head is a stateful crawl, where you want to know that two sites were visited sequentially by the same browser instance and thus with the same profile. But that's not perfect, because profiles can be saved and loaded.

@vringar
Copy link
Contributor

vringar commented Jun 29, 2020

I already wrote down my thoughts on this in #605 and mostly agree.

As I understand it a crawl consists of multiple tasks because we have multiple instances of a crawler running in parallel for cloud scale runs.
My suggestions would be:

  • Rename the crawl table to browser_configs
  • Rename the task table to manager_configs
  • task_id becomes manager_id
  • crawl_id becomes browser_id

The only thing that this wouldn't capture would be what a crawl is, but I also think that's very hard to capture since single instance and multi instance crawls are so different.

@birdsarah
Copy link
Contributor Author

@englehardt thanks for the thoughts. And I think I agree with you (and not myself). @vringar I'm not sure I agree with you.

@englehardt

I use the term "crawl" to refer a single parquet directory or sqlite database.

yes

@englehardt

If we agree on that notion of "crawl", then task_id as currently defined is more narrow in scope scope than "crawl". There can be multiple tasks within a crawl.

yes

@vringar

As I understand it a crawl consists of multiple tasks because we have multiple instances of a crawler running in parallel for cloud scale runs.

no

Hopefully my use case can illustrate why I agree with @englehardt but not @vringar. I want to run a crawl (gather one large dataset) with the majority of browser APIs instrumented. This can't be done in a single pass. I need to hit the same site over and over again with different sets of APIs instrumented. These different sets of APIs would be tasks. I would be doing this in the cloud over lots of sites so my "task" would be parallelized across lots of instances. I see this parallelization as an implementation detail it's not something I need to track ids for - whereas I do need the id of my task because I need to be able to rejoin my data to see what APIs I tried to instrument and what APIs I got data back from.

@englehardt

My original intention in saving the manager params inside the task table was that maybe you'd want to switch up some of the manager config params and re-visit some sites under different settings within the same crawl. Looking at the current manager params I'm not sure one would ever really want to do that.

Right. In my case, I'll actually be varying browser_params per "task". I'm not sure the separation between manager_params and browser_params is that meaningful. Perhaps they should all be task_params.

@englehardt

I'm not saying that "task" is the best term for this.

It's not particularly intuitive, but it's fine.

@vringar

task_id becomes manager_id

  • no - because of reasons outlined above I think the concept of task_id is valid and should not be used to capture parallelization.
  • but - if you want to keep track of the cloud parallelization there maybe a separate case for a manager_id or something like that.

@birdsarah
Copy link
Contributor Author

We discussed this in person and, if I remember right, agreed the following:

  • crawl_id -> browser_id
  • task_id stays the same
  • no need to track additional things (like instances) for now

Also see notes in #704.

@birdsarah birdsarah changed the title Task ID -> Crawl ID, Crawl ID -> Browser ID, and matching data structures Crawl ID -> Browser ID, and matching data structures Jul 9, 2020
@birdsarah
Copy link
Contributor Author

We may want a follow on issue to look at combining params together.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants