Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Aggregator Rewrite #753

Merged
merged 152 commits into from
Feb 22, 2021
Merged
Show file tree
Hide file tree
Changes from 72 commits
Commits
Show all changes
152 commits
Select commit Hold shift + click to select a range
f252410
First steps in the rewrite
Sep 29, 2020
0ba7de7
Fixed import paths
Sep 29, 2020
d65af1e
One giant refactor
Sep 29, 2020
0a18a19
Merge branch 'master' into DataAggregatorRefactor
Sep 30, 2020
3fb70b8
Fixing tests
Sep 30, 2020
53b55cd
Adding mypy
Sep 30, 2020
ec4c7df
Removed mypy from pre-commit workflow
Sep 30, 2020
7288783
First draft on DataAggregator
Oct 1, 2020
f694254
Wrote a DataAggregator that starts and shuts down
Oct 1, 2020
b7e9b1d
Created tests and added more empty types
Oct 2, 2020
8eaf7ce
Got demo.py working
Oct 2, 2020
d633f95
Created sql_provider
Oct 2, 2020
71ae8b0
Cleaned up imports in TaskManager
Oct 9, 2020
6e11e1b
Added async
Oct 9, 2020
07db4ab
Fixed minor bugs
Oct 9, 2020
f3857ac
First steps at porting arrow
Oct 9, 2020
ee5d5b6
Introduced TableName and different Task handling
Oct 9, 2020
9e06a5e
Added more failing tests
Oct 9, 2020
2737ec0
First first completes others don't
Oct 13, 2020
c63ae82
It works
Oct 13, 2020
8177ed4
Started working on arrow_provider
Oct 14, 2020
50e1539
Implemented ArrowProvider
Oct 16, 2020
9e1d67e
Added logger fixture
Oct 16, 2020
fe2e977
Fixed test_storage_controller
Oct 16, 2020
157ee24
Fixing OpenWPMTest.visit()
Oct 16, 2020
52c4ff6
Moved test/storage_providers to test/storage
Oct 16, 2020
1098727
Fixing up tests
Oct 19, 2020
d7e7268
Moved automation to openwpm
Nov 16, 2020
86fa88c
Merge branch 'master' into DataAggregatorRefactor
Nov 16, 2020
ce5e901
Readded datadir to .gitignore
Nov 16, 2020
0631848
Ran repin.sh
Nov 16, 2020
3d2d720
Fixed formatting
Nov 16, 2020
d167846
Let's see if this works
Nov 16, 2020
7f1597f
Fixed imports
Nov 16, 2020
5de0822
Got arrow_memory_provider working
Nov 16, 2020
ae718dc
Merge branch 'master' into DataAggregatorRefactor
Nov 25, 2020
12a60a0
Starting to rewrite tests
Nov 25, 2020
84bff66
Setting up fixtures
Nov 25, 2020
4eb5c23
Attempting to fix all the tests
Nov 25, 2020
9b03e30
Still fixing tests
Nov 25, 2020
95bfcd5
Broken content saving
Nov 25, 2020
1b2f162
Added node
Nov 25, 2020
f01756b
Fixed screenshot tests
Nov 26, 2020
c5dfcd6
Fixing more tests
Nov 27, 2020
9d635d3
Fixed tests
Nov 27, 2020
ceb1d98
Implemented local_storage.py
Nov 27, 2020
11fb99f
Cleaned up flush_cache
Nov 27, 2020
17835b4
Fixing more tests
Nov 27, 2020
cc9ed52
Wrote test for LocalArrowProvider
Nov 27, 2020
0098181
Introduced tests for local_storage_provider.py
Nov 27, 2020
bf4f92c
Asserting test dir is empty
Nov 30, 2020
5c0a1e1
Creating subfolder for different aggregators
Dec 4, 2020
5981463
New depencies and init()
Dec 4, 2020
ba56b34
Everything is terribly broken
Dec 4, 2020
74ae07c
Figured out finalize_visit_id
Dec 7, 2020
6068c69
Running two event loops kinda works???
Dec 7, 2020
17a22d3
Rearming the event
Dec 8, 2020
3389d00
Introduced mypy
Dec 8, 2020
7343c88
Downgraded black in pre-commit
Dec 8, 2020
babd962
Modifying the database directly
Dec 8, 2020
6f9a06d
Merge branch 'master' into DataAggregatorRefactor
Dec 8, 2020
b3d28a0
Fixed formatting
Dec 11, 2020
791d865
Made mypy a lil stricter
Dec 11, 2020
66e8caa
Fixing docs and config printing
Dec 11, 2020
70963bd
Realising I've been using the wrong with
Dec 11, 2020
9862bf7
Trying to figure arrow_storage
Dec 11, 2020
4a036fa
Moving lock initialization in in_memory_storage
Dec 11, 2020
57d8ba9
Fixing tests
Dec 11, 2020
67d3070
Fixing up tests and adding more typechecking
Dec 11, 2020
de00f94
Fixed num_browsers in test_cache_hits_recorded
Dec 11, 2020
4291ddb
Parametrized unstructured
Dec 11, 2020
fa1c52f
String fix
Dec 11, 2020
9aed882
Added failing test
Dec 15, 2020
ef0ba1e
New test
Dec 23, 2020
1b14cbd
Review changes with Steven
Dec 23, 2020
8eb6ef0
Fixed repin.sh and test_arrow_cache
Jan 8, 2021
51d510f
Merge branch 'master' into DataAggregatorRefactor
Jan 15, 2021
24fc5d2
Minor change
Jan 15, 2021
0096007
Fixed prune-environment.py
Jan 15, 2021
962af53
Removing references to DataAggregator
Jan 15, 2021
902e4ed
Fixed test_seed_persistance
Jan 15, 2021
25cd9cf
More paths
Jan 15, 2021
dcb9a6a
Fixed test display shutdown
Jan 18, 2021
e91aba7
Made cache test more robust
Jan 18, 2021
e4c9bb8
Update crawler.py
Jan 18, 2021
247a69c
Slimming down ManagerParams
Jan 18, 2021
41e59ad
Fixing more tests
Jan 18, 2021
c9e52ee
Merge remote-tracking branch 'origin/DataAggregatorRefactor' into Dat…
Jan 18, 2021
7acb624
Update test/storage/test_storage_controller.py
Jan 19, 2021
db0d27f
Purging references to DataAggregator
Jan 22, 2021
abe4a01
Reverted changes to .travis.yml
Jan 22, 2021
2223d0c
Merge remote-tracking branch 'origin/DataAggregatorRefactor' into Dat…
Jan 22, 2021
d7400d2
Demo.py saves locally again
Jan 22, 2021
645240b
Readjusting test paths
Jan 22, 2021
ecb87f0
Expanded comment on initialize to reference #846
Jan 22, 2021
8629538
Made token optional in finalize_visit_id
Jan 22, 2021
9983362
Simplified test paramtetrization
Jan 22, 2021
105c73b
Fixed callback semantics change
Jan 22, 2021
f5a0abd
Removed test_parse_http_stack_trace_str
Jan 22, 2021
e6175db
Added DataSocket
Jan 22, 2021
173de3a
WIP need to fix path encoding
Jan 22, 2021
501dc5c
Fixed path encoding
Jan 25, 2021
6bd5575
Added task and crawl to schema
Jan 25, 2021
e5395d4
Merge branch 'master' into DataAggregatorRefactor
Jan 29, 2021
eeceaa3
Fixed paths in GitHub actions
Jan 29, 2021
22a822b
Merge branch 'master' into DataAggregatorRefactor
Jan 29, 2021
d7db8ca
Refactored completion handling
Jan 29, 2021
d5733db
Fix tests
Jan 29, 2021
6ee9972
Trying to fix tests on CI
Feb 1, 2021
89635c2
Removed redundant setting of tag
Feb 1, 2021
d4a391d
Removing references to S3
Feb 1, 2021
ffbb346
Purging more DataAggregator references
Feb 1, 2021
379af2d
Craking up logging to figure out test failure
Feb 1, 2021
e5c897b
Moved test_values into a fixture
Feb 1, 2021
8520be6
Fixing GcpUnstructuredProvider
Feb 1, 2021
6527179
Fixed paths for future crawls
Feb 1, 2021
1ca2739
Renamed sqllite to official sqlite
Feb 3, 2021
7cf48a1
Restored demo.py
Feb 3, 2021
29d3e27
Update openwpm/commands/profile_commands.py
Feb 3, 2021
5b5f229
Restored previous behaviour of DumpProfileCommand
Feb 3, 2021
73f0850
Removed leftovers
Feb 3, 2021
a4a75ff
Cleaned up comments
Feb 3, 2021
41f6656
Expanded lock check
Feb 3, 2021
9046d0d
Fixed more stuff
Feb 3, 2021
0ec3353
More comment updates
Feb 3, 2021
c1a6038
Update openwpm/socket_interface.py
Feb 3, 2021
ae25bfa
Removed outdated comment
Feb 4, 2021
4e89806
Using config_encoder
Feb 4, 2021
669a40f
Merge remote-tracking branch 'origin/DataAggregatorRefactor' into Dat…
Feb 4, 2021
85a4c2d
Renamed tar_location to tar_path
Feb 4, 2021
4c03174
Removed references to database_name in docs
Feb 5, 2021
f565507
Cleanup
Feb 5, 2021
2553a09
Moved screenshot_path and source_dump_path to ManagerParamsInternal
Feb 5, 2021
fceaee0
Fixed imports
Feb 12, 2021
1846d25
Fixing up comments
Feb 12, 2021
dfdc34d
Fixing up comments
Feb 12, 2021
b922774
More docs
Feb 15, 2021
a7bcbb8
Merge branch 'master' into DataAggregatorRefactor
Feb 15, 2021
55f6cdb
updated dependencies
Feb 16, 2021
65c6eda
Fixed test_task_manager
Feb 17, 2021
9dbeb7e
Merge branch 'master' into DataAggregatorRefactor
Feb 17, 2021
048546d
Reupgraded to python 3.9.1
Feb 17, 2021
59484de
Restoring crawl_reference in mp_logger
Feb 22, 2021
937c8fe
Removed unused imports
Feb 22, 2021
1c819f7
Apply suggestions from code review
Feb 22, 2021
2cbb801
Cleaned up socket handling
Feb 22, 2021
2dd339c
Fixed TaskManager.__exit__
Feb 22, 2021
74762cb
Merge remote-tracking branch 'origin/DataAggregatorRefactor' into Dat…
Feb 22, 2021
a555c14
Moved validation code into config.py
Feb 22, 2021
4f6aed1
Removed comment
Feb 22, 2021
eb08c13
Removed comment
Feb 22, 2021
4820b2c
Removed comment
Feb 22, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -94,3 +94,5 @@ openwpm/Extension/firefox/dist
openwpm/Extension/firefox/openwpm.xpi
openwpm/Extension/firefox/src/content.js
openwpm/Extension/firefox/src/feature.js

datadir
13 changes: 11 additions & 2 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,16 @@ repos:
hooks:
- id: isort
- repo: https://github.com/psf/black
rev: 20.8b1 # Replace by any tag/version: https://github.com/psf/black/tags
rev: 19.10b0
hooks:
- id: black
language_version: python3 # Should be a command that runs python3.6+
language_version: python3
- repo: https://github.com/pre-commit/mirrors-mypy
rev: v0.790
hooks:
- id: mypy
additional_dependencies: [pytest]
# We may need to add more and more dependencies here, as pre-commit
# runs in an environment without our dependencies


14 changes: 5 additions & 9 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,11 @@ env:
# Once we add and remove tests, this distribution may become unbalanced.
# Feel free to move tests around to make the running time of the jobs
# as close as possible.
- TESTS=test_[a-e]*
- TESTS=test_[f-h]*
- TESTS=test_[i-r,t-z]*
- TESTS=test/test_[a-e]*
- TESTS=test/test_[f-h]*
- TESTS=test/test_[i-r,t-z]*
# test_simple_commands.py is slow due to parametrization.
- TESTS=test_[s]*
- TESTS=test/test_[s]*
- TESTS=webextension
git:
depth: 3
Expand All @@ -36,15 +36,11 @@ after_success:

jobs:
include:
- language:
python:
- language: "python"
env:
- TESTS="Docker"
services:
- docker
before_install:
before_script:
install:
script:
- docker build -f Dockerfile -t openwpm .
- ./scripts/deploy-to-dockerhub.sh
vringar marked this conversation as resolved.
Show resolved Hide resolved
85 changes: 53 additions & 32 deletions crawler.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,42 +4,53 @@
import signal
import sys
import time
from pathlib import Path
from threading import Lock
from typing import Any, Callable, List

import boto3
import sentry_sdk

from openwpm import mp_logger
from openwpm.command_sequence import CommandSequence
from openwpm.config import BrowserParams, ManagerParams
from openwpm.mp_logger import parse_config_from_env
from openwpm.task_manager import TaskManager, load_default_params
from openwpm.storage.cloud_storage.gcp_storage import (
GcsStructuredProvider,
GcsUnstructuredProvider,
)
from openwpm.task_manager import TaskManager
from openwpm.utilities import rediswq
from test.utilities import LocalS3Session, local_s3_bucket

# Configuration via environment variables
# Crawler specific config
REDIS_HOST = os.getenv("REDIS_HOST", "redis-box")
REDIS_QUEUE_NAME = os.getenv("REDIS_QUEUE_NAME", "crawl-queue")
MAX_JOB_RETRIES = int(os.getenv("MAX_JOB_RETRIES", "2"))
DWELL_TIME = int(os.getenv("DWELL_TIME", "10"))
TIMEOUT = int(os.getenv("TIMEOUT", "60"))

# Storage Provider Params
CRAWL_DIRECTORY = os.getenv("CRAWL_DIRECTORY", "crawl-data")
S3_BUCKET = os.getenv("S3_BUCKET", "openwpm-crawls")
GCS_BUCKET = os.getenv("GCS_BUCKET", "openwpm-crawls")
GCP_PROJECT = os.getenv("GCP_PROJECT", "senglehardt-openwpm-test-1")
vringar marked this conversation as resolved.
Show resolved Hide resolved
AUTH_TOKEN = os.getenv("GCP_AUTH_TOKEN", "cloud")

# Browser Params
DISPLAY_MODE = os.getenv("DISPLAY_MODE", "headless")
HTTP_INSTRUMENT = os.getenv("HTTP_INSTRUMENT", "1") == "1"
COOKIE_INSTRUMENT = os.getenv("COOKIE_INSTRUMENT", "1") == "1"
NAVIGATION_INSTRUMENT = os.getenv("NAVIGATION_INSTRUMENT", "1") == "1"
JS_INSTRUMENT = os.getenv("JS_INSTRUMENT", "1") == "1"
CALLSTACK_INSTRUMENT = os.getenv("CALLSTACK_INSTRUMENT", "1") == "1"
JS_INSTRUMENT_SETTINGS = os.getenv(
"JS_INSTRUMENT_SETTINGS", '["collection_fingerprinting"]'
JS_INSTRUMENT_SETTINGS = json.loads(
englehardt marked this conversation as resolved.
Show resolved Hide resolved
os.getenv("JS_INSTRUMENT_SETTINGS", '["collection_fingerprinting"]')
)

SAVE_CONTENT = os.getenv("SAVE_CONTENT", "")
PREFS = os.getenv("PREFS", None)
DWELL_TIME = int(os.getenv("DWELL_TIME", "10"))
TIMEOUT = int(os.getenv("TIMEOUT", "60"))
SENTRY_DSN = os.getenv("SENTRY_DSN", None)
LOGGER_SETTINGS = parse_config_from_env()
MAX_JOB_RETRIES = int(os.getenv("MAX_JOB_RETRIES", "2"))

JS_INSTRUMENT_SETTINGS = json.loads(JS_INSTRUMENT_SETTINGS)

SENTRY_DSN = os.getenv("SENTRY_DSN", None)
LOGGER_SETTINGS = mp_logger.parse_config_from_env()

if CALLSTACK_INSTRUMENT is True:
# Must have JS_INSTRUMENT True for CALLSTACK_INSTRUMENT to work
Expand Down Expand Up @@ -74,29 +85,41 @@
browser_params[i].prefs = json.loads(PREFS)

# Manager configuration
manager_params.data_directory = "~/Desktop/%s/" % CRAWL_DIRECTORY
manager_params.log_directory = "~/Desktop/%s/" % CRAWL_DIRECTORY
manager_params.data_directory = Path("~/Desktop/") / CRAWL_DIRECTORY
manager_params.log_directory = Path("~/Desktop/") / CRAWL_DIRECTORY
manager_params.output_format = "s3"
vringar marked this conversation as resolved.
Show resolved Hide resolved
manager_params.s3_bucket = S3_BUCKET
manager_params.s3_bucket = GCS_BUCKET
vringar marked this conversation as resolved.
Show resolved Hide resolved
manager_params.s3_directory = CRAWL_DIRECTORY

# Allow the use of localstack's mock s3 service
S3_ENDPOINT = os.getenv("S3_ENDPOINT")
if S3_ENDPOINT:
boto3.DEFAULT_SESSION = LocalS3Session(endpoint_url=S3_ENDPOINT)
manager_params.s3_bucket = local_s3_bucket(boto3.resource("s3"), name=S3_BUCKET)

structured = GcsStructuredProvider(
project=GCP_PROJECT,
bucket_name=GCS_BUCKET,
base_path=CRAWL_DIRECTORY,
token=AUTH_TOKEN,
)
unstructured = GcsUnstructuredProvider(
project=GCP_PROJECT,
bucket_name=GCS_BUCKET,
base_path=CRAWL_DIRECTORY,
token=AUTH_TOKEN,
)
# Instantiates the measurement platform
# Commands time out by default after 60 seconds
manager = TaskManager(manager_params, browser_params, logger_kwargs=LOGGER_SETTINGS)
manager = TaskManager(
manager_params,
browser_params,
structured,
unstructured,
logger_kwargs=LOGGER_SETTINGS,
)

# At this point, Sentry should be initiated
if SENTRY_DSN:
# Add crawler.py-specific context
with sentry_sdk.configure_scope() as scope:
# tags generate breakdown charts and search filters
scope.set_tag("CRAWL_DIRECTORY", CRAWL_DIRECTORY)
scope.set_tag("S3_BUCKET", S3_BUCKET)
scope.set_tag("S3_BUCKET", GCS_BUCKET)
vringar marked this conversation as resolved.
Show resolved Hide resolved
scope.set_tag("DISPLAY_MODE", DISPLAY_MODE)
scope.set_tag("HTTP_INSTRUMENT", HTTP_INSTRUMENT)
scope.set_tag("COOKIE_INSTRUMENT", COOKIE_INSTRUMENT)
Expand All @@ -108,14 +131,12 @@
scope.set_tag("DWELL_TIME", DWELL_TIME)
scope.set_tag("TIMEOUT", TIMEOUT)
scope.set_tag("MAX_JOB_RETRIES", MAX_JOB_RETRIES)
scope.set_tag("CRAWL_REFERENCE", "%s/%s" % (S3_BUCKET, CRAWL_DIRECTORY))
scope.set_tag("CRAWL_REFERENCE", "%s/%s" % (GCS_BUCKET, CRAWL_DIRECTORY))
# context adds addition information that may be of interest
scope.set_context("PREFS", PREFS)
if PREFS:
scope.set_context("PREFS", json.loads(PREFS))
scope.set_context(
"crawl_config",
{
"REDIS_QUEUE_NAME": REDIS_QUEUE_NAME,
},
"crawl_config", {"REDIS_QUEUE_NAME": REDIS_QUEUE_NAME,},
)
# Send a sentry error message (temporarily - to easily be able
# to compare error frequencies to crawl worker instance count)
Expand Down Expand Up @@ -159,9 +180,9 @@ def get_job_completion_callback(
job_queue: rediswq.RedisWQ,
job: bytes,
) -> Callable[[bool], None]:
def callback(sucess: bool) -> None:
def callback(success: bool) -> None:
with unsaved_jobs_lock:
if sucess:
if success:
logger.info("Job %r is done", job)
job_queue.complete(job)
else:
Expand Down
33 changes: 26 additions & 7 deletions demo.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,15 @@
import logging
import os
from pathlib import Path

from openwpm.command_sequence import CommandSequence
from openwpm.config import BrowserParams, ManagerParams
from openwpm.storage.cloud_storage.gcp_storage import (
GcsStructuredProvider,
GcsUnstructuredProvider,
)
from openwpm.storage.local_storage import LocalArrowProvider
from openwpm.storage.sql_provider import SqlLiteStorageProvider
from openwpm.task_manager import TaskManager

# The list of sites that we wish to crawl
Expand Down Expand Up @@ -35,26 +45,35 @@
browser_params[i].dns_instrument = True

# Update TaskManager configuration (use this for crawl-wide settings)
manager_params.data_directory = "~/Desktop/"
manager_params.log_directory = "~/Desktop/"
manager_params.data_directory = Path("./datadir/")
manager_params.log_directory = Path("./datadir/")

logging_params = {"log_level_console": logging.DEBUG}
# memory_watchdog and process_watchdog are useful for large scale cloud crawls.
# Please refer to docs/Configuration.md#platform-configuration-options for more information
# manager_params.memory_watchdog = True
# manager_params.process_watchdog = True

# Instantiates the measurement platform
project = "senglehardt-openwpm-test-1"
bucket_name = "openwpm-test-bucket"
# Commands time out by default after 60 seconds
manager = TaskManager(manager_params, browser_params)
manager = TaskManager(
manager_params,
browser_params,
GcsStructuredProvider(project, bucket_name, base_path="test3"),
vringar marked this conversation as resolved.
Show resolved Hide resolved
None,
)

# Visits the sites
for site in sites:
for index, site in enumerate(sites):

def callback(success: bool, val: str = site) -> None:
print("CommandSequence {} done".format(val))

# Parallelize sites over all number of browsers set above.
command_sequence = CommandSequence(
site,
reset=True,
callback=lambda success, val=site: print("CommandSequence {} done".format(val)),
site, site_rank=index, reset=True, callback=callback,
)

# Start by visiting the page
Expand Down
18 changes: 7 additions & 11 deletions environment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,37 +7,33 @@ dependencies:
- click=7.1.2
- codecov=2.1.10
- dill=0.3.3
- gcsfs=0.7.1
- geckodriver=0.28.0
- ipython=7.19.0
- leveldb=1.22
- localstack=0.11.1.1
- multiprocess=0.70.11.1
- mypy=0.790
- nodejs=14.15.1
- pandas=1.1.4
- pandas=1.1.5
- pillow=8.0.1
- pip=20.2.4
- pre-commit=2.9.2
- pip=20.3.1
- pre-commit=2.9.3
- psutil=5.7.3
- pyarrow=2.0.0
- pytest-cov=2.10.1
- pytest=6.1.2
- python=3.8.6
- pyvirtualdisplay=0.2.5
- redis-py=3.5.3
- s3fs=0.4.0
- s3fs=0.5.1
- selenium=3.141.0
- sentry-sdk=0.19.4
- sentry-sdk=0.19.5
- tabulate=0.8.7
- tblib=1.6.0
- wget=1.20.1
- pip:
- amazon-kclpy==2.0.1
- crontab==0.22.9
- dataclasses-json==0.5.2
- domain-utils==0.7.1
- flask-cors==3.0.9
- jsonschema==3.2.0
- moto-ext==1.3.15.15
- plyvel==1.3.0
- subprocess32==3.5.4
name: openwpm
Loading