This is a high level overview of the folder structure of the project.
.
├── Cargo.toml
├── config/ (config contains .yaml files that provides configs and various numeric values required to run the project.)
├── docker/ (contains Dockerfiles and docker compose configs)
├── grafana ( contains dashboard configs and prometheus datasources config)
│ ├── dashboards/
│ └── datasources/
├── prometheus.yaml (define prometheus metric collection related configs here)
├── scripts/ (various scripts that helps in populating tables, schema and test data)
│ └── sql/
├── src
│ ├── app/ (main application where we start poller and archival tasks)
│ ├── archival/ (deals with network requests to archive URLs, check status of archival, and cleanup of completed values)
│ │ └── tests/ (contains unit tests for archival service)
│ ├── cli/ (cli options are set here, along with the utils)
│ ├── configuration/ (parsing logic for .yaml configs belongs here)
│ ├── lib.rs (treats the app as a library)
│ ├── main.rs (entry point to the app)
│ ├── metrics/ (module contains metrics, and metrics collection methods for the app)
│ ├── poller/ (polling logic resides here)
│ │ └── tests/ (unit tests for poller module)
│ └── structs/
└── tests (contains Integration tests)
├── archival/
├── fixtures/
├── main.rs
└── poller/
We want to get URLs from edit_data
and edit_note
tables, and archive them in Internet Archive history.
The app provides multiple command line functionalities to archive URLs from edit_data
and edit_note
tables:
We create a external_url_archiver
schema, under which we create the required table, functions, trigger to make the service work.
Following are the long-running tasks:
poller task
- Create a
Poller
implementation which:- Gets the latest
edit_note
idedit_data
edit frominternet_archive_urls
table. We start polling theedit_note
andedit_data
from these ids.
- Gets the latest
- Poll
edit_note
andedit_data
table for URLs - Transformations to required format
- Save output to
internet_archive_urls
table
- Create a
archival task
- Has 2 parts:
notifer
- Creates a
Notifier
implementation which:- Fetches the last unarchived URL row from
internet_archive_urls
table, and start notifying from this row id. - Initialises a postgres function
notify_archive_urls
, which takes theurl_id
integer value, and sends the correspondinginternet_archive_urls
row through the channel calledarchive_urls
.
- Fetches the last unarchived URL row from
- This periodically run in order to archive URLs from
internet_archive_urls
.
- Creates a
listener
- Listens to the
archive_urls
channel, and makes the necessary Wayback Machine API request (The API calls are still to be made). - The listener task is delayed for currently 5 seconds, so that no matter how many URLs are passed to the channel, it only receives 1 URL per 5 seconds, in order to work under IA rate limits.
- Listens to the
- Has 2 parts:
retry/cleanup task
- Runs every 24 hours, and does the following:
- If the
status
of the URL archival issuccess
and the URL is present in the table for more than 24 hours, or if it isfailed
, the task will clean it. - In case the URL's status is
error
, it resends the URL toarchive_urls
channel fromnotify_archive_urls
function, so that it can be re-archived.
- If the
- Runs every 24 hours, and does the following:
- The status of each row in
internet_archive_urls
can have 5 values:- NotStarted
- Processing
- Success
- StatusError
- Failed