Skip to content

MarcoBuster/railway-opendata

Repository files navigation

RailScrape (railway-opendata)

In Italy there are no available official Open Data about the performance (delays, cancellations, ...) of the italian public rail transport. This project offers a tool which allows anyone to gather it and run some stats and visualizations.

Architecture

flowchart TB

S[Scraper] --> |Downloads data| D("ViaggiaTreno and Trenord APIs")
S -->|Produces| P[(Daily .pickle dumps)]
E[Extractor] -->|Reads| P
E[Extractor] -->|Produces| C[(Daily .CSV dumps)]
A2["(BYOD Analyzer)"] -.->|Reads| C
A[Analyzer] -->|Reads| C
A[Analyzer] -->|Produces| K(Stats, visualizations, etc...)
Loading

The application is composed by multiple modules, accessible via CLI:

  • scraper: unattended script to incrementally download and preserve the current status of the italian railway network. If run constantly (e.g. ~every hour using cron) all trains will be captured and saved in data/%Y-%m-%d/trains.pickle.
  • train-extractor and station-extractor: converts raw scraped data to usable .csv files;
  • analyze : shows reproducible stats and visualizations.

Running

The project is written in Python and it uses modern typing annotations, so Python >= 3.11 is needed.

Using Docker (easy)

A Dockerfile is available to avoid installing the dependencies manually. You can use the automatically updated ghcr.io/marcobuster/railway-opendata:latest Docker image if you want the latest version available on the master branch.

For instance, the following command will start the scraper on your machine.

$ docker run -v ./data:/app/data ghcr.io/marcobuster/railway-opendata:latest scraper

Using virtual envs

⚠️ WARNING: this project currently uses the builtin hash(...) function to quickly index objects. To ensure reproducibility between runs, you need to disable Python's hash seed randomization by setting the PYTHONHASHSEED=0 environment variable. If you fail to do so, the software will refuse to start.

$ export PYTHONHASHSEED=0
$ virtualenv venv
$ source ./venv/bin/activate
$ pip install -r requirements.txt
$ python main.py ...

Example usages

  • Start the scraper. For continuos data collection, it should be run every ~hour.

    $ python main.py scraper

  • Extract train data from a pickle file and save it in CSV.

    $ python main.py train-extractor -o data/2023/04-29/trains.csv data/2023-04-29/trains.pickle

  • Extract station data from a pickle file and save it in GeoJSON.

    $ python main.py station-extractor -f geojson data/stations.pickle

  • Describe a dataset and filter observation by date.

    $ python main.py analyze --start-date 2023-05-01 --end-date today data/stations.csv data/2023-05-*/trains.csv --stat describe

  • Show delay stats of the last stop.

    $ python main.py analyze --group-by train_hash --agg-func last [..]/stations.csv [..]/trains.csv --stat delay_box_plot

  • Show daily train count grouped by railway companies.

    $ python main.py analyze --group-by client_code [..]/stations.csv [..]/trains.csv --stat day_train_count

  • Display an interactive map and open it in the web browser.

    $ python main.py analyze [..]/stations.csv [..]/trains.csv --stat trajectories_map

  • Display a timetable graph.

    $ python main.py analyze [..]/stations.csv [..]/trains.csv --stat timetable --timetable-collapse

Fields

Stations CSV

Column Data type Description Notes
code String Station code This field is not actually unique. One station can have multiple codes
region Integer Region code If zero, unknown. Used in API calls
long_name String Station long name
short_name String Station short name Can be empty
latitude Float Station latitude Can be empty
longitude Float Station longitude Can be empty

Trains CSV

In the extracted trains CSV, each line is a train stop (not station nor train). Many fields are actually duplicated.

Column Data type Description Notes
train_hash MD5 hash Unique identifier for a particular train
number Integer Train number Can't be used to uniquely identify a train1
day Date Train departing date
origin Station (code) Train absolute origin
category String Train Category See table2
destination Station (code) Train final destination
client_code Integer Railway company See table3
phantom Boolean True if train was only partially fetched Trains with this flag should be safely ignored
trenord_phantom Boolean True if the train was only partially fetched using Trenord APIs Trains with this flag should be safely ignored4
cancelled Boolean True if the train is marked as cancelled Not all cancelled trains are marked as cancelled: for more accuracy, you should always check stop_type
stop_number Integer Stop progressive number (starting at 0)
stop_station_code Station (code) Stop station code
stop_type Char Stop type P if first, F if intermediate, A if last, C if cancelled
platform String Stop platform Can be empty
arrival_expected ISO 8601 Stop expected arrival time Can be empty
arrival_actual ISO 8601 Stop actual arriving time Can be empty
arrival_delay Integer Stop arriving delay in minutes Is empty if arrival_expected or arrival_actual are both empty
departure_expected ISO 8601 Stop expected departing time Can be empty
departure_actual ISO 8601 Stop actual departing time Can be empty
departure_delay Integer Stop departing delay in minutes Is empty if departing_expected or departing_actual are both empty
crowding Integer Train crowding in percentage Reported by Trenord

Contributing

See CONTRIBUTING.md.

Notes and caveats

Data completeness and correctness

The ViaggiaTreno APIs are known to be buggy and unreliable. As stated before, many fields (like departure_expected and arrival_expected) are not always guaranteed to be present and some concepts are counter-intuitive (a train number is not an unique identifier nor are station codes).

ViaggiaTreno is the main source of truth for many final user applications (like Trenìt! or Orario Treni) and is itself linked on the Trenitalia official website. For instance, if the API does not return information for a train stop, no other application will display it: the data simply does not exists online. The scraper always tries to save as much data as possible ("best effort") even when is probably incomplete; in those cases, proper flags (like phantom and trenord_phantom) are activated so the developer can choose for themselves.

Licensing

Copyright (c) 2023 Marco Aceti. Some rights reserved (see LICENSE).

Terms and conditions of the ViaggiaTreno web portal state that copying is prohibited (except for personal use) as all rights for the content are reserved to the original owner (Trenitalia or Gruppo FS). In July 2019 Trenitalia sued Trenìt for using train data in its app, but partially lost. I think data about the performance of public transport should be open as well, but I'm not a lawyer and I'm not willing to risk lawsuits redistributing data; if someone wants to, the tool is now available.

BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION.

IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

Footnotes

  1. In Italy, two different trains can share the same number. A train is only uniquely identified by the triple (number, origin, day).

  2. Known categories are listed below.

    Category Description
    REG Regional trains
    MET Metropolitan trains
    FR Frecciarossa (red arrow)
    IC Intercity
    ICN Intercity Night
    EC Eurocity
    FB Frecciabianca (white arrow)
    FA Frecciargento (silver arrow)
    EN EuroNight
    EC ER Eurocity
  3. Known client codes are listed below.

    Client code Railway company
    1 TRENITALIA_AV
    2 TRENITALIA_REG
    4 TRENITALIA_IC
    18 TPER
    63 TRENORD
    64 OBB
  4. This flag is activated when a train is seen on ViaggiaTreno APIs and marked as Trenord's but it can't be fetched on Trenord's APIs.