Skip to content

MuttData/soam

Repository files navigation

SoaM

pipeline status coverage report pypi version

SoaM is a Prefect based library created by Mutt. Its goal is to create a forecasting framework, this tool is developed with conjunctions of experience on previous projects. There come the name: Son of a Mutt = SoaM

SoaM pipeline

Mermaid diagram

Diagram source code
graph LR
    id0[(Database I)]-->id2[/SoaM Time Series Extractor/]
    id1[(Database II)]-->id2
    id2-->id3[/SoaM Transformer/]
    id3-->id4[/SoaM Forecaster/]
    id5{{Forecasting Model}}-->id4
    id4-->id6[(SoaM Predictions)]
    id6-->id7[/SoaM Forecaster Plotter/]
    id6-->id8[/SoaM Reporting/]
    id7-->id8
Loading
This library pipeline supports any data source. The process is structured in different stages: * Extraction: manages the granularity and aggregation of the input data. * Preprocessing: lets select among out of the box tools to perform standard tasks as normalization or fill nan values. * Forecasting: fits a model and predict results. * Postprocessing: modifies the results based on business/real information or create analysis with the predicted values, such as an anomaly detection.

Overview of the Steps Run in SoaM

Extraction

This stage extracts data from the needed sources to build the condensed dataset for the next steps. This tends to be project dependent. Then it converts the full dataset to the desired time granularity and aggregation level by some categorical attribute/s.

Preprocessing

This step implements functions to further cleanup and prepare the data for the following steps, such as:

  • Add feature/transformation
  • Fill nan values
  • Apply value normalizations
  • Shift values

Forecasting

This stage receives the clean data, performs the forecast and store the predicted values in the defined storages. Currently there are implementations to store in CSV files and SQL databases. A variety of models are currently supported to fit and predict data. They can be extended to create custom ones.

Backtesting

Window policies

To do backtesting the data is splited in train and validation, there are two spliting methods:

  • Sliding: create a fixed size window for the training data that ends at the beginning of the validation data.
  • Expanding: create the training data from remaining data since the start of the series until the validation data.

For more information review this document: backtesting at scale

Postprocessing

This last stage is prepared to work on the forecasts generated by the pipeline. For example:

  • Clip/Cleanup the predictions.
  • Perform further analyses (such as anomaly detection).
  • Export reports.

Table of Contents

Installation

Install the base lib via pipy by executing:

pip install soam

Or clone this repository:

git clone [soam-repo]

And then run:

pip install . or pip install -e .

Install extras

The project contains some extra dependencies that are not included in the default installation to make it lightweight. If you want to install extensions use:

pip install -e ".[slack]"
pip install -e ".[prophet]"
pip install -e ".[pdf_report]"
pip install -e ".[gsheets_report]"
pip install -e ".[report]" # slack and *_report extras
pip install -e ".[all]" # all previous

Note: The pdf_report extra might need to run the following command before installation (More info)

$ apt-get install texlive-xetex texlive-fonts-recommended libpoppler-cpp-dev

Quick start

Here is an example for a quick start into SoaM. In it a time series with AAPL stock prices is loaded, processed and forecasted. As well, there's other example with the same steps, but exploding the power of flows.

Usage

For further info check our end to end example where we explained how SoaM will interact with Airflow and Cookiecutter on a generic project.

Database management

For database storing there are complementary tools:

  • Decouple storing the database information in a separated file. With a settings.ini file to store the database credentials and general configurations, when modifying it don't change the keys names.
  • Alembic to create the database migrations. A brief description is below.
  • SQLAlchemy as an ORM, the schemas of the tables are defined in data_models.

Alembic

This package uses alembic and expects you to use it!

Alembic is a database migration tool for usage with SQLAlchemy. After defining the schemas, with SQLAlchemy, Alembic keeps track of the database modifications such as add new columns, modify a schema or add new tables.

Alembic is set up to use the credentials from the settings.ini file and get the defined models from data_models. Be aware that alembic needs this package installed to run!

When making any change of the data models you need them to impact into the database for this you will have to run:

alembic revision --autogenerate
alembic upgrade head

The first command will check the last version of the database and will autogenerate the python file with the necessary changes. It is always necessary to manually review and correct the candidate migrations that autogenerate produces.

The second command will use this file to impact the changes in the database.

For more alembic commands visit the documentation

Developers guide

If you are going to develop SoaM, you should checkout the documentation directory before adding code, you can start in the project structure document.

Testing

To run the default testsuite run this:

pytest

To run the tests with nox:

nox --session tests

Testing data extraction

The tests for the extractor currently depends on having a local Postgres database and the variable TEST_DB_CONNSTR set with it's connection string.

The easiest way to to this is as follows:

docker run --network=host \
    -e "POSTGRES_USER=soam" \
    -e "POSTGRES_PASSWORD=soam" \
    -e "POSTGRES_DB=soam" \
    --rm postgres

TEST_DB_CONNSTR="postgresql://soam:soam@localhost/soam" pytest

To run a specific test file:

TEST_DB_CONNSTR="postgresql://soam:soam@localhost/soam" pytest -v tests/test_file.py

Note that even though the example has a DB name during the tests a new database is created and dropped to ensure that no state is maintened between runs.

Testing plots

To generate images for testing we use pytest-mpl as follows:

pytest --mpl-generate-path=tests/plotting/baseline

To run the image based tests:

pytest --mpl

Contributing

We appreciate for considering to help out maintaining this project. If you'd like to contribute please read our contributing guidelines.

CI

To run the CI jobs locally you have to run it with nox: In the project root directory, there is a noxfile.py file defining all the jobs, these jobs will be executed when calling from CI or you can call them locally.

You can run all the jobs with the command nox, from the project root directory or run just one job with nox --session test command, for example.

The .gitlab-ci.yml file configures the gitlab CI to run nox. Nox let us execute some test and checks before making the commit. We are using:

  • Linting job:
    • isort to reorder imports
    • pylint to be pep8 compliant
    • black to format for code conventions
    • mypy for static type checking
  • bandit for security checks
  • pytest to run all the tests in the test folder.
  • pyreverse to create diagrams of the project

This runs on a gitlab machine after every commit.

We are caching the environments for each job on each branch. On every first commit of a branch, you will have to change the policy also if you add dependencies or a new package to the project. Gitlab cache policy:

  • pull: pull the cached files from the cloud.
  • push: push the created files to the cloud.
  • pull-push: pull the cached files and push the newly created files.

Rules of Thumb

This section contains some recommendations when working with SoaM to avoid common mistakes:

  • When possible reuse objects to preserve their configuration. Eg: Transformations, forecasters, etc.
  • Use the same train-test windows when backtesting and training to deploy and on later usage.

Credits

Alejandro Rusi
Diego Leguizamón
Diego Lizondo
Eugenio Scafati
Fabian Wolfmann
Federico Font
Francisco Lopez Destain
Guido Trucco
Hugo Daniel Viotti
Juan Martin Pampliega
Pablo Andres Lorenzatto
Wenceslao Villegas

License

soam is licensed under the Apache License 2.0.