This repository contains a stock market analysis demo of the ngods data stack. The demo performs the following steps:
- Download selected stock symbols data from Yahoo Finance API.
- Store the stock data in ngods data warehouse (using Iceberg format).
- Transform the data (e.g. normalize stock prices) using dbt.
- Expose analytics data model using cube.dev.
- Visualize data as reports and dashboards using Metabase.
- Predicts stock prices using ARIMA in Apache Spark.
The demo is packaged as docker-compose script that downloads, installs, and runs all components of the data stack.
- 2023-02-03:
- Upgrade to Apache Iceberg 1.1.0
- Upgrade to Trino 406
- Migrated to the new JDBC catalog (removed the heavyweigt Hive Metastore)
ngods stands for New Generation Opensource Data Stack. It includes the following components:
- Apache Spark for data transformation
- Apache Iceberg as a data storage format
- Trino for federated data query
- dbt for ELT
- Dagster for data orchetsration
- cube.dev for data analysis and semantic data model
- Metabase for self-service data visualization (dashboards)
- Minio for local S3 storage
ngods is open-sourced under a BSD license and it is distributed as a docker-compose script that supports Intel and ARM architectures.
ngods requires a machine with at least 16GB RAM and Intel or Arm 64 CPU running Docker. It requires docker-compose.
- Clone the ngods repo
git clone https://github.com/zsvoboda/ngods-stocks.git
- Start the data stack with the
docker-compose up
command
cd ngods-stocks
docker-compose up -d
NOTE: This can take quite long depending on your network speed.
- Stop the data stack via the
docker-compose down
command
docker-compose down
- Execute the data pipeline from the Dagster console at http://localhost:3070/ with this yaml config file.
Cut and paste the content of the e2e.yaml file to this Dagster UI console page and start the data pipeline by clicking the Launch Run
button.
NOTE: You can customize the list of stock symbols that will be downloaded.
- Review and customize the cube.dev metrics, and dimensions. Test these metrics in the cube.dev playground.
See the cube.dev documentation for more information.
- Check out the Metabase data visualizations that is connected to the cube.dev analytical model. You can run SQL queries on top of the cube.dev schema.
Use username [email protected]
and password metabase1
.
You can create your own data visualizations and dashboards. See the Metabase documentation for more information.
- Predict stock close price. Run the ARIMA time-series prediction model notebook that is trained on 29 months of the
Apple:AAPL
stock data and predicts the next month.
-
Download DBeaver SQL tool.
-
Connect to the Postgres database that contains the
gold
stage data. Usejdbc:postgresql://localhost:5432/ngods
JDBC URL with usernamengods
and passwordngods
.
- Connect to the Trino database that has access to all data stages (
bronze
,silver
, andgold
schemas of thewarehouse
database). Usejdbc:trino://localhost:8060
JDBC URL with usernametrino
and passwordtrino
.
- Connect to the Spark database that is used for data transformations. Use
jdbc:hive2://localhost:10009
JDBC URL with no username and password.
This chapter contains useful information for customizing the demo.
Here are few distribution's directories that you may need to customize:
conf
configuration of all data stack componentscube
cube.dev schema (semantic model definition)
data
main data directoryminio
root data directory (contains buckets and file data)spark
Jupyter notebooksstage
file stage data. Spark can access this directory via/var/lib/ngods/stage
path.
projects
dbt, Dagster, and DataHub projectsdagster
Dagster orchestration projectdbt
dbt transformations (one project per each medallion stage:bronze
,silver
, andgold
)
The data stack has the following endpoints
- Spark
- http://localhost:8888 - Jupyter notebooks
jdbc:hive2://localhost:10009
JDBC URL (no username / password)- localhost:7077 - Spark API endpoint
- http://localhost:8061 - Spark master node monitoring page
- http://localhost:8062 - Spark slave node monitoring page
- http://localhost:18080 - Spark history server page
- Trino
jdbc:trino://localhost:8060
JDBC URL (usernametrino
/ no password)
- Postgres
jdbc:postgresql://localhost:5432/ngods
JDBC URL (usernamengods
/ passwordngods
)
- Cube.dev
- http://localhost:4000 - cube.dev development UI
jdbc:postgresql://localhost:3245/cube
JDBC URL (usernamecube
/ passwordcube
)
- Metabase
- http://localhost:3030 Metabase UI (username
[email protected]
/ passwordmetabase1
)
- http://localhost:3030 Metabase UI (username
- Dagster
- http://localhost:3070 - Dagster orchestration UI
- Minio
- http://localhost:9001 - Minio UI (username
minio
/ passwordminio123
)
- http://localhost:9001 - Minio UI (username
ngods stack includes three database engines: Spark, Trino, and Postgres. Both Spark and Trino have access to Iceberg tables in warehouse.bronze
and warehouse.silver
schemas. Trino engine can also access the analytics.gold
schema in Postgres. Trino can federate queries between the Postgres and Iceberg tables.
The Spark engine is configured for ELT and pyspark data transformations.
The Trino engine is configured for data federation between the Iceberg and Postgres tables. Additional catalogs can be configured as needed.
The Postgres database has accesses only to the analytics.gold
schema and it is used for executing analytical queries over the gold data.
The demo data pipeline is utilizes the medallion architecture with bronze
, silver
, and gold
data stages.
and consists of the following phases:
- Data are downloaded from Yahoo Finance REST API to the local Minio bucket (./data/stage) using this Dagster operation.
- The downloaded CSV file is loaded to the bronze stage Iceberg tables (warehouse.bronze Spark schema) using dbt models that are executed in Spark (./projects/dbt/bronze).
- Silver stage Iceberg tables (warehouse.silver Spark schema) are created using dbt models that are executed in Spark (./projects/dbt/silver).
- Gold stage Postgres tables (analytics.gold Trino schema) are created using dbt models that are executed in Trino (./projects/dbt/gold).
All data pipeline phases are orchestrated by Dagster framework. Dagster operations, resources and jobs are defined in the Dagster project.
The pipeline is executed by running the e2e job from the Dagster console at http://localhost:3070/ using this yaml config file
ngods includes cube.dev for semantic data model and Metabase for self-service analytics (dashboards, reports, and visualizations).
Analytical (semantic) model is defined in cube.dev and is used for executing analytical queries over the gold data.
Metabase is connected to the cube.dev via SQL API. End users can use it for self-service creation of dashboards, reports, and data visualizations. Metabase is also directly connected to the gold schema in the Postgres database.
Jupyter Notebooks with Scala, Java and Python backends can be used for machine learning.
Create a github issue if you have any questions.