The task is to create a machine learning model that predicts the count of taxi trips for the next hour in Chicago's community areas. The utilization of this model could help to optimize taxi driver workload distribution across different locations. This is a time series task.
Datasets with their respective descriptions can be found by the following links:
https://data.cityofchicago.org/Transportation/Taxi-Trips-2022/npd7-ywjz
https://data.cityofchicago.org/Transportation/Taxi-Trips-2023/e55j-2ewb
This repository contains:
- A notebook with machine learning model predicting the count of taxi trips.
- Corresponding code for time series feature generation from pandas dataframe.
- A shell script to create the cluster of Docker containers to run the PySpark code.
To run the notebook, you need Docker installed. When done, run:
$ sh start_local_cluster.sh
Access the local cluster by the address you get in the terminal window. Add the SPARK_MASTER_IP variable that you get in the terminal to the .ipynb file, cell 8.
- Docker
- PySpark
- Pandas
- LightGBM
- Catboost