Insight Data Engineering Project
The idea is to provide recommendations to door-to-door sales which potential customer to visit next. Given the stringent time-limit, the sales person would want to know the customer fastest to reach. The app suggests next location to visit based on mode of transportation and real-time road conditions
Data Flow
Data Source
Approach
List of Licensed business comes from Data.gov
- Process data with PySpark (clean and normalize)
- Store data in Postgres
- Use PostGIST to index spacial data (location)
- Dash UI to interact with data
- Data saved at S3 as 51 csv files
- Removed row without location coordinates, address, or industry description (NAICS number)
- If provided with number of employees, broke number in bins (0-10, 10-100, 100-500, ...)
- If provided with sales value, broke number in bins (0-1000, 1000-10000, 100000-500000, ...)
- Here API
- options State
- options Business type
- Transportation mode
- Time radius
- Starting location
IMPORTANT Before doing anything need to setup config.py
boto3
,
dash
,
GeoAlchemy2
,
gunicorn
,
pandas
,
psycopg2-binary
,
requests
,
SQLAlchemy
./
├── LICENSE
├── README.md
├── config.py
├── data
│ └── 6-digit_2017_Codes.csv
└── src
├── APIs
│ ├── HereAPI.py
│ └── YelpAPI.py
├── AirFlow
│ └── UpdateDataSchedule.py
├── SQL
│ ├── AssociationTables.py
│ ├── BusinessTable.py
│ ├── CategoryTable.py
│ ├── LocationTable.py
│ ├── MainTable.py
│ ├── __init__.py
│ └── base.py
├── SQLScripts
│ └── create_index.sql
├── assets
│ ├── base.css
│ └── style.css
├── config.py
├── help_functions_app.py
├── main_app.py
└── pyspark_clean_data.py