diff --git a/gis/school_safety_zones/README.md b/gis/school_safety_zones/README.md index b2ae83e02..1e719c7e7 100644 --- a/gis/school_safety_zones/README.md +++ b/gis/school_safety_zones/README.md @@ -1,17 +1,22 @@ # Vision Zero Google Sheets API -This folder contains scripts to read Vision Zero google spreadsheets and put them into two postgres tables using Google Sheets API. This process is then automated using Airflow for it to run daily. + +This folder contains scripts to read Vision Zero Google spreadsheets and put them into the appropriate Postgres tables using Google Sheets API. This process is then automated using Airflow for it to run daily. ## Table of Contents - [1. Data Source](#1-data-source) - [2. The Automated Data Pipeline](#2-the-automated-data-pipeline) -- [3. Data pulling from the CLI](#3-data-pulling-from-the-cli) - - [3.1 Database Configuration File](#31-database-configuration-file) -- [4. Google Credentials](#4-google-credentials) -- [5. Adding a new year](#5-adding-a-new-year) - - [5.1 Create a New PostgreSQL Table](#51-create-a-new-postgresql-table) - - [5.2 Add the New Google Sheet to Airflow](#52-add-the-new-google-sheet-to-airflow) -- [6. Table generated](#6-table-generated) +- [3. Sheets Credentials](#3-sheets-credentials) +- [4. Adding a new year](#4-adding-a-new-year) + - [4.1 Create a New PostgreSQL Table](#41-create-a-new-postgresql-table) + - [4.2 Request sharing permission to the new sheet](#42-request-sharing-permission-to-the-new-sheet) + - [4.3 Add the New Google Sheet to Airflow](#43-add-the-new-google-sheet-to-airflow) + - [4.4 Check the airflow logs and the data in the database](#44-check-the-airflow-logs-and-the-data-in-the-database) + - [4.5 Wait overnight for the data to appear on the vision zero map](#45-wait-overnight-for-the-data-to-appear-on-the-vision-zero-map) +- [5. Table generated](#5-table-generated) +- [6. Pulling data with the command-line interface](#6-pulling-data-with-the-command-line-interface) + - [6.1 Database Configuration File](#61-database-configuration-file) + - [6.2 Local Google API key](#62-local-google-api-key) > **Notes:** > - Introduction to Google Sheets API can be found at [Intro](https://developers.google.com/sheets/api/guides/concepts). @@ -19,7 +24,9 @@ This folder contains scripts to read Vision Zero google spreadsheets and put the ## 1. Data Source -The School Safety Zone data are loaded from individual Google Sheets for every year since 2018. Those Google sheets are maintained by [Vision Zero](mailto:VisionZeroTO@toronto.ca). The data is pulled daily by an Airflow pipeline (DAG) and can be also pulled manually by running the script `gis.school_safety_zones.schools.py` with the appropriate arguments. The following two sections describe the two approaches in more details. +The School Safety Zone data are loaded from individual Google Sheets for every year since 2018. Those Google sheets are maintained by [Vision Zero](mailto:VisionZeroTO@toronto.ca). The data is pulled daily by an Airflow pipeline (DAG) and can be also pulled manually by running the script `gis.school_safety_zones.schools.py` with the appropriate arguments. The data are stored in a partitioned table structure under the `vz_safety_programs_staging.school_safety_zone_raw_parent` and then transformed via downstream views. + +The following sections describe the two approaches in more details. ## 2. The Automated Data Pipeline @@ -38,7 +45,66 @@ The DAG consists of two main tasks as shown in the below figure: ![vz_google_sheets DAG structure](dag_structure.png) -## 3. Data pulling from the CLI +## 3. Sheets Credentials + +A credential file (named `key.json` in the script) is required to connect to the Google Sheets to pull data, the contents of this file can be downloaded from [the google console](https://console.cloud.google.com/iam-admin/serviceaccounts/details/) if you're logged in to the right google account. This is currently stored in an encrypted Airflow connection: `vz_api_google`. + +## 4. Adding a new year + +Follow these steps to read in another spreadsheet for year `yyyy`. + +### 4.1 Create a New PostgreSQL Table + +Create an empty table `vz_safety_programs_staging.school_safety_zone_yyyy_raw`, where `yyyy` is the year to be stored, as a child of parent table `vz_safety_programs_staging.school_safety_zone_raw_parent`. Follow the format of the existing child tables (e.g. `vz_safety_programs_staging.school_safety_zone_2018_raw`) and declare the inheritance: + +```SQL +CREATE TABLE vz_safety_programs_staging.school_safety_zone_yyyy_raw ( + like vz_safety_programs_staging.school_safety_zone_2018_raw + including all +) INHERITS (vz_safety_programs_staging.school_safety_zone_raw_parent); +``` + +### 4.2 Request sharing permission to the new sheet + +The sheet must be shared with `vz-sheets@quickstart-1568664221624.iam.gserviceaccount.com`. This ought to be View-only. This email is saved in the Airflow credentials as `wys_cred.service_account_email`. + +### 4.3 Add the New Google Sheet to Airflow + +Add the new year details to the Airflow variable `ssz_spreadsheets` as described [above](#2-the-automated-data-pipeline) so that the DAG would start pulling its data. + +### 4.4 Check the airflow logs and the data in the database + +The logs produce a WARNING when a line is skipped because it is missing the end of line marker. + +Also check the downstream VIEWS to see if dates (`dt`) & geometries are properly transformed +* `vz_safety_programs.polygons_school_safety_zones`: for school zone polygons +* `vz_safety_programs.points_wyss`: for the Watch Your Speed Signs locations + +⚠ If there's any problems look at [5. Table generated](#5-table-generated) below to see how the contents of the sheet is mapped to the tables. + +### 4.5 Wait overnight for the data to appear on the vision zero map + +The Geographic Competency Centre (GCC) has a process to pull the map data nightly from our database and expose it via ESRI API. + +## 5. Table generated + +The script reads information from columns A, B, E, F, Y, Z, AA, AB which are as shown below + +|SCHOOL NAME|ADDRESS|FLASHING BEACON W/O|WYSS W/O|School Coordinate (X,Y)|Final Sign Installation Date|FB Locations (X,Y)|WYS Locations (X,Y)| +|-----------|-------|-------------|---------------|--------------|-----------------------|------------|--------------| +|AGINCOURT JUNIOR PUBLIC SCHOOL|29 Lockie Ave|9239020|9239021|43.788456, -79.281118|January 9, 2019|43.786566, -79.279023|43.787530, -79.279456| + +from the Google Sheets and put them into postgres tables with the following fields (all in data type text): + +|school_name|address|work_order_fb|work_order_wyss|locations_zone|final_sign_installation|locations_fb|locations_wyss| +|-----------|-------|-------------|---------------|--------------|-----------------------|------------|--------------| +|AGINCOURT JUNIOR PUBLIC SCHOOL|29 Lockie Ave|9239020|9239021|43.788456, -79.281118|January 9, 2019|43.786566, -79.279023|43.787530, -79.279456| + +**Notes:** +* The Google Sheets API do not read any row with empty cells at the beginning or end of the row or just an entire row of empty cells. It will log an error when that happens. +* The script being used reads up to line 180 although the actual data is less than that. This is to anticipate extra schools which might be added into the sheets in the future. + +## 6. Pulling data with the command-line interface The data can be loaded into the database from the appropriate Google sheet(s) using the Linux Command Line Interface (CLI). The script `gis.school_safety_zones.schools.py` requires some mandatory and optional arguments to load these data. The below table describes the script's arguments. For more details, run `./gis/school_safety_zones/schools.py --help`. @@ -51,7 +117,7 @@ The data can be loaded into the database from the appropriate Google sheet(s) us | schema | The PostgreSQL schema to load the data into. | `vz_safety_programs_staging` | | table | The PostgreSQL table to load the data into. | `school_safety_zone_{year}_raw` | -### 3.1 Database Configuration File +### 6.1 Database Configuration File To be able to run the data puller script from the CLI, you need to save the database parameters in a file in the following format: @@ -63,9 +129,9 @@ username=USERNAME password=PASSWORD ``` -## 4. Google Credentials +### 6.2 Local Google API key -Initially, a credential file (named `key.json` in the script) was required to connect to the Google Sheets to pull data. The google account used to read the Sheets is `bdittoronto@gmail.com`. First, Google Sheets API was enabled on the google account. Then, a service account was created so that we are not prompted to sign in every single time we run the script. Instructions on how to do that can be found at [Creating a service account](https://github.com/googleapis/google-api-python-client/blob/master/docs/oauth-server.md#creating-a-service-account). Go to the `Service accounts` page from there, select the `Quickstart` project and click on the `Search for APIs and Services` bar to generate credentials. Copy the credentials and paste it on a `key.json` file located in the same directory as the script. The `key.json` file should look something like this: +First, Google Sheets API was enabled on the google account. Then, a service account was created so that we are not prompted to sign in every single time we run the script. Instructions on how to do that can be found at [Creating a service account](https://github.com/googleapis/google-api-python-client/blob/master/docs/oauth-server.md#creating-a-service-account). Go to the `Service accounts` page from there, select the `Quickstart` project and click on the `Search for APIs and Services` bar to generate credentials. Copy the credentials and paste it into a `key.json` file located in the same directory as the script. The `key.json` file should look something like this: ```json "type": "service_account", @@ -79,42 +145,3 @@ Initially, a credential file (named `key.json` in the script) was required to co "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs", "client_x509_cert_url": ``` - -Currently, these credentials are stored in an encrypted Airflow connection: `vz_api_google`. - -## 5. Adding a new year - -Follow these steps to read in another spreadsheet for year `yyyy`. - -### 5.1 Create a New PostgreSQL Table - -Create an empty table `vz_safety_programs_staging.school_safety_zone_yyyy_raw`, where `yyyy` is the year to be stored, as a child of parent table `vz_safety_programs_staging.school_safety_zone_raw_parent`. Follow the format of the existing child tables (e.g. `vz_safety_programs_staging.school_safety_zone_2018_raw`) and declare the inheritance: - -```SQL -CREATE TABLE vz_safety_programs_staging.school_safety_zone_yyyy_raw ( - like vz_safety_programs_staging.school_safety_zone_2018_raw - including all -) INHERITS (vz_safety_programs_staging.school_safety_zone_raw_parent); -``` - -### 5.2 Add the New Google Sheet to Airflow - -Add the new year details to the Airflow variable `ssz_spreadsheets` as described [above](#2-the-automated-data-pipeline) so that the DAG would start pulling its data. - - -## 6. Table generated -The script reads information from columns A, B, E, F, Y, Z, AA, AB which are as shown below - -|SCHOOL NAME|ADDRESS|FLASHING BEACON W/O|WYSS W/O|School Coordinate (X,Y)|Final Sign Installation Date|FB Locations (X,Y)|WYS Locations (X,Y)| -|-----------|-------|-------------|---------------|--------------|-----------------------|------------|--------------| -|AGINCOURT JUNIOR PUBLIC SCHOOL|29 Lockie Ave|9239020|9239021|43.788456, -79.281118|January 9, 2019|43.786566, -79.279023|43.787530, -79.279456| - -from the Google Sheets and put them into postgres tables with the following fields (all in data type text): - -|school_name|address|work_order_fb|work_order_wyss|locations_zone|final_sign_installation|locations_fb|locations_wyss| -|-----------|-------|-------------|---------------|--------------|-----------------------|------------|--------------| -|AGINCOURT JUNIOR PUBLIC SCHOOL|29 Lockie Ave|9239020|9239021|43.788456, -79.281118|January 9, 2019|43.786566, -79.279023|43.787530, -79.279456| - -**Notes:** -* The Google Sheets API do not read any row with empty cells at the beginning or end of the row or just an entire row of empty cells. It will log an error when that happens. -* The script being used reads up to line 180 although the actual data is less than that. This is to anticipate extra schools which might be added into the sheets in the future.