Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#1067 #1066 Update instructions for adding a new spreadsheet #1069

Merged
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
131 changes: 79 additions & 52 deletions gis/school_safety_zones/README.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,32 @@
# Vision Zero Google Sheets API <!-- omit in toc -->
This folder contains scripts to read Vision Zero google spreadsheets and put them into two postgres tables using Google Sheets API. This process is then automated using Airflow for it to run daily.

This folder contains scripts to read Vision Zero Google spreadsheets and put them into the appropriate Postgres tables using Google Sheets API. This process is then automated using Airflow for it to run daily.

## Table of Contents <!-- omit in toc -->

- [1. Data Source](#1-data-source)
- [2. The Automated Data Pipeline](#2-the-automated-data-pipeline)
- [3. Data pulling from the CLI](#3-data-pulling-from-the-cli)
- [3.1 Database Configuration File](#31-database-configuration-file)
- [4. Google Credentials](#4-google-credentials)
- [5. Adding a new year](#5-adding-a-new-year)
- [5.1 Create a New PostgreSQL Table](#51-create-a-new-postgresql-table)
- [5.2 Add the New Google Sheet to Airflow](#52-add-the-new-google-sheet-to-airflow)
- [6. Table generated](#6-table-generated)
- [3. Sheets Credentials](#3-sheets-credentials)
- [4. Adding a new year](#4-adding-a-new-year)
- [4.1 Create a New PostgreSQL Table](#41-create-a-new-postgresql-table)
- [4.2 Request sharing permission to the new sheet](#42-request-sharing-permission-to-the-new-sheet)
- [4.3 Add the New Google Sheet to Airflow](#43-add-the-new-google-sheet-to-airflow)
- [4.4 Check the airflow logs and the data in the database](#44-check-the-airflow-logs-and-the-data-in-the-database)
- [4.5 Wait overnight for the data to appear on the vision zero map](#45-wait-overnight-for-the-data-to-appear-on-the-vision-zero-map)
- [5. Table generated](#5-table-generated)
- [6. Pulling data with the command-line interface](#6-pulling-data-with-the-command-line-interface)
- [6.1 Database Configuration File](#61-database-configuration-file)
- [6.2 Local Google API key](#62-local-google-api-key)

> **Notes:**
> - Introduction to Google Sheets API can be found at [Intro](https://developers.google.com/sheets/api/guides/concepts).
> - A guide on how to get started can be found at [Quickstart](https://developers.google.com/sheets/api/quickstart/python).

## 1. Data Source

The School Safety Zone data are loaded from individual Google Sheets for every year since 2018. Those Google sheets are maintained by [Vision Zero](mailto:[email protected]). The data is pulled daily by an Airflow pipeline (DAG) and can be also pulled manually by running the script `gis.school_safety_zones.schools.py` with the appropriate arguments. The following two sections describe the two approaches in more details.
The School Safety Zone data are loaded from individual Google Sheets for every year since 2018. Those Google sheets are maintained by [Vision Zero](mailto:[email protected]). The data is pulled daily by an Airflow pipeline (DAG) and can be also pulled manually by running the script `gis.school_safety_zones.schools.py` with the appropriate arguments. The data are stored in a partitioned table structure under the `vz_safety_programs_staging.school_safety_zone_raw_parent` and then transformed via downstream views.

The following sections describe the two approaches in more details.

## 2. The Automated Data Pipeline

Expand All @@ -38,7 +45,66 @@ The DAG consists of two main tasks as shown in the below figure:

![vz_google_sheets DAG structure](dag_structure.png)

## 3. Data pulling from the CLI
## 3. Sheets Credentials

A credential file (named `key.json` in the script) is required to connect to the Google Sheets to pull data, the contents of this file can be downloaded from [the google console](https://console.cloud.google.com/iam-admin/serviceaccounts/details/) if you're logged in to the right google account. This is currently stored in an encrypted Airflow connection: `vz_api_google`.

## 4. Adding a new year

Follow these steps to read in another spreadsheet for year `yyyy`.

### 4.1 Create a New PostgreSQL Table

Create an empty table `vz_safety_programs_staging.school_safety_zone_yyyy_raw`, where `yyyy` is the year to be stored, as a child of parent table `vz_safety_programs_staging.school_safety_zone_raw_parent`. Follow the format of the existing child tables (e.g. `vz_safety_programs_staging.school_safety_zone_2018_raw`) and declare the inheritance:

```SQL
CREATE TABLE vz_safety_programs_staging.school_safety_zone_yyyy_raw (
like vz_safety_programs_staging.school_safety_zone_2018_raw
including all
) INHERITS (vz_safety_programs_staging.school_safety_zone_raw_parent);
```

### 4.2 Request sharing permission to the new sheet

The sheet must be shared with `[email protected]`. This ought to be View-only. This email is saved in the Airflow credentials as `wys_cred.service_account_email`.

### 4.3 Add the New Google Sheet to Airflow

Add the new year details to the Airflow variable `ssz_spreadsheets` as described [above](#2-the-automated-data-pipeline) so that the DAG would start pulling its data.

### 4.4 Check the airflow logs and the data in the database

The logs produce a WARNING when a line is skipped because it is missing the end of line marker.

Also check the downstream VIEWS to see if dates (`dt`) & geometries are properly transformed
* `vz_safety_programs.polygons_school_safety_zones`: for school zone polygons
* `vz_safety_programs.points_wyss`: for the Watch Your Speed Signs locations

⚠ If there's any problems look at [5. Table generated](#5-table-generated) below to see how the contents of the sheet is mapped to the tables.

### 4.5 Wait overnight for the data to appear on the vision zero map

The Geographic Competency Centre (GCC) has a process to pull the map data nightly from our database and expose it via ESRI API.

## 5. Table generated

The script reads information from columns A, B, E, F, Y, Z, AA, AB which are as shown below

|SCHOOL NAME|ADDRESS|FLASHING BEACON W/O|WYSS W/O|School Coordinate (X,Y)|Final Sign Installation Date|FB Locations (X,Y)|WYS Locations (X,Y)|
|-----------|-------|-------------|---------------|--------------|-----------------------|------------|--------------|
|AGINCOURT JUNIOR PUBLIC SCHOOL|29 Lockie Ave|9239020|9239021|43.788456, -79.281118|January 9, 2019|43.786566, -79.279023|43.787530, -79.279456|

from the Google Sheets and put them into postgres tables with the following fields (all in data type text):

|school_name|address|work_order_fb|work_order_wyss|locations_zone|final_sign_installation|locations_fb|locations_wyss|
|-----------|-------|-------------|---------------|--------------|-----------------------|------------|--------------|
|AGINCOURT JUNIOR PUBLIC SCHOOL|29 Lockie Ave|9239020|9239021|43.788456, -79.281118|January 9, 2019|43.786566, -79.279023|43.787530, -79.279456|

**Notes:**
* The Google Sheets API do not read any row with empty cells at the beginning or end of the row or just an entire row of empty cells. It will log an error when that happens.
* The script being used reads up to line 180 although the actual data is less than that. This is to anticipate extra schools which might be added into the sheets in the future.

## 6. Pulling data with the command-line interface

The data can be loaded into the database from the appropriate Google sheet(s) using the Linux Command Line Interface (CLI). The script `gis.school_safety_zones.schools.py` requires some mandatory and optional arguments to load these data. The below table describes the script's arguments. For more details, run `./gis/school_safety_zones/schools.py --help`.

Expand All @@ -51,7 +117,7 @@ The data can be loaded into the database from the appropriate Google sheet(s) us
| schema | The PostgreSQL schema to load the data into. | `vz_safety_programs_staging` |
| table | The PostgreSQL table to load the data into. | `school_safety_zone_{year}_raw` |

### 3.1 Database Configuration File
### 6.1 Database Configuration File

To be able to run the data puller script from the CLI, you need to save the database parameters in a file in the following format:

Expand All @@ -63,9 +129,9 @@ username=USERNAME
password=PASSWORD
```

## 4. Google Credentials
### 6.2 Local Google API key

Initially, a credential file (named `key.json` in the script) was required to connect to the Google Sheets to pull data. The google account used to read the Sheets is `[email protected]`. First, Google Sheets API was enabled on the google account. Then, a service account was created so that we are not prompted to sign in every single time we run the script. Instructions on how to do that can be found at [Creating a service account](https://github.com/googleapis/google-api-python-client/blob/master/docs/oauth-server.md#creating-a-service-account). Go to the `Service accounts` page from there, select the `Quickstart` project and click on the `Search for APIs and Services` bar to generate credentials. Copy the credentials and paste it on a `key.json` file located in the same directory as the script. The `key.json` file should look something like this:
First, Google Sheets API was enabled on the google account. Then, a service account was created so that we are not prompted to sign in every single time we run the script. Instructions on how to do that can be found at [Creating a service account](https://github.com/googleapis/google-api-python-client/blob/master/docs/oauth-server.md#creating-a-service-account). Go to the `Service accounts` page from there, select the `Quickstart` project and click on the `Search for APIs and Services` bar to generate credentials. Copy the credentials and paste it into a `key.json` file located in the same directory as the script. The `key.json` file should look something like this:

```json
"type": "service_account",
Expand All @@ -79,42 +145,3 @@ Initially, a credential file (named `key.json` in the script) was required to co
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"client_x509_cert_url":
```

Currently, these credentials are stored in an encrypted Airflow connection: `vz_api_google`.

## 5. Adding a new year

Follow these steps to read in another spreadsheet for year `yyyy`.

### 5.1 Create a New PostgreSQL Table

Create an empty table `vz_safety_programs_staging.school_safety_zone_yyyy_raw`, where `yyyy` is the year to be stored, as a child of parent table `vz_safety_programs_staging.school_safety_zone_raw_parent`. Follow the format of the existing child tables (e.g. `vz_safety_programs_staging.school_safety_zone_2018_raw`) and declare the inheritance:

```SQL
CREATE TABLE vz_safety_programs_staging.school_safety_zone_yyyy_raw (
like vz_safety_programs_staging.school_safety_zone_2018_raw
including all
) INHERITS (vz_safety_programs_staging.school_safety_zone_raw_parent);
```

### 5.2 Add the New Google Sheet to Airflow

Add the new year details to the Airflow variable `ssz_spreadsheets` as described [above](#2-the-automated-data-pipeline) so that the DAG would start pulling its data.


## 6. Table generated
The script reads information from columns A, B, E, F, Y, Z, AA, AB which are as shown below

|SCHOOL NAME|ADDRESS|FLASHING BEACON W/O|WYSS W/O|School Coordinate (X,Y)|Final Sign Installation Date|FB Locations (X,Y)|WYS Locations (X,Y)|
|-----------|-------|-------------|---------------|--------------|-----------------------|------------|--------------|
|AGINCOURT JUNIOR PUBLIC SCHOOL|29 Lockie Ave|9239020|9239021|43.788456, -79.281118|January 9, 2019|43.786566, -79.279023|43.787530, -79.279456|

from the Google Sheets and put them into postgres tables with the following fields (all in data type text):

|school_name|address|work_order_fb|work_order_wyss|locations_zone|final_sign_installation|locations_fb|locations_wyss|
|-----------|-------|-------------|---------------|--------------|-----------------------|------------|--------------|
|AGINCOURT JUNIOR PUBLIC SCHOOL|29 Lockie Ave|9239020|9239021|43.788456, -79.281118|January 9, 2019|43.786566, -79.279023|43.787530, -79.279456|

**Notes:**
* The Google Sheets API do not read any row with empty cells at the beginning or end of the row or just an entire row of empty cells. It will log an error when that happens.
* The script being used reads up to line 180 although the actual data is less than that. This is to anticipate extra schools which might be added into the sheets in the future.