Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revised bucket input parameters #16

Merged
merged 2 commits into from
Jul 16, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 9 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,10 @@
## Loading data into postgres in AWS

This repository contains code that is used as a runnable task in ECS. The
entry point [task/load.sh](task/load.sh) expects environment variables
are set for:
entry point [task/load.sh](task/load.sh) expects an environment variable
is set for the S3 object to consume:

S3_BUCKET=some-bucket
S3_KEY=some-key
S3_OBJECT_ARN=arn:aws:s3:::some-bucket/some-key.sqlite3

These provide a bucket and key to load data from. At the moment the keys are assumed to be sqlite files produced by
the digital land collection process.
Expand All @@ -22,6 +21,10 @@ To see how the values for bucket and key are extracted have a [look here](https:

## Running locally to load data into local postgres

Running locally does not download the Digital Land Sqlite database from S3 directly but instead via a CDN, it is
necessary to ensure the $S3_OBJECT_ARN contains the correct file path. The bucket name portion of the ARN will
be ignored and the file path will be appended to https://files.planning.data.gov.uk/.

**Prerequisites**

- A running postgres server (tested with PostgreSQL 14)
Expand All @@ -39,7 +42,7 @@ application)

With a fresh checkout that file configures the scripts in this repo to load the digital-land database.

To load the entity database change the S3_KEY to the correct key for the entity sqlite database (see below).
To load the entity database ensure the $S3_OBJECT_ARN has the correct key for the entity sqlite database (see below).


2. **Create a virtualenv and install requirements**
Expand All @@ -56,7 +59,7 @@ Remember the .env file is already set to load the digital-land db. However in or

6. **Run the load script to load entity database**

Update the S3_KEY in the .env file to S3_KEY=entity-builder/dataset/entity.sqlite3
Update the $S3_OBJECT_ARN in the .env file to $S3_OBJECT_ARN=arn:aws:s3:::placeholder/entity-builder/dataset/entity.sqlite3

./load_local.sh

Expand Down
3 changes: 1 addition & 2 deletions task/.env.example
Original file line number Diff line number Diff line change
@@ -1,2 +1 @@
export S3_BUCKET=digital-land-production-collection-dataset
export S3_KEY=digital-land-builder/dataset/digital-land.sqlite3
export S3_OBJECT_ARN=arn:aws:s3:::digital-land-production-collection-dataset/digital-land-builder/dataset/digital-land.sqlite3
13 changes: 12 additions & 1 deletion task/load.sh
Original file line number Diff line number Diff line change
@@ -1,5 +1,16 @@
#! /usr/bin/env bash
# need to use the files cdn instead of the bucket name when loading locally without logging into aws

s3_object_arn_regex="^arn:aws:s3:::([0-9A-Za-z-]*/)(.*)$"

if ! [[ "$S3_OBJECT_ARN" =~ $s3_object_arn_regex ]]; then
echo "Received invalid S3 Object S3 ARN: $S3_OBJECT_ARN, skipping"
exit 1
fi

S3_BUCKET=${BASH_REMATCH[1]%/*}
S3_KEY=${BASH_REMATCH[2]}

DATABASE=${S3_KEY##*/}
export DATABASE_NAME=${DATABASE%.*}
echo "DATABASE NAME: $DATABASE_NAME"
Expand Down Expand Up @@ -76,4 +87,4 @@ echo "$EVENT_ID: loading data into postgres"
python3 -m pgload.load --source="$DATABASE_NAME" || \
(echo "$EVENT_ID: failed to load $DATABASE" && exit 1)

echo "$EVENT_ID: loading of $DATABASE_NAME completed successfully"
echo "$EVENT_ID: loading of $DATABASE_NAME completed successfully"
25 changes: 16 additions & 9 deletions task/load_local.sh
Original file line number Diff line number Diff line change
@@ -1,5 +1,21 @@
#! /usr/bin/env bash

s3_object_arn_regex="^arn:aws:s3:::([0-9A-Za-z-]*/)(.*)$"

if ! [[ "$S3_OBJECT_ARN" =~ $s3_object_arn_regex ]]; then
echo "Received invalid S3 Object S3 ARN: $S3_OBJECT_ARN, skipping"
exit 1
fi

S3_KEY=${BASH_REMATCH[2]}

# need to use the files cdn instead of the bucket name when loading locally without logging into aws
DATABASE=${S3_KEY##*/}

export DATABASE_NAME=${DATABASE%.*}
echo "DATABASE NAME: $DATABASE_NAME"
echo "$EVENT_ID: running with settings: S3_KEY=$S3_KEY, DATABASE=$DATABASE, DATABASE_NAME=$DATABASE_NAME"

# download specification
export SOURCE_URL=https://raw.githubusercontent.com/digital-land/
mkdir -p specification/
Expand All @@ -19,15 +35,6 @@ curl -qfsL $SOURCE_URL/specification/main/specification/dataset-schema.csv > spe
curl -qfsL $SOURCE_URL/specification/main/specification/schema.csv > specification/schema.csv
curl -qfsL $SOURCE_URL/specification/main/specification/schema-field.csv > specification/schema-field.csv


# need to use the files cdn instead of the bucket name when loading locally without logging into aws
DATABASE=${S3_KEY##*/}
export DATABASE_NAME=${DATABASE%.*}
echo "DATABASE NAME: $DATABASE_NAME"
echo "$EVENT_ID: running with settings: S3_KEY=$S3_KEY, DATABASE=$DATABASE, DATABASE_NAME=$DATABASE_NAME"



# if [[ $DATABASE_NAME != "entity" && $DATABASE_NAME != "digital-land" ]]; then
# echo "$EVENT_ID: wrong database, skipping"
# exit 1
Expand Down
Loading