Skip to content

Commit

Permalink
Revised input parameters into AWS S3 Object ARN instead of separate b…
Browse files Browse the repository at this point in the history
…ucket and key; a necessary change since referencing via ARNs is the only reliable way the builder can be triggered via EventBridge on CloudTrail events.
  • Loading branch information
cpcundill committed Jul 16, 2024
1 parent 44435a7 commit a9ed240
Show file tree
Hide file tree
Showing 4 changed files with 28 additions and 15 deletions.
14 changes: 8 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,10 @@
## Loading data into postgres in AWS

This repository contains code that is used as a runnable task in ECS. The
entry point [task/load.sh](task/load.sh) expects environment variables
are set for:
entry point [task/load.sh](task/load.sh) expects an environment variable
is set for the S3 object to consume:

S3_BUCKET=some-bucket
S3_KEY=some-key
S3_OBJECT_ARN=arn:aws:s3:::some-bucket/some-key.sqlite3

These provide a bucket and key to load data from. At the moment the keys are assumed to be sqlite files produced by
the digital land collection process.
Expand All @@ -22,6 +21,9 @@ To see how the values for bucket and key are extracted have a [look here](https:

## Running locally to load data into local postgres

Since running locally does not download the Digital Land Sqlite database from S3, it is necessary to set the
$SQLITE_FILE_PATH environment variable rather than $S3_OBJECT_ARN.

**Prerequisites**

- A running postgres server (tested with PostgreSQL 14)
Expand All @@ -39,7 +41,7 @@ application)

With a fresh checkout that file configures the scripts in this repo to load the digital-land database.

To load the entity database change the S3_KEY to the correct key for the entity sqlite database (see below).
To load the entity database change the $SQLITE_FILE_PATH to the correct key for the entity sqlite database (see below).


2. **Create a virtualenv and install requirements**
Expand All @@ -56,7 +58,7 @@ Remember the .env file is already set to load the digital-land db. However in or

6. **Run the load script to load entity database**

Update the S3_KEY in the .env file to S3_KEY=entity-builder/dataset/entity.sqlite3
Update the $SQLITE_FILE_PATH in the .env file to $SQLITE_FILE_PATH=entity-builder/dataset/entity.sqlite3

./load_local.sh

Expand Down
4 changes: 2 additions & 2 deletions task/.env.example
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
export S3_BUCKET=digital-land-production-collection-dataset
export S3_KEY=digital-land-builder/dataset/digital-land.sqlite3
export S3_OBJECT_ARN=arn:aws:s3:::digital-land-production-collection-dataset/digital-land-builder/dataset/digital-land.sqlite3
export SQLITE_FILE_PATH=digital-land-builder/dataset/digital-land.sqlite3
13 changes: 12 additions & 1 deletion task/load.sh
Original file line number Diff line number Diff line change
@@ -1,5 +1,16 @@
#! /usr/bin/env bash
# need to use the files cdn instead of the bucket name when loading locally without logging into aws

s3_object_arn_regex="^arn:aws:s3:::([0-9A-Za-z-]*/)(.*)$"

if ! [[ "$S3_OBJECT_ARN" =~ $s3_object_arn_regex ]]; then
echo "Received invalid S3 Object S3 ARN: $S3_OBJECT_ARN, skipping"
exit 1
fi

S3_BUCKET=${BASH_REMATCH[1]%/*}
S3_KEY=${BASH_REMATCH[2]}

DATABASE=${S3_KEY##*/}
export DATABASE_NAME=${DATABASE%.*}
echo "DATABASE NAME: $DATABASE_NAME"
Expand Down Expand Up @@ -76,4 +87,4 @@ echo "$EVENT_ID: loading data into postgres"
python3 -m pgload.load --source="$DATABASE_NAME" || \
(echo "$EVENT_ID: failed to load $DATABASE" && exit 1)

echo "$EVENT_ID: loading of $DATABASE_NAME completed successfully"
echo "$EVENT_ID: loading of $DATABASE_NAME completed successfully"
12 changes: 6 additions & 6 deletions task/load_local.sh
Original file line number Diff line number Diff line change
Expand Up @@ -21,10 +21,10 @@ curl -qfsL $SOURCE_URL/specification/main/specification/schema-field.csv > speci


# need to use the files cdn instead of the bucket name when loading locally without logging into aws
DATABASE=${S3_KEY##*/}
DATABASE=${SQLITE_FILE_PATH##*/}
export DATABASE_NAME=${DATABASE%.*}
echo "DATABASE NAME: $DATABASE_NAME"
echo "$EVENT_ID: running with settings: S3_KEY=$S3_KEY, DATABASE=$DATABASE, DATABASE_NAME=$DATABASE_NAME"
echo "$EVENT_ID: running with settings: SQLITE_FILE_PATH=$SQLITE_FILE_PATH, DATABASE=$DATABASE, DATABASE_NAME=$DATABASE_NAME"



Expand All @@ -35,11 +35,11 @@ echo "$EVENT_ID: running with settings: S3_KEY=$S3_KEY, DATABASE=$DATABASE, DATA


if ! [ -f "$DATABASE_NAME.sqlite3" ]; then
echo "$EVENT_ID: attempting download from https://files.planning.data.gov.uk/$S3_KEY"
if curl --fail --show-error --location "https://files.planning.data.gov.uk/$S3_KEY" > "$DATABASE_NAME.sqlite3"; then
echo "$EVENT_ID: finished downloading from https://files.planning.data.gov.uk/$S3_KEY"
echo "$EVENT_ID: attempting download from https://files.planning.data.gov.uk/$SQLITE_FILE_PATH"
if curl --fail --show-error --location "https://files.planning.data.gov.uk/$SQLITE_FILE_PATH" > "$DATABASE_NAME.sqlite3"; then
echo "$EVENT_ID: finished downloading from https://files.planning.data.gov.uk/$SQLITE_FILE_PATH"
else
echo "$EVENT_ID: failed to download from https://files.planning.data.gov.uk/$S3_KEY"
echo "$EVENT_ID: failed to download from https://files.planning.data.gov.uk/$SQLITE_FILE_PATH"
rm "$DATABASE_NAME.sqlite3" # remove the file if it was created
exit 1
fi
Expand Down

0 comments on commit a9ed240

Please sign in to comment.