Revised input parameters into AWS S3 Object ARN instead of separate b…

…ucket and key; a necessary change since referencing via ARNs is the only reliable way the builder can be triggered via EventBridge on CloudTrail events.
digital-land · Jul 16, 2024 · a9ed240 · a9ed240
1 parent 44435a7
commit a9ed240
Show file tree

Hide file tree

Showing 4 changed files with 28 additions and 15 deletions.
diff --git a/README.md b/README.md
@@ -3,11 +3,10 @@
 ## Loading data into postgres in AWS
 
 This repository contains code that is used as a runnable task in ECS. The
-entry point [task/load.sh](task/load.sh) expects environment variables
-are set for:
+entry point [task/load.sh](task/load.sh) expects an environment variable
+is set for the S3 object to consume:
 
-    S3_BUCKET=some-bucket
-    S3_KEY=some-key
+    S3_OBJECT_ARN=arn:aws:s3:::some-bucket/some-key.sqlite3
 
 These provide a bucket and key to load data from. At the moment the keys are assumed to be sqlite files produced by
 the digital land collection process.
@@ -22,6 +21,9 @@ To see how the values for bucket and key are extracted have a [look here](https:
 
 ## Running locally to load data into local postgres
 
+Since running locally does not download the Digital Land Sqlite database from S3, it is necessary to set the
+$SQLITE_FILE_PATH environment variable rather than $S3_OBJECT_ARN.
+
 **Prerequisites**
 
    - A running postgres server (tested with PostgreSQL 14)
@@ -39,7 +41,7 @@ application)
 
 With a fresh checkout that file configures the scripts in this repo to load the digital-land database.
 
-To load the entity database change the S3_KEY to the correct key for the entity sqlite database (see below).
+To load the entity database change the $SQLITE_FILE_PATH to the correct key for the entity sqlite database (see below).
 
 
 2. **Create a virtualenv and install requirements**
@@ -56,7 +58,7 @@ Remember the .env file is already set to load the digital-land db. However in or
 
 6. **Run the load script to load entity database**
 
-Update the S3_KEY in the .env file to S3_KEY=entity-builder/dataset/entity.sqlite3
+Update the $SQLITE_FILE_PATH in the .env file to $SQLITE_FILE_PATH=entity-builder/dataset/entity.sqlite3
 
     ./load_local.sh
 

diff --git a/task/.env.example b/task/.env.example
@@ -1,2 +1,2 @@
-export S3_BUCKET=digital-land-production-collection-dataset
-export S3_KEY=digital-land-builder/dataset/digital-land.sqlite3
+export S3_OBJECT_ARN=arn:aws:s3:::digital-land-production-collection-dataset/digital-land-builder/dataset/digital-land.sqlite3
+export SQLITE_FILE_PATH=digital-land-builder/dataset/digital-land.sqlite3
diff --git a/task/load.sh b/task/load.sh
@@ -1,5 +1,16 @@
 #! /usr/bin/env bash
 # need to use the files cdn instead of the bucket name when loading locally without logging into aws
+
+s3_object_arn_regex="^arn:aws:s3:::([0-9A-Za-z-]*/)(.*)$"
+
+if ! [[ "$S3_OBJECT_ARN" =~ $s3_object_arn_regex ]]; then
+    echo "Received invalid S3 Object S3 ARN: $S3_OBJECT_ARN, skipping"
+    exit 1
+fi
+
+S3_BUCKET=${BASH_REMATCH[1]%/*}
+S3_KEY=${BASH_REMATCH[2]}
+
 DATABASE=${S3_KEY##*/}
 export DATABASE_NAME=${DATABASE%.*}
 echo "DATABASE NAME: $DATABASE_NAME"
@@ -76,4 +87,4 @@ echo "$EVENT_ID: loading data into postgres"
 python3 -m pgload.load --source="$DATABASE_NAME" || \
   (echo "$EVENT_ID: failed to load $DATABASE" && exit 1)
 
-echo "$EVENT_ID: loading of $DATABASE_NAME completed successfully"
+echo "$EVENT_ID: loading of $DATABASE_NAME completed successfully"
diff --git a/task/load_local.sh b/task/load_local.sh
@@ -21,10 +21,10 @@ curl -qfsL $SOURCE_URL/specification/main/specification/schema-field.csv > speci
 
 
 # need to use the files cdn instead of the bucket name when loading locally without logging into aws
-DATABASE=${S3_KEY##*/}
+DATABASE=${SQLITE_FILE_PATH##*/}
 export DATABASE_NAME=${DATABASE%.*}
 echo "DATABASE NAME: $DATABASE_NAME"
-echo "$EVENT_ID: running with settings: S3_KEY=$S3_KEY, DATABASE=$DATABASE, DATABASE_NAME=$DATABASE_NAME"
+echo "$EVENT_ID: running with settings: SQLITE_FILE_PATH=$SQLITE_FILE_PATH, DATABASE=$DATABASE, DATABASE_NAME=$DATABASE_NAME"
 
 
 
@@ -35,11 +35,11 @@ echo "$EVENT_ID: running with settings: S3_KEY=$S3_KEY, DATABASE=$DATABASE, DATA
 
 
 if ! [ -f "$DATABASE_NAME.sqlite3" ]; then
-  echo "$EVENT_ID: attempting download from https://files.planning.data.gov.uk/$S3_KEY"
-  if curl --fail --show-error --location "https://files.planning.data.gov.uk/$S3_KEY" > "$DATABASE_NAME.sqlite3"; then
-      echo "$EVENT_ID: finished downloading from https://files.planning.data.gov.uk/$S3_KEY"
+  echo "$EVENT_ID: attempting download from https://files.planning.data.gov.uk/$SQLITE_FILE_PATH"
+  if curl --fail --show-error --location "https://files.planning.data.gov.uk/$SQLITE_FILE_PATH" > "$DATABASE_NAME.sqlite3"; then
+      echo "$EVENT_ID: finished downloading from https://files.planning.data.gov.uk/$SQLITE_FILE_PATH"
   else
-      echo "$EVENT_ID: failed to download from https://files.planning.data.gov.uk/$S3_KEY"
+      echo "$EVENT_ID: failed to download from https://files.planning.data.gov.uk/$SQLITE_FILE_PATH"
       rm "$DATABASE_NAME.sqlite3"  # remove the file if it was created
       exit 1
   fi