This a a toy example of a long running job to demonstrate how to make your job compatible with kelpie.
The job counts to 50, with a 5 second sleep between steps. Break out the super computers. However, it demonstrates how to structure a job to take advantage of Kelpie.
The job accepts the environment variables set by kelpie
import os
input_dir = os.getenv('INPUT_DIR')
output_dir = os.getenv('OUTPUT_DIR')
checkpoint_dir = os.getenv('CHECKPOINT_DIR')
The job tries to resume from a checkpoint in the checkpoint directory before starting from the beginning.
checkpoint_file = os.path.join(checkpoint_dir, 'count.txt')
try:
with open(checkpoint_file, 'r') as f:
current_count = int(f.read())
print(f'Resuming from {current_count}', flush=True)
except FileNotFoundError:
current_count = 0
The job saves its progress to the checkpoint directory after a chunk of "work".
print(f'Processing {current_count}', flush=True)
time.sleep(5) # The "work"
current_count += 1
with open(checkpoint_file, 'w') as f:
f.write(str(current_count))
The job saves its output in the output directory when it is completed
with open(os.path.join(output_dir, 'result.txt'), 'w') as f:
print(f'Processed {current_count} records', flush=True)
f.write(f'Processed {current_count} records')
The job exits normally, with the default exit code of 0.
This job is built from Dockerfile.base.
Kelpie is added to the image on top of the base image in Dockerfile.kelpie.
And then when running the container, you add some environment variables that are specific to kelpie, and some that are for your s3-compatible storage. You can see an example in docker-compose.yml