Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update data directory structure #22

Merged
merged 1 commit into from
Feb 27, 2024
Merged

Update data directory structure #22

merged 1 commit into from
Feb 27, 2024

Conversation

jrbourbeau
Copy link
Member

This PR simplifies the data directory structure a bit to something like this:

data
├── archive
│   ├── customer
│   ├── lineitem
│   └── ...
├── processed
│   ├── customer
│   ├── lineitem
│   └── ...
├── results
│   ├── customer
│   ├── lineitem
│   └── ...
└── staging
    ├── customer
    ├── lineitem
    └── ...

@jrbourbeau jrbourbeau marked this pull request as ready for review February 26, 2024 21:23
Copy link
Member Author

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments since this diff has some unrelated changes in it

)
generate.fn.client.restart()
generate.fn.client.restart(wait_for_workers=False)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an unrelated bugfix for running locally (xref dask/distributed#8534)

Comment on lines -34 to -48

if str(path).startswith("s3://"):
session = botocore.session.Session()
creds = session.get_credentials()
con.install_extension("httpfs")
con.load_extension("httpfs")
con.sql(
f"""
SET s3_region='{REGION}';
SET s3_access_key_id='{creds.access_key}';
SET s3_secret_access_key='{creds.secret_key}';
SET s3_session_token='{creds.token}';
"""
)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unrelated to the directory structure. We don't need to do this configuration because we're not reading / writing with duckdb.

Comment on lines +1 to +6
# Whether to run data-processing tasks locally
# or on the cloud with Coiled.
local: true
# Output location for data files. Can be a local directory
# or a remote path like "s3://path/to/bucket".
data-dir: ./data
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving this configuration out into a standalone file (related to, but doesn't close, #2)

@jrbourbeau jrbourbeau merged commit 36bcd72 into main Feb 27, 2024
1 check passed
@jrbourbeau jrbourbeau deleted the rename-dirs branch February 27, 2024 16:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant