Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extraction workflow specification #1

Open
adelavega opened this issue Jul 31, 2024 · 0 comments
Open

extraction workflow specification #1

adelavega opened this issue Jul 31, 2024 · 0 comments

Comments

@adelavega
Copy link
Member

adelavega commented Jul 31, 2024

neurostore-text-extraction
meta-runner scripts handle data inputs from mega-ni-dataset.
since input data is split into distinct slices (hashes), the pipelines will operate on specific slices and output different outputs for each slice (with reference hash), plus the hash of the pipeline arguments.

the meta-runner will decide if a pipeline needs to be run on an {input_data_hash} based on if an output already exists for a given pipeline/arghhash combination.
optionally, a pipeline may be force re-run, generating a new timestamped output folder

  • outputs/
    • {input_data_hash}
      • {pipeline_name}
        • {arghhash-timestamp}
          • features.csv
          • descriptions.csv
          • args.json
          • info.json
  • pipelines/
    • pipeline_name
      • run.py
  • scripts/
    • run_all.py

mega-ni-dataset
Organization of mega-ni-dataset (separate repo)

  • /input_data
    • searches from ACE, pubget and other data from neurostore dataset
  • /processed_data
    • previously called combined_data
    • each folder is a hash_id of the contents
    • when new data is acquired, a new hash_id folder is created from input_data
  • /combined_data
    • for human consumption, combines all the hashes of processed data into a single outputs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant