Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate Avro schema from AnVIL schema #6109

Closed
nadove-ucsc opened this issue Mar 28, 2024 · 6 comments
Closed

Generate Avro schema from AnVIL schema #6109

nadove-ucsc opened this issue Mar 28, 2024 · 6 comments
Assignees
Labels
- [priority] Medium demo [process] To be demonstrated at the end of the sprint demoed [process] Successfully demonstrated to team enh [type] New feature or request manifests [subject] Generation and contents of manifests orange [process] Done by the Azul team

Comments

@nadove-ucsc
Copy link
Contributor

nadove-ucsc commented Mar 28, 2024

Description: #6109 (comment)

@github-actions github-actions bot added the orange [process] Done by the Azul team label Mar 28, 2024
@dsotirho-ucsc dsotirho-ucsc changed the title Consider generating avro schema from AnVIL schema Consider generating Avro schema from AnVIL schema Apr 1, 2024
@dsotirho-ucsc
Copy link
Contributor

Assignee to consider next steps.

@hannes-ucsc
Copy link
Member

hannes-ucsc commented Apr 3, 2024

The current implementation generates the schema dynamically, based on the replica documents in the manifest currently being generated. This means that the schema could be different between manifests, depending on which files match the filter and which parts of the schema are expressed by the replicas connected to those files. This could lead to missing tables and missing columns in the workspace the manifest is exported to.

While that may be considered by some a cosmetic issue or even a feature, entity relations can only be reliably expressed in the AvroPFB schema when deriving it from the foreign-key relationships in the expressed AnVIL schema. IOW, this is blocking #6066.

@hannes-ucsc hannes-ucsc changed the title Consider generating Avro schema from AnVIL schema Generating Avro schema from AnVIL schema Apr 3, 2024
@hannes-ucsc hannes-ucsc added manifests [subject] Generation and contents of manifests - [priority] Medium enh [type] New feature or request labels Apr 3, 2024
@hannes-ucsc hannes-ucsc removed their assignment Apr 3, 2024
@hannes-ucsc hannes-ucsc changed the title Generating Avro schema from AnVIL schema Generate Avro schema from AnVIL schema Apr 4, 2024
nadove-ucsc added a commit that referenced this issue May 8, 2024
dsotirho-ucsc pushed a commit that referenced this issue May 10, 2024
@nadove-ucsc nadove-ucsc self-assigned this May 16, 2024
@hannes-ucsc
Copy link
Member

For demo, perform two small handovers (the fewer rows, the better) for two different datasets, into two distinct workspaces. Show that the table schema is identical in both workspaces.

@hannes-ucsc hannes-ucsc added the demo [process] To be demonstrated at the end of the sprint label Jun 7, 2024
@nadove-ucsc
Copy link
Contributor Author

@hannes-ucsc: "This was successfully demoed, albeit with different instructions. We had to look at the PFB schema and contents because Terra appears to drop empty tables and empty columns. An empty column is a scalar nullable column that is null for all rows. Assignee to contact the Broad to inquire as to whether this is intentional because it more or less counteracts our intent behind this issue."

@nadove-ucsc nadove-ucsc added the demoed [process] Successfully demonstrated to team label Jun 18, 2024
@nadove-ucsc
Copy link
Contributor Author

Screencaps of two Terra workspaces showing inconsistent tables and columns:

Screenshot from 2024-06-18 16-05-26
Screenshot from 2024-06-18 16-06-07

Manifests reproducible in catalog anvil6 using the filters
{"biosamples.anatomical_site": {"is": ["Umbilical Cord"]}}
and
{"biosamples.anatomical_site": {"is": ["Not Listed"]}}
respectively.

Creating manifests with the same filters via the Swagger UI and inspecting the resulting PFB files on the command line confirmed that their schemas were identical.

The tables missing from the first workspace (anvil_activity and anvil_diagnosis) are defined in that manifest's schema:

$ pfb show -i umbilical-cord.avro schema | jq '.[] | select(.name == "anvil_activity")'
{
  "type": "record",
  "name": "anvil_activity",
  "fields": [
    {
      "namespace": "anvil_activity",
      "name": "activity_id",
      "type": "string"
    },
    {
...

And the missing columns from the biosample table are present in both the schema and the biosample entities:

$ pfb show -i umbilical-cord.avro schema | jq '.[] | select(.name == "anvil_biosample") | .fields[].name'
"anatomical_site"
"apriori_cell_type"
"biosample_id"
"biosample_type"
"datarepo_row_id"
"disease"
"donor_age_at_collection_lower_bound"
"donor_age_at_collection_unit"
"donor_age_at_collection_upper_bound"
"donor_id"
"part_of_dataset_id"
"source_datarepo_row_ids"
$ pfb show -i umbilical-cord.avro | jq 'select(.name == "anvil_biosample") | .object | keys'
[
  "anatomical_site",
  "apriori_cell_type",
  "biosample_id",
  "biosample_type",
  "datarepo_row_id",
  "disease",
  "donor_age_at_collection_lower_bound",
  "donor_age_at_collection_unit",
  "donor_age_at_collection_upper_bound",
  "donor_id",
  "part_of_dataset_id",
  "source_datarepo_row_ids"
]
[
...

@hannes-ucsc
Copy link
Member

hannes-ucsc commented Jul 2, 2024

I've alerted the BI about this on Slack. From the Azul perspective, this feature is complete and the issue can be closed, after reassigning the implementor and moving it to Stable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
- [priority] Medium demo [process] To be demonstrated at the end of the sprint demoed [process] Successfully demonstrated to team enh [type] New feature or request manifests [subject] Generation and contents of manifests orange [process] Done by the Azul team
Projects
None yet
Development

No branches or pull requests

3 participants