Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index orphaned replicas (#6626) #6627

Merged

Conversation

nadove-ucsc
Copy link
Contributor

@nadove-ucsc nadove-ucsc commented Oct 10, 2024

Connected issues: #6626

Checklist

Author

  • PR is a draft
  • Target branch is develop
  • Name of PR branch matches issues/<GitHub handle of author>/<issue#>-<slug>
  • On ZenHub, PR is connected to all issues it (partially) resolves
  • PR description links to connected issues
  • PR title matches1 that of a connected issue or comment in PR explains why they're different
  • PR title references all connected issues
  • For each connected issue, there is at least one commit whose title references that issue

1 when the issue title describes a problem, the corresponding PR
title is Fix: followed by the issue title

Author (partiality)

  • Added p tag to titles of partial commits
  • This PR is labeled partial or completely resolves all connected issues
  • This PR partially resolves each of the connected issues or does not have the partial label

Author (chains)

  • This PR is blocked by previous PR in the chain or is not chained to another PR
  • The blocking PR is labeled base or this PR is not chained to another PR
  • This PR is labeled chained or is not chained to another PR

Author (reindex, API changes)

  • Added r tag to commit title or the changes introduced by this PR will not require reindexing of any deployment
  • This PR is labeled reindex:dev or the changes introduced by it will not require reindexing of dev
  • This PR is labeled reindex:anvildev or the changes introduced by it will not require reindexing of anvildev
  • This PR is labeled reindex:anvilprod or the changes introduced by it will not require reindexing of anvilprod
  • This PR is labeled reindex:prod or the changes introduced by it will not require reindexing of prod
  • This PR is labeled reindex:partial and its description documents the specific reindexing procedure for dev, anvildev, anvilprod and prod or requires a full reindex or carries none of the labels reindex:dev, reindex:anvildev, reindex:anvilprod and reindex:prod
  • This PR and its connected issues are labeled API or this PR does not modify a REST API
  • Added a (A) tag to commit title for backwards (in)compatible changes or this PR does not modify a REST API
  • Updated REST API version number in app.py or this PR does not modify a REST API

Author (upgrading deployments)

  • Ran make docker_images.json and committed the resulting changes or this PR does not modify azul_docker_images, or any other variables referenced in the definition of that variable
  • Documented upgrading of deployments in UPGRADING.rst or this PR does not require upgrading deployments
  • Added u tag to commit title or this PR does not require upgrading deployments
  • This PR is labeled upgrade or does not require upgrading deployments
  • This PR is labeled deploy:shared or does not modify docker_images.json, and does not require deploying the shared component for any other reason
  • This PR is labeled deploy:gitlab or does not require deploying the gitlab component
  • This PR is labeled deploy:runner or does not require deploying the runner image

Author (hotfixes)

  • Added F tag to main commit title or this PR does not include permanent fix for a temporary hotfix
  • Reverted the temporary hotfixes for any connected issues or the none of the stable branches (anvilprod and prod) have temporary hotfixes for any of the issues connected to this PR

Author (before every review)

  • Rebased PR branch on develop, squashed old fixups
  • Ran make requirements_update or this PR does not modify requirements*.txt, common.mk, Makefile and Dockerfile
  • Added R tag to commit title or this PR does not modify requirements*.txt
  • This PR is labeled reqs or does not modify requirements*.txt
  • make integration_test passes in personal deployment or this PR does not modify functionality that could affect the IT outcome

Peer reviewer (after approval)

  • PR is not a draft
  • Ticket is in Review requested column
  • PR is awaiting requested review from system administrator
  • PR is assigned to only the system administrator

System administrator (after approval)

  • Actually approved the PR
  • Labeled connected issues as demo or no demo
  • Commented on connected issues about demo expectations or all connected issues are labeled no demo
  • Decided if PR can be labeled no sandbox
  • A comment to this PR details the completed security design review
  • PR title is appropriate as title of merge commit
  • N reviews label is accurate
  • Moved connected issues to Approved column
  • PR is assigned to only the operator

Operator (before pushing merge the commit)

  • Checked reindex:… labels and r commit title tag
  • Checked that demo expectations are clear or all connected issues are labeled no demo
  • Squashed PR branch and rebased onto develop
  • Sanity-checked history
  • Pushed PR branch to GitHub
  • Ran _select dev.shared && CI_COMMIT_REF_NAME=develop make -C terraform/shared apply_keep_unused or this PR is not labeled deploy:shared
  • Ran _select dev.gitlab && CI_COMMIT_REF_NAME=develop make -C terraform/gitlab apply or this PR is not labeled deploy:gitlab
  • Ran _select anvildev.shared && CI_COMMIT_REF_NAME=develop make -C terraform/shared apply_keep_unused or this PR is not labeled deploy:shared
  • Ran _select anvildev.gitlab && CI_COMMIT_REF_NAME=develop make -C terraform/gitlab apply or this PR is not labeled deploy:gitlab
  • Checked the items in the next section or this PR is labeled deploy:gitlab
  • PR is assigned to only the system administrator or this PR is not labeled deploy:gitlab

System administrator

  • Background migrations for dev.gitlab are complete or this PR is not labeled deploy:gitlab
  • Background migrations for anvildev.gitlab are complete or this PR is not labeled deploy:gitlab
  • PR is assigned to only the operator

Operator (before pushing merge the commit)

  • Ran _select dev.gitlab && make -C terraform/gitlab/runner or this PR is not labeled deploy:runner
  • Ran _select anvildev.gitlab && make -C terraform/gitlab/runner or this PR is not labeled deploy:runner
  • Added sandbox label or PR is labeled no sandbox
  • Pushed PR branch to GitLab dev or PR is labeled no sandbox
  • Pushed PR branch to GitLab anvildev or PR is labeled no sandbox
  • Build passes in sandbox deployment or PR is labeled no sandbox
  • Build passes in anvilbox deployment or PR is labeled no sandbox
  • Reviewed build logs for anomalies in sandbox deployment or PR is labeled no sandbox
  • Reviewed build logs for anomalies in anvilbox deployment or PR is labeled no sandbox
  • Deleted unreferenced indices in sandbox or this PR does not remove catalogs or otherwise causes unreferenced indices in dev
  • Deleted unreferenced indices in anvilbox or this PR does not remove catalogs or otherwise causes unreferenced indices in anvildev
  • Started reindex in sandbox or this PR is not labeled reindex:dev
  • Started reindex in anvilbox or this PR is not labeled reindex:anvildev
  • Checked for failures in sandbox or this PR is not labeled reindex:dev
  • Checked for failures in anvilbox or this PR is not labeled reindex:anvildev
  • The title of the merge commit starts with the title of this PR
  • Added PR # reference to merge commit title
  • Collected commit title tags in merge commit title but only included p if the PR is also labeled partial
  • Moved connected issues to Merged lower column in ZenHub
  • Moved blocked issues to Triage or no issues are blocked on the connected issues
  • Pushed merge commit to GitHub

Operator (chain shortening)

  • Changed the target branch of the blocked PR to develop or this PR is not labeled base
  • Removed the chained label from the blocked PR or this PR is not labeled base
  • Removed the blocking relationship from the blocked PR or this PR is not labeled base
  • Removed the base label from this PR or this PR is not labeled base

Operator (after pushing the merge commit)

  • Pushed merge commit to GitLab dev
  • Pushed merge commit to GitLab anvildev
  • Build passes on GitLab dev
  • Reviewed build logs for anomalies on GitLab dev
  • Build passes on GitLab anvildev
  • Reviewed build logs for anomalies on GitLab anvildev
  • Ran _select dev.shared && make -C terraform/shared apply or this PR is not labeled deploy:shared
  • Ran _select anvildev.shared && make -C terraform/shared apply or this PR is not labeled deploy:shared
  • Deleted PR branch from GitHub
  • Deleted PR branch from GitLab dev
  • Deleted PR branch from GitLab anvildev

Operator (reindex)

  • Deindexed all unreferenced catalogs in dev or this PR is neither labeled reindex:partial nor reindex:dev
  • Deindexed all unreferenced catalogs in anvildev or this PR is neither labeled reindex:partial nor reindex:anvildev
  • Deindexed specific sources in dev or this PR is neither labeled reindex:partial nor reindex:dev
  • Deindexed specific sources in anvildev or this PR is neither labeled reindex:partial nor reindex:anvildev
  • Indexed specific sources in dev or this PR is neither labeled reindex:partial nor reindex:dev
  • Indexed specific sources in anvildev or this PR is neither labeled reindex:partial nor reindex:anvildev
  • Started reindex in dev or this PR does not require reindexing dev
  • Started reindex in anvildev or this PR does not require reindexing anvildev
  • Checked for, triaged and possibly requeued messages in both fail queues in dev or this PR does not require reindexing dev
  • Checked for, triaged and possibly requeued messages in both fail queues in anvildev or this PR does not require reindexing anvildev
  • Emptied fail queues in dev or this PR does not require reindexing dev
  • Emptied fail queues in anvildev or this PR does not require reindexing anvildev

Operator

  • Propagated the deploy:shared, deploy:gitlab, deploy:runner, API, reindex:partial, reindex:anvilprod and reindex:prod labels to the next promotion PRs or this PR carries none of these labels
  • Propagated any specific instructions related to the deploy:shared, deploy:gitlab, deploy:runner, API, reindex:partial, reindex:anvilprod and reindex:prod labels, from the description of this PR to that of the next promotion PRs or this PR carries none of these labels
  • PR is assigned to no one

Shorthand for review comments

  • L line is too long
  • W line wrapping is wrong
  • Q bad quotes
  • F other formatting problem

@nadove-ucsc nadove-ucsc added chained [process] PR needs to based of develop before merging reindex:anvildev [process] PR requires reindexing anvildev reindex:anvilprod [process] PR requires reindexing anvilprod labels Oct 10, 2024
@github-actions github-actions bot added the orange [process] Done by the Azul team label Oct 10, 2024
@nadove-ucsc nadove-ucsc changed the base branch from develop to issues/nadove-ucsc/6615-use-datasets-projects-as-hub-ids October 10, 2024 01:36
Copy link

codecov bot commented Oct 10, 2024

Codecov Report

Attention: Patch coverage is 89.62264% with 22 lines in your changes missing coverage. Please review.

Project coverage is 85.57%. Comparing base (cfe7acf) to head (9b6cf31).
Report is 11 commits behind head on develop.

Files with missing lines Patch % Lines
src/azul/plugins/repository/tdr_anvil/__init__.py 94.44% 7 Missing ⚠️
test/integration_test.py 0.00% 5 Missing ⚠️
src/azul/plugins/metadata/anvil/bundle.py 50.00% 3 Missing ⚠️
src/azul/plugins/repository/tdr_hca/__init__.py 50.00% 3 Missing ⚠️
src/azul/plugins/repository/canned/__init__.py 50.00% 2 Missing ⚠️
src/azul/plugins/__init__.py 50.00% 1 Missing ⚠️
src/azul/terra.py 85.71% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #6627      +/-   ##
===========================================
+ Coverage    85.50%   85.57%   +0.07%     
===========================================
  Files          155      155              
  Lines        20758    20874     +116     
===========================================
+ Hits         17749    17863     +114     
- Misses        3009     3011       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@nadove-ucsc nadove-ucsc force-pushed the issues/nadove-ucsc/6615-use-datasets-projects-as-hub-ids branch 3 times, most recently from daf9bfa to 7e3e665 Compare October 11, 2024 18:50
@nadove-ucsc nadove-ucsc changed the base branch from issues/nadove-ucsc/6615-use-datasets-projects-as-hub-ids to develop October 11, 2024 21:53
@nadove-ucsc nadove-ucsc added base [process] Another PR needs to be rebased before merging this one and removed chained [process] PR needs to based of develop before merging labels Oct 11, 2024
@nadove-ucsc nadove-ucsc force-pushed the issues/nadove-ucsc/6626-index-orphaned-replicas branch 3 times, most recently from d0edc4b to d207786 Compare October 12, 2024 00:57
@coveralls
Copy link

coveralls commented Oct 12, 2024

Coverage Status

coverage: 85.593% (+0.07%) from 85.522%
when pulling 9b6cf31 on issues/nadove-ucsc/6626-index-orphaned-replicas
into cfe7acf on develop.

@nadove-ucsc nadove-ucsc force-pushed the issues/nadove-ucsc/6626-index-orphaned-replicas branch 12 times, most recently from c8b0767 to a95de72 Compare October 17, 2024 01:33
@nadove-ucsc nadove-ucsc added the reindex:dev [process] PR requires reindexing dev label Oct 17, 2024
@nadove-ucsc nadove-ucsc marked this pull request as ready for review November 7, 2024 21:53
hannes-ucsc
hannes-ucsc previously approved these changes Nov 8, 2024
Copy link
Member

@hannes-ucsc hannes-ucsc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No showstoppers, approved.

For #6691:

Index: test/integration_test.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/test/integration_test.py b/test/integration_test.py
--- a/test/integration_test.py	(revision 251c79e7791982fef83293ee40f83be8694466ea)
+++ b/test/integration_test.py	(date 1731020545236)
@@ -1905,7 +1905,11 @@
         source = self._choose_source(catalog)
         # The plugin will raise an exception if the source lacks a prefix
         source = source.with_prefix(Prefix.of_everything)
-        bundle_fqids = self.repository_plugin(catalog).list_bundles(source, '')
+        # REVIEW: We had issues with this part of the test being surprisingly
+        #         slow. We should make sure that the removal of log statements
+        #         from list_bundles doesn't make it harder for use to diagnose
+        #         these types of issues. Maybe we should use the client here.
+        bundle_fqids = self.repository_plugin(catalog).list_bundles(source, prefix='')
         return self.random.choice(sorted(bundle_fqids))
 
     def _can_bundle(self,
Index: src/azul/plugins/repository/canned/__init__.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/azul/plugins/repository/canned/__init__.py b/src/azul/plugins/repository/canned/__init__.py
--- a/src/azul/plugins/repository/canned/__init__.py	(revision 251c79e7791982fef83293ee40f83be8694466ea)
+++ b/src/azul/plugins/repository/canned/__init__.py	(date 1731019797504)
@@ -26,9 +26,6 @@
 from furl import (
     furl,
 )
-from more_itertools import (
-    ilen,
-)
 
 from azul import (
     CatalogName,
@@ -165,11 +162,11 @@
 
     def count_bundles(self, source: SOURCE_SPEC) -> int:
         staging_area = self.staging_area(source.spec.name)
-        return ilen(
-            links_id
-            for links_id in staging_area.links
-            if source.prefix is None or links_id.startswith(source.prefix.common)
-        )
+        if source.prefix is None:
+            return len(staging_area.links)
+        else:
+            prefix = source.prefix.common
+            return sum(1 for links_id in staging_area.links if links_id.startswith(prefix))
 
     def list_bundles(self,
                      source: CannedSourceRef,
Index: src/azul/plugins/metadata/anvil/bundle.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/azul/plugins/metadata/anvil/bundle.py b/src/azul/plugins/metadata/anvil/bundle.py
--- a/src/azul/plugins/metadata/anvil/bundle.py	(revision 251c79e7791982fef83293ee40f83be8694466ea)
+++ b/src/azul/plugins/metadata/anvil/bundle.py	(date 1731025228121)
@@ -130,29 +130,27 @@
         pass
 
     def to_json(self) -> MutableJSON:
-        def serialize_entities(entities):
+        def to_json(entities):
             return {
                 str(entity_ref): entity
                 for entity_ref, entity in sorted(entities.items())
             }
 
         return {
-            'entities': serialize_entities(self.entities),
-            'orphans': serialize_entities(self.orphans),
+            'entities': to_json(self.entities),
+            'orphans': to_json(self.orphans),
             'links': [link.to_json() for link in sorted(self.links)]
         }
 
     @classmethod
-    def from_json(cls, fqid: BUNDLE_FQID, json_: JSON) -> Self:
-        def deserialize_entities(json_entities):
+    def from_json(cls, fqid: BUNDLE_FQID, bundle: JSON) -> Self:
+        def from_json(entities):
             return {
                 EntityReference.parse(entity_ref): entity
-                for entity_ref, entity in json_entities.items()
+                for entity_ref, entity in entities.items()
             }
 
-        return cls(
-            fqid=fqid,
-            entities=deserialize_entities(json_['entities']),
-            links=set(map(EntityLink.from_json, json_['links'])),
-            orphans=deserialize_entities(json_['orphans'])
-        )
+        return cls(fqid=fqid,
+                   entities=from_json(bundle['entities']),
+                   links=set(map(EntityLink.from_json, bundle['links'])),
+                   orphans=from_json(bundle['orphans']))
Index: src/azul/plugins/repository/tdr_anvil/__init__.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/azul/plugins/repository/tdr_anvil/__init__.py b/src/azul/plugins/repository/tdr_anvil/__init__.py
--- a/src/azul/plugins/repository/tdr_anvil/__init__.py	(revision 251c79e7791982fef83293ee40f83be8694466ea)
+++ b/src/azul/plugins/repository/tdr_anvil/__init__.py	(date 1731043074073)
@@ -11,7 +11,6 @@
     AbstractSet,
     Callable,
     Iterable,
-    Self,
     cast,
 )
 import uuid
@@ -80,63 +79,83 @@
 
 class BundleType(Enum):
     """
-    AnVIL snapshots have no inherent notion of a "bundle". During indexing, we
-    dynamically construct bundles by querying each table in the snapshot. This
-    class enumerates the tables that require special strategies for listing and
-    fetching their bundles.
+    Unlike HCA, AnVIL has no inherent notion of a "bundle". Its data model is
+    strictly relational: each row in a table represents an entity, each entity
+    has a primary key, and entities reference each other via a foreign keys.
+    During indexing, we dynamically construct bundles by querying each table in
+    the snapshot. This class enumerates the tables that require special
+    strategies for listing and fetching their bundles.
 
-    Primary bundles are defined by a biosample entity, termed the bundle entity.
-    Each primary bundle includes all of the bundle entity descendants and all of
-    those those entities' ancestors, which are discovered by iteratively
-    following foreign keys. Biosamples were chosen for this role based on a
-    desirable balance between the size and number of the resulting bundles as
-    well as the degree of overlap between them. The implementation of the graph
-    traversal is tightly coupled to this choice, and switching to a different
-    entity type would require re-implementing much of the Plugin code. Primary
-    bundles consist of at least one biosample (the bundle entity), exactly one
-    dataset, and zero or more other entities of assorted types. Primary bundles
+    Primary bundles are defined by a biosample entity, termed the *bundle
+    entity*. Each primary bundle includes all of the bundle entity's descendants
+    and all of those those entities' ancestors. Descendants and ancestors are
+    discovered by iteratively following foreign keys. Biosamples were chosen to
+    act as the bundle entity for primary bundles based on a desirable balance
+    between the size and number of the resulting bundles as well as the degree
+    of overlap between them. The implementation of the graph traversal is
+    tightly coupled to this choice, and switching to a different entity type
+    would require re-implementing much of the Plugin code. Primary bundles
+    consist of at least one biosample (the bundle entity), exactly one dataset
+    entity, and zero or more other entities of assorted types. Primary bundles
     never contain orphans because they are bijective to rows in the biosample
     table.
 
     Supplementary bundles consist of batches of file entities, which may include
-    supplementary files, which lack any foreign keys that associate them with
-    any other entity. Non-supplementary files in the bundle are classified as
-    orphans. The bundle also includes a dataset entity linked to the
+    supplementary files. The latter lack any foreign keys that would associate
+    them with any other entity. Normal (non-supplementary) files in the bundle
+    are classified as orphans.
+
+    REVIEW: That (above) sounds surprising and may need more explanation.
+
+    Each supplementary bundle also includes the dataset entity linked to the
     supplementary files.
 
-    Duos bundles consist of a single dataset entity. This "entity" includes only
+    DUOS bundles consist of a single dataset entity. This "entity" includes only
     the dataset description retrieved from DUOS, while a copy of the BigQuery
     row for this dataset is also included as an orphan. We chose this design
     because there is only one dataset per snapshot, which is referenced in all
     primary and supplementary bundles. Therefore, only one request to DUOS per
-    *snapshot* is necessary, but if `description` is retrieved at the same time
-    as the other dataset fields, we will make one request per *bundle* instead,
-    potentially overloading the DUOS service. Our solution is to retrieve
-    `description` only in a dedicated bundle format, once per snapshot, and
-    merge it with the other dataset fields during aggregation.
+    *snapshot* is necessary. If the DUOS `description` were retrieved at the
+    same time as the other fields of the dataset entity, we would make one
+    request per *bundle* instead, potentially overloading the DUOS service. Our
+    solution is to retrieve `description` only in a bundle of this dedicated
+    DUOS type, once per snapshot, and merge it with the other dataset fields
+    during aggregation.
 
     All other bundles are replica bundles. Replica bundles consist of a batch of
     rows from an arbitrary BigQuery table, which may or may not be described by
     the AnVIL schema. Replica bundles only include orphans and have no links.
+
+    REVIEW: Confusingly worded. I think what we mean is that the replicas are
+            stored in the `orphans` attribute. We may need to find a new name
+            for that attribute.
     """
     primary = 'anvil_biosample'
     supplementary = 'anvil_file'
     duos = 'anvil_dataset'
 
-    def is_batched(self: Self | str) -> bool:
+    # REVIEW: I'm getting type errors and PyCharm warnings with the original approach
+
+    @classmethod
+    def is_batched(cls, table_name: str) -> bool:
         """
-        >>> BundleType.primary.is_batched()
+        True if bundles for the table of the given name represent batches of
+        rows, or if each bundle represents a single row.
+
+        >>> BundleType.primary.is_batched
         False
 
         >>> BundleType.is_batched('anvil_activity')
         True
         """
-        if isinstance(self, str):
-            try:
-                self = BundleType(self)
-            except ValueError:
-                return True
-        return self not in (BundleType.primary, BundleType.duos)
+        return table_name not in (BundleType.primary.value, BundleType.duos.value)
+
+
+# REVIEW: The change from method to attribute may require more changes at the
+#         usage sites
+
+for bundle_type in BundleType:
+    bundle_type.is_batched = BundleType.is_batched(bundle_type.value)
 
 
 class TDRAnvilBundleFQIDJSON(SourcedBundleFQIDJSON):
@@ -245,28 +264,29 @@
         self._assert_source(source)
         bundles = []
         spec = source.spec
+
         if config.duos_service_url is not None:
+            # We intentionally omit the WHERE clause for datasets in order to
+            # verify our assumption that each snapshot only contains rows for a
+            # single dataset. This verification is performed independently and
+            # concurrently for every partition, but only one partition actually
+            # emits the bundle.
             row = one(self._run_sql(f'''
                 SELECT datarepo_row_id
                 FROM {backtick(self._full_table_name(spec, BundleType.duos.value))}
             '''))
             dataset_row_id = row['datarepo_row_id']
-            # We intentionally omit the WHERE clause for datasets in order
-            # to verify our assumption that each snapshot only contains rows
-            # for a single dataset. This verification is performed
-            # independently and concurrently for every partition, but only
-            # one partition actually emits the bundle.
             if dataset_row_id.startswith(prefix):
                 bundle_uuid = change_version(dataset_row_id,
                                              self.datarepo_row_uuid_version,
                                              self.bundle_uuid_version)
-                bundles.append(TDRAnvilBundleFQID(
-                    uuid=bundle_uuid,
-                    version=self._version,
-                    source=source,
-                    table_name=BundleType.duos.value,
-                    batch_prefix=None,
-                ))
+                bundle_fqid = TDRAnvilBundleFQID(uuid=bundle_uuid,
+                                                 version=self._version,
+                                                 source=source,
+                                                 table_name=BundleType.duos.value,
+                                                 batch_prefix=None)
+                bundles.append(bundle_fqid)
+
         for row in self._run_sql(f'''
             SELECT datarepo_row_id
             FROM {backtick(self._full_table_name(spec, BundleType.primary.value))}
@@ -275,24 +295,26 @@
             bundle_uuid = change_version(row['datarepo_row_id'],
                                          self.datarepo_row_uuid_version,
                                          self.bundle_uuid_version)
-            bundles.append(TDRAnvilBundleFQID(
-                uuid=bundle_uuid,
-                version=self._version,
-                source=source,
-                table_name=BundleType.primary.value,
-                batch_prefix=None,
-            ))
+            bundle_fqid = TDRAnvilBundleFQID(uuid=bundle_uuid,
+                                             version=self._version,
+                                             source=source,
+                                             table_name=BundleType.primary.value,
+                                             batch_prefix=None)
+            bundles.append(bundle_fqid)
+
         prefix_lengths_by_table = self._batch_tables(source.spec, prefix)
         for table_name, (batch_prefix_length, _) in prefix_lengths_by_table.items():
             batch_prefixes = Prefix(common=prefix,
                                     partition=batch_prefix_length - len(prefix)).partition_prefixes()
             for batch_prefix in batch_prefixes:
                 bundle_uuid = self._batch_uuid(spec, table_name, batch_prefix)
-                bundles.append(TDRAnvilBundleFQID(uuid=bundle_uuid,
-                                                  version=self._version,
-                                                  source=source,
-                                                  table_name=table_name,
-                                                  batch_prefix=batch_prefix))
+                bundle_fqid = TDRAnvilBundleFQID(uuid=bundle_uuid,
+                                                 version=self._version,
+                                                 source=source,
+                                                 table_name=table_name,
+                                                 batch_prefix=batch_prefix)
+                bundles.append(bundle_fqid)
+
         return bundles
 
     def _emulate_bundle(self, bundle_fqid: TDRAnvilBundleFQID) -> TDRAnvilBundle:
@@ -346,6 +368,11 @@
         table_names = sorted(filter(BundleType.is_batched, self.tdr.list_tables(source)))
         log.info('Calculating batch prefix lengths for partition %r of %d tables '
                  'in source %s', prefix, len(table_names), source)
+
+        # REVIEW: This needs a FIXME. The respective issue should have a
+        #         reproduction, maybe in the form of a diff removing the
+        #         workaround, and the resulting unit test failure.
+
         # The extraneous outer 'SELECT *' works around a bug in BigQuery emulator
         query = ' UNION ALL '.join(f'''(
             SELECT * FROM (
Index: src/azul/indexer/index_service.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/azul/indexer/index_service.py b/src/azul/indexer/index_service.py
--- a/src/azul/indexer/index_service.py	(revision 251c79e7791982fef83293ee40f83be8694466ea)
+++ b/src/azul/indexer/index_service.py	(date 1731044069661)
@@ -212,6 +212,9 @@
         for contributions, replicas in transforms:
             tallies.update(self.contribute(catalog, contributions))
             self.replicate(catalog, replicas)
+
+        # REVIEW: The addition of this conditional seems like an optimization
+        #         that is unrelated to the other changes in that commit
         if tallies:
             self.aggregate(tallies)
 
@@ -237,6 +240,9 @@
             tallies.update(self.contribute(catalog, contributions))
         # FIXME: Replica index does not support deletions
         #        https://github.com/DataBiosphere/azul/issues/5846
+
+        # REVIEW: Should this also be conditional like above?
+
         self.aggregate(tallies)
 
     def deep_transform(self,
Index: src/azul/plugins/metadata/anvil/indexer/transform.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/azul/plugins/metadata/anvil/indexer/transform.py b/src/azul/plugins/metadata/anvil/indexer/transform.py
--- a/src/azul/plugins/metadata/anvil/indexer/transform.py	(revision 251c79e7791982fef83293ee40f83be8694466ea)
+++ b/src/azul/plugins/metadata/anvil/indexer/transform.py	(date 1731025001923)
@@ -169,6 +169,8 @@
             assert False, entity_type
 
     def estimate(self, partition: BundlePartition) -> int:
+        # REVIEW: I don't quite understand the pat after "but". *All* orphans will be replicated by one partition? 
+
         # Orphans are not considered when deciding whether to partition the
         # bundle, but if the bundle is partitioned then each orphan will be
         # replicated in a single partition
@@ -577,14 +579,16 @@
                   partition: BundlePartition
                   ) -> Iterable[Contribution | Replica]:
         yield from super().transform(partition)
+        # REVIEW: I think *to coalesce* is rarely seen in passive tense, as in
+        #         "The cells are coalesced" but rather "The cells coalesce"
         if config.enable_replicas:
-            # Replicas are only emitted by the file transformer for entities
-            # that are linked to at least one file. This excludes all orphans,
-            # and a small number of linked entities, usually from primary
-            # bundles don't include any files. Some of the replicas we emit here
-            # will be redundant with those emitted by the file transformer, but
-            # these will be coalesced by the index service before they are
-            # written to ElasticSearch.
+            # The file transformer only emits replicas for entities that are
+            # linked to at least one file. This excludes all orphans, and a
+            # small number of linked entities, usually from primary bundles that
+            # don't include any files. Some of the replicas we emit here will be
+            # redundant with those emitted by the file transformer, but these
+            # will be consolidated by the index service before they are written
+            # to ElasticSearch.
             dataset = self._only_dataset()
             for entity in chain(self.bundle.orphans, self.bundle.entities):
                 if partition.contains(UUID(entity.entity_id)):

@hannes-ucsc
Copy link
Member

Security design review

  • Security design review completed; this PR does not
    • … affect authentication; for example:
      • OAuth 2.0 with the application (API or Swagger UI)
      • Authentication of developers with Google Cloud APIs
      • Authentication of developers with AWS APIs
      • Authentication with a GitLab instance in the system
      • Password and 2FA authentication with GitHub
      • API access token authentication with GitHub
      • Authentication with Terra
    • … affect the permissions of internal users like access to
      • Cloud resources on AWS and GCP
      • GitLab repositories, projects and groups, administration
      • an EC2 instance via SSH
      • GitHub issues, pull requests, commits, commit statuses, wikis, repositories, organizations
    • … affect the permissions of external users like access to
      • TDR snapshots
    • … affect permissions of service or bot accounts
      • Cloud resources on AWS and GCP
    • … affect audit logging in the system, like
      • adding, removing or changing a log message that represents an auditable event
      • changing the routing of log messages through the system
    • … affect monitoring of the system
    • … introduce a new software dependency like
      • Python packages on PYPI
      • Command-line utilities
      • Docker images
      • Terraform providers
    • … add an interface that exposes sensitive or confidential data at the security boundary
    • … affect the encryption of data at rest
    • … require persistence of sensitive or confidential data that might require encryption at rest
    • … require unencrypted transmission of data within the security boundary
    • … affect the network security layer; for example by
      • modifying, adding or removing firewall rules
      • modifying, adding or removing security groups
      • changing or adding a port a service, proxy or load balancer listens on
  • Documentation on any unchecked boxes is provided in comments below

@hannes-ucsc hannes-ucsc added 0 reviews [process] Lead didn't request any changes sandbox [process] Resolution is being verified in sandbox deployment labels Nov 8, 2024
@achave11-ucsc achave11-ucsc removed their assignment Nov 8, 2024
@nadove-ucsc nadove-ucsc force-pushed the issues/nadove-ucsc/6626-index-orphaned-replicas branch from 251c79e to 043298e Compare November 8, 2024 20:16
@nadove-ucsc nadove-ucsc force-pushed the issues/nadove-ucsc/6626-index-orphaned-replicas branch 3 times, most recently from 24a106a to 2c7de0b Compare November 8, 2024 20:59
@hannes-ucsc hannes-ucsc force-pushed the issues/nadove-ucsc/6626-index-orphaned-replicas branch from 2704c7b to 9b6cf31 Compare November 9, 2024 00:40
@achave11-ucsc achave11-ucsc merged commit 0b3b6b4 into develop Nov 9, 2024
11 checks passed
@achave11-ucsc achave11-ucsc removed the base [process] Another PR needs to be rebased before merging this one label Nov 9, 2024
@achave11-ucsc achave11-ucsc deleted the issues/nadove-ucsc/6626-index-orphaned-replicas branch November 12, 2024 17:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 reviews [process] Lead didn't request any changes orange [process] Done by the Azul team reindex:anvildev [process] PR requires reindexing anvildev reindex:anvilprod [process] PR requires reindexing anvilprod reindex:dev [process] PR requires reindexing dev sandbox [process] Resolution is being verified in sandbox deployment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants