Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use shorter build_key #652

Merged
merged 12 commits into from
Nov 28, 2023

Conversation

nkaretnikov
Copy link
Contributor

Fixes #611.

Copy link

netlify bot commented Nov 4, 2023

Deploy Preview for kaleidoscopic-dango-0cf31d canceled.

Name Link
🔨 Latest commit 36da8f0
🔍 Latest deploy log https://app.netlify.com/sites/kaleidoscopic-dango-0cf31d/deploys/6565c0c56f5375000822bf08

@nkaretnikov

This comment was marked as resolved.

@nkaretnikov nkaretnikov force-pushed the shorter-build-key-611 branch 3 times, most recently from d5de58a to 00ee1ca Compare November 5, 2023 21:04
@nkaretnikov
Copy link
Contributor Author

Seems fine. Still need to do the final review and test locally.

timestamp = int(self.scheduled_on.timestamp())
id = self.id
name = self.specification.name[:16]
return f"{hash}-{timestamp}-{id}-{name}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lengths (4 and 16) are hardcoded and there's no explanation on why that was chosen. Instead I'd rather have a fixed-length hash à la nix.

Copy link
Contributor Author

@nkaretnikov nkaretnikov Nov 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like Nix hashes as well (specifically, I'm talking about nix-store hashes, since those also use the filesystem, see here), but it can happen later and in a separate PR. It requires quite a bit of work and it was pointed out to me during the last team meeting that we might want to allocate our time elsewhere for now.

Here are some questions to answer for Nix hashes:

  • To make this really fixed-size, we would need to design the v2 path scheme such that it also doesn't include the namespace name. It should be just: store_directory / hash / <env contents> where the first two components is the prefix that needs to be <= 255 to satisfy the conda prefix contraint.
  • To hash like this and avoid weird issues due to mismatched package versions, we cannot hash the specification as we do now, we need to hash the contents of the lock file. Specs don't have the versions pinned, lock files are deterministic.
  • Since the hashes are shared between users (no namespace anymore), they need to be resistant to collisions. The size of the hash and its collision resistance are related. This can be estimated to find the best size for this use case.
  • Hashes are one-way (by design). But we also use build_keys to get the build_id. See parse_build_key and how it's used in get_docker_image_manifest. For that, we'd need to store an extra hash -> id mapping in the DB.
  • IIUC, all other parts of the build_key are just to help with debugging or search. You can verify which env on disk corresponds to which specification, see when it was built, and get a human-readable name of the env.
  • build_ids are unique because they are primary keys in the Build table. The only concern here is that they might potentially get reused by some DB backends.

How 4 and 16 were selected:

  • Because only the build_ids matter, everything else doesn't help with the uniqueness part, just helps with debugging
  • I selected 4 to still be able to verify that a spec matches the env by hashing it and have some confidence
  • I selected 16 because it contained enough information to be meaningful with some env names I've tried, but also because it's short enough.

Summary:

  • Since Check the size of build_path #653 adds error checking for this and documentation, it makes it easier for users to understand what's going on
  • I'm going to bump 4 to 8 since it's a more standard size for truncation (even though it's not necessary here). This makes it somewhat less arbitrary. The original proposal was a unix timestamp and a truncated hash.
  • I'm going to remove the truncation from names (since the linked PR has instructions for users to use shorter names if there are problems). It removes this arbitrary decision, which might lead to confusion later.
  • I'll try to add the v1/v2 build_path scheme as a parameter. I initially avoided it because it ends another degree of freedom, which leads to more bugs. But I also see value in this setting where build_path size is not a problem, just to avoid any unexpected v1->v2 migration problems.
  • Let's discuss the Nix-hash-like idea separately in a new issue.

Copy link
Member

@jaimergp jaimergp Nov 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It requires quite a bit of work and it was pointed out to me during the last team meeting that we might want to allocate our time elsewhere for now.

My main concern here is that we are going to break the path interface already, so maybe there's a benefit in only doing it once and for good. At the same time, if we perfect the art of migrating paths, then it's ok :) I guess it can be discussed in the weekly tomorrow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no breakage. Everything is migrated transparently and you can switch back and forth. Nix-like hashes would be more different compared to what we have right now. I still see value in having just a truncated hash and timestamp (as in the v2 here). Nix-style hashes can be added later.

@nkaretnikov
Copy link
Contributor Author

Manual tests I did (as of commit 738f650):

  • conda-store state created on main works with changes in the PR (links to the lock file and other files work in the UI)
  • having build_key_version=1 in the config allows to access v2 files in the UI
  • having build_key_version=2 in the config allows to access v1 files in the UI
  • having build_key_version=3 in the config raises an error.

@nkaretnikov nkaretnikov force-pushed the shorter-build-key-611 branch 3 times, most recently from b340a6e to 738f650 Compare November 18, 2023 13:36
@nkaretnikov

This comment was marked as resolved.

@nkaretnikov nkaretnikov changed the title WIP: Use shorter build_key Use shorter build_key Nov 23, 2023
@nkaretnikov nkaretnikov marked this pull request as ready for review November 23, 2023 10:24
@nkaretnikov
Copy link
Contributor Author

@jaimergp Updated, PTAL. Added a command-line/config parameter, now users can switch between v1 and v2 if needed. By default the short hash version is used. With the new default, old environments are still accessible in the UI.

Copy link
Member

@jaimergp jaimergp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nkaretnikov. I'm sure this is functional now but I'm spotting some non-ideal practices in the code for things that should be well established now, namely:

  • Using global constants that get overridden at runtime for default configuration. This should be better guarded.
  • Parsing identifiers to retrieve info that should already have. Probably worth an issue if there's not one already.

Additionally, this deservers some user facing documentation covering things like:

  • Why would they care about this configuration option
  • What's the difference between v1 and v2
  • Which one should the choose (v2 I guess, unless they need backwards compatibility)
  • A migration guide for those in v1 that want to move to v2. If that's trivial, mention it explicitly. If it's not, what kind of problems they might face and how to work around them.

conda-store-server/conda_store_server/app.py Outdated Show resolved Hide resolved
conda-store-server/conda_store_server/orm.py Outdated Show resolved Hide resolved
Comment on lines 273 to 282
def parse_build_key(key):
parts = key.split("-")
if len(parts) < 5:
return None
return int(parts[4]) # build_id
# Note: cannot rely on the number of dashes to differentiate between
# versions because name can contain dashes. Instead, this relies on the
# hash size to infer the format. The name is the last field, so indexing
# to find the id is okay.
if key[_BUILD_KEY_V2_HASH_SIZE] == "-": # v2
return int(parts[2]) # build_id
else: # v1
return int(parts[4]) # build_id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also a bit hacky. We shouldn't rely on lossy info for this, but a well-known id->info relationship in the database. I know you know, and that that'd be considered ideal and we don't have the time but I wonder what's the rush to ship this now instead of in two weeks (still within the STF window).

If this is not going to be addressed now, and if we don't have an issue for this limitation, we should create one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to ship this version of the build key regardless because:

  • it somewhat mitigates the length issues people might be having (see the test, the new version is 2x shorter)
  • it's close enough to what we have already so risk of potential issues is minimized
  • I consider Nix-like hashes a riskier change and would like to have this as an intermediate step, so people could fall back to this if needed.

conda-store-server/conda_store_server/orm.py Outdated Show resolved Hide resolved
conda-store-server/conda_store_server/orm.py Outdated Show resolved Hide resolved
timestamp = int(build.scheduled_on.replace(tzinfo=tzinfo).timestamp())
id = build.id
name = build.specification.name
return f"{hash}-{timestamp}-{id}-{name}"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: without tzinfo, this would vary between machines because timestamp uses the system clock by default, which might be not in UTC.

try:
return BuildKey.set_current_version(proposal.value)
except Exception as e:
raise TraitError(f"c.CondaStore.build_key_version: {e}")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current version is now set right after build_key_version is processed, which is better than before, when it was done only when creating a DB session.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better to use raise X from e here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, yes. But it doesn't work with traitlets. E.g.,:

            raise TraitError("c.CondaStore.build_key_version") from e

would print

[CondaStoreServer] CRITICAL | Bad config encountered during initialization: c.CondaStore.build_key_version

No traceback is included. No additional information is printed.

# Uses local import to make sure current version is initialized
from conda_store_server import BuildKey

return BuildKey.current_version()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All conda_store_server imports are local in this file to avoid a cyclic import problem mentioned in a BuildKey class comment.

return # invalid, nothing more to test
conda_store.build_key_version = build_key_version
assert BuildKey.current_version() == build_key_version
assert BuildKey.versions() == (1, 2)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This now initializes build_key_version via conda_store as one would in real life (via the config)

@nkaretnikov
Copy link
Contributor Author

@jaimergp Addressed your feedback, PTAL

```

It consists of:
1. a truncated SHA-256 hash of the environment specification
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By "environment specification" you mean the input package requests, or the solved environment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Added this:

(CondaSpecification, which represents a user-provided environment, is
converted to a dict and passed to datastructure_hash, which recursively sorts
it and calculates the SHA-256 hash)

@nkaretnikov
Copy link
Contributor Author

@jaimergp PTAL. Replied to all of your comments. Only made a change to the docs.

@nkaretnikov nkaretnikov merged commit 4ed688a into conda-incubator:main Nov 28, 2023
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done 💪🏾
Development

Successfully merging this pull request may close these issues.

ENH - Shorten hash used in environment path
3 participants