Use shorter `build_key` #652

nkaretnikov · 2023-11-04T22:18:49Z

Fixes #611.

netlify · 2023-11-04T22:18:54Z

✅ Deploy Preview for kaleidoscopic-dango-0cf31d canceled.

Name	Link
🔨 Latest commit	`36da8f0`
🔍 Latest deploy log	https://app.netlify.com/sites/kaleidoscopic-dango-0cf31d/deploys/6565c0c56f5375000822bf08

conda-store-server/conda_store_server/orm.py

nkaretnikov · 2023-11-05T21:24:12Z

Seems fine. Still need to do the final review and test locally.

jaimergp · 2023-11-10T18:18:52Z

conda-store-server/conda_store_server/orm.py

+            timestamp = int(self.scheduled_on.timestamp())
+            id = self.id
+            name = self.specification.name[:16]
+            return f"{hash}-{timestamp}-{id}-{name}"


These lengths (4 and 16) are hardcoded and there's no explanation on why that was chosen. Instead I'd rather have a fixed-length hash à la nix.

I like Nix hashes as well (specifically, I'm talking about nix-store hashes, since those also use the filesystem, see here), but it can happen later and in a separate PR. It requires quite a bit of work and it was pointed out to me during the last team meeting that we might want to allocate our time elsewhere for now.

Here are some questions to answer for Nix hashes:

To make this really fixed-size, we would need to design the v2 path scheme such that it also doesn't include the namespace name. It should be just: store_directory / hash / <env contents> where the first two components is the prefix that needs to be <= 255 to satisfy the conda prefix contraint.

To hash like this and avoid weird issues due to mismatched package versions, we cannot hash the specification as we do now, we need to hash the contents of the lock file. Specs don't have the versions pinned, lock files are deterministic.

Since the hashes are shared between users (no namespace anymore), they need to be resistant to collisions. The size of the hash and its collision resistance are related. This can be estimated to find the best size for this use case.

Hashes are one-way (by design). But we also use build_keys to get the build_id. See parse_build_key and how it's used in get_docker_image_manifest. For that, we'd need to store an extra hash -> id mapping in the DB.

IIUC, all other parts of the build_key are just to help with debugging or search. You can verify which env on disk corresponds to which specification, see when it was built, and get a human-readable name of the env.

build_ids are unique because they are primary keys in the Build table. The only concern here is that they might potentially get reused by some DB backends.

How 4 and 16 were selected:

Because only the build_ids matter, everything else doesn't help with the uniqueness part, just helps with debugging

I selected 4 to still be able to verify that a spec matches the env by hashing it and have some confidence

I selected 16 because it contained enough information to be meaningful with some env names I've tried, but also because it's short enough.

Summary:

Since Check the size of build_path #653 adds error checking for this and documentation, it makes it easier for users to understand what's going on

I'm going to bump 4 to 8 since it's a more standard size for truncation (even though it's not necessary here). This makes it somewhat less arbitrary. The original proposal was a unix timestamp and a truncated hash.

I'm going to remove the truncation from names (since the linked PR has instructions for users to use shorter names if there are problems). It removes this arbitrary decision, which might lead to confusion later.

I'll try to add the v1/v2 build_path scheme as a parameter. I initially avoided it because it ends another degree of freedom, which leads to more bugs. But I also see value in this setting where build_path size is not a problem, just to avoid any unexpected v1->v2 migration problems.

Let's discuss the Nix-hash-like idea separately in a new issue.

It requires quite a bit of work and it was pointed out to me during the last team meeting that we might want to allocate our time elsewhere for now.

My main concern here is that we are going to break the path interface already, so maybe there's a benefit in only doing it once and for good. At the same time, if we perfect the art of migrating paths, then it's ok :) I guess it can be discussed in the weekly tomorrow.

There's no breakage. Everything is migrated transparently and you can switch back and forth. Nix-like hashes would be more different compared to what we have right now. I still see value in having just a truncated hash and timestamp (as in the v2 here). Nix-style hashes can be added later.

nkaretnikov · 2023-11-18T07:54:12Z

Manual tests I did (as of commit 738f650):

conda-store state created on main works with changes in the PR (links to the lock file and other files work in the UI)
having build_key_version=1 in the config allows to access v2 files in the UI
having build_key_version=2 in the config allows to access v1 files in the UI
having build_key_version=3 in the config raises an error.

Fixes conda-incubator#611.

nkaretnikov · 2023-11-23T10:26:34Z

@jaimergp Updated, PTAL. Added a command-line/config parameter, now users can switch between v1 and v2 if needed. By default the short hash version is used. With the new default, old environments are still accessible in the UI.

jaimergp

Thanks @nkaretnikov. I'm sure this is functional now but I'm spotting some non-ideal practices in the code for things that should be well established now, namely:

Using global constants that get overridden at runtime for default configuration. This should be better guarded.
Parsing identifiers to retrieve info that should already have. Probably worth an issue if there's not one already.

Additionally, this deservers some user facing documentation covering things like:

Why would they care about this configuration option
What's the difference between v1 and v2
Which one should the choose (v2 I guess, unless they need backwards compatibility)
A migration guide for those in v1 that want to move to v2. If that's trivial, mention it explicitly. If it's not, what kind of problems they might face and how to work around them.

conda-store-server/conda_store_server/app.py

conda-store-server/conda_store_server/orm.py

jaimergp · 2023-11-23T10:36:14Z

conda-store-server/conda_store_server/orm.py

    def parse_build_key(key):
        parts = key.split("-")
-        if len(parts) < 5:
-            return None
-        return int(parts[4])  # build_id
+        # Note: cannot rely on the number of dashes to differentiate between
+        # versions because name can contain dashes. Instead, this relies on the
+        # hash size to infer the format. The name is the last field, so indexing
+        # to find the id is okay.
+        if key[_BUILD_KEY_V2_HASH_SIZE] == "-":  # v2
+            return int(parts[2])  # build_id
+        else:  # v1
+            return int(parts[4])  # build_id


This is also a bit hacky. We shouldn't rely on lossy info for this, but a well-known id->info relationship in the database. I know you know, and that that'd be considered ideal and we don't have the time but I wonder what's the rush to ship this now instead of in two weeks (still within the STF window).

If this is not going to be addressed now, and if we don't have an issue for this limitation, we should create one.

I want to ship this version of the build key regardless because:

it somewhat mitigates the length issues people might be having (see the test, the new version is 2x shorter)

it's close enough to what we have already so risk of potential issues is minimized

I consider Nix-like hashes a riskier change and would like to have this as an intermediate step, so people could fall back to this if needed.

conda-store-server/conda_store_server/orm.py

nkaretnikov · 2023-11-25T15:24:54Z

conda-store-server/conda_store_server/__init__.py

+        timestamp = int(build.scheduled_on.replace(tzinfo=tzinfo).timestamp())
+        id = build.id
+        name = build.specification.name
+        return f"{hash}-{timestamp}-{id}-{name}"


Note: without tzinfo, this would vary between machines because timestamp uses the system clock by default, which might be not in UTC.

conda-store-server/conda_store_server/__init__.py

nkaretnikov · 2023-11-25T15:44:30Z

conda-store-server/conda_store_server/app.py

+        try:
+            return BuildKey.set_current_version(proposal.value)
+        except Exception as e:
+            raise TraitError(f"c.CondaStore.build_key_version: {e}")


The current version is now set right after build_key_version is processed, which is better than before, when it was done only when creating a DB session.

Is it better to use raise X from e here?

In general, yes. But it doesn't work with traitlets. E.g.,:

raise TraitError("c.CondaStore.build_key_version") from e

would print

[CondaStoreServer] CRITICAL | Bad config encountered during initialization: c.CondaStore.build_key_version

No traceback is included. No additional information is printed.

nkaretnikov · 2023-11-25T15:47:38Z

conda-store-server/conda_store_server/orm.py

+        # Uses local import to make sure current version is initialized
+        from conda_store_server import BuildKey
+
+        return BuildKey.current_version()


All conda_store_server imports are local in this file to avoid a cyclic import problem mentioned in a BuildKey class comment.

nkaretnikov · 2023-11-25T15:51:37Z

conda-store-server/tests/test_actions.py

+        return  # invalid, nothing more to test
+    conda_store.build_key_version = build_key_version
+    assert BuildKey.current_version() == build_key_version
+    assert BuildKey.versions() == (1, 2)


This now initializes build_key_version via conda_store as one would in real life (via the config)

docs/administration.md

nkaretnikov · 2023-11-25T18:41:55Z

@jaimergp Addressed your feedback, PTAL

jaimergp · 2023-11-26T20:31:44Z

docs/administration.md

+```
+
+It consists of:
+1. a truncated SHA-256 hash of the environment specification


By "environment specification" you mean the input package requests, or the solved environment?

Updated. Added this:

(CondaSpecification, which represents a user-provided environment, is
converted to a dict and passed to datastructure_hash, which recursively sorts
it and calculates the SHA-256 hash)

conda-store-server/conda_store_server/__init__.py

nkaretnikov · 2023-11-28T09:06:45Z

@jaimergp PTAL. Replied to all of your comments. Only made a change to the docs.

This comment was marked as resolved.

Sign in to view

nkaretnikov commented Nov 4, 2023

View reviewed changes

conda-store-server/conda_store_server/orm.py Outdated Show resolved Hide resolved

nkaretnikov force-pushed the shorter-build-key-611 branch 3 times, most recently from d5de58a to 00ee1ca Compare November 5, 2023 21:04

nkaretnikov mentioned this pull request Nov 7, 2023

Check the size of build_path #653

Merged

jaimergp reviewed Nov 10, 2023

View reviewed changes

nkaretnikov mentioned this pull request Nov 12, 2023

ENH - Shorten hash used in environment path #611

Closed

trallard added status: in progress 🏗 project: challenges labels Nov 14, 2023

nkaretnikov force-pushed the shorter-build-key-611 branch from 09ae211 to 9b731f6 Compare November 17, 2023 14:10

nkaretnikov force-pushed the shorter-build-key-611 branch 3 times, most recently from b340a6e to 738f650 Compare November 18, 2023 13:36

This comment was marked as resolved.

Sign in to view

nkaretnikov added the status: blocked ⛔️ label Nov 18, 2023

nkaretnikov force-pushed the shorter-build-key-611 branch from 8f60123 to 738f650 Compare November 19, 2023 23:52

Nikita Karetnikov added 5 commits November 22, 2023 20:15

Use shorter build_key

4c0082d

Fixes conda-incubator#611.

Truncate only the hash, add a constant

fe2e9c7

Fix syntax warnings

542d911

Allow to set build_key_version via the config

9b9bb65

Update migration

8f155ec

nkaretnikov force-pushed the shorter-build-key-611 branch from 738f650 to 8f155ec Compare November 22, 2023 19:30

nkaretnikov removed the status: blocked ⛔️ label Nov 22, 2023

nkaretnikov changed the title ~~WIP: Use shorter build_key~~ Use shorter build_key Nov 23, 2023

nkaretnikov marked this pull request as ready for review November 23, 2023 10:24

nkaretnikov requested a review from jaimergp November 23, 2023 10:28

jaimergp reviewed Nov 23, 2023

View reviewed changes

trallard assigned nkaretnikov Nov 23, 2023

nkaretnikov added 2 commits November 24, 2023 22:48

Add BuildKey class

998512f

Use the same "local import" comment

24be46a

nkaretnikov commented Nov 25, 2023

View reviewed changes

conda-store-server/conda_store_server/__init__.py Outdated Show resolved Hide resolved

nkaretnikov commented Nov 25, 2023

View reviewed changes

nkaretnikov added 2 commits November 25, 2023 17:02

Add missing type hints to BuildKey

9013d1a

Document build key versions

0ae713b

nkaretnikov commented Nov 25, 2023

View reviewed changes

docs/administration.md Show resolved Hide resolved

Mention build_key_version in CondaStore options

b44be1b

nkaretnikov requested a review from jaimergp November 25, 2023 18:41

jaimergp reviewed Nov 26, 2023

View reviewed changes

conda-store-server/conda_store_server/__init__.py Show resolved Hide resolved

Explain how hash is computed from CondaSpecification

aa4598d

nkaretnikov requested a review from jaimergp November 28, 2023 09:04

Add a docstring to BuildKey

36da8f0

jaimergp approved these changes Nov 28, 2023

View reviewed changes

nkaretnikov merged commit 4ed688a into conda-incubator:main Nov 28, 2023
19 checks passed

nkaretnikov mentioned this pull request Nov 28, 2023

[ENH] - Support hash-only build paths #678

Closed

nkaretnikov mentioned this pull request Dec 13, 2023

Figure out how to handle long paths on Windows #588

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use shorter `build_key` #652

Use shorter `build_key` #652

nkaretnikov commented Nov 4, 2023

netlify bot commented Nov 4, 2023 •

edited

Loading

This comment was marked as resolved.

nkaretnikov commented Nov 5, 2023

jaimergp Nov 10, 2023

nkaretnikov Nov 13, 2023 •

edited

Loading

jaimergp Nov 13, 2023 •

edited

Loading

nkaretnikov Nov 23, 2023

nkaretnikov commented Nov 18, 2023

This comment was marked as resolved.

nkaretnikov commented Nov 23, 2023

jaimergp left a comment

jaimergp Nov 23, 2023

nkaretnikov Nov 23, 2023

nkaretnikov Nov 25, 2023

nkaretnikov Nov 25, 2023

jaimergp Nov 26, 2023

nkaretnikov Nov 28, 2023

nkaretnikov Nov 25, 2023

nkaretnikov Nov 25, 2023

nkaretnikov commented Nov 25, 2023

jaimergp Nov 26, 2023

nkaretnikov Nov 28, 2023

nkaretnikov commented Nov 28, 2023

Use shorter build_key #652

Use shorter build_key #652

Conversation

nkaretnikov commented Nov 4, 2023

netlify bot commented Nov 4, 2023 • edited Loading

✅ Deploy Preview for kaleidoscopic-dango-0cf31d canceled.

This comment was marked as resolved.

nkaretnikov commented Nov 5, 2023

Choose a reason for hiding this comment

nkaretnikov Nov 13, 2023 • edited Loading

Choose a reason for hiding this comment

jaimergp Nov 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nkaretnikov commented Nov 18, 2023

This comment was marked as resolved.

nkaretnikov commented Nov 23, 2023

jaimergp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nkaretnikov commented Nov 25, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nkaretnikov commented Nov 28, 2023

Use shorter `build_key` #652

Use shorter `build_key` #652

netlify bot commented Nov 4, 2023 •

edited

Loading

nkaretnikov Nov 13, 2023 •

edited

Loading

jaimergp Nov 13, 2023 •

edited

Loading