-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use shorter build_key
#652
Use shorter build_key
#652
Conversation
✅ Deploy Preview for kaleidoscopic-dango-0cf31d canceled.
|
This comment was marked as resolved.
This comment was marked as resolved.
d5de58a
to
00ee1ca
Compare
Seems fine. Still need to do the final review and test locally. |
timestamp = int(self.scheduled_on.timestamp()) | ||
id = self.id | ||
name = self.specification.name[:16] | ||
return f"{hash}-{timestamp}-{id}-{name}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These lengths (4 and 16) are hardcoded and there's no explanation on why that was chosen. Instead I'd rather have a fixed-length hash à la nix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like Nix hashes as well (specifically, I'm talking about nix-store hashes, since those also use the filesystem, see here), but it can happen later and in a separate PR. It requires quite a bit of work and it was pointed out to me during the last team meeting that we might want to allocate our time elsewhere for now.
Here are some questions to answer for Nix hashes:
- To make this really fixed-size, we would need to design the v2 path scheme such that it also doesn't include the namespace name. It should be just:
store_directory / hash / <env contents>
where the first two components is the prefix that needs to be <= 255 to satisfy the conda prefix contraint. - To hash like this and avoid weird issues due to mismatched package versions, we cannot hash the specification as we do now, we need to hash the contents of the lock file. Specs don't have the versions pinned, lock files are deterministic.
- Since the hashes are shared between users (no namespace anymore), they need to be resistant to collisions. The size of the hash and its collision resistance are related. This can be estimated to find the best size for this use case.
- Hashes are one-way (by design). But we also use
build_key
s to get thebuild_id
. Seeparse_build_key
and how it's used inget_docker_image_manifest
. For that, we'd need to store an extrahash -> id
mapping in the DB. - IIUC, all other parts of the
build_key
are just to help with debugging or search. You can verify which env on disk corresponds to which specification, see when it was built, and get a human-readable name of the env. build_id
s are unique because they are primary keys in theBuild
table. The only concern here is that they might potentially get reused by some DB backends.
How 4 and 16 were selected:
- Because only the
build_id
s matter, everything else doesn't help with the uniqueness part, just helps with debugging - I selected 4 to still be able to verify that a spec matches the env by hashing it and have some confidence
- I selected 16 because it contained enough information to be meaningful with some env names I've tried, but also because it's short enough.
Summary:
- Since Check the size of
build_path
#653 adds error checking for this and documentation, it makes it easier for users to understand what's going on - I'm going to bump 4 to 8 since it's a more standard size for truncation (even though it's not necessary here). This makes it somewhat less arbitrary. The original proposal was a unix timestamp and a truncated hash.
- I'm going to remove the truncation from names (since the linked PR has instructions for users to use shorter names if there are problems). It removes this arbitrary decision, which might lead to confusion later.
- I'll try to add the v1/v2 build_path scheme as a parameter. I initially avoided it because it ends another degree of freedom, which leads to more bugs. But I also see value in this setting where
build_path
size is not a problem, just to avoid any unexpected v1->v2 migration problems. - Let's discuss the Nix-hash-like idea separately in a new issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It requires quite a bit of work and it was pointed out to me during the last team meeting that we might want to allocate our time elsewhere for now.
My main concern here is that we are going to break the path interface already, so maybe there's a benefit in only doing it once and for good. At the same time, if we perfect the art of migrating paths, then it's ok :) I guess it can be discussed in the weekly tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no breakage. Everything is migrated transparently and you can switch back and forth. Nix-like hashes would be more different compared to what we have right now. I still see value in having just a truncated hash and timestamp (as in the v2 here). Nix-style hashes can be added later.
09ae211
to
9b731f6
Compare
Manual tests I did (as of commit 738f650):
|
b340a6e
to
738f650
Compare
This comment was marked as resolved.
This comment was marked as resolved.
8f60123
to
738f650
Compare
738f650
to
8f155ec
Compare
@jaimergp Updated, PTAL. Added a command-line/config parameter, now users can switch between v1 and v2 if needed. By default the short hash version is used. With the new default, old environments are still accessible in the UI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @nkaretnikov. I'm sure this is functional now but I'm spotting some non-ideal practices in the code for things that should be well established now, namely:
- Using global constants that get overridden at runtime for default configuration. This should be better guarded.
- Parsing identifiers to retrieve info that should already have. Probably worth an issue if there's not one already.
Additionally, this deservers some user facing documentation covering things like:
- Why would they care about this configuration option
- What's the difference between v1 and v2
- Which one should the choose (v2 I guess, unless they need backwards compatibility)
- A migration guide for those in v1 that want to move to v2. If that's trivial, mention it explicitly. If it's not, what kind of problems they might face and how to work around them.
def parse_build_key(key): | ||
parts = key.split("-") | ||
if len(parts) < 5: | ||
return None | ||
return int(parts[4]) # build_id | ||
# Note: cannot rely on the number of dashes to differentiate between | ||
# versions because name can contain dashes. Instead, this relies on the | ||
# hash size to infer the format. The name is the last field, so indexing | ||
# to find the id is okay. | ||
if key[_BUILD_KEY_V2_HASH_SIZE] == "-": # v2 | ||
return int(parts[2]) # build_id | ||
else: # v1 | ||
return int(parts[4]) # build_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also a bit hacky. We shouldn't rely on lossy info for this, but a well-known id->info relationship in the database. I know you know, and that that'd be considered ideal and we don't have the time but I wonder what's the rush to ship this now instead of in two weeks (still within the STF window).
If this is not going to be addressed now, and if we don't have an issue for this limitation, we should create one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to ship this version of the build key regardless because:
- it somewhat mitigates the length issues people might be having (see the test, the new version is 2x shorter)
- it's close enough to what we have already so risk of potential issues is minimized
- I consider Nix-like hashes a riskier change and would like to have this as an intermediate step, so people could fall back to this if needed.
timestamp = int(build.scheduled_on.replace(tzinfo=tzinfo).timestamp()) | ||
id = build.id | ||
name = build.specification.name | ||
return f"{hash}-{timestamp}-{id}-{name}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: without tzinfo
, this would vary between machines because timestamp
uses the system clock by default, which might be not in UTC.
try: | ||
return BuildKey.set_current_version(proposal.value) | ||
except Exception as e: | ||
raise TraitError(f"c.CondaStore.build_key_version: {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current version is now set right after build_key_version
is processed, which is better than before, when it was done only when creating a DB session.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it better to use raise X from e
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, yes. But it doesn't work with traitlets. E.g.,:
raise TraitError("c.CondaStore.build_key_version") from e
would print
[CondaStoreServer] CRITICAL | Bad config encountered during initialization: c.CondaStore.build_key_version
No traceback is included. No additional information is printed.
# Uses local import to make sure current version is initialized | ||
from conda_store_server import BuildKey | ||
|
||
return BuildKey.current_version() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All conda_store_server
imports are local in this file to avoid a cyclic import problem mentioned in a BuildKey
class comment.
return # invalid, nothing more to test | ||
conda_store.build_key_version = build_key_version | ||
assert BuildKey.current_version() == build_key_version | ||
assert BuildKey.versions() == (1, 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This now initializes build_key_version
via conda_store
as one would in real life (via the config)
@jaimergp Addressed your feedback, PTAL |
``` | ||
|
||
It consists of: | ||
1. a truncated SHA-256 hash of the environment specification |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By "environment specification" you mean the input package requests, or the solved environment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. Added this:
(CondaSpecification
, which represents a user-provided environment, is
converted to a dict and passed to datastructure_hash
, which recursively sorts
it and calculates the SHA-256 hash)
@jaimergp PTAL. Replied to all of your comments. Only made a change to the docs. |
Fixes #611.