Make index lookup robust to _ vs -, but don't let the user get it wrong. #5691

Eh2406 · 2018-07-06T16:17:15Z

This does a brute force search thru combinations of hyphen and underscores to allow queries of crates to pass the wrong one.

This is a small first step of fixing #2775

Where is best to add test?

rust-highfive · 2018-07-06T16:17:19Z

r? @matklad

(rust_highfive has picked a reviewer for you, use r? to override)

matklad · 2018-07-07T09:25:15Z

Exponential search is a bit troubling ...

It might be prudent to add support for canonicalization when reading the index, that is, try to search for all underscores version first. That way, we might be able to add canonicalization on writing the index in he future, because by that time then-old cargos will have gained the support for canonicalization.

I also worry that this might make project accidentally unbildable by older cargos. Let’s perhaps add a warning if made use of search? Several releases later we will be able to remove it.

Similarly, as an optimization, let’s start the sequence of -/_ as specified in Cargo.toml. That is, change the meaning of bit to flipped/non-flipped, rather then minus/underscore.

Finally, let’s make he search bounded just in case, by checking if there are fewer than, say, 5 bits.

I don’t know where the test should go :)

Eh2406 · 2018-07-07T15:41:45Z

Responding to every point but out of order:

let’s start the sequence of -/_ as specified in Cargo.toml. That is, change the meaning of bit to flipped/non-flipped, rather then minus/underscore.

I really like this!

Exponential search is a bit troubling ...

let’s make he search bounded just in case, by checking if there are fewer than, say, 5 bits.

This is currently a panic attempted to shift with overflow, so a hard cap is better. This works well with your previous suggestion we brute force only the first N items after that it is up to you to get it right.

It might be prudent to add support for canonicalization when reading the index, that is, try to search for all underscores version first.

Sadly, I don't think this is compatible with your other suggestions.

I also worry that this might make project accidentally unbildable by older cargos.

So far this has not added the ability for Cargo.toml to be wrong, so I don't think we have to worry yet. A warning on build and an error on publish is probably a good strategy. When we get there.

bors · 2018-07-08T15:45:20Z

☔ The latest upstream changes (presumably #5694) made this pull request unmergeable. Please resolve the merge conflicts.

alexcrichton · 2018-07-09T16:16:49Z

I'm a little worried about a patch like this in the sense that it can affect Cargo's startup performance (as I'm sure you're well aware by this point!). This needs to be executed every time Cargo starts up to re-resolve the lock file, and if Cargo's always trying tons of permutations of underscores and whatnot then Cargo may run the risk of spending a large amount of time reading files that don't exist before it finds the right one.

Now this is solved if the index itself is canonicalized, but that breaks older versions of Cargo (boo!). I'm not sure of a great way to solve this for now...

Perhaps though in the meantime we could strike a balance? Cargo could provide a better error message along the lines of "Failed to find crate foo-bar, but did you mean foo_bar?" That way we may be able to get most of the benefit without many of the perf consequences?

Eh2406 · 2018-07-09T17:07:55Z

I agree to the core of all your points, but let me respond to sum details.

Thanks to @matklad this now starts with the name passed in before continuing the search. So as long as we are not searching for mis-hyphenated names there will be no performance change. Currently, eavan with this pr, we do not allow mis-hyphenated names in ether Cargo.toml nor Cargo.lock. So I don't think this change on its own is a problem.

As to the design space for extensions to this pr.
For backcompat reasons we should not allow mis-hyphenated names in the index nor in Cargo.toml of published crates. Nor is there any reason to use them in Cargo.lock. So most of the dependency tree should be searching from the correct hyphenation. The exception is in Cargo.tomls of local projects and path dependencies. Where we should ether build with a warning or error with a grate error message.

Eh2406 · 2018-07-09T17:31:19Z

btw, an alternative implementation would be to use glob::Glob as it is already in our dependencies.

alexcrichton · 2018-07-09T17:49:56Z

@Eh2406 oh hm I may be misreading then! It looks like here you're allowed to check in (and publish) a crate with mis-hyphenated names, but if that's not the case then it sounds good to me!

And yeah I do think that we basically want to guide people towards correct-hyphenation in the sense that it'll make for the fastest lookups in the future as well

Eh2406 · 2018-07-09T18:19:13Z

You know the code a lot better than I, so let us make sure we have tests. Where should they go?

alexcrichton · 2018-07-09T18:44:26Z

Hm probably in registry.rs, and probably a few tests for "assert a good error message comes out"?

alexcrichton · 2018-07-09T22:40:05Z

Tests look good! So just to make sure I understand, something like this will be required to provide a better error message later on? (trying to catch up on the context of how we'll use this if we shift to primarily-better-error-messagse)

alexcrichton · 2018-07-09T22:40:49Z

src/cargo/sources/registry/index.rs

+        // of the name so old cargos can find it.
+        // This loop tries all possible combinations of switching
+        // hyphen and underscores to find the uncanonicalized one.
+        for hyphen_combination_num in 0u16..(1 << num_hyphen_underscore) {


This might be a pretty good candidate to refactor out into a custom iterator, and that way we could just do .take(1000) to avoid too many permutations from happening?

The min 10 above already limits this, but an additional .take(1024) to make it absolutely clear cant hurt. I'll look into making a dedicated iterator. And possibly having a smaller cut off with a fallback to glob.

Oh sure yeah what I mean is it'll be functionally equivalent but perhaps a bit cleaner on the implementation

Moved to a struct. It allowed adding more targeted tests!

Eh2406 · 2018-07-10T01:09:58Z

I think these test should go in even if we don't add the exponential search, to document the current behavior. I'd like to add test that insure that one can not publish as well, also as documentation. Additionally a test for what happens if we add an error to the index, before can not resolve now it works.

It looked to me like #2775 agreed this is the first step. I liked the implementation idea of using an int as a bit mask so much that I just had to try it. But,
Yes, as mentioned "It's elsewhere in Cargo's crate graph resolver that the mapping would need to happen" so I thought this would be a safe move.

Eh2406 · 2018-07-10T14:14:57Z

Attempted to add matching tests in publish.rs. There are no existing tests for how publish interacts with already published crates! Am I just looking in the wrong place! Adding a Package::new("init", "0.0.1").publish(); led to an error

thread 'publish::wrong_case' panicked at 'self.repo.commit(Some("HEAD"), &sig, &sig, "Initial commit", &tree, &[]) failed with failed to create commit: current tip is not the first parent; class=Object (11); code=Modified (-15)', tests\testsuite\cargotest\support\git.rs:61:13

alexcrichton · 2018-07-10T14:43:01Z

Hm I'm not sure what's up with that error, that's odd!

If the eventual goal here is #2775 though I think though this may not be the best place to solve it? This'll help provide support for crates in the registry but not for crates in places like git repositories or path dependencies. In that sense I wonder if it's better to place this logic in the resolver (and query multiple times) instead of searching the index?

Eh2406 · 2018-07-10T20:08:26Z

There are a lot of moving parts. It would be reasonable to decide not to make changes until we see the big picture and how it all fits together. If we decide that then I will remove the search from this PR and just leave the tests. (Although we should add test for publish.)

Not that I see all the parts but hear is why I think this makes the index more like path/git dependencies. But feel free to explain how I am wrong.

If I put in my Cargo.toml serde-derive = { git = "https://github.com/serde-rs/serde"}
then cargo will look at that path and create a list of packages that can be found there, like
vec["serde", "serde_derive", ...] then it will filter to find the ones that matche serde-derive. None match so we get an error. (OT, but what happens if a git dependency includes several projects by the same name?)

In pre-PR cargo if I put in my Cargo.toml serde-derive = "1.0" then cargo will look at the index for a list of packages but it will get vec[] and we get an error.

After this PR it gets back vec["serde_derive"] then it will filter to find the ones that matche serde-derive. None match so we get an error.

alexcrichton · 2018-07-10T21:23:47Z

Aha I forgot about that behavior! In that case I'd agree that the index is the best place to put this specific logic.

In that case this I think is almost good to go! The idea is that next up, given these vectors of lots of crates, we'd do like lehvenstein distance searching to return "did you mean" ?

OT, but what happens if a git dependency includes several projects by the same name?

Sadness. 😿

Eh2406 · 2018-07-11T16:52:31Z

I was trying to figure out how search actually work for path/git, and convince myself that this search is important to making implementing a "did you mean" suggestion. And well, the best way to prove it works is to do it.

alexcrichton · 2018-07-11T22:28:45Z

Looking plausible! I might recommend though still using lehvenstein distance in the end just to make sure we don't print too many suggestions (and maybe cap to 5 or so?)

Eh2406 · 2018-07-12T16:04:22Z

Do you have a preferred levenshtein sort library? fst, or something trie based, or strsim or edit-distance or role my own homework style?

Also fuzzy as an arg vs a separate fuzzy fn? Oddly I did one trait one way and the other the other. Thoughts?

matklad · 2018-07-12T16:20:14Z

@Eh2406 we have lev_distance here: https://github.com/rust-lang/cargo/blob/805fbebc830174ad34feb774e019769d6fe7bbe9/src/cargo/util/lev_distance.rs

alexcrichton · 2018-07-16T15:29:55Z

Looks great! One final thing I'd recommend is to filter the final list by lev_distance as well. Right now it says "no package baz, did you mean foo?" which is a little silly in the sense that it's so far in distance it's probably not right.

It looks like subcommands uses a cap of 4, so perhaps the same can be done here?

Eh2406 · 2018-07-16T16:13:20Z

Instead of returning only the first 3 or in addition? Also lots of sources will only return one, so maybe we could use it even if lev is over the filter. I.E.
Preference = { git = "https://github.com/Eh2406/preferences-rs", branch = "patch-1"} will have only one suggestion preferences and even though it is lev > 4 it is the only reasonable answer.

alexcrichton · 2018-07-16T18:21:25Z

Hm I think let's probably start out with 3 (or maybe 4/5?) and go from there. In any case if you type Preference and it should be preferences then we should definitely have that recommended in our lev threshold filter!

Eh2406 · 2018-07-16T19:35:12Z

Lol, my example did not make sense. Updated to only show suggestions with lev < 4.

alexcrichton · 2018-07-16T19:35:34Z

@bors: r+

👍

bors · 2018-07-16T19:35:35Z

📌 Commit 95b4640 has been approved by alexcrichton

bors · 2018-07-16T19:35:42Z

⌛ Testing commit 95b4640 with merge fefbb68...

Make index lookup robust to _ vs -, but don't let the user get it wrong. This does a brute force search thru combinations of hyphen and underscores to allow queries of crates to pass the wrong one. This is a small first step of fixing #2775 Where is best to add test?

bors · 2018-07-16T20:40:37Z

☀️ Test successful - status-appveyor, status-travis
Approved by: alexcrichton
Pushing fefbb68 to master...

most sorts can be unstable Inspired by [this](https://github.com/rust-lang/cargo/blob/94f7058a483b05ad742da5efb66dd1c2d4b8619c/src/bin/cargo/main.rs#L112-L122) witch was improved in #5691, I did a quick review of `sort`s in the code. Most can be unstable, some can use a `_key` form, and none had unnecessary allocation.

rust-highfive assigned matklad Jul 6, 2018

Eh2406 force-pushed the hyphen-underscore branch from 157f88b to f8aaf85 Compare July 6, 2018 17:54

Eh2406 added 3 commits July 8, 2018 12:19

make index lookup robust to _ vs -

3b495a5

cargo +stable fmt

d0d9377

start with the name as given, and cap the number to be switched

78a0486

Eh2406 force-pushed the hyphen-underscore branch from cc60ae4 to 78a0486 Compare July 8, 2018 16:21

Eh2406 changed the title ~~[WIP] make index lookup robust to _ vs -~~ Make index lookup robust to _ vs -, but don't let the user get it wrong. Jul 9, 2018

add test for wrong_case in cargo.toml

3fafca1

alexcrichton reviewed Jul 9, 2018

View reviewed changes

add test for wrong_case in registry

8a717d3

allow each source to recommend packages that are close to a dependency

84cc3d8

update test to get ci passing

3f5036a

Eh2406 force-pushed the hyphen-underscore branch from bbdde4a to 3f5036a Compare July 12, 2018 03:27

Eh2406 added 4 commits July 12, 2018 13:25

sort suggestions by distance

2bac06c

move the Uncanonicalizing loop to a Iterator

335f641

add tests

fc7ef62

oops...

98e1bdb

only suggest lev_distance < 4

95b4640

Eh2406 mentioned this pull request Jul 16, 2018

most sorts can be unstable #5732

Merged

bors merged commit 95b4640 into rust-lang:master Jul 16, 2018

bors mentioned this pull request Jul 16, 2018

include version requirements in path for resolver error messages #5452

Closed

Eh2406 deleted the hyphen-underscore branch July 17, 2018 02:03

ehuss added this to the 1.29.0 milestone Feb 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make index lookup robust to _ vs -, but don't let the user get it wrong. #5691

Make index lookup robust to _ vs -, but don't let the user get it wrong. #5691

Eh2406 commented Jul 6, 2018

rust-highfive commented Jul 6, 2018

matklad commented Jul 7, 2018

Eh2406 commented Jul 7, 2018

bors commented Jul 8, 2018

alexcrichton commented Jul 9, 2018

Eh2406 commented Jul 9, 2018

Eh2406 commented Jul 9, 2018

alexcrichton commented Jul 9, 2018

Eh2406 commented Jul 9, 2018

alexcrichton commented Jul 9, 2018

alexcrichton commented Jul 9, 2018

alexcrichton Jul 9, 2018

Eh2406 Jul 10, 2018

alexcrichton Jul 12, 2018

Eh2406 Jul 13, 2018

Eh2406 commented Jul 10, 2018

Eh2406 commented Jul 10, 2018

alexcrichton commented Jul 10, 2018

Eh2406 commented Jul 10, 2018

alexcrichton commented Jul 10, 2018

Eh2406 commented Jul 11, 2018

alexcrichton commented Jul 11, 2018

Eh2406 commented Jul 12, 2018

matklad commented Jul 12, 2018

alexcrichton commented Jul 16, 2018

Eh2406 commented Jul 16, 2018

alexcrichton commented Jul 16, 2018

Eh2406 commented Jul 16, 2018

alexcrichton commented Jul 16, 2018

bors commented Jul 16, 2018

bors commented Jul 16, 2018

bors commented Jul 16, 2018

Make index lookup robust to _ vs -, but don't let the user get it wrong. #5691

Make index lookup robust to _ vs -, but don't let the user get it wrong. #5691

Conversation

Eh2406 commented Jul 6, 2018

rust-highfive commented Jul 6, 2018

matklad commented Jul 7, 2018

Eh2406 commented Jul 7, 2018

bors commented Jul 8, 2018

alexcrichton commented Jul 9, 2018

Eh2406 commented Jul 9, 2018

Eh2406 commented Jul 9, 2018

alexcrichton commented Jul 9, 2018

Eh2406 commented Jul 9, 2018

alexcrichton commented Jul 9, 2018

alexcrichton commented Jul 9, 2018

alexcrichton Jul 9, 2018

Choose a reason for hiding this comment

Eh2406 Jul 10, 2018

Choose a reason for hiding this comment

alexcrichton Jul 12, 2018

Choose a reason for hiding this comment

Eh2406 Jul 13, 2018

Choose a reason for hiding this comment

Eh2406 commented Jul 10, 2018

Eh2406 commented Jul 10, 2018

alexcrichton commented Jul 10, 2018

Eh2406 commented Jul 10, 2018

alexcrichton commented Jul 10, 2018

Eh2406 commented Jul 11, 2018

alexcrichton commented Jul 11, 2018

Eh2406 commented Jul 12, 2018

matklad commented Jul 12, 2018

alexcrichton commented Jul 16, 2018

Eh2406 commented Jul 16, 2018

alexcrichton commented Jul 16, 2018

Eh2406 commented Jul 16, 2018

alexcrichton commented Jul 16, 2018

bors commented Jul 16, 2018

bors commented Jul 16, 2018

bors commented Jul 16, 2018