Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hashes/encodings below the heuristic limit are treated as typos #415

Open
halkeye opened this issue Jan 28, 2022 · 17 comments
Open

Hashes/encodings below the heuristic limit are treated as typos #415

halkeye opened this issue Jan 28, 2022 · 17 comments
Labels
question Uncertainty is involved

Comments

@halkeye
Copy link
Contributor

halkeye commented Jan 28, 2022

error: `Ba` should be `By`, `Be`
  --> ./content/blog/2017/08/2017-08-08-introducing-jenkins-minute.adoc:39:14
   |
39 | video::FhDomw6BaHU[youtube, width=852, height=480]

I can add it to [default.extend-identifiers] so its not a blocker but figured you'd like another test case.

@epage epage added the question Uncertainty is involved label Jan 28, 2022
@epage
Copy link
Collaborator

epage commented Jan 28, 2022

The challenge is being able to identify that has a hash. How do we tell a hash from an identifier?

Right now, we support

  • SHA detection: must be 32+ characters long and consistent case
  • Base 64 detection: Must be 90 characters long or have + / / in it and must have the padding bytes (though there is uncertainty if the padding byte requirement will stay, see Unexpectedly treated base64 value as a typo #413)

@halkeye
Copy link
Contributor Author

halkeye commented Jan 28, 2022

yea that makes a lot of sense. I'm just starting to use the app and loving it so far, I've only had to whitelist two hashes that have ba in them so its not a big deal for me.

Maybe some sort of regex or something so I could whitelist video::[a-zA-Z0-9]\[

@pums974
Copy link

pums974 commented Mar 10, 2022

I think it causes also some problems with jupyter notebooks.

error: `ba` should be `by`, `be`
  --> jupyter.ipynb:661:11
    |
661 |    "id": "6ba7c279",
    |            ^^
    |
error: `ba` should be `by`, `be`
  --> jupyter.ipynb:784:15
    |
784 |    "id": "33088ba8",
    |                ^^
    |
error: `ba` should be `by`, `be`
  --> jupyter.ipynb:1029:10
     |
1029 |    "id": "ba6788ca",
     |           ^^
     |
error: `ba` should be `by`, `be`
  --> jupyter.ipynb:2029:10
     |
2029 |    "id": "ba638183",
     |           ^^
     |

@tspearconquest
Copy link

Hello,

Git commit hashes tend to run in the range [0-9a-fA-F]{7,} so that would be a useful addition.

@epage
Copy link
Collaborator

epage commented Jun 15, 2022

@tspearconquest for shorter git commit hashes, we'll need to rely on a heuristic like talked about in #484 because shorter commit hashes could just as easily be words.

@epage epage changed the title I'm not sure hash detection is working fully Hashes/encodings below the heuristic limit are treated as typos Aug 1, 2022
@jplatte
Copy link
Contributor

jplatte commented Sep 1, 2022

How about adding a heuristic "word contains characters preceded by numbers" (where "word" is a whitespace-separated segment, not a case-separated segment)? I don't think I've ever seen an identitifer be named foo1bar or 3foo, though foo3 or foo3_bar seem realistic (e.g. zip3, zip4).

@halkeye
Copy link
Contributor Author

halkeye commented Sep 1, 2022

sha1hash?

@jplatte
Copy link
Contributor

jplatte commented Sep 1, 2022

Right, there's a few exceptions (I also remembered there being 2to3), but maybe it's still a good heuristic? Personally, I consider false positives a bigger issue than false negatives, and I think that matches typos' overall approach.

@pums974
Copy link

pums974 commented Sep 1, 2022

There is also all the this2that and thing4stuff

@epage
Copy link
Collaborator

epage commented Sep 1, 2022

@jplatte I'd probably refine your comment to be "any identifier that exclusively word splits due to numbers and not any other separator (be it case or _)

The next question is the likelihood of a shortened sha having no numbers. I probably didn't bring this up in the other thread talking about heuristics but I suspect to have something always complain than it have it complain in a way people no longer expect.

@jplatte
Copy link
Contributor

jplatte commented Sep 1, 2022

In the case of a hex string, that would be (10/16) (since 6 out of the 16 possible chars are alphabetical) to the power of the string length. git short hashes are the shortest hash I see in practice, and they seem to start at 7 characters (longer in large repos), which puts the probability of such a hash having no digits at pretty much exactly 0.1%. Strings that contain no digits before any non-digit characters would be closer to 0.5% though (rough estimation, could also be >0.5%, but not <0.28%).

@epage
Copy link
Collaborator

epage commented Mar 22, 2023

FYI #695 provides a new workaround for false positives

@azzamsa
Copy link

azzamsa commented Jan 28, 2024

  --> ./content/n/rust-docker.md:52:44
   |
52 | hello         0.1.0                ac4e1a72ba05   2 minutes ago    1.38GB
   |                                            ^^
   |
error: `ba` should be `by`, `be`
  --> ./content/n/rust-docker.md:53:46
   |
53 | rust          1.52.1-slim-buster   61cb3c65a6ba   3 weeks ago      621MB
   |                                              ^^

The extend-ignore-re solved this issue.

[default]
extend-ignore-re = ["[0-9a-fA-F]{12}"]

I think we can safely close this issue.

@jplatte
Copy link
Contributor

jplatte commented Jan 28, 2024

@azzamsa that regex is much too generic, it disables spell-checking for all 12-letter identifiers as well.

@DeadNews
Copy link

DeadNews commented Apr 4, 2024

Description

The typos pre-commit hook fails on truncated commit hashes in CHANGELOG.md.

Environment

- repo: https://github.com/crate-ci/typos
  rev: v1.20.4
  hooks:
    - id: typos

Actual Behavior

$ pre-commit run --files CHANGELOG.md

typos....................................................................Failed
- hook id: typos
- exit code: 2

error: `ba` should be `by`, `be`
  --> CHANGELOG.md:100:28
    |
100 | - _(README)_ update - ([e84ba3e](https://github.com/DeadNews/firebirdsql-run/commit/e84ba3e8e2f72a8dcad43f8ac3c768527ca199bd))
    |                            ^^
    |
$ pre-commit run --files CHANGELOG.md

typos....................................................................Failed
- hook id: typos
- exit code: 2

error: `ba` should be `by`, `be`
  --> CHANGELOG.md:22:99
   |
22 | - update `mkdocs` config ([#127](https://github.com/DeadNews/encode-utils-cli/issues/127)) - ([c92ba20](https://github.com/DeadNews/encode-utils-cli/commit/c92ba2032ac0b492b390d45c50f7c57c2660df5c))
   |                                                                                                   ^^
   |
error: `ba` should be `by`, `be`
  --> CHANGELOG.md:62:43
   |
62 | - _(renovate)_ use shared config - ([693c3ba](https://github.com/DeadNews/encode-utils-cli/commit/693c3ba58822db45dd06a032ba1ce554db6deaf6))
   |                                           ^^
   |

original: #982

@DeadNews
Copy link

DeadNews commented Apr 4, 2024

ba should be by, be

↑ This ba is in all examples.
Maybe add it to the exceptions?

@dmadisetti
Copy link

Example of SHA-256 hash "typo"

typos tests/_save/test_hash.py                                                                                                                                                                                                                
error: `iy` should be `it`
  --> tests/_save/test_hash.py:785:51
    |
785 |             expected_hash = "rC6YiNsuaZKQ1JqMgSRa0iyEDwi7ZTl6InABjuM0RDY"
    |                                                   ^^
    |

Using typos-cli 1.24.5.

Seems like defaults might be a little loose

dmadisetti added a commit to marimo-team/marimo that referenced this issue Oct 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Uncertainty is involved
Projects
None yet
Development

No branches or pull requests

8 participants