🐛 noSecrets false positives #4113

minht11 · 2024-09-28T16:24:37Z

Environment information

Biome 1.9.2

What happened?

Some false positives I found when turning the rule on.
Verify OTP Google Mobile Authenticator (2FAS) playground
0 USD,,. for {bitlocus|string}. playground only occurs with object property and not variable, which is stange.
Verifying takes 15 approved the following 3. playground
facebook.com |console.aws.amazon.com playground
ISO-27001 information , GDPR playground

There are more, but I hope fixing these would prevent most others.

Expected result

No false positives or at least fewer

Code of Conduct

I agree to follow Biome's Code of Conduct

The text was updated successfully, but these errors were encountered:

minht11 · 2024-09-28T16:25:40Z

@SaadBazaz are you interested looking into this?

Conaclos · 2024-09-28T16:27:04Z

Related to #3861

SaadBazaz · 2024-09-28T18:20:15Z

@SaadBazaz are you interested looking into this?

Will look into these this weekend, tomorrow most likely. I think I can find a way around for these.

My proposal: We should add another heuristic called "containsSpaces" which checks only specific regexes which have the heuristic.

The high entropy check should not have spaces.

What do you think @Conaclos @minht11

SaadBazaz · 2024-09-28T18:23:33Z

only occurs with object property and not variable, which is stange.

Maybe JsStringIdentifier(or whichever I used in code) needs some modification? Because I don't understand how the plugins would affect that.

hilja · 2024-10-01T16:27:35Z

Great lint rule!

Here’s few false positives I caught in my current project (links to playground):

Some json path queries which contain ||
Some Tailwind class names in React components
Some Tailwind classes in tailwind config object
Some URLs
And perhaps ironically 1Password secret references, which are exactly meant for safe secrets management :)

hilja · 2024-10-01T16:47:39Z

2¢: character sets for custom IDs etc. That looks like a secret for sure, but it’s actually a low entropy string, high in alphanumeric order, if you will.

import { customAlphabet } from 'nanoid'

export const shortId = customAlphabet(
  // biome-ignore lint/nursery/noSecrets:
  '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz',
  12
)

Conaclos · 2024-10-01T21:30:36Z

Have you some entropy algorithms to suggest?

cr7pt0gr4ph7 · 2024-11-10T10:14:29Z

Have you some entropy algorithms to suggest?

After thinking about it for a bit, I would suggest to turn the way we detect generic secrets on its head:

Secrets (by which we really mean API Keys + Tokens here, because human-readable passwords are basically impossible to detect except via contextual clues) are usually a continguous string of text without spaces, so looking at segments delimited by r"[[:space:]]" seems to be the right way.

Of those segments, most of them should fall into one of four classes classes:

Written by humans, for humans. Because we only consider tokens without spaces at this point, this class basically includes only code identifiers like someLongCodeIdentifier, this_is_another_identifier, some.other.identifier.
- The defining characteristic of these strings could be formulated as "(Written by a human so it) looks orderly to another human (speaking the same language)"¹.
- I'm going to exclude the possibility of using a dictionary to detect English words because that is probably out of scope for this rule.
- But even abstracting away from the specific language, there are probably some stochastic properties in how the underlying alphabet is being used that we might be able to exploit when it comes to how different "classes" of characters are distributed spacially within the string. Specificially, I am thinking about:
  - Uppercase vs. Lowercase
  - Consonants vs. Vowels
  - Numbers vs. Letters
Embedded code: JSONPath ($.phoneNumbers[:1].type), XPath /some:element/from[the-root()], CSS Selectors, Regexes (which might be the hardest to discern from random strings)
Encoded data: Base64, hex, long decimal numbers encoded as strings, which may be:
- API Keys or other secret token material
  - Some parts of the token may be ordered information, but to work as authentication measure, there has to be some random (or random looking, because properly encrypted information should be indistingushable from random bits) part of the string.
  - The "random" parts wil either make up a continguous subsection of the string (if the issuer of the token has joined it from multiple parts), or the whole string (if everything has been encrypted by the issuer or everything is just random entropy). It is in theory possible that random and non-random parts are interspersed in a structured way, but the result of that should still "look" random.
  - It should thus hold that:
    - Every token is generated from some underlying alphabet, be it r"[A-Za-z0-9+/-_]" for Base64 & Base64 URL, r[a-f0-9]" or r"[A-Z0-9]" for hex
    - Every letter of the token is picked essentially at random (according to an uniform probability distribution) and idependently of the previous or next letter from the alphabet, without caring for character classes and/or how those character classes are distributed within the string.
- Embedded data blobs that are known to be safe
  - These blobs may either be auto-generated from another source, in which case it makes sense to either make the generator emit // ignore directives or to exclude those files from scanning altogether.
  - Binary blobs in otherwise human-written code will be a minority in most codebases, so it seems sensible to require a manual exclusion in those cases.
Mixtures of the above: selector=$.phoneNumber[:1].type,api_token=ad56dt65daf66adsf

So, instead of:

Assume everything is non-random (=safe) by default and trying to detect random strings (=secrets).

I would propose thinking about it as:

Assume every string is random (=secret) and try to detect non-random (=safe) strings using the properties of natural language (/the way programmers use natural language in code).

Another way to put it: It's not about entropy(potential_secret) but about entropy(prior_knowledge+potential_secret) - entropy(prior_knowledge), where prior_knowledge = word_dictionary(language_of_code_writer). ↩

cr7pt0gr4ph7 · 2024-11-10T11:42:24Z

So, putting it all together:

Split the input text at whitespaces as well as ", ', {,},(,), except for when the subsections would be below a certain length.
If a (sub-)string consists of alphanumeric characters ([A-Za-z0-9])...
- ...that are distributed (in a spacial sense) pretty evenly throughout the string, it's probably a secret.
- ...where most or all digits are bunched together (request50Timeout), its probably safe.
- ...where all digits are far apart (just0some1string2with3strange4separators), it might be a secret, or it might not be. I'd argue for reporting a secret in these cases, because those strings shouldn't be too common in codebases anyway.
If a (sub-)string consists of only digits and is sufficiently long...-
- ...and is just a list of ascending or descending digits, it should probably be safe.
- ...everything else is basically a binary blob that should be reported as a secret, leaving it to the user to mark safe occurences.
If a (sub-)string consists of both uppercase and lowercase characters
- ...that are distributed (in a spacial sense) pretty evenly throughout the string, and the relative frequency of upper- and lowercase letters is similar, it's probably a secret.
- ...otherwise, it is very like an identifier, but we would need to use some stochastic analysis to be more sure.
If a (sub-)string contains only uppercase or only lowercase characters
- ...it might be a secret, but it also might be a code identifier. We probably have to use some stochastic analysis to classify the string as "english/human word" vs. "random text".

Plus some filter that supresses occurences that

contain an ascending or descending sequence of ABC..., abc..., 012... ...7890 (or some sufficiently long subsequence)...
...and where the "secret detected" threshold would not have been reached without that sequence.

Further notes:

String that contain non-basic Latin characters like äöüéèô etc. are probably localized strings, and therefore safe, but I may have to think about that a bit more.

minht11 added A-Linter Area: linter L-JavaScript Language: JavaScript and super languages S-Bug-confirmed Status: report has been confirmed as a valid bug labels Sep 28, 2024

SaadBazaz mentioned this issue Sep 29, 2024

feat(noSecrets): refine the entropy computation to avoid some false positives #4118

Merged

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 noSecrets false positives #4113

🐛 noSecrets false positives #4113

minht11 commented Sep 28, 2024

minht11 commented Sep 28, 2024

Conaclos commented Sep 28, 2024

SaadBazaz commented Sep 28, 2024

SaadBazaz commented Sep 28, 2024

hilja commented Oct 1, 2024

hilja commented Oct 1, 2024

Conaclos commented Oct 1, 2024

cr7pt0gr4ph7 commented Nov 10, 2024 •

edited

Loading

cr7pt0gr4ph7 commented Nov 10, 2024 •

edited

Loading

🐛 noSecrets false positives #4113

🐛 noSecrets false positives #4113

Comments

minht11 commented Sep 28, 2024

Environment information

What happened?

Expected result

Code of Conduct

minht11 commented Sep 28, 2024

Conaclos commented Sep 28, 2024

SaadBazaz commented Sep 28, 2024

SaadBazaz commented Sep 28, 2024

hilja commented Oct 1, 2024

hilja commented Oct 1, 2024

Conaclos commented Oct 1, 2024

cr7pt0gr4ph7 commented Nov 10, 2024 • edited Loading

Footnotes

cr7pt0gr4ph7 commented Nov 10, 2024 • edited Loading

cr7pt0gr4ph7 commented Nov 10, 2024 •

edited

Loading

cr7pt0gr4ph7 commented Nov 10, 2024 •

edited

Loading