Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 noSecrets false positives #4113

Open
1 task done
minht11 opened this issue Sep 28, 2024 · 9 comments
Open
1 task done

🐛 noSecrets false positives #4113

minht11 opened this issue Sep 28, 2024 · 9 comments
Labels
A-Linter Area: linter L-JavaScript Language: JavaScript and super languages S-Bug-confirmed Status: report has been confirmed as a valid bug

Comments

@minht11
Copy link
Contributor

minht11 commented Sep 28, 2024

Environment information

Biome 1.9.2

What happened?

Some false positives I found when turning the rule on.
Verify OTP Google Mobile Authenticator (2FAS) playground
0 USD,,. for {bitlocus|string}. playground only occurs with object property and not variable, which is stange.
Verifying takes 15 approved the following 3. playground
facebook.com |console.aws.amazon.com playground
ISO-27001 information , GDPR playground

There are more, but I hope fixing these would prevent most others.

Expected result

No false positives or at least fewer

Code of Conduct

  • I agree to follow Biome's Code of Conduct
@minht11 minht11 added A-Linter Area: linter L-JavaScript Language: JavaScript and super languages S-Bug-confirmed Status: report has been confirmed as a valid bug labels Sep 28, 2024
@minht11
Copy link
Contributor Author

minht11 commented Sep 28, 2024

@SaadBazaz are you interested looking into this?

@Conaclos
Copy link
Member

Related to #3861

@SaadBazaz
Copy link
Contributor

@SaadBazaz are you interested looking into this?

Will look into these this weekend, tomorrow most likely. I think I can find a way around for these.

My proposal: We should add another heuristic called "containsSpaces" which checks only specific regexes which have the heuristic.

The high entropy check should not have spaces.

What do you think @Conaclos @minht11

@SaadBazaz
Copy link
Contributor

only occurs with object property and not variable, which is stange.

Maybe JsStringIdentifier(or whichever I used in code) needs some modification? Because I don't understand how the plugins would affect that.

@hilja
Copy link

hilja commented Oct 1, 2024

Great lint rule!

Here’s few false positives I caught in my current project (links to playground):

@hilja
Copy link

hilja commented Oct 1, 2024

2¢: character sets for custom IDs etc. That looks like a secret for sure, but it’s actually a low entropy string, high in alphanumeric order, if you will.

import { customAlphabet } from 'nanoid'

export const shortId = customAlphabet(
  // biome-ignore lint/nursery/noSecrets:
  '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz',
  12
)

@Conaclos
Copy link
Member

Conaclos commented Oct 1, 2024

Have you some entropy algorithms to suggest?

@cr7pt0gr4ph7
Copy link
Contributor

cr7pt0gr4ph7 commented Nov 10, 2024

Have you some entropy algorithms to suggest?

After thinking about it for a bit, I would suggest to turn the way we detect generic secrets on its head:

Secrets (by which we really mean API Keys + Tokens here, because human-readable passwords are basically impossible to detect except via contextual clues) are usually a continguous string of text without spaces, so looking at segments delimited by r"[[:space:]]" seems to be the right way.

Of those segments, most of them should fall into one of four classes classes:

  • Written by humans, for humans. Because we only consider tokens without spaces at this point, this class basically includes only code identifiers like someLongCodeIdentifier, this_is_another_identifier, some.other.identifier.
    • The defining characteristic of these strings could be formulated as "(Written by a human so it) looks orderly to another human (speaking the same language)"1.
    • I'm going to exclude the possibility of using a dictionary to detect English words because that is probably out of scope for this rule.
    • But even abstracting away from the specific language, there are probably some stochastic properties in how the underlying alphabet is being used that we might be able to exploit when it comes to how different "classes" of characters are distributed spacially within the string. Specificially, I am thinking about:
      • Uppercase vs. Lowercase
      • Consonants vs. Vowels
      • Numbers vs. Letters
  • Embedded code: JSONPath ($.phoneNumbers[:1].type), XPath /some:element/from[the-root()], CSS Selectors, Regexes (which might be the hardest to discern from random strings)
  • Encoded data: Base64, hex, long decimal numbers encoded as strings, which may be:
    • API Keys or other secret token material
      • Some parts of the token may be ordered information, but to work as authentication measure, there has to be some random (or random looking, because properly encrypted information should be indistingushable from random bits) part of the string.
      • The "random" parts wil either make up a continguous subsection of the string (if the issuer of the token has joined it from multiple parts), or the whole string (if everything has been encrypted by the issuer or everything is just random entropy). It is in theory possible that random and non-random parts are interspersed in a structured way, but the result of that should still "look" random.
      • It should thus hold that:
        • Every token is generated from some underlying alphabet, be it r"[A-Za-z0-9+/-_]" for Base64 & Base64 URL, r[a-f0-9]" or r"[A-Z0-9]" for hex
        • Every letter of the token is picked essentially at random (according to an uniform probability distribution) and idependently of the previous or next letter from the alphabet, without caring for character classes and/or how those character classes are distributed within the string.
    • Embedded data blobs that are known to be safe
      • These blobs may either be auto-generated from another source, in which case it makes sense to either make the generator emit // ignore directives or to exclude those files from scanning altogether.
      • Binary blobs in otherwise human-written code will be a minority in most codebases, so it seems sensible to require a manual exclusion in those cases.
  • Mixtures of the above: selector=$.phoneNumber[:1].type,api_token=ad56dt65daf66adsf

So, instead of:

  • Assume everything is non-random (=safe) by default and trying to detect random strings (=secrets).

I would propose thinking about it as:

  • Assume every string is random (=secret) and try to detect non-random (=safe) strings using the properties of natural language (/the way programmers use natural language in code).

Footnotes

  1. Another way to put it: It's not about entropy(potential_secret) but about entropy(prior_knowledge+potential_secret) - entropy(prior_knowledge), where prior_knowledge = word_dictionary(language_of_code_writer).

@cr7pt0gr4ph7
Copy link
Contributor

cr7pt0gr4ph7 commented Nov 10, 2024

So, putting it all together:

  • Split the input text at whitespaces as well as ", ', {,},(,), except for when the subsections would be below a certain length.
  • If a (sub-)string consists of alphanumeric characters ([A-Za-z0-9])...
    • ...that are distributed (in a spacial sense) pretty evenly throughout the string, it's probably a secret.
    • ...where most or all digits are bunched together (request50Timeout), its probably safe.
    • ...where all digits are far apart (just0some1string2with3strange4separators), it might be a secret, or it might not be. I'd argue for reporting a secret in these cases, because those strings shouldn't be too common in codebases anyway.
  • If a (sub-)string consists of only digits and is sufficiently long...-
    • ...and is just a list of ascending or descending digits, it should probably be safe.
    • ...everything else is basically a binary blob that should be reported as a secret, leaving it to the user to mark safe occurences.
  • If a (sub-)string consists of both uppercase and lowercase characters
    • ...that are distributed (in a spacial sense) pretty evenly throughout the string, and the relative frequency of upper- and lowercase letters is similar, it's probably a secret.
    • ...otherwise, it is very like an identifier, but we would need to use some stochastic analysis to be more sure.
  • If a (sub-)string contains only uppercase or only lowercase characters
    • ...it might be a secret, but it also might be a code identifier. We probably have to use some stochastic analysis to classify the string as "english/human word" vs. "random text".

Plus some filter that supresses occurences that

  • contain an ascending or descending sequence of ABC..., abc..., 012... ...7890 (or some sufficiently long subsequence)...
  • ...and where the "secret detected" threshold would not have been reached without that sequence.

Further notes:

  • String that contain non-basic Latin characters like äöüéèô etc. are probably localized strings, and therefore safe, but I may have to think about that a bit more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Linter Area: linter L-JavaScript Language: JavaScript and super languages S-Bug-confirmed Status: report has been confirmed as a valid bug
Projects
None yet
Development

No branches or pull requests

5 participants