Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(noSecrets): refine the entropy computation to avoid some false positives #4118

Merged
merged 18 commits into from
Oct 7, 2024

Conversation

SaadBazaz
Copy link
Contributor

@SaadBazaz SaadBazaz commented Sep 29, 2024

Tackles #4113 and #3861

Further improves #3823

Research

Ever since the creation of this biome plugin originally, I've found various other secret scanning solutions:

  1. The OG, gitleaks (Go): https://github.com/gitleaks/gitleaks/
  2. Trufflehog (Go): https://github.com/trufflesecurity/trufflehog
  3. Sensleak (Rust): https://github.com/crates-pro/sensleak-rs

I've been wondering; Can we consume either one of the above to use in our usecase? Do we really need to add another to the ecosystem?
This I'd like to ask the authors here.

In the meanwhile...

Fixes

  • Fix original eslint attribution
  • Add better tests for generic use cases

New features planned

Legend:
FP = Feature-Parity with eslint no-secrets (features listed here)

  • Add config:
    • Control the entropy (FP)
    • add excludeRegexes
    • add additionalRegexes (FP)
    • checkComments
    • ignoreIdentifiers (FP)
  • Add better entropy mechanism
  • Add more stopwords and heuristics
    • Contains space
    • Startswith check
  • Check for secrets in comments (FP)
  • Pass all current tests (FP)

I would love to hear strategies on how to achieve this.

@github-actions github-actions bot added A-Linter Area: linter L-JavaScript Language: JavaScript and super languages labels Sep 29, 2024
@SaadBazaz
Copy link
Contributor Author

@minht11 @Conaclos @dyc3 - Would appreciate if you can read the PR description and drop your thoughts.

@dyc3
Copy link
Contributor

dyc3 commented Sep 29, 2024

I'll defer to the other maintainers for their opinions on adding a dependency for this.

For the proposed options: We generally try to avoid adding options for the sake of adding options. We need an appropriate amount of demand (from users) and justification to determine the correct granularity and scope of those options.

Can we consume either one of the above to use in our usecase? Do we really need to add another to the ecosystem?

Personally, I think a dedicated tool is probably going to have better heuristics than we will, at least in the short term. Plus, we don't take commit history into account like some of those tools do (and we shouldn't, that's not what biome is for).

I think this rule will at least be good for picking out the most egregious cases, with the added benefit that adding the rule to their existing config is easier than adding a new tool to their chain. We could also consider pointing users to a dedicated tool in the diagnostic message and rule documentation. Gitleaks seems to be the most mature.

@minht11
Copy link
Contributor

minht11 commented Sep 29, 2024

To add to what @dyc3 said, in my opinion Biome likely isn't a right tool to fully replace full repo secret scanning. Biome does not support all the languages which those tools do and need to be a proper security solution, we have stricter perf constraints and so on.

Having this rule in Biome does help development velocity, no need to wait for CI, and majority of users who wouldn't think about secrets get a warning for common cases, though we definitely should document about full solutions in case users need those.

As for integrations listed:
Sensleak - does not have rust crate, only CLI.
Otherss are GO solutions. Not sure how portable they are, can you call GO lib from rust reasonably? How would the performance be?

@dyc3
Copy link
Contributor

dyc3 commented Sep 30, 2024

can you call GO lib from rust reasonably?

No. Turbopack had a whole zig helper library to interop between rust and golang while they were porting the project to rust. Definitely not worth the complexity.

@SaadBazaz SaadBazaz changed the title V2/no-secrets feat(no-secrets): V2 Sep 30, 2024
@SaadBazaz
Copy link
Contributor Author

For the proposed options: We generally try to avoid adding options for the sake of adding options. We need an appropriate amount of demand (from users) and justification to determine the correct granularity and scope of those options.

I agree!

There are some features which would make us feature-parity with eslint no-secrets, I've updated the description with a legend to reflect that.

We could also consider pointing users to a dedicated tool in the diagnostic message and rule documentation. Gitleaks seems to be the most mature.
... though we definitely should document about full solutions in case users need those.

Definitely going to do this!

Sensleak - does not have rust crate, only CLI.

I can port some code then 🤠 Or maybe ask their team to make a crate/API for reusability.

No. Turbopack had a whole zig helper library to interop between rust and golang while they were porting the project to rust. Definitely not worth the complexity.

Yikes.

@ematipico
Copy link
Member

We usually add options only when required or requested, and when there are valid use cases to cover.

Not sure there's enough value to add yet another dependency for this rule. Sure, it's an important rule, but as the others said, we don't cover all languages. Our documentation could actually propose alternatives to the users.

@Conaclos
Copy link
Member

I totally agree with others. The rule should find the most obvious secret leaks and avoid false positives because users might get annoyed and turn off the rule (that defeat its purpose).

We could add a disclaimer in the rule description saying that and point to relevant tools (as the one you cited) for advanced secret leak detections.

@SaadBazaz
Copy link
Contributor Author

I totally agree with others. The rule should find the most obvious secret leaks and avoid false positives because users might get annoyed and turn off the rule (that defeat its purpose).

We could add a disclaimer in the rule description saying that and point to relevant tools (as the one you cited) for advanced secret leak detections.

In that case, (which I agree with), we should just do two tasks for now:

  1. Add the containsSpace heuristic which will drastically reduce false positives
  2. Update the docs and error message to suggest relevant tools

However I do suggest 3) adding the option to control entropy as it's not only easy to add, but also useful for us to learn what people are usually comfortable with so we can improve the default option.

What do you think?

@Conaclos
Copy link
Member

Conaclos commented Oct 1, 2024

However I do suggest 3) adding the option to control entropy as it's not only easy to add, but also useful for us to learn what people are usually comfortable with so we can improve the default option.

This looks fair enough to me.

@SaadBazaz
Copy link
Contributor Author

@Conaclos - Noob question:

#[derive(Clone, Debug, Default, Deserialize, Deserializable, Eq, PartialEq, Serialize)]
#[cfg_attr(feature = "schema", derive(schemars::JsonSchema))]
#[serde(rename_all = "camelCase", deny_unknown_fields)]
pub struct NoSecretsOptions {
    /// Set entropy threshold (default is 4.5).
    entropy_threshold: f64,
}

How can I get f32 or f64 options (i.e. any floating point option) in Options?

@Conaclos
Copy link
Member

Conaclos commented Oct 2, 2024

How can I get f32 or f64 options (i.e. any floating point option) in Options?

ctx.options().entropy_threshold in the run/diagnostic methods.

@SaadBazaz
Copy link
Contributor Author

@Conaclos - I'm getting the following error:

error[E0277]: the trait bound `f64: std::cmp::Eq` is not satisfied
   --> crates/biome_js_analyze/src/lint/nursery/no_secrets.rs:135:5
    |
130 | #[derive(Clone, Debug, Default, Deserialize, Deserializable, Eq, PartialEq, Serialize)]
    |                                                              -- in this derive macro expansion
...
135 |     entropy_threshold: f64, // @TODO: Doesn't work currently.
    |     ^^^^^^^^^^^^^^^^^^^^^^ the trait `std::cmp::Eq` is not implemented for `f64`
    |
    = help: the following other types implement trait `std::cmp::Eq`:
              i128
              i16
              i32
              i64
              i8
              isize
              u128
              u16
            and 4 others
note: required by a bound in `AssertParamIsEq`

Reading the docs:

The compiler should warn you that MyRuleOptions does not implement some required types. We currently require implementing serde's traits Deserialize/Serialize.

https://github.com/biomejs/biome/blob/main/crates/biome_analyze/CONTRIBUTING.md#:~:text=The%20compiler%20should%20warn%20you%20that%20MyRuleOptions%20does%20not%20implement%20some%20required%20types.%20We%20currently%20require%20implementing%20serde%27s%20traits%20Deserialize/Serialize.

It seems like f64 isn't supported in Options, am I missing out on something?

@Conaclos
Copy link
Member

Conaclos commented Oct 3, 2024

@SaadBazaz

f64 and f32 doesn't implement Eq/PartialEq because directly comparing floats is an error.
I am unsure why we need the options to be Eq/PartialEq.
I have to investigate if we can remove it.

For now, I see two possible approaches:

  1. Implement ourselves PartialEq.
    Usually we compare floats by taking the absolute value of their subtraction and comparing it as EPSILON.
    impl Eq for MyOptions {}
    impl PartialEq for MyOptions {
         fn eq(&self, other: &Self) -> bool {
             (self.entropy_threshold - other.entropy_threshold).abs() < f64:EPSILON
         }
    }
  2. Use an integer instead of float for the threshold.
    For example, we could use a u16.
    We could either divide the provided threshold by 10 (or even 100 if we need more precision: not sure if we need it) before comparing it against the computed threshold.
    Or we could make the reverse: multiply by 10 (or 100) the computed threshold and truncate it before comparing it against the provided threshold.

Personnaly I could choose (2).

@SaadBazaz
Copy link
Contributor Author

Personnaly I could choose (2).

Yeah I thought about (2) too, it'll allow us to have an abstraction for later on when/if we change the underlying entropy function. I'll go for that.

@SaadBazaz
Copy link
Contributor Author

  • Updated entropy function to add a few more metrics which I thought might be useful (i.e. checking consecutive upper/lower cases, checking presence of numbers and symbols)
  • Had to comment out some tests :( I would love to cater them all but too many hyperparameters to balance. Need to go deeper into entropy or have better regexes.
  • Added entropy_threshold as Option.
  • Updated docs
  • Updated error message
  • Updated inspiration to original plugin

@SaadBazaz SaadBazaz requested a review from dyc3 October 3, 2024 19:57
@SaadBazaz SaadBazaz changed the title feat(no-secrets): V2 feat(linter): implement noSecrets V2 Oct 3, 2024
Copy link

codspeed-hq bot commented Oct 3, 2024

CodSpeed Performance Report

Merging #4118 will degrade performances by 8.01%

Comparing SaadBazaz:v2/no-secrets (d65ac68) with main (055b0db)

Summary

❌ 1 regressions
✅ 104 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark main SaadBazaz:v2/no-secrets Change
js_analyzer[lint_13640784270757307929.ts] 30.3 ms 32.9 ms -8.01%

@SaadBazaz
Copy link
Contributor Author

@Conaclos @dyc3 Would love a review on the work so far.

Copy link
Contributor

@dyc3 dyc3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick review.

// TODO: Remove these false positives, they unfortunately hurt the user experience.
// const NAMESPACE_CLASSNAME = 'Validation.JSONSchemaValidationUtilsImplFactory';
// const BASE64_CHARS = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/=";
// const webpackFriendlyConsole = require('./config/webpack/webpackFriendlyConsole');
Copy link
Contributor

@dyc3 dyc3 Oct 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could easily exclude strings inside require() and import() calls, but I'm not blocking this PR for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would that be by looking at the sibling / previous node in the AST?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parent, but yeah.

Copy link
Contributor Author

@SaadBazaz SaadBazaz Oct 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do that in V3 alongside implementing JavaScript comments, what say? Because we haven't traversed the AST as of yet in this rule.

crates/biome_js_analyze/src/lint/nursery/no_secrets.rs Outdated Show resolved Hide resolved
crates/biome_js_analyze/src/lint/nursery/no_secrets.rs Outdated Show resolved Hide resolved
Comment on lines +391 to +392
For example, Continuous mixed cases (lIkE tHiS) are more likely to contribute to a higher score than single cases.
Symbols also contribute highly to secrets.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you based this on existing works? This could be worth adding references.

Copy link
Contributor Author

@SaadBazaz SaadBazaz Oct 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, I took assumptions and did some prompt engineering (attached in the ChatGPT chat). I'm considering reading some paper(s) on this topic to see if there's a better entropy function out in the wild. Do you have any recommendations for reads?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately I have no expertise in that domain and I didn't take the time to read the literature. Have you tried the new version of the rule on some code bases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added more test cases in valid.js, it was able to clear at least 1-2 more false positives (still leaves some leftover ones). Not a HUGE impact though.

I will see if I can find time to read through the literature, however, I might defer it to one of my company's team members if it takes too long.

@Conaclos Conaclos changed the title feat(linter): implement noSecrets V2 feat(noSecrets): refine the entropy computation to avoid some false positives Oct 7, 2024
@Conaclos Conaclos merged commit 731d00d into biomejs:main Oct 7, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Linter Area: linter L-JavaScript Language: JavaScript and super languages
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants