-
Notifications
You must be signed in to change notification settings - Fork 571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Package duplicated by different cataloger #931
Comments
Hi @WhyJee, good eye! The short answer is: this is known behavior. And you're right about Syft's philosophy is to surface as much data as it's aware of. And in this case it found evidence of a Python package and also of an RPM package. But Syft also is aware of the relationship between these two packages. For this "libcomps" example, you should see an item in the {
"parent": "f3a95e529e656bd7",
"child": "56567855bd1c8c05",
"type": "ownership-by-file-overlap",
"metadata": {
"files": [
"/usr/lib64/python3.6/site-packages/libcomps-0.1.16-py3.6.egg-info"
]
}
} This way, if a consumer of this data wanted to, they could intentionally and explicitly filter out child packages that are part of
I think this is a great idea for a new feature. As we saw above, Syft already has the data needed to do this. It would just need to expose the filtering functionality via a CLI flag or something. How does that sound? |
@luhring so it seems that the feature (or part of it) is already there. It is more a matter of SBOM processing in order to "minimize" it. Now we may say we do have two use-cases:
In case 1. you may want to eliminate one of the two (which one is another story but normally you shall pick the englobing package). In case 2. I may be interested to eliminate my package and keep only what it delivers. So in that case I need to find the Id of my package to remove it or not. But If I have a standard cleanup mechanism I will need to have a way to indicate do not remove the child (X,Y,Z, ...). But in any case, we may say that ball is in the end of the SBOM consumer (and not producer -- here Syft). Note that the picture is probably more complex as there are more relationship that we (I) may want to see in a SBOM.
So we may have in the end something as (not complete picture as I may have forgotten other relationship types and using spdx relation names): graph TD
R1[pkg:rpm/myRpm-1.2.3] --> |contains| P1
R1 --> |hasPrerequisites| R2[pkg:rpm/python-3.y.z]
P1[pkg:pypi/myMod-2.3.4] --> |descendantOf| P2
P2[pkg:pypi/aMod-4.5.6] --> |dependsOn| P3
P3[pkg:pypi/otherMod-7.8.9]
There is a simpler case where you did not rename the module but just override the version such as This implies that :
|
Thanks for the thoughtful depiction here! I think I'm following. So it sounds like the actionable piece of this is that there are two potential feature enhancements we could add to Grype (the vulnerability scanner companion to Syft).
Does this align with your thinking? |
Also just quick notes: @WhyJee would you be ok with me updating the issue name to reflect the new enhancement here? |
@spiffcs sorry for the delay. I was on trip and then got too much involved in day to day activities. |
@luhring I believe there could be some enhancements in Syft side either. I would see its logic in a multistage pattern:
With this type of logic you can also assess the score of the findings:
Regarding Grype enhancement, sounds both make sense. For 1. people using such option shall be aware that if vulnerability database use consistent package identification (the contained or the container) the final VEX may miss some vulnerabilities. Put pb is more on vulnerability identifier than on SBOM. |
An example of the consequence of this behavior in another context. An image is scanned with Syft and the generated SBOM is pushed to a vulnerability tool.
CVE-2021-3421 is reported against this product for the 2nd item (Python) whereas it is fixed as per RHSA-2021:2574 - Security Advisory since version 4.14.3-14. |
@luhring I have the same issue, but with different catalogers (binary and sbom), and in this case there is no direct relationship between them...
Will detect "Redis" and "redis" with respectively the
However there is no relationship between them... how should I filter these? |
cc: @wagoodman |
@matthyx I'm not certain in the specific case you brought up that there is enough information from syft's perspective to know that these are two different redis packages. That is, name and version might not be enough information in all cases, to create a relationship we need one package to claim ownership of a location that another package was defined by. For instance, a package found in an RPM DB lists out all files owned by a package... if that package is a python wheel, then the python cataloger will also pick up on it. The key to knowing if the packages describe the same thing is if the RPM file ownership locations overlap with the python package evidence locations, in this case a python wheel metadata file -- the we'd be able to determine there is some overlap in ownership. Even though you provided a small snippet, the binary and sbom catalogers today do not list out owned files from packages they raise up, so it isn't possible yet to create ownership overlap relationships for these two packages. |
@wagoodman alright... so there is nothing we could do in that particular case (and other similar ones). |
Popping off the stack some, ideally syft does persist all information found for packages discovered. However, that goal might not be mutually exclusive to merging packages together (slightly different than deduplicating, which may be lossy to the full set of data on a package object). I feel that such an approach probably has tradeoffs. Today the syft package data model allows for a package to be found by a single cataloger and for tailored metadata to be persisted in the {
"id": "83df403875b8c91",
"name": "curl",
"version": "8.2.7",
"type": "binary",
"foundBy": "binary-cataloger",
"purl": "pkg:github/curl/[email protected]",
"metadata": {
"matches": [
{
"classifier": "curl-binary",
"location": "..."
}
]
}
}
(omitting a few things...) But if we wanted to merge this definition with the dpkg one: {
"id": "413d8f5b0378c98",
"name": "curl",
"version": "8.2.7",
"type": "dpkg",
"foundBy": "dpkg-cataloger",
"purl": "pkg:deb/[email protected]",
"metadata": {
"architecture": "x86-86",
"installedSize": 8392,
"maintainer": "[email protected]"
}
}
... we'd be out of luck, since there are several singular fields that have different values, thus you'd need to drop one. But if the package object were modified to allow for merging logical packages found in different ways, then it would be possible: {
"name": "curl",
"version": "8.2.7",
"evidence": [
{
"id": "83df403875b8c91",
"type": "binary",
"foundBy": "binary-cataloger",
"purl": "pkg:github/curl/[email protected]",
"metadata": {
"matches": [
{
"classifier": "curl-binary",
"location": "..."
}
]
}
},
{
"id": "413d8f5b0378c98",
"type": "dpkg",
"foundBy": "dpkg-cataloger",
"purl": "pkg:deb/[email protected]",
"metadata": {
"architecture": "x86-86",
"installedSize": 8392,
"maintainer": "[email protected]"
}
}
]
} There are some important things to note with this hypothetical object. Today nodes in the graph are package objects, which is relatively simple. With this new approach the evidence array elements would be nodes in the graph, not the logical package. Why? We don't know if there are multiple dependency graphs in the greater SBOM being explained, where each node would be a part of a different dependency graph... making the logical package the node represented in the graph means that we'd be effectively merging currently separate dependency graphs. It's ok if the two separate evidence nodes are related across the two dependency graphs, as long as that relationship is not expressed as a dependency. This new hypothetical object also makes it a little harder for consumers to use the data. Instead of a small jq command to understand answers to simple questions (what is the type of package name==x? Popping out of the hypothetical package object example... I still think the default behavior of syft should raise all of the raw information as possible, but I think there is room for allowing for opt-in filtering or deduplication logic. (needs more thought though...) |
cc: @slashben |
We've had some internal discussion regarding the correctness of overlapping packages in an SBOM so I wanted to get the ball rolling on this thread for what this kind of opt in filtering enhancement would look like if we were to add it to the syft tool. Just to set a baseline, we are NOT talking about how to enhance the current SBOM for downstream tooling like grype. We're discussing opt-in behavior for syft (non default flag or config) where it filters and decides a winner between packages that overlap via the FeatureBefore generating the SBOM, if syft detects a difference between the two package's information that are related via the Current Default StateThe current philosophy for package overlap is as follows (chime in @anchore/tools if any of this seems wrong):
The above is in place because it's currently not clear 100% of the time that some other cataloger (SBOM, binary, ecosystem, etc) will be wrong relative to an OS-package cataloger. There is no one size fits all cataloger hierarchy. Said another way, Syft currently has no mechanism for assessing correctness between two catalogers output if a conflict in package information arrises, but Consider the following case of two packages: {
"id": "83df403875b8c91",
"name": "curl",
"version": "8.2.7",
"type": "binary",
"foundBy": "binary-cataloger",
"purl": "pkg:github/curl/[email protected]",
"metadata": {
"matches": [
{
"classifier": "curl-binary",
"location": "..."
}
]
}
} {
"id": "413d8f5b0378c98",
"name": "curl",
"version": "8.2.7-rc1",
"type": "dpkg",
"foundBy": "dpkg-cataloger",
"purl": "pkg:deb/[email protected]",
"metadata": {
"architecture": "x86-86",
"installedSize": 8392,
"maintainer": "[email protected]"
}
} These packages have the following relationship:
In the default SBOM - syft would surface both packages given the conflicting information. One of these packages would turn out to have incorrect version information upon investigation, but there is not current rule to say Way forwardThe thread is open for discussion on design/implementation on how we can best build this more advanced context into syft's current mechanisms so that users have more agency over filtering the SBOM they create. I'll follow up with my own proposal separate to this framing comment so that we can keep problem/solution discussion separate. |
As this is related to https://support.anchore.com/hc/en-us/requests/4315 I was hoping you could tell me when this Issue is planned to be worked on? If the I can say from my experience, binary detections should always be at the bottom - they have the least accurate information. Merging/de-dupping/otherwise removing the binary component is not a 'lossy' action, it's just a bad detection - the version is purely wrong. It may be the upstream version, but that doesn't make it correct for the package as distros frequently apply patches and change the release string, which changes the component version. The package manager manifest has the correct information for that component, and where you know there is a relationship between the two components (shared files list for example from the package manager metadata, e.g. /var/lib/dpkg/info/<package_name>.list), you should delete the bad detection in favor of the correct information from the package manager. |
Hey @christinahaig! Thanks for following up here - The solution of:
Is a great suggestion. My thoughts here are adding a few things to the syft configuration that help accomplish this in a two pass approach. The first pass would just be:
This would allow syft to prune the generated SBOM before it's output much in the same way that https://github.com/anchore/grype already filters packages using the A follow up to that would be what you suggested - where a cataloger precedence construct is added that the user can configure. This one needs a bit more design work as the catalogers are hierarchical by convention only right now. They would need some kind of additions or alterations that allow a user configuration to hook into / identify cataloger "types" (binary, distro, package manager etc) that could be assigned a precedence. As to when this is being worked on - I can take a look at getting the above config option added this afternoon with a PR for review by the rest of the @anchore/tools - |
I'm starting to see the case for why a binary package in particular should probably not be included if there is an owning package... since the "binary packages" were entirely synthesized by syft. That may call for excluding them by default.
This resonates with me too. Here's an option of what that might look like in syft configuration: drop-packages-with-ownership-overlap:
- parent-type: class:os
type: binary Where drop-packages-with-ownership-overlap:
- type: binary
parent-type:
- "apk"
- "alpm"
- "rpm"
- "dpkg"
- "portage"
This could be the default configuration. However, we could allow for simple expressions like: drop-packages-with-ownership-overlap:
# drop any python package that is owned by an RPM package
- parent-type: rpm
type: python Alternatively we could allow for something as agnostic as dropping packages based off of more generic criteria: drop-packages:
- relationship-type: ownership-by-file-overlap
parent-type: class:os
type: binary But this is really starting to get into something like #31 ... but I'd like to avoid this since |
One thing that I feel is unanswered is should we be looking at exclusively the relationships and package types? Or should there be more to match on in order to drop a package? For instance, what if an OS package contains multiple binaries, should we suppress the binary packages then? Or what if a binary contained within an OS package does not logically represent the same package name as the OS package name (e.g. an RPM for |
This is a great point! Here are a list of other fields we can match on to make this more exact. The current implementation on #1948 is very bare bones in that a match will be excluded based on the relationship existing and the types being of the correct orientation (From: os --->To:Binary )
There was a suggestion in another issue that PURL could be a candidate to consider here. I do know that Example:
Other consideration:
^ My Opinion: Edit: Weston makes a good point below that more exact matching on Name would still keep some of the frustration persisting here |
I suspect that's not going to work particularly well because the package manager names will often not match the syft constructed names (for instance |
agreed that a specific approach would be needed (we can look for partial matches or similarity). The higher level question is should we be trying to determine if the binary package is being represented by the OS package? or not try and detect this? I feel that not accounting for this will filter out packages that should remain in the SBOM. |
) Fixes #931 PR #1948 introduces a new implicit exclusion for binary packages that overlap by file ownership and have certain characteristics: 1) the relationship between packages is OwnershipByFileOverlap 2) the parent package is an "os" package - see changelog for included catalogers 3) the child is a synthetic package generated by the binary cataloger - see changelog for included catalogers 4) the package names are identical --------- Signed-off-by: Christopher Phillips <[email protected]>
@christinahaig - we just merged #1948 and it should go into the next syft release - feedback is always welcome and we hope that this new default configuration reduces the noise you were seeing when synthetic packages were incorrectly constructed and had a valid OS overlap =) |
…chore#1948) Fixes anchore#931 PR anchore#1948 introduces a new implicit exclusion for binary packages that overlap by file ownership and have certain characteristics: 1) the relationship between packages is OwnershipByFileOverlap 2) the parent package is an "os" package - see changelog for included catalogers 3) the child is a synthetic package generated by the binary cataloger - see changelog for included catalogers 4) the package names are identical --------- Signed-off-by: Christopher Phillips <[email protected]>
What happened:
Scanning
almalinux:latest
image with various tools to compare the generated SBOM. At first sight Syft sounds better as total number of identified components was greater. But... making a deeper analysis showed that Syft had identified the same packages from rpm and from python.Example Rpm cataloger finding:
Example Python Cataloger finding:
In fact this python package is delivered by above rpm so shall point to the same.
What you expected to happen:
In fact I am not sure if this is a good or bad to have duplicate for the same. Note that purl/cpe are different.
Searching on NVD https://nvd.nist.gov/products/cpe/search/results?namingFormat=2.3&keyword=libcomps
the CPEs are only rpm based
cpe:2.3:a:rpm:libcomps:...
. Thus we may think that only rpm is necessary. Maybe a way to reduce the findings when one also belongs to another packager could be provided.How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Environment:
syft version
: 0.38.0 (and same result with 0.42.4)cat /etc/os-release
or similar): N/AThe text was updated successfully, but these errors were encountered: