Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add hash.* field set #426

Merged
merged 4 commits into from
May 23, 2019
Merged

Add hash.* field set #426

merged 4 commits into from
May 23, 2019

Conversation

cwurm
Copy link

@cwurm cwurm commented Apr 10, 2019

Adds a hash.* field set containing various hash algorithms and their values. It is meant to be nested under other entities, notably file (file hashes) and process (hash of the process executable).

These fields are already used by the Auditbeat FIM (file_integrity) module (see docs), and in elastic/beats#11722 I'm proposing to add them to the Auditbeat System process dataset as well.

Note: The Auditbeat FIM module puts these in a top-level hash.* object. I personally think it makes more sense to have it nested under the entity that is hashed (there might even be multiple hashes in one event), which is why I marked it as reusable.top_level: false. File hashes should then go under file.hash.*.

Resolves #388.

@cwurm cwurm added the review label Apr 10, 2019
@cwurm cwurm requested review from webmat and andrewkroh April 10, 2019 15:10
type: group
short: Hashes, usually file hashes.
description: >
The hash fields represent different hash algorithms and their values.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend establishing an expected format of the hashes such as lowercase hex.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do different implementations for a given hash algorithm sometimes use upper and lower case? If this happens, perhaps we need to address it.

But we also have to think of people copying a hash in their search box. If we mandate simple lowercasing on ingest, searching for an uppercase hash will not yield results.

I think we may want to leave this unaddressed in this PR, and explore the idea of using the lowercase normalizer (normalizer, lowercase). It wasn't a big priority when we determined that protocol names should be lowercased, as they're typically only a few characters long, so no big deal re-typing them.

But lowercasing hashes by hand is another scale of annoyance :-)

Looking at our current lowercasing guidance, I realize that post-asciidoc move, they refer to a documentation section that no longer exists. We'll address that shortly.

@andrewkroh
Copy link
Member

How would the file hash be different from the process hash? Are they both hashes of files?

Copy link
Contributor

@webmat webmat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cwurm!

Looking at these field definitions, I wonder if we could trim down the amount of hashes to only the most widely used. We can document the field set so that it's clear people can add their own hashes as well there, if they use other ones (guidance: lowercase, underscores between words). This way we will avoid having to add more algorithms every month :-)

I think we should also consider giving guidance on how to query for a hash if someone is uncertain which algo was used to produce the hash. Of course in most cases (e.g. with a threat feed in hand) you'll know which type it is. But I'm thinking of people sending each other hashes. Not sure if this scenario is likely, though.

@andrewkroh Also brings up a good point about differences in hashing different kinds of payloads. Within ECS, I definitely want hashes to be useable for file, processes but also network payloads. If the FIM tells me the hash of a file, which is available for download via a webserver: would a network monitoring agent compute the same hash for the payload of that file transfer?

If we agree to trim down the list of hashes a bit, we'd need to get the most widely used ones in there. MD5 and SHA1, I suppose. But other than that, I'll have to rely on other people's take.

Perhaps @MikePaquette you can pitch in on this?

Makefile Show resolved Hide resolved
type: group
short: Hashes, usually file hashes.
description: >
The hash fields represent different hash algorithms and their values.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do different implementations for a given hash algorithm sometimes use upper and lower case? If this happens, perhaps we need to address it.

But we also have to think of people copying a hash in their search box. If we mandate simple lowercasing on ingest, searching for an uppercase hash will not yield results.

I think we may want to leave this unaddressed in this PR, and explore the idea of using the lowercase normalizer (normalizer, lowercase). It wasn't a big priority when we determined that protocol names should be lowercased, as they're typically only a few characters long, so no big deal re-typing them.

But lowercasing hashes by hand is another scale of annoyance :-)

Looking at our current lowercasing guidance, I realize that post-asciidoc move, they refer to a documentation section that no longer exists. We'll address that shortly.

@webmat
Copy link
Contributor

webmat commented Apr 17, 2019

The subtext of my question about FIM vs network monitoring agent is: do we need to support additional meta-data per hashing result?

{ "hash":
  { "md5":
    { "value": "deadbeef",
      "encoding": "latin1" }}}

@webmat webmat mentioned this pull request Apr 17, 2019
@cwurm
Copy link
Author

cwurm commented Apr 17, 2019

How would the file hash be different from the process hash? Are they both hashes of files?

My idea was that process hashes would be the hashes of their executables (e.g. /usr/bin/bash) - so yes, they'd be file hashes.

To make the expectations clear, it would be nice if we could have separate documentation for file.hash and process.hash, but I don't think that's currently possible with a reused object like this?

The subtext of my question about FIM vs network monitoring agent is: do we need to support additional meta-data per hashing result?

{ "hash":
  { "md5":
    { "value": "deadbeef",
      "encoding": "latin1" }}}

I would think it's unlikely for hashes to be represented in different encodings. Hexadecimal is the only one I've ever seen, so from my perspective, it would be reasonable to expect that. Happy to add a note to the documentation as @andrewkroh suggested.

@webmat
Copy link
Contributor

webmat commented Apr 17, 2019

Ah I should have been a bit clearer by what I meant here. I didn't mean encoding for how the hash is displayed, but rather when the file being transferred (e.g. keeping the webserver download example), can it change the encoding?

I'm pretty sure it wouldn't for binary transfers, but perhaps this can happen for text files (think code files: js, php, etc).

@webmat
Copy link
Contributor

webmat commented Apr 24, 2019

Another thing I'm wondering is whether we already need to make hash reusable.

Is there a case where one would have both a file.hash.xx and a process.hash.xx in the same event?

@webmat
Copy link
Contributor

webmat commented Apr 25, 2019

@MikePaquette I'd like your take on a few things here, when you have a moment:

Official hashes

We want to trim down the list of hashes in this proposal, and tell people how to add additional hashes. The goal is we don't want to maintain a complete list of hashes (aka add new hashes every week). What's your take on which ones we should include?

The idea would also be to document the hash field set such that if people use a different hash, they just nest it as well under hash, with proper instructions on field naming (lowercase, underscores).

Reusable

The current proposal is to introduce hash as reusable at file.hash.* and process.hash.* (which is just a hash of the process.executable file). The proposal also specifies that hash is not expected at the top level.

I'm questioning this, and wondering whether we actually expect both file.hash and process.hash to be populated at the same time in one event. If not, I think we could introduce hash not as a reusable field set, but just at the top level.

Events would look like:

{ "process": { "executable": "/usr/bin/bash" },
  "hash": { "md5": "..." }
}

and

{ "file": { "path": "/usr/bin/bash" },
  "hash": { "md5": "..." }
}

@cwurm
Copy link
Author

cwurm commented Apr 25, 2019

Is there a case where one would have both a file.hash.xx and a process.hash.xx in the same event?

I think so. An example would be a process executable hash and a file hash representing either one or more files specified on the command line or a process reading/writing file(s) while it's running. I'm not sure if anything is generating this kind of data at the moment, but it's something I've thought about.

Just in general, there can be multiple files in an event and in fact we already have this situation in Auditbeat with auditd.paths.

@MikePaquette
Copy link
Contributor

@webmat

We want to trim down the list of hashes in this proposal, and tell people how to add additional hashes. The goal is we don't want to maintain a complete list of hashes (aka add new hashes every week). What's your take on which ones we should include?

The idea would also be to document the hash field set such that if people use a different hash, they just nest it as well under hash, with proper instructions on field naming (lowercase, underscores).

Since these are all "extended" level fields, and beats implementations that use them already exist, I am not opposed to including the entire set in the ECS spec. I think this will result in more compatibility than if we encourage users to define their own custom hash.* fields.

The current proposal is to introduce hash as reusable at file.hash.* and process.hash.* (which is just a hash of the process.executable file). The proposal also specifies that hash is not expected at the top level.

I'm questioning this, and wondering whether we actually expect both file.hash and process.hash to be populated at the same time in one event. If not, I think we could introduce hash not as a reusable field set, but just at the top level.

From a search perspective, I can see both positions here. Allowing hash.* to be reusable does allow some granularity when searching, for example file.hash:"deadf00d" and you could still use leading wildcard *.hash:"deadf00d" to perform a search across all uses of hash. However, I think you could get similar granularity by including the event.category or some other event categorization field in your search, so I'm not sure the re-usability is required.

I think the multiple hashes in one event is the case that could tip the odds one way or the other. How are we doing to handle that case? Is using the hash.* fields as arrays sufficient? What about related.hash.* ?

@webmat
Copy link
Contributor

webmat commented Apr 30, 2019

Heads up on the leading wildcard queries. They are purely a Kibana trick, not a core Elasticserch feature. When querying e.g. *.hash.md5:deaf00d, Kibana turns this into an OR query that explicitly lists all field names that match *.hash. But related.hash will be the "real" solution to this anyway, so that's no big deal.

With that said, I think I've been convinced that we need the reusability. If an event describes a file interaction by a process, and a hash for each is calculated, it should not be ambiguous which hash matches which entity.

Finally, I think we still need the hash field set to document how to add additional hashes correctly, because the current list is missing a bunch more hashes that have already been mentioned in issue #388. The point I'm making is not that we should go add these ones now. Because even with those, the list is still not complete :-)

So with that said, I do think there would be value in trimming down the list somewhat, because the list is quite long, and I wonder how many of these hashes have been added only for completeness' sake. Or are all of these shades of SHA hashes actually being used by many threat indicator sources / detection systems?

If they are all legitimately used, I could go with adding them all. But I'd need to be convinced :-)

@webmat
Copy link
Contributor

webmat commented May 1, 2019

@cwurm When you fix the changelog conflict, please keep your line above the line for PR 439, to reduce chances of further conflicts, if we merge other things before this. It will also keep the entries loosely ordered by PR.

Copy link
Contributor

@MikePaquette MikePaquette left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I am OK with the the longer list or the shorter list.

@cwurm
Copy link
Author

cwurm commented May 1, 2019

Ok, so my takeaway is three things to be done:

  1. Trim list down to most common hashes.
  2. Document how to customize with more hash fields.
  3. Fix changelog conflict but maintain order in the changelog.

I'll make those changes.

@cwurm
Copy link
Author

cwurm commented May 20, 2019

  • Trim list down to most common hashes.

I've trimmed down the list of hashes to MD5, SHA1, SHA256, and SHA512.

  • Document how to customize with more hash fields.

Added Field names for common hashes (e.g. MD5, SHA1) are predefined. Add fields for other hashes by lowercasing the hash algorithm and using underscore separators as appropriate (snake case).

@webmat @andrewkroh @MikePaquette Let me know if this looks good now.

schemas/hash.yml Outdated
The hash fields represent different hash algorithms and their values.

Field names for common hashes (e.g. MD5, SHA1) are predefined. Add fields for other hashes
by lowercasing the hash algorithm and using underscore separators as appropriate (snake case).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small clarification: "the hash algorithm name". I'd also give an example at the end of the sentence, something like "for example sha3_512"

Copy link
Contributor

@webmat webmat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with this trimmed down list. Thanks @cwurm!

One small nit above, not the end of the world. Then I'm good to merge this.

@cwurm
Copy link
Author

cwurm commented May 22, 2019

Thanks @webmat, I fixed the description.

@MikePaquette you requested changes to this, does this look good now?

@webmat webmat merged commit de2b8b5 into elastic:master May 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

hashes ie: file hashes
4 participants