Add hash.* field set #426

cwurm · 2019-04-10T15:10:22Z

Adds a hash.* field set containing various hash algorithms and their values. It is meant to be nested under other entities, notably file (file hashes) and process (hash of the process executable).

These fields are already used by the Auditbeat FIM (file_integrity) module (see docs), and in elastic/beats#11722 I'm proposing to add them to the Auditbeat System process dataset as well.

Note: The Auditbeat FIM module puts these in a top-level hash.* object. I personally think it makes more sense to have it nested under the entity that is hashed (there might even be multiple hashes in one event), which is why I marked it as reusable.top_level: false. File hashes should then go under file.hash.*.

Resolves #388.

andrewkroh · 2019-04-16T22:58:25Z

schemas/hash.yml

+  type: group
+  short: Hashes, usually file hashes.
+  description: >
+    The hash fields represent different hash algorithms and their values.


I recommend establishing an expected format of the hashes such as lowercase hex.

Do different implementations for a given hash algorithm sometimes use upper and lower case? If this happens, perhaps we need to address it.

But we also have to think of people copying a hash in their search box. If we mandate simple lowercasing on ingest, searching for an uppercase hash will not yield results.

I think we may want to leave this unaddressed in this PR, and explore the idea of using the lowercase normalizer (normalizer, lowercase). It wasn't a big priority when we determined that protocol names should be lowercased, as they're typically only a few characters long, so no big deal re-typing them.

But lowercasing hashes by hand is another scale of annoyance :-)

Looking at our current lowercasing guidance, I realize that post-asciidoc move, they refer to a documentation section that no longer exists. We'll address that shortly.

andrewkroh · 2019-04-17T14:27:54Z

How would the file hash be different from the process hash? Are they both hashes of files?

webmat

Thanks @cwurm!

Looking at these field definitions, I wonder if we could trim down the amount of hashes to only the most widely used. We can document the field set so that it's clear people can add their own hashes as well there, if they use other ones (guidance: lowercase, underscores between words). This way we will avoid having to add more algorithms every month :-)

I think we should also consider giving guidance on how to query for a hash if someone is uncertain which algo was used to produce the hash. Of course in most cases (e.g. with a threat feed in hand) you'll know which type it is. But I'm thinking of people sending each other hashes. Not sure if this scenario is likely, though.

@andrewkroh Also brings up a good point about differences in hashing different kinds of payloads. Within ECS, I definitely want hashes to be useable for file, processes but also network payloads. If the FIM tells me the hash of a file, which is available for download via a webserver: would a network monitoring agent compute the same hash for the payload of that file transfer?

If we agree to trim down the list of hashes a bit, we'd need to get the most widely used ones in there. MD5 and SHA1, I suppose. But other than that, I'll have to rely on other people's take.

Perhaps @MikePaquette you can pitch in on this?

Makefile

webmat · 2019-04-17T13:13:53Z

schemas/hash.yml

+  type: group
+  short: Hashes, usually file hashes.
+  description: >
+    The hash fields represent different hash algorithms and their values.


Do different implementations for a given hash algorithm sometimes use upper and lower case? If this happens, perhaps we need to address it.

But we also have to think of people copying a hash in their search box. If we mandate simple lowercasing on ingest, searching for an uppercase hash will not yield results.

I think we may want to leave this unaddressed in this PR, and explore the idea of using the lowercase normalizer (normalizer, lowercase). It wasn't a big priority when we determined that protocol names should be lowercased, as they're typically only a few characters long, so no big deal re-typing them.

But lowercasing hashes by hand is another scale of annoyance :-)

Looking at our current lowercasing guidance, I realize that post-asciidoc move, they refer to a documentation section that no longer exists. We'll address that shortly.

webmat · 2019-04-17T14:44:05Z

The subtext of my question about FIM vs network monitoring agent is: do we need to support additional meta-data per hashing result?

{ "hash":
  { "md5":
    { "value": "deadbeef",
      "encoding": "latin1" }}}

cwurm · 2019-04-17T16:58:19Z

How would the file hash be different from the process hash? Are they both hashes of files?

My idea was that process hashes would be the hashes of their executables (e.g. /usr/bin/bash) - so yes, they'd be file hashes.

To make the expectations clear, it would be nice if we could have separate documentation for file.hash and process.hash, but I don't think that's currently possible with a reused object like this?

The subtext of my question about FIM vs network monitoring agent is: do we need to support additional meta-data per hashing result?
{ "hash":
  { "md5":
    { "value": "deadbeef",
      "encoding": "latin1" }}}

I would think it's unlikely for hashes to be represented in different encodings. Hexadecimal is the only one I've ever seen, so from my perspective, it would be reasonable to expect that. Happy to add a note to the documentation as @andrewkroh suggested.

webmat · 2019-04-17T17:35:53Z

Ah I should have been a bit clearer by what I meant here. I didn't mean encoding for how the hash is displayed, but rather when the file being transferred (e.g. keeping the webserver download example), can it change the encoding?

I'm pretty sure it wouldn't for binary transfers, but perhaps this can happen for text files (think code files: js, php, etc).

webmat · 2019-04-24T20:40:37Z

Another thing I'm wondering is whether we already need to make hash reusable.

Is there a case where one would have both a file.hash.xx and a process.hash.xx in the same event?

webmat · 2019-04-25T15:55:14Z

@MikePaquette I'd like your take on a few things here, when you have a moment:

Official hashes

We want to trim down the list of hashes in this proposal, and tell people how to add additional hashes. The goal is we don't want to maintain a complete list of hashes (aka add new hashes every week). What's your take on which ones we should include?

The idea would also be to document the hash field set such that if people use a different hash, they just nest it as well under hash, with proper instructions on field naming (lowercase, underscores).

Reusable

The current proposal is to introduce hash as reusable at file.hash.* and process.hash.* (which is just a hash of the process.executable file). The proposal also specifies that hash is not expected at the top level.

I'm questioning this, and wondering whether we actually expect both file.hash and process.hash to be populated at the same time in one event. If not, I think we could introduce hash not as a reusable field set, but just at the top level.

Events would look like:

{ "process": { "executable": "/usr/bin/bash" },
  "hash": { "md5": "..." }
}

and

{ "file": { "path": "/usr/bin/bash" },
  "hash": { "md5": "..." }
}

cwurm · 2019-04-25T17:02:26Z

Is there a case where one would have both a file.hash.xx and a process.hash.xx in the same event?

I think so. An example would be a process executable hash and a file hash representing either one or more files specified on the command line or a process reading/writing file(s) while it's running. I'm not sure if anything is generating this kind of data at the moment, but it's something I've thought about.

Just in general, there can be multiple files in an event and in fact we already have this situation in Auditbeat with auditd.paths.

MikePaquette · 2019-04-26T11:47:44Z

@webmat

We want to trim down the list of hashes in this proposal, and tell people how to add additional hashes. The goal is we don't want to maintain a complete list of hashes (aka add new hashes every week). What's your take on which ones we should include?

The idea would also be to document the hash field set such that if people use a different hash, they just nest it as well under hash, with proper instructions on field naming (lowercase, underscores).

Since these are all "extended" level fields, and beats implementations that use them already exist, I am not opposed to including the entire set in the ECS spec. I think this will result in more compatibility than if we encourage users to define their own custom hash.* fields.

The current proposal is to introduce hash as reusable at file.hash.* and process.hash.* (which is just a hash of the process.executable file). The proposal also specifies that hash is not expected at the top level.

I'm questioning this, and wondering whether we actually expect both file.hash and process.hash to be populated at the same time in one event. If not, I think we could introduce hash not as a reusable field set, but just at the top level.

From a search perspective, I can see both positions here. Allowing hash.* to be reusable does allow some granularity when searching, for example file.hash:"deadf00d" and you could still use leading wildcard *.hash:"deadf00d" to perform a search across all uses of hash. However, I think you could get similar granularity by including the event.category or some other event categorization field in your search, so I'm not sure the re-usability is required.

I think the multiple hashes in one event is the case that could tip the odds one way or the other. How are we doing to handle that case? Is using the hash.* fields as arrays sufficient? What about related.hash.* ?

webmat · 2019-04-30T15:59:19Z

Heads up on the leading wildcard queries. They are purely a Kibana trick, not a core Elasticserch feature. When querying e.g. *.hash.md5:deaf00d, Kibana turns this into an OR query that explicitly lists all field names that match *.hash. But related.hash will be the "real" solution to this anyway, so that's no big deal.

With that said, I think I've been convinced that we need the reusability. If an event describes a file interaction by a process, and a hash for each is calculated, it should not be ambiguous which hash matches which entity.

Finally, I think we still need the hash field set to document how to add additional hashes correctly, because the current list is missing a bunch more hashes that have already been mentioned in issue #388. The point I'm making is not that we should go add these ones now. Because even with those, the list is still not complete :-)

So with that said, I do think there would be value in trimming down the list somewhat, because the list is quite long, and I wonder how many of these hashes have been added only for completeness' sake. Or are all of these shades of SHA hashes actually being used by many threat indicator sources / detection systems?

If they are all legitimately used, I could go with adding them all. But I'd need to be convinced :-)

webmat · 2019-05-01T15:08:12Z

@cwurm When you fix the changelog conflict, please keep your line above the line for PR 439, to reduce chances of further conflicts, if we merge other things before this. It will also keep the entries loosely ordered by PR.

MikePaquette

LGTM. I am OK with the the longer list or the shorter list.

cwurm · 2019-05-01T22:39:19Z

Ok, so my takeaway is three things to be done:

Trim list down to most common hashes.
Document how to customize with more hash fields.
Fix changelog conflict but maintain order in the changelog.

I'll make those changes.

cwurm · 2019-05-20T19:45:35Z

Trim list down to most common hashes.

I've trimmed down the list of hashes to MD5, SHA1, SHA256, and SHA512.

Document how to customize with more hash fields.

Added Field names for common hashes (e.g. MD5, SHA1) are predefined. Add fields for other hashes by lowercasing the hash algorithm and using underscore separators as appropriate (snake case).

@webmat @andrewkroh @MikePaquette Let me know if this looks good now.

webmat · 2019-05-22T20:36:51Z

schemas/hash.yml

+    The hash fields represent different hash algorithms and their values.
+
+    Field names for common hashes (e.g. MD5, SHA1) are predefined. Add fields for other hashes
+    by lowercasing the hash algorithm and using underscore separators as appropriate (snake case).


Small clarification: "the hash algorithm name". I'd also give an example at the end of the sentence, something like "for example sha3_512"

webmat

I'm good with this trimmed down list. Thanks @cwurm!

One small nit above, not the end of the world. Then I'm good to merge this.

cwurm · 2019-05-22T20:46:15Z

Thanks @webmat, I fixed the description.

@MikePaquette you requested changes to this, does this look good now?

cwurm added the review label Apr 10, 2019

cwurm requested review from webmat and andrewkroh April 10, 2019 15:10

andrewkroh reviewed Apr 16, 2019

View reviewed changes

webmat reviewed Apr 17, 2019

View reviewed changes

webmat mentioned this pull request Apr 17, 2019

hashes ie: file hashes #388

Closed

MikePaquette requested changes May 1, 2019

View reviewed changes

cwurm force-pushed the add_hash branch from d06f72c to ffa8894 Compare May 20, 2019 19:43

webmat reviewed May 22, 2019

View reviewed changes

Christoph Wurm added 4 commits May 22, 2019 13:41

Add hash.* field set.

f66ca6d

Document how to customize with more hash fields.

cb8fb60

Trim list down to most common hashes.

203f9d1

Amend description.

6de98fe

cwurm force-pushed the add_hash branch from ffa8894 to 6de98fe Compare May 22, 2019 20:44

cwurm requested a review from MikePaquette May 22, 2019 20:45

MikePaquette approved these changes May 23, 2019

View reviewed changes

webmat approved these changes May 23, 2019

View reviewed changes

webmat merged commit de2b8b5 into elastic:master May 23, 2019

vbohata mentioned this pull request Jun 11, 2019

User.hash vs. new hash.* fields #483

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add hash.* field set #426

Add hash.* field set #426

cwurm commented Apr 10, 2019

andrewkroh Apr 16, 2019

webmat Apr 17, 2019

andrewkroh commented Apr 17, 2019

webmat left a comment

webmat Apr 17, 2019

webmat commented Apr 17, 2019

cwurm commented Apr 17, 2019

webmat commented Apr 17, 2019 •

edited

Loading

webmat commented Apr 24, 2019

webmat commented Apr 25, 2019

cwurm commented Apr 25, 2019

MikePaquette commented Apr 26, 2019

webmat commented Apr 30, 2019 •

edited

Loading

webmat commented May 1, 2019

MikePaquette left a comment

cwurm commented May 1, 2019

cwurm commented May 20, 2019

webmat May 22, 2019

webmat left a comment

cwurm commented May 22, 2019

Add hash.* field set #426

Add hash.* field set #426

Conversation

cwurm commented Apr 10, 2019

andrewkroh Apr 16, 2019

Choose a reason for hiding this comment

webmat Apr 17, 2019

Choose a reason for hiding this comment

andrewkroh commented Apr 17, 2019

webmat left a comment

Choose a reason for hiding this comment

webmat Apr 17, 2019

Choose a reason for hiding this comment

webmat commented Apr 17, 2019

cwurm commented Apr 17, 2019

webmat commented Apr 17, 2019 • edited Loading

webmat commented Apr 24, 2019

webmat commented Apr 25, 2019

cwurm commented Apr 25, 2019

MikePaquette commented Apr 26, 2019

webmat commented Apr 30, 2019 • edited Loading

webmat commented May 1, 2019

MikePaquette left a comment

Choose a reason for hiding this comment

cwurm commented May 1, 2019

cwurm commented May 20, 2019

webmat May 22, 2019

Choose a reason for hiding this comment

webmat left a comment

Choose a reason for hiding this comment

cwurm commented May 22, 2019

webmat commented Apr 17, 2019 •

edited

Loading

webmat commented Apr 30, 2019 •

edited

Loading