Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Reverse engineer Grok patterns from categorization results #30125

Merged

Conversation

droberts195
Copy link
Contributor

@droberts195 droberts195 commented Apr 25, 2018

It has been noted that the regexes we produce in our categorization
results are not that far away from Grok patterns that could be used
in Logstash to categorize messages at ingest time and do better
field extraction for log formats that do not have out-of-the-box
patterns.

This change adds a grok_pattern field to our GET categories API
output. It's calculated using the regex and examples in the
categorization result, and applying a list of candidate Grok
patterns to the bits in between the tokens that are considered to
define the category.

This can currently be considered a prototype, as the Grok patterns
it produces are not optimal. However, enough people have said it
would be useful for it to be worthwhile exposing it as experimental
functionality for interested parties to try out.

@droberts195 droberts195 added WIP :ml Machine learning labels Apr 25, 2018
It has been noted that the regexes we produce in our categorization
results are not that far away from Grok patterns that could be used
in Logstash to categorize messages at ingest time and do better
field extraction for log formats that do not have out-of-the-box
patterns.

This change adds a `grok_pattern` field to our GET categories API
output. It's calculated using the regex and examples in the
categorization result, and applying a list of candidate Grok
patterns to the bits in between the tokens that are considered to
define the category.

This can currently be considered a prototype, as there is an
outstanding question on how the functionality should work:

* Is calculating the Grok patterns on the fly the best thing to do?
  It might be better to calculate them when categorization results
  are created/updated, and store the patterns in a new type of ML
  results document. Then we could let users manually improve the
  patterns and remember their edits.  But the decision here needs
  to tie in with the end-to-end story for this functionality. If
  the intended flow is
  `ML -> User edits in the Grok debugger -> Logstash config` then
  maybe there's no need for ML to remember the user edits.
Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I left a few comments possibly based on my own misunderstanding

} else {
// We should never get here. If we do it implies a bug in the original categorization,
// as it's produced a regex that doesn't match the examples.
assert matcher.matches() : exampleProcessor.pattern() + " did not match " + example;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be logged for the case where assertions are not enabled

// "the cat sat on the mat" will result in "the ", " ", " on the ", and ""
// being added to the 4 "in between" collections in that order
for (int groupNum = 1; groupNum <= matcher.groupCount(); ++groupNum) {
if (inBetweenBits.size() < groupNum) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the first pass of the for (String example: examples) the array will be populated. Maybe pre-populate inBetweenBits with the new ArrayLists once it is created

// Finally, for each collection of "in between" bits we look for the best Grok pattern and incorporate
// it into the overall Grok pattern that will match the each example in its entirety
for (int inBetweenBitNum = 0; inBetweenBitNum < inBetweenBits.size(); ++inBetweenBitNum) {
// Remember (from the first comment in this method) that the first element in this array is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that this causes every generated grok pattern to start with '.*?' unless one of ORDERED_CANDIDATE_GROK_PATTERNS matches the empty string is that the intention?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the Grok patterns are built from the regex that categorization churns out there's no way to be sure what comes before the first common token for the category. Therefore all the regexes will begin with .*?. (It doesn't make any difference if one of the ORDERED_CANDIDATE_GROK_PATTERNS matches the empty string - this prefix will always be on the regex.)

I agree it could be inefficient and look silly to a human who knows the data, but this is a limitation from trying to reverse engineer Grok patterns from the categorization output. An alternative would be to try to find fields as part of the main categorization algorithm. But that would imply either moving the whole categorization algorithm into Java, or adding Grok functionality to the C++, or using something other than Grok to define how to extract the fields.

Possibly the best thing to do is just release this functionality in its current form as experimental, because several people have said they could get value from this prototype, and switch to something completely different and more complex in the future.

// E.g., ".*?cat.+?sat.+?mat.*" -> Pattern (.*?)cat(.+?)sat(.+?)mat(.*)
Pattern exampleProcessor = Pattern.compile(regex.replaceAll("(\\.[*+]\\??)", "($1)"), Pattern.DOTALL);

List<Collection<String>> inBetweenBits = new ArrayList<>(fixedRegexBits.length);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe groupsMatches instead of inBetweenBits or matchesForGroups. This is and array of example matches for each group

@droberts195 droberts195 added review and removed WIP labels May 14, 2018
`grok_pattern`::
(string) A Grok pattern that could be used in Logstash or an Ingest Pipeline
to extract fields from messages that match the category. This field is
experimental and may be changed or removed in a future version. The Grok
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lcawl do you know if there are any precedents for a single field that is experimental within a feature that is generally fully supported? Is the way I've documented it here right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@droberts195 I'd suggest putting "experimental[]" after the data type, per https://github.com/elastic/docs#experimental-and-beta

Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM I left a couple of comments

@@ -629,7 +630,7 @@ public void bucketRecords(String jobId, Bucket bucket, int from, int size, boole
* @param from Skip the first N categories. This parameter is for paging
* @param size Take only this number of categories
*/
public void categoryDefinitions(String jobId, Long categoryId, Integer from, Integer size,
public void categoryDefinitions(String jobId, Long categoryId, boolean augment, Integer from, Integer size,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The javadoc will need updating for the new parameter

@@ -41,7 +41,7 @@ protected void doExecute(GetCategoriesAction.Request request, ActionListener<Get

Integer from = request.getPageParams() != null ? request.getPageParams().getFrom() : null;
Integer size = request.getPageParams() != null ? request.getPageParams().getSize() : null;
jobProvider.categoryDefinitions(request.getJobId(), request.getCategoryId(), from, size,
jobProvider.categoryDefinitions(request.getJobId(), request.getCategoryId(), true, from, size,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be exposed as a request parameter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it's worth it. People can easily ignore the Grok patterns if they're not interested.

@droberts195 droberts195 merged commit 50c34b2 into elastic:master May 15, 2018
@droberts195 droberts195 deleted the prototype_grok_pattern_creator branch May 15, 2018 08:02
droberts195 added a commit that referenced this pull request May 15, 2018
This change adds a grok_pattern field to the GET categories API
output in ML. It's calculated using the regex and examples in the
categorization result, and applying a list of candidate Grok
patterns to the bits in between the tokens that are considered to
define the category.

This can currently be considered a prototype, as the Grok patterns
it produces are not optimal. However, enough people have said it
would be useful for it to be worthwhile exposing it as experimental
functionality for interested parties to try out.
droberts195 added a commit that referenced this pull request May 15, 2018
dnhatn added a commit that referenced this pull request May 15, 2018
* 6.x:
  Revert "Silence IndexUpgradeIT test failures. (#30430)"
  [DOCS] Remove references to changelog and to highlights
  Revert "Mute ML upgrade test (#30458)"
  [ML] Fix BWC version for backport of #30125
  [Docs] Improve section detailing translog usage (#30573)
  [Tests] Relax allowed delta in extended_stats aggregation (#30569)
  Fail if reading from closed KeyStoreWrapper (#30394)
  [ML] Reverse engineer Grok patterns from categorization results (#30125)
  Derive max composite buffers from max content len
  Update build file due to doc file rename
  SQL: Extract SQL request and response classes (#30457)
  Remove the changelog (#30593)
  Revert "Add deprecation warning for default shards (#30587)"
  Silence IndexUpgradeIT test failures. (#30430)
  Add deprecation warning for default shards (#30587)
  [DOCS] Adds 6.4.0 release highlight pages
  [DOCS] Adds release highlight pages (#30590)
  Docs: Document how to rebuild analyzers (#30498)
  [DOCS] Fixes title capitalization in security content
  LLRest: Add equals and hashcode tests for Request (#30584)
  [DOCS] Fix realm setting names (#30499)
  [DOCS] Fix path info for various security files (#30502)
  Docs: document precision limitations of geo_bounding_box (#30540)
  Fix non existing javadocs link in RestClientTests
  Auto-expand replicas only after failing nodes (#30553)
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request May 15, 2018
…ngs-to-true

* elastic/master:
  [DOCS] Restores 7.0.0 release notes and highlights
  Remove assert statements from field caps documentation. (elastic#30601)
  Repository GCS plugin new client library (elastic#30168)
  HLRestClient: Follow-up for put index template api (elastic#30592)
  Unmute IndexUpgradeIT tests
  [DOCS] Remove references to changelog and to highlights
  Side-step pending deletes check (elastic#30571)
  [DOCS] Remove references to removed changelog
  Revert "Mute ML upgrade test (elastic#30458)"
  [ML] Adjust BWC version following backport of elastic#30125
  [Docs] Improve section detailing translog usage (elastic#30573)
  [Tests] Relax allowed delta in extended_stats aggregation (elastic#30569)
  [ML] Reverse engineer Grok patterns from categorization results (elastic#30125)
  Update build file due to doc file rename
  Remove the changelog (elastic#30593)
  Fix issue with finishing handshake in ssl driver (elastic#30580)
  Fail if reading from closed KeyStoreWrapper (elastic#30394)
  Silence IndexUpgradeIT test failures. (elastic#30430)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants