Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Allow overrides for some file structure detection decisions #33630

Merged
merged 3 commits into from
Sep 14, 2018

Conversation

droberts195
Copy link
Contributor

This change modifies the file structure detection functionality
such that some of the decisions can be overridden with user
supplied values.

The fields that can be overridden are:

  • charset
  • format
  • has_header_row
  • column_names
  • delimiter
  • quote
  • should_trim_fields
  • grok_pattern
  • timestamp_field
  • timestamp_format

If an override makes finding the file structure impossible then
the endpoint will return an exception.

This change modifies the file structure detection functionality
such that some of the decisions can be overridden with user
supplied values.

The fields that can be overridden are:

- charset
- format
- has_header_row
- column_names
- delimiter
- quote
- should_trim_fields
- grok_pattern
- timestamp_field
- timestamp_format

If an override makes finding the file structure impossible then
the endpoint will return an exception.
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few comments, nothing major

FORMAT.getPreferredName() + " is " + FileStructure.Format.DELIMITED, validationException);
}
if (shouldTrimFields != null) {
validationException = addValidationError(SHOULD_TRIM_FIELDS.getPreferredName() + " may only be specified if " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Excluding the first word the error message is duplicated and could be extracted.

@@ -116,8 +131,11 @@ static DelimitedFileStructureFinder makeDelimitedFileStructureFinder(List<String
}

if (isHeaderInFile) {
String quote = String.valueOf(csvPreference.getQuoteChar());
String twoQuotes = quote + quote;
String optQuote = quote.replaceAll("([\\\\|()\\[\\]{}^$*?])", "\\\\$1") + "?";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like you are escaping the set of regex special characters but some are missing. ., +, :?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think : is OK as it's only special within contexts that won't arise if the other characters are escaped. But I'll add . and + and remove the duplication with the following line.

@@ -39,6 +39,17 @@ protected RestChannelConsumer prepareRequest(RestRequest restRequest, NodeClient
FindFileStructureAction.Request request = new FindFileStructureAction.Request();
request.setLinesToSample(restRequest.paramAsInt(FindFileStructureAction.Request.LINES_TO_SAMPLE.getPreferredName(),
FileStructureFinderManager.DEFAULT_IDEAL_SAMPLE_LINE_COUNT));
request.setCharset(restRequest.param(FindFileStructureAction.Request.CHARSET.getPreferredName()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xpack.ml.find_file_structure.json should be updated with the new params and perhaps a yml test if possible

Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@droberts195 droberts195 merged commit 568ac10 into elastic:master Sep 14, 2018
@droberts195 droberts195 deleted the file_structure_overrides branch September 14, 2018 08:29
droberts195 added a commit that referenced this pull request Sep 14, 2018
)

This change modifies the file structure detection functionality
such that some of the decisions can be overridden with user
supplied values.

The fields that can be overridden are:

- charset
- format
- has_header_row
- column_names
- delimiter
- quote
- should_trim_fields
- grok_pattern
- timestamp_field
- timestamp_format

If an override makes finding the file structure impossible then
the endpoint will return an exception.
droberts195 added a commit to droberts195/elasticsearch that referenced this pull request Sep 14, 2018
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Sep 14, 2018
* master: (24 commits)
  Only notify ready global checkpoint listeners (elastic#33690)
  Don't count hits via the collector if the hit count can be computed from index stats. (elastic#33701)
  Expose retries for CCR fetch failures (elastic#33694)
  Test fix - Graph vertices could appear in different orders based on map insertion sequence (elastic#33709)
  Structured audit logging (elastic#31931)
  Core: Add DateFormatter interface for java time parsing (elastic#33467)
  [CCR] Check whether the rejected execution exception has the shutdown flag set (elastic#33703)
  Mute ClusterDisruptionIT#testSendingShardFailure
  Revert "Mute FullClusterRestartSettingsUpgradeIT"
  Adjust BWC version on settings upgrade test (elastic#33650)
  [ML] Allow overrides for some file structure detection decisions (elastic#33630)
  Adapt skip version for doc_values format deprecation
  [TEST] wait for no initializing shards
  [Docs] Minor fix in `has_child` javadoc comment (elastic#33674)
  Mute FullClusterRestartSettingsUpgradeIT
  [Kerberos] Add realm name & UPN to user metadata (elastic#33338)
  [TESTS] Disable specific locales for RestrictedTrustManagerTest (elastic#33299)
  SQL: Return functions in JDBC driver metadata (elastic#33672)
  SCRIPTING: Move terms_set Context to its Own Class (elastic#33602)
  AwaitsFix testRestoreMinmal
  ...
droberts195 added a commit to droberts195/elasticsearch that referenced this pull request Sep 14, 2018
droberts195 added a commit that referenced this pull request Sep 20, 2018
droberts195 added a commit that referenced this pull request Sep 20, 2018
kcm pushed a commit that referenced this pull request Oct 30, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants