Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Adds setting to configure threads used by DF analytics job #57274

Conversation

dimitris-athanasiou
Copy link
Contributor

This commit adds a max_num_threads setting to data frame analytics
jobs. The setting value defaults to 1 which, for now, preserves
behaviour. Setting it to a higher value allows the analytics process
to utilize more threads in order to speed up the analysis.

We introduce this setting here in order to be able to test node load
when multiple threads are used. Depending on the feedback the setting
may be changed. Thus, this commit adds the minimum to allow testing.

This commit adds a `max_num_threads` setting to data frame analytics
jobs. The setting value defaults to 1 which, for now, preserves
behaviour. Setting it to a higher value allows the analytics process
to utilize more threads in order to speed up the analysis.

We introduce this setting here in order to be able to test node load
when multiple threads are used. Depending on the feedback the setting
may be changed. Thus, this commit adds the minimum to allow testing.
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

@@ -142,6 +146,7 @@ private DataFrameAnalyticsConfig(String id, String description, DataFrameAnalyti
this.createTime = createTime == null ? null : Instant.ofEpochMilli(createTime.toEpochMilli());
this.version = version;
this.allowLazyStart = allowLazyStart;
this.maxNumThreads = maxNumThreads;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should validate that this setting is >= 1 when it is provided. I did not see this check anywhere (unless you were planning on adding it to the rest layer).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intentionally avoided adding validation as I believe the way we implement the setting may change based on the feedback we get from testing performance behaviour. If we end up leaving it as is, I'll add validation plus other things that are missing in a following PR.

Copy link
Contributor

@przemekwitek przemekwitek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -438,12 +443,13 @@ private AnalyticsProcessConfig createProcessConfig(DataFrameDataExtractor dataEx
ExtractedFields extractedFields) {
DataFrameDataExtractor.DataSummary dataSummary = dataExtractor.collectDataSummary();
Set<String> categoricalFields = dataExtractor.getCategoricalFields(config.getAnalysis());
int threads = config.getMaxNumThreads() == null ? 1 : Math.min(config.getMaxNumThreads(), numAllocatedProcessors);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering where we should apply the default: here or in DataFrameAnalyticsConfig.
If the calling code doesn't care whether the user has set max_num_threads, the default of 1 could be as well applied in DataFrameAnalyticsConfig's constructor.
Then we wouldn't need null handling here.
I don't have a strong opinion though so feel free to disregard the comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intentionally left it there as it's minimum effort. I anticipate we might change the way we implement this setting depending on feedback we'll get from testing.

@dimitris-athanasiou dimitris-athanasiou merged commit 105b585 into elastic:master May 28, 2020
@dimitris-athanasiou dimitris-athanasiou deleted the add-df-analytics-threads-setting branch May 28, 2020 13:02
dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this pull request Jul 9, 2020
This adds a setting to data frame analytics jobs called
`max_number_threads`. The setting expects a positive integer.
When used the user specifies the max number of threads that may
be used by the analysis. Note that the actual number of threads
used is limited by the number of processors on the node where
the job is assigned. Also, the process may use a couple more threads
for operational functionality that is not the analysis itself.

This setting may also be updated for a stopped job.

More threads may reduce the time it takes to complete the job at the cost
of using more CPU.

Backport of elastic#59254 and elastic#57274
dimitris-athanasiou added a commit that referenced this pull request Jul 9, 2020
This adds a setting to data frame analytics jobs called
`max_number_threads`. The setting expects a positive integer.
When used the user specifies the max number of threads that may
be used by the analysis. Note that the actual number of threads
used is limited by the number of processors on the node where
the job is assigned. Also, the process may use a couple more threads
for operational functionality that is not the analysis itself.

This setting may also be updated for a stopped job.

More threads may reduce the time it takes to complete the job at the cost
of using more CPU.

Backport of #59254 and #57274
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants