-
Notifications
You must be signed in to change notification settings - Fork 24.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Adds setting to configure threads used by DF analytics job #57274
[ML] Adds setting to configure threads used by DF analytics job #57274
Conversation
This commit adds a `max_num_threads` setting to data frame analytics jobs. The setting value defaults to 1 which, for now, preserves behaviour. Setting it to a higher value allows the analytics process to utilize more threads in order to speed up the analysis. We introduce this setting here in order to be able to test node load when multiple threads are used. Depending on the feedback the setting may be changed. Thus, this commit adds the minimum to allow testing.
Pinging @elastic/ml-core (:ml) |
@@ -142,6 +146,7 @@ private DataFrameAnalyticsConfig(String id, String description, DataFrameAnalyti | |||
this.createTime = createTime == null ? null : Instant.ofEpochMilli(createTime.toEpochMilli()); | |||
this.version = version; | |||
this.allowLazyStart = allowLazyStart; | |||
this.maxNumThreads = maxNumThreads; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should validate that this setting is >= 1
when it is provided. I did not see this check anywhere (unless you were planning on adding it to the rest layer).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intentionally avoided adding validation as I believe the way we implement the setting may change based on the feedback we get from testing performance behaviour. If we end up leaving it as is, I'll add validation plus other things that are missing in a following PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -438,12 +443,13 @@ private AnalyticsProcessConfig createProcessConfig(DataFrameDataExtractor dataEx | |||
ExtractedFields extractedFields) { | |||
DataFrameDataExtractor.DataSummary dataSummary = dataExtractor.collectDataSummary(); | |||
Set<String> categoricalFields = dataExtractor.getCategoricalFields(config.getAnalysis()); | |||
int threads = config.getMaxNumThreads() == null ? 1 : Math.min(config.getMaxNumThreads(), numAllocatedProcessors); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering where we should apply the default: here or in DataFrameAnalyticsConfig
.
If the calling code doesn't care whether the user has set max_num_threads
, the default of 1
could be as well applied in DataFrameAnalyticsConfig
's constructor.
Then we wouldn't need null
handling here.
I don't have a strong opinion though so feel free to disregard the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intentionally left it there as it's minimum effort. I anticipate we might change the way we implement this setting depending on feedback we'll get from testing.
This adds a setting to data frame analytics jobs called `max_number_threads`. The setting expects a positive integer. When used the user specifies the max number of threads that may be used by the analysis. Note that the actual number of threads used is limited by the number of processors on the node where the job is assigned. Also, the process may use a couple more threads for operational functionality that is not the analysis itself. This setting may also be updated for a stopped job. More threads may reduce the time it takes to complete the job at the cost of using more CPU. Backport of elastic#59254 and elastic#57274
This adds a setting to data frame analytics jobs called `max_number_threads`. The setting expects a positive integer. When used the user specifies the max number of threads that may be used by the analysis. Note that the actual number of threads used is limited by the number of processors on the node where the job is assigned. Also, the process may use a couple more threads for operational functionality that is not the analysis itself. This setting may also be updated for a stopped job. More threads may reduce the time it takes to complete the job at the cost of using more CPU. Backport of #59254 and #57274
This commit adds a
max_num_threads
setting to data frame analyticsjobs. The setting value defaults to 1 which, for now, preserves
behaviour. Setting it to a higher value allows the analytics process
to utilize more threads in order to speed up the analysis.
We introduce this setting here in order to be able to test node load
when multiple threads are used. Depending on the feedback the setting
may be changed. Thus, this commit adds the minimum to allow testing.