-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
api: enable support for setting original job source #16763
Conversation
This PR adds support for setting job source material along with the registration of a job. This includes a new HTTP endpoint and a new RPC endpoint for making queries for the original source of a job. The HTTP endpoint is /v1/job/<id>/submission?version=<version> and the RPC method is Job.GetJobSubmission. The job source (if submitted, and doing so is always optional), is stored in the job_submission memdb table, separately from the actual job. This way we do not incur overhead of reading the large string field throughout normal job operations. The server config now includes job_max_source_size for configuring the maximum size the job source may be, before the server simply drops the source material. This should help prevent Bad Things from happening when huge jobs are submitted. If the value is set to 0, all job source material will be dropped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really exciting! Great work.
// Variables contains the opaque variables configuration as coming from | ||
// a var-file or the WebUI variables input (hcl2 only). | ||
Variables string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make this a map[string]string
to preserve each filename and and its contents?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the parse endpoint is really only used by the webUI and the webUI's variable content is coming from a html form, I don't think it quite make sense to associate the content with a file name
command/agent/job_endpoint.go
Outdated
// writeVariablesFile writes content to a temporary file that is to be read by | ||
// the hcl parser. If content is empty nothing is written and nil is returned. | ||
// The return value is otherwise a one element slice with the filename of the | ||
// temporary file. Also returned is a cleanup function that must be called by | ||
// the caller for removing the temporary file once it is no longer needed. | ||
func writeVariablesFile(content string) ([]string, func(), error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would love to avoid disk writes. We've been very careful to restrict those to Raft on Servers, and AFAICT these files would be written before any size limitations were applied. It shouldn't matter, but at the very least its existence introduces a new consideration when diagnosing performance.
If Variables
became a map[string]string
of filenames -> contents, then we could add a new parser entrypoint func in jobspec2/parse.go
to skip the actual filesystem bits. AFAICT the hcl library itself only wants a filename and does not perform any IO: only our wrapper funcs in jobspec2
perform IO. (I'm going to be sad if I'm wrong)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right and this is for the Parse endpoint which only requires read-job
, so it seems bad form to allow it to perform diskio.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason I thought it was the hcl library requiring file names, but yeah it's just our own jobspec2
package. I expanded the ParseOptions
to also enable setting raw variable file content, and now just use that eliminating the need for files. eaddac3
command/agent/job_endpoint.go
Outdated
} | ||
|
||
if totalSize > maxSize { | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to give users some indication of what happened to their submission (if maxSize is > 0).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In 1699883 I move this check out of the HTTP layer and into the RPC layer, where we do lots of job admission controller things. That way we get to set a warning here, and avoid reading server config on clients which as you point out below doesn't exist.
command/agent/job_endpoint.go
Outdated
@@ -410,8 +450,13 @@ func (s *HTTPServer) jobUpdate(resp http.ResponseWriter, req *http.Request, jobI | |||
} | |||
|
|||
sJob, writeReq := s.apiJobAndRequestToStructs(args.Job, req, args.WriteRequest) | |||
maxSubmissionSize := s.agent.Server().GetConfig().JobMaxSourceSize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will panic on Clients.
@@ -1631,23 +1631,23 @@ func (s *StateStore) Nodes(ws memdb.WatchSet) (memdb.ResultIterator, error) { | |||
} | |||
|
|||
// UpsertJob is used to register a job or update a job definition | |||
func (s *StateStore) UpsertJob(msgType structs.MessageType, index uint64, job *structs.Job) error { | |||
func (s *StateStore) UpsertJob(msgType structs.MessageType, index uint64, sub *structs.JobSubmission, job *structs.Job) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could have saved yourself a lot of code churn by just adding a new upsertJobImpl wrapper that took this, but nbd. 😅
nomad/state/state_store.go
Outdated
// job structure originates from. It is up to the job submitter to include the source | ||
// material, and as such sub may be nil, in which case nothing is stored. | ||
func (s *StateStore) updateJobSubmission(index uint64, sub *structs.JobSubmission, namespace, jobID string, version uint64, txn *txn) error { | ||
if sub == nil || namespace == "" || jobID == "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't namespace or jobID being empty be an error? Readers might assume ignoring all 3 cases implies some behavior that I don't think it's meant to imply.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In f19721c we now return an error if namespace
or jobID
are not set
nomad/structs/structs.go
Outdated
// JobIndex is managed internally, not set. | ||
// | ||
// The raft index the Job this submission is associated with. | ||
JobIndex uint64 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The JobModifyIndex specifically right? I don't think it matters, but since Jobs have a whopping 3 indexes on them it might be nice to specify if its well known.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to clarify in 37e71b2
website/content/api-docs/jobs.mdx
Outdated
@@ -122,6 +122,10 @@ The table below shows this endpoint's support for | |||
|
|||
- `Job` `(Job: <required>)` - Specifies the JSON definition of the job. | |||
|
|||
- `Submission` `(JobSubmission: <optional>)` - Specifies the original HCL/HCL2/JSON | |||
definition of the job. This data is useful for reference only, it is not considered | |||
for the actual registration of `Job`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the actual registration of `Job`. | |
for the actual scheduling of `Job`. |
A bit pedantic so nbd either way. I think arguably it is considered by the Job.Register operation, just not for any operational logic. Just saying "scheduling" might make the distinction more clear even if it's ignored by more than just scheduling logic.
@@ -317,6 +317,7 @@ func (c *JobRunCommand) Run(args []string) int { | |||
PolicyOverride: override, | |||
PreserveCounts: preserveCounts, | |||
EvalPriority: evalPriority, | |||
Submission: sub, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a -drop-source
flag (or similar) to allow people to opt out of submitting the source? I'm thinking it might be useful for batch/sysbatch jobs where you might be submitting a relatively large number in an automated way and don't want the overhead of the source.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From offline discussion it's become apparent we need to add something to jobspecs to say "Don't save me!" so that operators don't have to change how they submit jobs depending on whether or not they want the source persisted.
queryMeta: &reply.QueryMeta, | ||
run: func(ws memdb.WatchSet, state *state.StateStore) error { | ||
// Look for the submission | ||
out, err := state.JobSubmission(ws, args.RequestNamespace(), args.JobID, args.Version) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're submitting the jobspec with the parent job and there's no pointer from structs.Job
to the structs.JobSubmission
(which is totally reasonable). But that means for periodic jobs when we call deriveJob
(and similarly for dispatch jobs) we're copying the structs.Job
and changing the job ID, which leaves no way to get from the derived job to the job submission anymore.
Maybe we should detect the case where there's a parent job ID and pass that as the job ID parameter here instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like keeping this simple and leaving that up to higher layers (UI and API). I think job invocations (aka child dispatch or periodic jobs) should be as minimal as possible, and I think reflecting that in the lowlevel RPCs/APIs is important to let users know they're minimal.
I would hate to include the parent source and give a user the impression we're "wasting" space saving the same job submission for every invocation of a job. That could lead them to disable the feature erroneously thinking they'll get a big performance improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's reasonable, but we'll need to let the UI folks know that and we should document that you're responsible for sending the parent ID in the API docs.
In ca315b5 the state store is updated to prune all but the last |
Should be good for another 👀 |
website/content/api-docs/jobs.mdx
Outdated
the job in the var file format. | ||
|
||
- `VariableFlags` `(map[string]string: nil)` - Specifies HCL2 variables to use | ||
during parsing of the job in key = value format.q |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
during parsing of the job in key = value format.q | |
during parsing of the job in key = value format. |
@@ -1248,7 +1310,8 @@ type JobRevertRequest struct { | |||
|
|||
// JobRegisterRequest is used to update a job | |||
type JobRegisterRequest struct { | |||
Job *Job | |||
Submission *JobSubmission | |||
Job *Job |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Job *Job | |
Job *Job | |
Note to reviewers: this PR isn't as large as it looks - most file changes are just due to
UpsertJob
getting a new signature, and that is used in many, many tests.This PR adds support for setting job source material along with the registration of a job.
This includes a new HTTP endpoint and a new RPC endpoint for making queries for the original source of a job. The HTTP endpoint is
/v1/job/<id>/submission?version=<version>
and the RPC method isJob.GetJobSubmission
. This endpoint is limited by ACLs withread-job
capability. It is quite similar to the JobVersions endpoint, in that data is upserted and removed along with a Job, but queryable through a separate endpoint.The job source (if submitted, and doing so is always optional), is stored in the
job_submission
memdb table, separately from the actual job. This way we do not incur overhead of reading the potentially huge string field throughout normal job operations.The server config now includes
job_max_source_size
for configuring the maximum size the job source may be, before the server simply drops the source material. This should help prevent Bad Things from happening when huge jobs are submitted. If the value is set to 0, all job source material will be dropped.Will add e2e tests in a followup PR, this one is already too large.
No backports, this is for a 1.6 feature.