Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

api: enable support for setting original job source #16763

Merged
merged 10 commits into from
Apr 11, 2023
Merged

api: enable support for setting original job source #16763

merged 10 commits into from
Apr 11, 2023

Conversation

shoenig
Copy link
Member

@shoenig shoenig commented Apr 3, 2023

Note to reviewers: this PR isn't as large as it looks - most file changes are just due to UpsertJob getting a new signature, and that is used in many, many tests.

This PR adds support for setting job source material along with the registration of a job.

This includes a new HTTP endpoint and a new RPC endpoint for making queries for the original source of a job. The HTTP endpoint is /v1/job/<id>/submission?version=<version> and the RPC method is Job.GetJobSubmission. This endpoint is limited by ACLs with read-job capability. It is quite similar to the JobVersions endpoint, in that data is upserted and removed along with a Job, but queryable through a separate endpoint.

The job source (if submitted, and doing so is always optional), is stored in the job_submission memdb table, separately from the actual job. This way we do not incur overhead of reading the potentially huge string field throughout normal job operations.

The server config now includes job_max_source_size for configuring the maximum size the job source may be, before the server simply drops the source material. This should help prevent Bad Things from happening when huge jobs are submitted. If the value is set to 0, all job source material will be dropped.

Will add e2e tests in a followup PR, this one is already too large.

No backports, this is for a 1.6 feature.

@shoenig shoenig changed the title wip: api support for hcl in ui api: enable support for setting original source alongside job Apr 4, 2023
@shoenig shoenig changed the title api: enable support for setting original source alongside job api: enable support for setting original job source Apr 5, 2023
This PR adds support for setting job source material along with
the registration of a job.

This includes a new HTTP endpoint and a new RPC endpoint for
making queries for the original source of a job. The
HTTP endpoint is /v1/job/<id>/submission?version=<version> and
the RPC method is Job.GetJobSubmission.

The job source (if submitted, and doing so is always optional), is
stored in the job_submission memdb table, separately from the
actual job. This way we do not incur overhead of reading the large
string field throughout normal job operations.

The server config now includes job_max_source_size for configuring
the maximum size the job source may be, before the server simply
drops the source material. This should help prevent Bad Things from
happening when huge jobs are submitted. If the value is set to 0,
all job source material will be dropped.
Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really exciting! Great work.

Comment on lines +900 to +902
// Variables contains the opaque variables configuration as coming from
// a var-file or the WebUI variables input (hcl2 only).
Variables string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make this a map[string]string to preserve each filename and and its contents?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the parse endpoint is really only used by the webUI and the webUI's variable content is coming from a html form, I don't think it quite make sense to associate the content with a file name

Comment on lines 750 to 755
// writeVariablesFile writes content to a temporary file that is to be read by
// the hcl parser. If content is empty nothing is written and nil is returned.
// The return value is otherwise a one element slice with the filename of the
// temporary file. Also returned is a cleanup function that must be called by
// the caller for removing the temporary file once it is no longer needed.
func writeVariablesFile(content string) ([]string, func(), error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love to avoid disk writes. We've been very careful to restrict those to Raft on Servers, and AFAICT these files would be written before any size limitations were applied. It shouldn't matter, but at the very least its existence introduces a new consideration when diagnosing performance.

If Variables became a map[string]string of filenames -> contents, then we could add a new parser entrypoint func in jobspec2/parse.go to skip the actual filesystem bits. AFAICT the hcl library itself only wants a filename and does not perform any IO: only our wrapper funcs in jobspec2 perform IO. (I'm going to be sad if I'm wrong)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right and this is for the Parse endpoint which only requires read-job, so it seems bad form to allow it to perform diskio.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason I thought it was the hcl library requiring file names, but yeah it's just our own jobspec2 package. I expanded the ParseOptions to also enable setting raw variable file content, and now just use that eliminating the need for files. eaddac3

}

if totalSize > maxSize {
return nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to give users some indication of what happened to their submission (if maxSize is > 0).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In 1699883 I move this check out of the HTTP layer and into the RPC layer, where we do lots of job admission controller things. That way we get to set a warning here, and avoid reading server config on clients which as you point out below doesn't exist.

@@ -410,8 +450,13 @@ func (s *HTTPServer) jobUpdate(resp http.ResponseWriter, req *http.Request, jobI
}

sJob, writeReq := s.apiJobAndRequestToStructs(args.Job, req, args.WriteRequest)
maxSubmissionSize := s.agent.Server().GetConfig().JobMaxSourceSize
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will panic on Clients.

command/agent/job_endpoint.go Show resolved Hide resolved
@@ -1631,23 +1631,23 @@ func (s *StateStore) Nodes(ws memdb.WatchSet) (memdb.ResultIterator, error) {
}

// UpsertJob is used to register a job or update a job definition
func (s *StateStore) UpsertJob(msgType structs.MessageType, index uint64, job *structs.Job) error {
func (s *StateStore) UpsertJob(msgType structs.MessageType, index uint64, sub *structs.JobSubmission, job *structs.Job) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could have saved yourself a lot of code churn by just adding a new upsertJobImpl wrapper that took this, but nbd. 😅

// job structure originates from. It is up to the job submitter to include the source
// material, and as such sub may be nil, in which case nothing is stored.
func (s *StateStore) updateJobSubmission(index uint64, sub *structs.JobSubmission, namespace, jobID string, version uint64, txn *txn) error {
if sub == nil || namespace == "" || jobID == "" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't namespace or jobID being empty be an error? Readers might assume ignoring all 3 cases implies some behavior that I don't think it's meant to imply.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In f19721c we now return an error if namespace or jobID are not set

Comment on lines 4301 to 4305
// JobIndex is managed internally, not set.
//
// The raft index the Job this submission is associated with.
JobIndex uint64
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JobModifyIndex specifically right? I don't think it matters, but since Jobs have a whopping 3 indexes on them it might be nice to specify if its well known.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to clarify in 37e71b2

@@ -122,6 +122,10 @@ The table below shows this endpoint's support for

- `Job` `(Job: <required>)` - Specifies the JSON definition of the job.

- `Submission` `(JobSubmission: <optional>)` - Specifies the original HCL/HCL2/JSON
definition of the job. This data is useful for reference only, it is not considered
for the actual registration of `Job`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for the actual registration of `Job`.
for the actual scheduling of `Job`.

A bit pedantic so nbd either way. I think arguably it is considered by the Job.Register operation, just not for any operational logic. Just saying "scheduling" might make the distinction more clear even if it's ignored by more than just scheduling logic.

@@ -317,6 +317,7 @@ func (c *JobRunCommand) Run(args []string) int {
PolicyOverride: override,
PreserveCounts: preserveCounts,
EvalPriority: evalPriority,
Submission: sub,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a -drop-source flag (or similar) to allow people to opt out of submitting the source? I'm thinking it might be useful for batch/sysbatch jobs where you might be submitting a relatively large number in an automated way and don't want the overhead of the source.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From offline discussion it's become apparent we need to add something to jobspecs to say "Don't save me!" so that operators don't have to change how they submit jobs depending on whether or not they want the source persisted.

queryMeta: &reply.QueryMeta,
run: func(ws memdb.WatchSet, state *state.StateStore) error {
// Look for the submission
out, err := state.JobSubmission(ws, args.RequestNamespace(), args.JobID, args.Version)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're submitting the jobspec with the parent job and there's no pointer from structs.Job to the structs.JobSubmission (which is totally reasonable). But that means for periodic jobs when we call deriveJob (and similarly for dispatch jobs) we're copying the structs.Job and changing the job ID, which leaves no way to get from the derived job to the job submission anymore.

Maybe we should detect the case where there's a parent job ID and pass that as the job ID parameter here instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like keeping this simple and leaving that up to higher layers (UI and API). I think job invocations (aka child dispatch or periodic jobs) should be as minimal as possible, and I think reflecting that in the lowlevel RPCs/APIs is important to let users know they're minimal.

I would hate to include the parent source and give a user the impression we're "wasting" space saving the same job submission for every invocation of a job. That could lead them to disable the feature erroneously thinking they'll get a big performance improvement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's reasonable, but we'll need to let the UI folks know that and we should document that you're responsible for sending the parent ID in the API docs.

@shoenig
Copy link
Member Author

shoenig commented Apr 10, 2023

In ca315b5 the state store is updated to prune all but the last structs.JobTrackedVersions (6) job submissions, ensuring we do not balloon the state store on stored job submissions.

@shoenig
Copy link
Member Author

shoenig commented Apr 10, 2023

Should be good for another 👀

the job in the var file format.

- `VariableFlags` `(map[string]string: nil)` - Specifies HCL2 variables to use
during parsing of the job in key = value format.q
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
during parsing of the job in key = value format.q
during parsing of the job in key = value format.

website/content/api-docs/jobs.mdx Outdated Show resolved Hide resolved
command/agent/config.go Outdated Show resolved Hide resolved
@@ -1248,7 +1310,8 @@ type JobRevertRequest struct {

// JobRegisterRequest is used to update a job
type JobRegisterRequest struct {
Job *Job
Submission *JobSubmission
Job *Job
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Job *Job
Job *Job

command/agent/job_endpoint.go Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants