-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rfcs: Task groups in Go: per-session resource usage tracking #60589
base: master
Are you sure you want to change the base?
Conversation
e07d169
to
3cf68cf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto, @bdarnell, @knz, @nvanbenschoten, @RaduBerinde, and @tbg)
docs/RFCS/20210215_per_session_cpu_usage.md, line 155 at r1 (raw file):
who / what is responsible for resource usage. The reason why we prefer an ID-to-state mapping at the application
We could provide a key-value storage facility on the task group similar to what exists on context.Context
.
docs/RFCS/20210215_per_session_cpu_usage.md, line 183 at r1 (raw file):
Context: The Go scheduler is preemptive and runs goroutines in segments of time called "ticks". The specific time length of a tick is variable but is
The phrasing "a tick is variable" confused me. The unit of a tick seems to depend on the architecture, but is not variable within the lifetime of a process. I was interpreting "a tick is variable" to indicate that the unit of a tick was variable within the lifetime of a process.
Hmm, or maybe ticks are variable length within a process because execute
is called at arbitrary times. This all makes me quite anxious about using ticks as the unit to reporting CPU usage.
docs/RFCS/20210215_per_session_cpu_usage.md, line 187 at r1 (raw file):
second, TBD how much precisely). For CPU-heavy loads every tick tends to use its fully allowed length before control is switched over to a different goroutine, to ensure fair scheduling.
What happens when control switches due to a blocking operation such as a lock acquisition?
docs/RFCS/20210215_per_session_cpu_usage.md, line 202 at r1 (raw file):
The RFC proposes to extend the internal task group struct with a new `schedticks` field, incremented upon every scheduler tick for the
I think this metric should expose seconds or nanoseconds. Otherwise the value feels difficult to use. I thought initially that tickspersecond()
could be used to do a conversion, but I don't think the ticks
in tickspersecond()
are the same as scheduler ticks. Confusing!
The GC records stats about how much time is spent in assist vs dedicated-mark vs idle-mark by recording entry and exit to these states with calls to nanotime()
. See runtime/mgc.go
.
docs/RFCS/20210215_per_session_cpu_usage.md, line 221 at r1 (raw file):
// GetInternalTaskGroupSchedTicks retrieves the number of scheduler ticks for // all goroutines in the given task group. func GetInternalTaskGroupSchedTicks(taskGroup TaskGroup) uint64 {
Have you seen https://tip.golang.org/pkg/runtime/metrics/ which is new in go1.16? I point this out because the Go runtime itself is moving towards having metrics accessible by string keys. The schedticks
seems like a metric associated with a task group, but otherwise not significantly different than this new runtime/metrics
package.
docs/RFCS/20210215_per_session_cpu_usage.md, line 377 at r1 (raw file):
instrument every single entry point in Go where "CPU time" or "memory usage" can be incurred. This has been attempted in the past and was largely unsuccessful.
Could we use a sampling-based approach for determining CPU usage? The CPU profiler is arriving at some estimate of CPU usage, but we can't easily map that back to goroutines. Seems like that distance could be closed with extensions or enhancements to the Go runtime.
There is another approach that should probably listed as an alternative, which is to "make do" with what Go gives us out of the box: we use profiler labels and you use the cpu profile endpoint to get a breakdown of CPU usage mapped to queries. A prototype of that is here: The main motivation in this RFC,
could be satisfactorily handled with the profiling approach alone. But beyond that, if we're looking towards something like active session history, it gets more sketchy and we're possibly looking at periodically running a CPU profile, or "background tracing" which in itself requires changes to the Go runtime (suggested in golang/go#41554 (comment)). So I'm not suggesting that the idea presented in this RFC does not have merit. |
a task group to "pages" (blocks) of system memory that are used | ||
to carve heap allocations from. | ||
|
||
When a heap allocation is performed, the RAM usage counter on the task |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider trying to introduce an indirection here by tying allocations to "allocators" (like in other languates), not directly to task groups. And so a task group, or individual goroutines, could specify what they current allocator is. So for example userspace could write allocWidget(allocator) *widget
constructors that allocate with a particular allocator.
I just learned that tailscale also maintains a fork of the runtime: https://twitter.com/bradfitz/status/1360293997038637058 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto, @bdarnell, @knz, @petermattis, @RaduBerinde, and @tbg)
docs/RFCS/20210215_per_session_cpu_usage.md, line 50 at r1 (raw file):
that can be inspected separately (e.g. via `crdb_internal.jobs`? TBD) A preliminary version of the proposed Go runtime changes is available
I really like that you did this prototyping and linked to it from the RFC. It makes everything else that follows significantly easier to understand. We should continue to do this.
docs/RFCS/20210215_per_session_cpu_usage.md, line 187 at r1 (raw file):
Previously, petermattis (Peter Mattis) wrote…
What happens when control switches due to a blocking operation such as a lock acquisition?
I'm also left wondering this. Does this result in more frequent ticks or work that is not accounted for? Either would pose challenges for meaningful reporting. The first would make ticks incomparable in terms of being a measurement of CPU utilization. The second would lead to inaccuracy, especially with the course-grained 100ms time quanta (TBD from above) - though I have to imagine it's much smaller than this. EDIT: it looks like it's 10ms: const forcePreemptNS = 10 * 1000 * 1000 // 10ms
.
docs/RFCS/20210215_per_session_cpu_usage.md, line 305 at r1 (raw file):
To achieve this we change the heap allocator structures to attach a task group to "pages" (blocks) of system memory that are used to carve heap allocations from.
Have you taken a look into what kinds of changes will be needed in the runtime to support this, as you did with CPU utilization above? Without looking myself, I fear that we may not have access to the taskGroupCtx
structures in all of the places we need them to perform the proper accounting. We certainly will have access to the task group structure when performing a heap allocation, but what about when freeing it? This would imply associating an owner goroutine/task group to each heap allocation and keeping a record of that somewhere, wouldn't it? Is there such a notion of heap memory ownership in the runtime today?
If not, new bookkeeping here sounds expensive, which may be why you're discussing some loss of precision for small objects down below. Is that how I should be understanding that suggestion?
3cf68cf
to
12bf659
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your reviews all!
Here's a summary of my changes since the last time you read the RFC:
- included the pprof label approach as an alternative, with discussion of cons
- detailed the manual instrumentation alternative
- spelled out that CPU ticks were merely PoC, and proposed how to use actual time instead
- researched and detailed how to get precise RAM usage from Go's current heap alloc
- added mention of Andrei's idea for custom allocators.
Answers to Tobias' top level comments above:
There is another approach that should probably listed as an alternative
Done
could be satisfactorily handled with the profiling approach alone
This is incorrect. As now stated in the text of the RFC, the pprof based approach is unsuitable because:
-
The profiling causes frequent interruptions of the goroutines, which
impairs performance. It also trashes CPU caches and generally slows
down processing for memory-heavy workloads. -
It only measures CPU usage during the goroutine interrupts. To get a
more precise measurement requires increasing the profiling
frequency, which impairs performance more. -
The CPU usage output is formatted for consumption by human eyes. It
is difficult and expensive to re-format so that it can be inspected
using an API.
I just learned that tailscale also maintains a fork of the runtime
Added a note of this in the RFC.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @asubiotto, @bdarnell, @nvanbenschoten, @petermattis, @RaduBerinde, and @tbg)
docs/RFCS/20210215_per_session_cpu_usage.md, line 155 at r1 (raw file):
Previously, petermattis (Peter Mattis) wrote…
We could provide a key-value storage facility on the task group similar to what exists on
context.Context
.
Added a note about that. Thanks.
docs/RFCS/20210215_per_session_cpu_usage.md, line 183 at r1 (raw file):
Previously, petermattis (Peter Mattis) wrote…
The phrasing "a tick is variable" confused me. The unit of a tick seems to depend on the architecture, but is not variable within the lifetime of a process. I was interpreting "a tick is variable" to indicate that the unit of a tick was variable within the lifetime of a process.
Hmm, or maybe ticks are variable length within a process because
execute
is called at arbitrary times. This all makes me quite anxious about using ticks as the unit to reporting CPU usage.
Reworded to clarify. And yes, I agree ticks do not report CPU usage accordingly. Clarified in the section title that this merely a proof of concept. Added a new section with actual CPU time below.
docs/RFCS/20210215_per_session_cpu_usage.md, line 187 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I'm also left wondering this. Does this result in more frequent ticks or work that is not accounted for? Either would pose challenges for meaningful reporting. The first would make ticks incomparable in terms of being a measurement of CPU utilization. The second would lead to inaccuracy, especially with the course-grained 100ms time quanta (TBD from above) - though I have to imagine it's much smaller than this. EDIT: it looks like it's 10ms:
const forcePreemptNS = 10 * 1000 * 1000 // 10ms
.
Yes agreed, CPU yields cause ticks to be shorter. Added a clarification in the text as well as another section for actual CPU utilization.
docs/RFCS/20210215_per_session_cpu_usage.md, line 202 at r1 (raw file):
Previously, petermattis (Peter Mattis) wrote…
I think this metric should expose seconds or nanoseconds. Otherwise the value feels difficult to use. I thought initially that
tickspersecond()
could be used to do a conversion, but I don't think theticks
intickspersecond()
are the same as scheduler ticks. Confusing!The GC records stats about how much time is spent in assist vs dedicated-mark vs idle-mark by recording entry and exit to these states with calls to
nanotime()
. Seeruntime/mgc.go
.
Done.
docs/RFCS/20210215_per_session_cpu_usage.md, line 221 at r1 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Have you seen https://tip.golang.org/pkg/runtime/metrics/ which is new in go1.16? I point this out because the Go runtime itself is moving towards having metrics accessible by string keys. The
schedticks
seems like a metric associated with a task group, but otherwise not significantly different than this newruntime/metrics
package.
Very good point. Added a reference to metrics
here with a note that the API needs to be refined further.
docs/RFCS/20210215_per_session_cpu_usage.md, line 305 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Have you taken a look into what kinds of changes will be needed in the runtime to support this, as you did with CPU utilization above? Without looking myself, I fear that we may not have access to the
taskGroupCtx
structures in all of the places we need them to perform the proper accounting. We certainly will have access to the task group structure when performing a heap allocation, but what about when freeing it? This would imply associating an owner goroutine/task group to each heap allocation and keeping a record of that somewhere, wouldn't it? Is there such a notion of heap memory ownership in the runtime today?If not, new bookkeeping here sounds expensive, which may be why you're discussing some loss of precision for small objects down below. Is that how I should be understanding that suggestion?
I did some more research and the situation is actually simpler than I had feared. Now explained in the text. PTAL.
docs/RFCS/20210215_per_session_cpu_usage.md, line 307 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
consider trying to introduce an indirection here by tying allocations to "allocators" (like in other languates), not directly to task groups. And so a task group, or individual goroutines, could specify what they current allocator is. So for example userspace could write
allocWidget(allocator) *widget
constructors that allocate with a particular allocator.
Good point! Added a section to explain your idea, although this would be a different enhancement that solves another problem than the one that motivated this RFC.
docs/RFCS/20210215_per_session_cpu_usage.md, line 377 at r1 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Could we use a sampling-based approach for determining CPU usage? The CPU profiler is arriving at some estimate of CPU usage, but we can't easily map that back to goroutines. Seems like that distance could be closed with extensions or enhancements to the Go runtime.
CPU usage in profiles can be tracked back to task groups via the profiler labels. Added a section to explain, as per Tobias' suggestion.
19decc4
to
9d246e4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @asubiotto, @bdarnell, @knz, @nvanbenschoten, @petermattis, @RaduBerinde, and @tbg)
docs/RFCS/20210215_per_session_cpu_usage.md, line 419 at r2 (raw file):
allocations were made. #### Tweaking the Go heap allocator for precise measurement
Goroutine allocations are performed both for temporary state needed during execution, and for persistent state stored in caches. For example, the SQL execution goroutines will allocate memory to populate the replica cache (or whatever that structure is called in DistSender nowadays). I believe there are also metadata caches present in SQL-land itself. Is there an intention to distinguish between this cache memory allocation and temporary state needed during execution?
docs/RFCS/20210215_per_session_cpu_usage.md, line 434 at r2 (raw file):
- we change the mapping of P to `mcache`. Instead of one `mcache` per P, we use one `mcache` per P per `taskGroupCtx` -- i.e. we implement a separate small heap per task group.
How heavyweight are mcache
structures? This will significantly increase the number of them in a process which is worrisome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is incorrect. As now stated in the text of the RFC, the pprof based approach is unsuitable because:
I left inline comments but as a meta comment, in the discussions of alternatives I perceive you as biased. For example, off-hand dismissing a profiling-based approach as "too expensive" doesn't seem constructive. Someone could similarly dismiss your proposal as "too invasive". I agree that there will be judgment calls (for example, what is the price of a subtle bug introduced by switching to custom heap arenas that we ship to customers?) but the profiler overhead is very simple to measure. I'm not even suggesting you do this, just please try to root both against and for the alternatives when you consider them and reflect that in the verbiage.
By the way, if it seems like I'm rooting only for the profiling approach, that is incorrect. I think that if we're forking the runtime it has to be for very well-substantiated reasons, and the possible alternatives factor heavily into that.
Reviewed 1 of 1 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @asubiotto, @bdarnell, @knz, @nvanbenschoten, @petermattis, and @RaduBerinde)
docs/RFCS/20210215_per_session_cpu_usage.md, line 526 at r2 (raw file):
manage a custom fork of Go. For reference, Tailscale was founded in 2019 and has a smaller
You may want to reference the details from the thread though:
- they are upstreaming everything, so they wouldn't make any of the changes proposed here. Most of their fork is "this will be in the next Go version but I wanted it now", so
- their fork is correspondingly small
It is unlikely that what we are building here will ever be upstreamed. Our fork will possibly be much larger than theirs.
Tailscale has a smaller engineering team, but they have more Go core expertise.
docs/RFCS/20210215_per_session_cpu_usage.md, line 656 at r2 (raw file):
2. to start a CPU usage report, the program requests `StartCPUProfile` on every server process simultaneously. An output file / buffer
No, this is not necessary. What you want (and this is the same with your approach) is that when a SQL gateway delegates work to another node, that it submits metadata that allows identifying the source of the work. For this approach, this means that labels have to be propagated across RPC boundaries at all times (OR we profile all nodes at the same time, but this is brittle).
docs/RFCS/20210215_per_session_cpu_usage.md, line 678 at r2 (raw file):
- The profiling causes frequent interruptions of the goroutines, which impairs performance. It also trashes CPU caches and generally slows down processing for memory-heavy workloads.
Please give nuance to this argument. It is fair to be concerned about the overhead of the approach, but it is unfair to dismiss it off-hand as "too ineffective". The sampling frequency is configurable and a certain loss of fidelity might be acceptable.
Please also mention an upside of this approach: it is the mainstream Go way. Profiler labels will eventually be supported by the heap profiles as well. There is no need to fork, and tools will become better without requiring resources on our end.
docs/RFCS/20210215_per_session_cpu_usage.md, line 680 at r2 (raw file):
down processing for memory-heavy workloads. - It is only able to *observe* CPU usage. It cannot stop tasks from
It could be the basis of a system that can. The alternative you are suggesting can also not stop tasks from consuming more CPU. I don't think you're suggesting that the runtime changes will block allocations or "fail" goroutines (or are you?)
docs/RFCS/20210215_per_session_cpu_usage.md, line 687 at r2 (raw file):
frequency, which impairs performance more. - The CPU usage output is formatted for consumption by human eyes. It
This is not correct. Profiles are protocol buffers and you can write any analysis on top of them that you would like. Most of it can actually be cribbed from pprof
itself, if you can't outright use it as a library.
9d246e4
to
791c417
Compare
This has probably been discussed but what about forking a separate OS process for each session? Not necessarily a real fork (can't imagine the go runtime could be made to handle that) but a subprocess that handled just SQL execution. Like a detached distsql processor that was just handed a full query execution spec and routed kv requests to its parent/gateway process and/or made direct distsql connections to other nodes. Then managing the memory/CPU usage of the SQL session would get really simple and get rid of the complexity of thinking about in-process/same goroutine kvclient/kvserver situations (if that even happens). It could be an optin thing too where only heavy aggregation queries got booted to their own subprocess. One could even imagine implementing these little SQL workers in multiple languages (rust, js, inet,java) for extra efficiency or to open the door for UDFs (maybe even stored procedures) in developers language of choice (sorry runaway brain!). This is essentially what browsers do with plugins; although security was the main driver but the security implications might appeal to some customers/use cases. We could also conceivably take advantage of OS features like nice, cgroups et al. Not a small lift but I imagine a Go based prototype could be slapped together fairly quickly. One could also imagine something like this enabling interesting CC free tier implementation options where each tenant gets its own process or something. |
That's an intriguing suggestion. Would be curious to hear thoughts from @RaduBerinde and @jordanlewis about that idea. |
+1, it is an intriguing suggestion.
In CC free tier, each tenant does get its own process. In fact, each tenant SQL process is in its own k8s pod, with various resource limits applied to it. |
The process-per-session approach would simplify the resource tracking problem, but then we'd have the problem of deciding what memory limits to place on each of these processes. There are also various other practical problems. Would the sub-process talk to the parent process for caches, or would we replicate the caches? I'm thinking of caches like the replica cache used by distsender to locate where replicas are in the cluster. Those concerns aside, it is certainly worth thinking about this more. |
Another being the increased CPU and memory cost per session which necessitates aggressive connection pooling in systems like Postgres. |
Agreed this is an interesting idea. It could be challenging to get the IPC right. I'd imagine the child process would send all of its requests back to the parent process to avoid any challenges with cache replication. I'd hope the child processes would be stateless. |
Sending all RPCs through the parent process would be quite costly. Reading @cucaroach's original proposal again, I see he was imagining doing this at the DistSQL processor level. I think that could eliminate some of the cache concerns I have. |
But the TableReaders need to figure out where to send their KV requests. That's the part I don't quite see how we'd accomplish without something trickier. |
Sure, the OOM story gets a lot better though since the whole node doesn't go down when inevitable OOMs occur. Its been awhile since I've looked into how the oomkiller works but we'd need to be able to influence the scoring to make the sql workers sacrificial and the make the parent process less so. But you can't beat the simplicity of calling getrusage to access resource usage. I'm not sure what caches are important probably can get away with IPC back to the parent for most things. Wonder if anyone has taken gRPC to the limit with a shared memory implementation, cutting out the serialization/deserialization might be necessary to make it practical. Unfortunately I don't really have enough of the architecture in my brain yet to think clearly about how feasible this but it seems like a lot of pieces are already in place with distsql and the kvclient/kvserver split. |
Will there be hard memory limits for each process? I am trying to understand if this will result in fragmentation, in that ample unused memory in one process will be unavailable for use in another process. |
Isn't that what happens today with free tier CC (pods have hard limits that are a fixed fraction of available space)? Tempting to avoid hard limits and let the workers rip on the theory that a) oomkiller scoring can be made to preserve parent process and b) spill to disk limits keep most workflows in line. RAM is a terrible thing to waste. |
Here's another experiment on that topic. IMO if we go this route, this sort of connection pooling should be built in to CockroachDB. In particular, pgbouncer only helps with this problem if it's in "transaction pooling" mode, but that breaks prepared statements / extended protocol since pgbouncer is not smart enough to keep track of that state. And also in addition to issues like that, I think it's too much overhead to set up and configure this separate service. |
Ok so to synthetize the discussion so far:
Meanwhile, regardless of the approach considered, we have a discussion about memory budgets. When using multiple processes, do we set limits and how?
Finally, something not discussed so far but must be mentioned for any process-based solutions: we still also need to think about CPU accounting and budgeting:
Did I miss anything? I can mention this alternative in the RFC, although at this point it's becoming pretty clear that we need two separate RFCs: one for CPU accounting/budgets, and one for memory accounting/budget. |
In addition to connection pooling another technique we could leverage would be to not just do one process per session, ie allow multiple client connections per process. For instance would it make sense to allow one process per user (role?) or database (ignoring the fact that multiple databases can be queried in one session). In addition to efficiency gains by having multiple connections served by one process grouping things this way seems like it could match how operators might want to control/monitor resource usage. |
There are other variants on this idea, like the one mentioned 2 weeks ago on an internal thread in the context of admission control https://cockroachlabs.slack.com/archives/C01SRKWGHG8/p1617571577065900?thread_ts=1617224758.005400&cid=C01SRKWGHG8
Different cpu shares in the scheduler tends to work well enough from what I've observed in the past, and has the advantage of being work conserving compared to a fixed limit. |
bc5bde2
to
55be05b
Compare
I was able to add precise tracking for "large allocations" (above 32KB) see new explanatory section here If we look at cockroachdb with this patch in place, we see the |
Release note: None
55be05b
to
943ee5a
Compare
Following up on the above:
|
(the links to the 3 docs are now available in the PR description at the top) |
Hi @knz, this PR is of particular interest to me, I believe that resource controls are still important. I see the PR is open, is this going to move forward in the future? |
we want to, yes |
Connecting some dots: see #82356 which proposes similar mechanisms (an instrumented runtime) to get fine-grained CPU attribution, a subset of what this RFC helped define. We're hoping to do something here for 22.2. |
RFC text here:
Shared mini-RFC on task groups: https://github.com/knz/cockroach/blob/20210215-task-group-rfc/docs/RFCS/20210215_task_groups.md
RFC on CPU usage: https://github.com/knz/cockroach/blob/20210215-task-group-rfc/docs/RFCS/20210215_task_group_resource_tracking_cpu.md
RFC on RAM usage: https://github.com/knz/cockroach/blob/20210215-task-group-rfc/docs/RFCS/20210215_task_group_resource_tracking_ram.md
Release note: None