Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lazy memory allocation in channel pushers. #394

Open
ryzhyk opened this issue Jun 11, 2021 · 6 comments
Open

Lazy memory allocation in channel pushers. #394

ryzhyk opened this issue Jun 11, 2021 · 6 comments

Comments

@ryzhyk
Copy link
Contributor

ryzhyk commented Jun 11, 2021

We have an application that builds a very large dataflow graph (thousands of operators) and creates hundreds of concurrently running instances of this graph. Heap profiling showed that the application wastes a lot of memory in channel pushers, which maintain Message::default_length entries even when the system is idle with no outstanding messages.

I understand that this is a performance optimization that reduces the number of malloc calls, but the memory overhead was not acceptable for our use case, so I created a patch that changes the allocation policy to only allocate channel memory (in chunks of Message::default_length entries) when there are messages to send and to deallocate it when there are no messages, reducing the memory footprint to 0 in idle state: ddlog-dev@a94e176

@frankmcsherry , any chance you'd accept a cleaned up version of this patch as a PR? I realize that it caters for a somewhat idiosyncratic use case, but it doesn't seem to cause a measurable slow down in my experiments and it does save memory.

@Kixiron
Copy link
Contributor

Kixiron commented Jun 12, 2021

If this being the default behavior is a deal breaker it can always be gated either behind a feature flag or behind a timely Config option

@frankmcsherry
Copy link
Member

Sorry for the delay, I was out / am overwhelmed. :)

The thing you describe sounds like an issue we should fix. If you have a patch, we could totally look at it. It would be a performance regression for some use cases (those that fill these buffers in steady state). Do you have a take on whether your graphs 1. alternate between filling the buffers and then idling on empty buffers, or 2. don't fill the buffers and so a smaller default / adaptive approach would be better, or maybe a bit of both?

@frankmcsherry
Copy link
Member

Another option, though idk for certain: this could just be a different pact implementation. Exchange has an implementation, but it doesn't have to be the pact that you use (we could have both implementations, and users could choose between them as appropriate). That .. only really makes sense if the exchange channel is where the empty buffers languish, rather than at other moments in the dataflow.

@frankmcsherry
Copy link
Member

There's also, and I'm not sure I want to be that guy but, the possibility that making hundreds of independent copies of a large dataflow graph is just an anti-pattern in TD. I apologize if you have before, but can you explain why this can't be one copy of the large dataflow graph, with keys on the data to indicate which logical instance they belong to?

@ryzhyk
Copy link
Contributor Author

ryzhyk commented Jun 23, 2021

It would be a performance regression for some use cases (those that fill these buffers in steady state)

I understand this is true in theory, but I wonder if this actually happens in practice.

If you have a patch

I do. I'll clean it up and submit as PR, thanks!

I can also try implementing @Kixiron 's suggestion and make this a configuration option (which will break backwards compatibility for existing timely applications and add some extra forks in the code).

  1. alternate between filling the buffers and then idling on empty buffers, or 2. don't fill the buffers and so a smaller default / adaptive approach would be better, or maybe a bit of both?

It's definitely 1. And I did try with a smaller default, which helps, but a lot of memory still gets wasted.

but can you explain why this can't be one copy of the large dataflow graph, with keys on the data to indicate which logical instance they belong to?

There are several reasons. The most important one is programmability/security. Our users write hundreds of (ddlog) rules and it would be a pain for them to use the extra key in all rules. And even if we provided some idioms to help them get this right, a small typo would cause data to flow across different logical instances, which in the real world translates to sensitive data leaking across different external customers, which is the absolute worst nightmare for the folks we're working with. In fact, we are planning to implement even stronger isolation by encapsulating each dataflow in a separate process.

Aside from this, scalability is another concern. @Kixiron managed to improve the scalability of DDlog-generated code with the number of workers, but it's still nowhere near as good as the near-linear scaling we get when using multiple separate dataflows.

ryzhyk added a commit to ddlog-dev/timely-dataflow that referenced this issue Jun 25, 2021
The current implementation of channel pushers preallocates
`Message::default_length` entries per pusher.  For very large graphs this
adds up to a lot of memory being allocated even when the system is idle
with no outstanding messages.  This patch changes the allocation policy
to only allocate channel memory when there are messages to send and to
deallocate it when there are no messages, reducing the memory footprint
to 0 in idle state at the cost of some potential slow-down due to a
larger number of allocations.

See TimelyDataflow#394.
@ryzhyk
Copy link
Contributor Author

ryzhyk commented Jun 25, 2021

Created a draft PR: #397

ryzhyk added a commit to ddlog-dev/timely-dataflow that referenced this issue Jun 28, 2021
The current implementation of channel pushers preallocates
`Message::default_length` entries per pusher.  For very large graphs this
adds up to a lot of memory being allocated even when the system is idle
with no outstanding messages.  This patch changes the allocation policy
to only allocate channel memory when there are messages to send and to
deallocate it at the end of a burst of messages (signaled by pushing
a `None` message), reducing the memory footprint to 0 in idle state at
the cost of some potential slow-down due to a larger number of allocations.

See TimelyDataflow#394.
ryzhyk added a commit to ddlog-dev/timely-dataflow that referenced this issue Jun 28, 2021
The current implementation of channel pushers preallocates
`Message::default_length` entries per pusher.  For very large graphs this
adds up to a lot of memory being allocated even when the system is idle
with no outstanding messages.  This patch changes the allocation policy
to only allocate channel memory when there are messages to send and to
deallocate it at the end of a burst of messages (signaled by pushing
a `None` message), reducing the memory footprint to 0 in idle state at
the cost of some potential slow-down due to a larger number of allocations.

See TimelyDataflow#394.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants