Accelerate Scheduler with Cython, PyPy, or C #854

mrocklin · 2017-02-03T15:19:27Z

We are sometimes bound by the administrative of the distributed scheduler. The scheduler is Pure-Python, and a bundle of core data structures (lists, sets, dicts). It generally has an overhead of a few hundred microseconds per task. When graphs become large (hundreds of thousands) this overhead can become troublesome.

There are a few potential solutions:

Use Cython in a few places
Run the entire scheduler in PyPy (workers, clients, and user code can still be in CPython)
Rewrite everything in C/Go/Julia/whatever

Generally efforts here have to be balanced with the fact that the scheduler will continue to change, and we're likely to continue writing it in Python, so any performance improvement would have the extra constraint that it can't add significant development inertia or friction.

Here are a couple of cProfile-able scripts that stress scheduler performance: https://gist.github.com/mrocklin/eb9ca64813f98946896ec646f0e4a43b

mrocklin · 2017-02-03T15:43:38Z

Here is another, more real-world benchmark: https://gist.github.com/48b7c4b610db63b2ee816bd387b5a328

Though I plan to try to clean up performance on this one up a bit in pure python first.

kszucs · 2017-02-03T15:46:55Z

Any experience running with PyPy? How much is the difference?

I think it's a bit complicated to run the scheduler on pypy and the workers with cpython (if you want numpy or pandas).

According to @GaelVaroquaux 's testimonial on Cython:

You guys rock! In scikit-learn, we have decided early on to do Cython, rather than C or C++. That decision has been a clear win because the code is way more maintainable. We have had to convince new contributors that Cython was better for them, but the readability of the code, and the capacity to support multiple Python versions, was worth it.

With Cython it's easier to achieve c-like performance without leaving the python origins. I had great experience with it too.

mrocklin · 2017-02-03T16:00:22Z

My perspective is the following:

The advantage of PyPy is that we would maintain a single codebase. We would also speedup everything for free (including tornado). I'm not normally excited about PyPy. I think that dask's distributed scheduler is an interesting exception. PyPy seems to be fairly common among web projects. The distributed scheduler looks much more like a web project than a data science project.

The advantage of Cython is that users can install and run it normally using their normal tool chain. Also the community that works on Dask has more experience with Cython than with PyPy. The disadvantage is that we will need to maintain two copies of several functions to support non-Cython users, and things will grow out of date. I do not intend to make Cython a hard dependency.

The advantage of C is that it would force us to reconsider our data structures.

Kobzol · 2017-02-03T19:34:07Z

I'm using PyPy for client, workers and scheduler and it works fine. I also tried to run the scheduler on PyPy and client with CPython, that worked too.
The only problems arised when workers and client didn't both use PyPy or CPython, because the serialization/deserialization wasn't compatible (which makes sense).

I think that accelerating the scheduler in this way is a good idea, just wanted to state that at least with PyPy you can try it now right away, without any changes to distributed itself :-)

hussainsultan · 2017-02-03T22:07:43Z

cc: @marshyski this might of interest to you from Go perspective.

kszucs · 2017-02-04T10:09:08Z

@mrocklin

The disadvantage is that we will need to maintain two copies of several functions to support non-Cython users

Did You mean non-CPython users? It doesn't seem impossible to keep compatible with pypy without code duplication.

I guess Numba is out-of-scope - the scheduler depends and possibly will depend on hashtable structures, right?

mrocklin · 2017-02-04T13:58:08Z

No, I mean non-Cython users. I don't intend to make Dask depend on Cython near term.

mrocklin · 2017-02-04T13:58:22Z

Numba is unlikely to accelerate the scheduler code.

GaelVaroquaux · 2017-02-04T19:03:42Z

The disadvantage is that we will need to maintain two copies of several functions to support non-Cython users, and things will grow out of date.

You should explore using Cython's pure Python mode: http://cython.readthedocs.io/en/latest/src/tutorial/pure.html Maybe it will be enough for what you need. Also, I think that if you need to have two implementations of functions, a good practice is to test them together. It will help avoiding a decay.

I do not intend to make Cython a hard dependency.

The problem would not be Cython as a dependency, as it needs only to be a build dependency, or even better, you can generate the C files and ship them when you do releases (this is what we do in scikit-learn). The problem is that you are adding compiled code to your project. It makes it more likely that things go run on installation. You need to start distributing binaries, and hence start compiling for the different platforms and the different versions of Python. Systems that have non uniform platforms (think a multi-machine environment) suffer. I have always considered that adding compiled code to a project was a big step, and I try to limit it. Note that limiting it is not always avoiding it: I am very very happy that we decided for compiled code in scikit-learn. It does increase the adoption cost for dask to add such a requirement.

mrocklin · 2017-02-07T23:42:16Z

cc @anton-malakhov

mrocklin · 2017-02-11T23:45:52Z

Quick timing update showing PyPy vs CPython for creating typical dask graphs. We find that PyPy is in the expected 2-5x faster range we've seen with Cython in the past:

PyPy

>>>> import time
>>>> start = time.time(); d = {('x', i): (apply, lambda x: x + 1, [1, 2, 3, i],\
 {}) for i in range(100000)}; end = time.time(); print(end - start)
0.11324095726

CPython

In [1]: import time

In [2]: start = time.time(); d = {('x', i): (apply, lambda x: x + 1, [1, 2, 3, i], {}) for i in range(100000)}; end = time.time(); print(end - start)
0.357743024826

Given what happens to turn up as a bottleneck. I can imagine wanting to cythonize core parts of dask.array as well (like top).

GaelVaroquaux · 2017-02-12T10:48:40Z

Quick timing update showing PyPy vs CPython for creating typical dask graphs. We find that PyPy is in the expected 2-5x faster range we've seen with Cython in the past:

Usually that's because there are some type information that should be added to make Cython faster.

mrocklin · 2017-02-12T15:18:45Z

Usually that's because there are some type information that should be added to make Cython faster.

I'm aware. To be clear, my previous comment was comparing PyPy to CPython, not Cython. We observe similar speedups that we've seen when accelerating data-structure-dominated Python code with Cython in the past, notably 2-5x. In my experience 100x speedups only occur in Cython when accelerating numeric code, where Python's dynamic dispatch is a larger fraction of the cost.

GaelVaroquaux · 2017-02-12T15:23:44Z

OK, I had misunderstood your comment.

honnibal · 2017-05-24T14:55:47Z

I'm not sure what you mean by "making Dask depend on Cython". Cython would only be a build-time dependency --- you'd ship either the generated C files via PyPi, Conda, etc, or the user would require a C compiler. It's similar to writing a C extension, just in a better language.

mrocklin · 2017-05-24T23:14:53Z

@honnibal of course you're correct. I should have said something like "making the dask development process depend on Cython and the dask source installation process depend on a C compiler" both of which add a non-trivial cost.

jkterry1 · 2017-11-01T08:59:34Z

@mrocklin

Couldn't you integrate the cython building into travis before shipping to PyPI, and have a dask and dask-c version in PyPI? That seems like it has all the advantages and none of the costs.

Also, you're completely right that Cython won't be faster (or only a tiny bit than PyPy for your scheduler), but my argument against relying on it is that PyPy isn't production ready for and/or doesn't support a lot of important things, including a lot of the data science ecosystem that people would use Dask with. Cython already has complete compatibility with everything and is used in massive deployments by Google etc.

I also have a related question to all this, which is the reason I stumbled across this thread in the first place:
I'm trying to have dask compute a giant custom DAG of small numeric operators (something cython will give the 100x improvements in). How could I best implement this with dask?

mrocklin · 2017-11-01T09:09:55Z

Couldn't you integrate the cython building into travis before shipping to PyPI, and have a dask and dask-c version in PyPI?

Yes.

That seems like it has all the advantages and none of the costs.

This adds a significant cost in development and build maintenance.

Also, you're completely right that Cython won't be faster (or only a tiny bit than PyPy for your scheduler), but my argument against relying on it is that PyPy isn't production ready for and/or doesn't support a lot of important things, including a lot of the data science ecosystem that people would use Dask with. Cython already has complete compatibility with everything and is used in massive deployments by Google etc.

It is very hard (and often incorrect) to make claims about one being faster or slower than the other generally. Things are more complex than that.

I also have a related question to all this, which is the reason I stumbled across this thread in the first place: I'm trying to have dask compute a giant custom DAG of small numeric operators (something cython will give the 100x improvements in). How could I best implement this with dask?

I'm going to claim that this is unrelated. Please open a separate issue if you have a bug or ask a question on stack overflow (preferably with a minimal example) if you have a usage question.

honnibal · 2017-11-01T10:20:40Z

@justinkterry You won't really see a speed benefit from Cython unless you plan out your data structures in C. That's not nearly as hard as people suggest, and I think it actually makes code better, not worse. But it does mean maintaining a separate dask-c fork is really costly.

As far as the Travis build process goes: Yeah, that does work. But the effort of automating the artifact release is really a lot. You have to use both Travis and Appveyor, and Travis's OSX stuff is not very nice, because the problem is hard. I'm also not sure Travis will stay so free for so long. I suspect they're losing a tonne of money.

In general the effort of shipping a Python C extension is really quite a lot. Sometimes I feel like it's harder to build, package and ship my NLP and ML libraries than it is to write them.

If it included C extensions, a library like Dask would have a build matrix with the following dimensions:

OS: Windows, Linux, OSX. There's now also tonnes of less standard Linuxes, because of Docker
Compiler: GCC, CLang, MinGW, MSVC (various), ICC
Python version: 2.7, 3.5, 3.6, 3.7
Installation: pip, conda, pip system installation, pip local directory
Artifact type: sdist, wheel, build repository
Architecture: 32 bit, 64 bit

That's over 1,000 combinations, so you can't test the matrix exhaustively. Building and shipping wheels for all combinations is also really difficult, so a lot of users will have to source install. This means that a lot of things that shouldn't matter do. For instance, the peak memory usage might spike during compilation for some compilers, bringing down small nodes (obviously an important use-case for Dask!). This might happen on some platforms, but not others --- and the breakage might be introduced in a point release when Dask upgraded the version of Cython used to generate the code.

Is it worth it? Well, for me the choice is between writing extension and going off and doing something completely different. My libraries couldn't exist in pure Python. But for a very marginal benefit, I think you'd rather be shipping a pure Python library.

Btw, I also think PyPy might not be that helpful? The workers would have to be running CPython, right? Most tasks you want to schedule with Dask will run poorly on PyPy.

jkterry1 · 2017-11-01T11:18:35Z

@honnibal Thank you very much for your detailed explanation; that actually helps a lot. What would you recommend doing if I want to call a bunch of functions in dask that would be highly accelerated by C? Just package each one with cython and call it from the script using dask? My only concern with doing that is the time it takes to go between python and C, because it's a very large number of very short functions.

mrocklin · 2017-11-01T11:44:09Z

What would you recommend doing if I want to call a bunch of functions in dask that would be highly accelerated by C?

@justinkterry I recommend raising a question on Stack Overflow using the #dask tag.

fijal · 2017-11-11T14:08:51Z

Hi Everyone.

So me & Matt did some benchmarks, looked at pypy and here are the takeaways:

PyPy gives about 40% speedup for free
The remaining time is spent, predominantly:
- 15% doing bytearray += stuff, which relies on refcounting to be fast.
- 10% looking up stuff in dictionaries. The real figure is probably higher as it would show up in the GC. Using objects here instead of small dicts with known keys would speed things up considerably
- 25% of time in the GC - likely mostly the stuff above, but also there is quite a bit of list copies and resizing

I think we can get another 2x speedup from PyPy with moderate effort. I can't promise I'll find time immediately, but if someone pesters me at some stage in the near to mid future, I can have a look.

Cheers,
fijal

mrocklin · 2017-11-11T16:25:40Z

This provides some motivation to arrange per-task information into objects. It is currently spread across ~20 or so dictionaries. This would have the extra advantage of maybe being clearer to new developers (although this is subjective). We would also have to see what affect this would have on CPython performance.

A rewrite of this size is feasible, but is also something that we would want to discuss heavily before performing. Presumably there are a number of other improvements that we might want to implement at the same time.

I can't promise I'll find time immediately, but if someone pesters me at some stage in the near to mid future, I can have a look.

My guess is that your time would be better spent by providing us with advice on how best to use PyPy during this process. My guess is that it would be unpleasant for anyone not already familiar with Dask's task scheduler to actually perform this rewrite.

njsmith · 2017-11-11T23:22:43Z

@mrocklin as a general note, if you're thinking about moving things from dicts to objects, then the attrs library is really excellent

fijal · 2017-11-13T14:59:19Z

So on the plus side, I added some small PyPy improvements to be not as bad when handling bytearray (and it has nothing to do with the handling of refcounts). Where are the bytearrays constructed? It might be better (for Cpython too) to create a list and use b"".join instead of having a gigantic bytearrays.

As for how to get some basics. I do the following:

run the program with PYPYLOG=jit-summary:- pypy program. That creates a short summary. What to look for:
- number of aborts. If there are a lot of aborts, we need to look where. PYPYLOG=jit:log would create a massive log, where ABORT strings can be found. ABORT because trace is too long is normal - don't worry. ABORT because quasi immutable is forced is bad. I looked through the traceback and found is from calling a function. Then matt told me that function comes from cloudpickle, so presumably cloudpickle modifies some function parameters that are supposed to not be modified much
- tracing time and backend time. This gives you some indication of warmup - if tracing + backend is a significant part of total time, either run for longer or pypy is struggling to warm up
I run PYPYLOG=log pypy program and then run PYTHONPATH=~/pypy python ~/pypy/rpython/tool/logparser.py print-summary - which gives me some basics: 8% of the time in GC (acceptable), 1% JIT tracing (good), the rest execution
python -m vmprof --web would upload the vmprof run.

At this stage, I think what we need to do is to take the core functions and make smaller benchmarks, otherwise it's a touch hard to do anything.

My hunch is that a lot of dicts and forest-like code of a few functions makes it hard for the JIT to make sense of it and as such too slow, but it's just a hunch for now

pitrou · 2017-11-13T15:04:59Z

I'm not sure we want to go too deep into PyPy-specific tuning. The idea of converting the forest-of-dicts approach to a per-task object scheme sounds reasonable on the principle, though I'm not sure how much it would speed things up. We also don't want to risk making CPython slower by accident, as it is our primary platform (and our users' as well).

fijal · 2017-11-13T15:09:55Z

Right, I'm pretty sure pypy-specific tuning is a terrible idea. It's also rather unlikely to make a measurable difference on CPython. Measuring on PyPy though DOES make sense (especially that it runs almost 2x faster in the first place). Do you have benchmarks running on CPython all the time? Because if not, then "making it slower by accident" is a completely moot point.

pitrou · 2017-11-13T15:11:04Z

We do have a benchmarks suite (and some of us have individual benchmarks they run on a casual basis), but unfortunately we haven't automated its running (yet?).

pitrou · 2017-11-13T15:18:45Z

Where are the bytearrays constructed? It might be better (for Cpython too) to create a list and use b"".join instead of having a gigantic bytearrays.

If I'm guessing correctly, it is in Tornado and should be fixed by tornadoweb/tornado#2169

fijal · 2017-11-13T15:29:48Z

replacing OrderedDict with dict helps a bit on PyPy. I presume we can do that on PyPy always and CPython >= 3.6? Should help there too

mrocklin · 2020-05-05T14:42:17Z

In all benchmarks, 7-15% of time is spent in dask.sizeof. Seems like a lot. Here are some flamegraphs:

cc @fjetter nothing to do here, but I wanted you to be aware.

It looks like this is happening whenever the scheduler writes a message. We check if the message is large enough that we should serialize it in a separate thread. In the case of the scheduler all messages should already be pre-serialized, so this check should probably be skipped. This could be solved by passing through some offload= keyword or something similar. Although doing this in a way that respected Comms that didn't offload (like inproc) might require some care.

For the other components I'm still curious what within those functions is slow. Is it data structure access? Manipulating Task objects? It's hard to dive more deeply to figure out what about CPython itself is slow in particular.

mrocklin · 2020-05-26T00:29:13Z

I wanted to check in here and bump this a bit. Two comments.

It seems like a task that we can do right now is come up with a benchmark suite that stresses the client/scheduler in a variety of ways. I'll list a few here
- roundtrip time for a single task
- bandwidth of many embarrassingly parallel tasks
- fully sequential workloads
- random graphs that are both sparsely and densely connected
- dataframe shuffle
- array rechunking with x.rechunk((1, 1000)).rechunk((1000, 1))
Last time we spoke it sounded like this was something that the folks at NVIDIA felt comfortable contributing.
The results by @shwina with Cythonization are very cool. I would be curious to know some of the following:
- Are the speedups similar if we're not running cProfile, but just timing the results (sometimes cprofile can make Pure Python code a bit slower than it would otherwise be
- Are there things that we can do in the scheduler/client to make Cythonizing more effective? If so, what?
- What would our build process look like with a bit of Cython code?

quasiben · 2020-09-08T14:37:26Z

@Kobzol did you ever end up profiling things with perf or did you only use py-spy in the end ? No worries if you didn't do a perf run

Kobzol · 2020-09-08T14:41:26Z

I didn't use perf on Dask itself, just py-spy.

mrocklin · 2020-10-26T19:05:28Z

I have some questions about Cython.

In @shwina 's work here he Cythonized the TaskState and other classes in the Scheduler

distributed/distributed/scheduler_cy.pyx

Lines 584 to 627 in 79f1f89

    
           cdef public: 
        
               # === General description === 
        
               cdef object actor 
        
               # Key name 
        
               cdef object key 
        
               # Key prefix (see key_split()) 
        
               cdef object prefix 
        
               # How to run the task (None if pure data) 
        
               cdef object run_spec 
        
               # Alive dependents and dependencies 
        
               cdef object dependencies 
        
               cdef object dependents 
        
               # Compute priority 
        
               cdef object priority 
        
               # Restrictions 
        
               cdef object host_restrictions 
        
               cdef object worker_restrictions 
        
               cdef object resource_restrictions 
        
               cdef object loose_restrictions 
        
               # === Task state === 
        
               cdef object _state 
        
               # Whether some dependencies were forgotten 
        
               cdef object has_lost_dependencies 
        
               # If in 'waiting' state, which tasks need to complete 
        
               # before we can run 
        
               cdef object waiting_on 
        
               # If in 'waiting' or 'processing' state, which tasks needs us 
        
               # to complete before they can run 
        
               cdef object waiters 
        
               # In in 'processing' state, which worker we are processing on 
        
               cdef object processing_on 
        
               # If in 'memory' state, Which workers have us 
        
               cdef object who_has 
        
               # Which clients want us 
        
               cdef object who_wants 
        
               cdef object exception 
        
               cdef object traceback 
        
               cdef object exception_blame 
        
               cdef object suspicious 
        
               cdef object retries 
        
               cdef object nbytes 
        
               cdef object type 
        
               cdef object group_key 
        
               cdef object group

This yielded only modest speedup.

I'm curious if it is possible to take this further, and modify the types in the class itself away from object type into something more compound, like the following:

dependencies: Dict[str, TaskState]

More specifically, some technical questions:

Does Cython have the necessary infrastructure to understand compound types like this? Do we instead need to declare types on the various methods like Scheduler.transition_*?

It looks like the previous effort didn't attempt to Cythonize the Scheduler.transition_* methods. This might be a good place to start with future efforts?

cc @jakirkham who I think might be able to answer the Cython questions above and has, if I recall correctly, recent experience doing this with UCX-Py. Also cc @quasiben who has been doing profiling here recently.

scoder · 2020-10-26T19:25:52Z

You can use the dict type, but there is not currently dedicated support for containers with typed values like Dict[str, TaskState]. One of the issues is tracking usages of the "typing" module, another is integrating container item types into the type system. Both probably aren't difficult to implement (there is precedence for both, e.g. C++ template support), but it hasn't been implemented yet.

Usually, people type the variables that they assign the results to. Or use type casts. But that introduces either a requirement for Cython syntax or some runtime overhead in pure Python (due to a function call to cython.cast()).

Without looking at how the transition_*() methods are used, I can say that reducing call overhead is a very worthwhile goal for function/class heavy code. Nothing you can do in Python comes close to a C call.

kkraus14 · 2020-10-26T19:30:21Z

Given the dict is used with a pre-defined set of keys, I think we could define a big if, elif, elif, elif, etc. block which I believe Cython will optimize into a switch case for us. Would look a bit gross code-wise and be a bit unintuitive from a Python perspective.

scoder · 2020-10-26T19:40:05Z

which I believe Cython will optimize into a switch case for us

switch doesn't work with string values. And a Python dict lookup with a str (especially a literal) is actually plenty fast. I doubt that this is a bottleneck here.

mrocklin · 2020-10-26T20:40:36Z

Usually, people type the variables that they assign the results to. Or use type casts. But that introduces either a requirement for Cython syntax or some runtime overhead in pure Python (due to a function call to cython.cast()).

Thanks for the response @scoder . I think that we're becoming more comfortable with switching the entire file to Cython. So what I'm hearing is that we'll do something like the following:

cdef class TaskState
    dependencies: dict

cdef transition_foo_bar(self, ts: TaskState):
    key: str
    value: TaskState
    for key, value in ts.dependencies.items():
        ...

This is more c-like in that we're declaring up-front the types of various variables used within a function. Am I right in understanding that Cython will use these type hints effectively when unpacking key, value in the for loop, or do we need to do more here?

jakirkham · 2020-10-26T22:31:50Z

I'm curious if it is possible to take this further, and modify the types in the class itself away from object type into something more compound, like the following:
dependencies: Dict[str, TaskState]

If we are comfortable moving to more typical Cython syntax, are we also comfortable making use of C++ objects in Cython? For example we could do things like this...

from libcpp.map cimport map
from libcpp.string cimport string

cdef class Obj:
    cdef map[string, int] data

Asking as this would allow us to make the kind of optimizations referred to above.

mrocklin · 2020-10-26T22:37:44Z

I would be inclined to do this incrementally and see what is needed. My guess is that there will be value in the attributes of a TaskState object being easy to manipulate in Python as well. This will probably be useful when we engage the networking side of the scheduler, or the Bokeh dashboard.

To me the following somewhat incremental path seems good:

Cythonize the entire scheduler.py file, without using C++
Iterate on that for a while and see how fast we can get while collecting all of the low hanging fruit
If this is fast enough then we stop here.
Split the scheduler.pyx file out into a state machine part and a networking part, which gets converted back into pure Python. In doing so we figure out the right split in the scheduler.
Now that the computational parts are well separated and there is a nice protocol boundary we have a bit more freedom to play with different technologies like C++, Rust, ...

pitrou · 2020-10-26T22:41:59Z

I'm skeptical using a C++ map would bring anything. First, you probably want an unordered_map (which is a hash table rather than a tree as in map). Second, Python dicts are quite optimized as far as hash tables go (for example, the hash value of a string is interned and needn't be recomputed, strings can often be compared by pointer..).

jakirkham · 2020-10-26T22:59:01Z

Sure the code above was intended to provide a sample. Not the optimal solution necessarily.

pitrou · 2020-11-04T12:58:12Z

Slightly orthogonal, but I think the biggest potential speedup for the scheduler would come from vectorizing scheduler operations. That is, instead of having dicts and sets that are iterated on, find a way to express the scheduling algorithm in terms of vector/matrix operations, and use Numpy to accelerate them. Regardless of the implementation language, implementing set operations in terms of hash table operations will always be costly.

(this has several implications, such as having to identify tasks and other entities by integer ids rather than arbitrary dict keys)

fijal · 2020-11-04T13:45:32Z

Jumping on a discussion from forever ago.... that was essentially the issue with trying to speed it up on pypy. No matter what you do, dict operations are expensive. If the only thing you do is dict, then cpython/pypy will be the same speed and probably faster than an equivalent C implementation. But the implementation in C will probably not use dicts everywhere, because it's hard. I would strongly suggest doing that work before moving to anything else, even if it makes CPython not faster (but it should make it faster).

…

On Wed, Nov 4, 2020 at 2:58 PM Antoine Pitrou ***@***.***> wrote: Slightly orthogonal, but I think the biggest potential speedup for the scheduler would come from vectorizing scheduler operations. That is, instead of having dicts and sets that are iterated on, find a way to express the scheduling algorithm in terms of vector/matrix operations, and use Numpy to accelerate them. Regardless of the implementation language, implementing set operations in terms of hash table operations will always be costly. (this has several implications, such as having to identify tasks and other entities by integer ids rather than arbitrary dict keys) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#854 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAU7OHEZL3AIZY74733OGLSOFFXHANCNFSM4C6ZYWMA> .

mrocklin · 2020-11-04T14:13:50Z

Thanks for the comments @pitrou and @fijal (also, it's good to hear from both of you after so long)

I agree that vectorization would probably push us well into the millions of tasks per second mark. If you look at HPC schedulers in academia one sees this kind of throughput. At some point we'll have to think about this, and I look forward to that. I still think that we should be able to do better than the 5k tasks-per-second limit that we have today. Dicts are slow, yes, but not that slow.

@quasiben did some low-level profiling with NVIDIA profilers and found that the majority of our time in CPython wasn't in PyDict_GetItem, but had more to do with attribute access (I think). Previous Cythonization efforts left, I think, some performance on the table. I'd like to explore that to see if we can get up to 50k or so (which would be a huge win for us) before moving on to larger architectural changes.

fijal · 2020-11-04T14:16:35Z

HI Matthew It might be an unpopular opinion, but the CPython performance does not tell you much about how it would perform on anything else (e.g. PyPy, numba, cython, rust, C). E.g. attribute access on PyPy is mostly getting compiled to an array access

…

On Wed, Nov 4, 2020 at 4:14 PM Matthew Rocklin ***@***.***> wrote: Thanks for the comments @pitrou <https://github.com/pitrou> and @fijal <https://github.com/fijal> (also, it's good to hear from both of you after so long) I agree that vectorization would probably push us well into the millions of tasks per second mark. If you look at HPC schedulers in academia one sees this kind of throughput. At some point we'll have to think about this, and I look forward to that. I still think that we should be able to do better than the 5k tasks-per-second limit that we have today. Dicts are slow, yes, but not that slow. @quasiben <https://github.com/quasiben> did some low-level profiling with NVIDIA profilers and found that the majority of our time in CPython wasn't in PyDict_GetItem, but had more to do with attribute access (I think). Previous Cythonization efforts left, I think, some performance on the table. I'd like to explore that to see if we can get up to 50k or so (which would be a huge win for us) before moving on to larger architectural changes. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#854 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAU7ODHEQVJSROZ2RVVDR3SOFOS7ANCNFSM4C6ZYWMA> .

mrocklin · 2020-11-04T14:21:05Z

Yeah, I'm hoping that we could get the same performance result with Cython. My hope (perhaps naive) is that there are large optimizations here that PyPy wasn't able to find, and that by writing this code manually and inspecting the generated C that we might be able to find something that PyPy missed. I'm making that bet mostly based on the belief that dict/attribute access isn't that slow, but I'll admit that I'm making it in igornance of an understanding of how awesome PyPy is.

Also, to be clear, I'm not saying "we're going with Cython" but rather "we should explore diving more deeply into Cython and see how it goes"

fijal · 2020-11-04T14:27:15Z

Hi Matthew I'm really not telling you where you should or should not go. I was merely saying that everything that *does* any form of optimization will have a very different profile (more like PyPy) than CPython and looking where CPython spends it's time is not that interesting. Using ints as keys, preallocating arrays (as opposed to resizable ones or dictionaries) etc. are all sensible strategies that will yield good results in anything optimizing, like say Java, *except* in CPython. Best, Maciej

…

On Wed, Nov 4, 2020 at 4:21 PM Matthew Rocklin ***@***.***> wrote: Yeah, I'm hoping that we could get the same performance result with Cython. My hope (perhaps naive) is that there are large optimizations here that PyPy wasn't able to find, and that by writing this code manually and inspecting the generated C that we might be able to find something that PyPy missed. I'm making that bet mostly based on the belief that dict/attribute access isn't that slow, but I'll admit that I'm making it in igornance of an understanding of how awesome PyPy is. Also, to be clear, I'm not saying "we're going with Cython" but rather "we should explore diving more deeply into Cython and see how it goes" — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

mrocklin · 2020-11-04T14:28:11Z

Ah! Got it.

jakirkham · 2020-11-25T22:34:19Z

On the interpreter side, conda-forge and Anaconda both build CPython with profile-guided optimizations (PGO). So at least on that side, we are probably as well optimized as we can hope.

jakirkham · 2022-06-11T01:07:45Z

Another option to consider if this were revisited would be mypyc, which mypy dogfoods (IOW mypy compiles itself with mypyc). This is likely a better fit to other approaches as it uses standard Python type hints, which have increasingly been added to the codebase, and it is targeting server applications, which Distributed basically is.

That said, mypyc itself is considered alpha (though it has been around for a little while) ( mypyc/mypyc#780 ). Also (and this may be inexperience on my part) it seems to look at all files instead of just one being requested and does not provide all errors at once (so one incrementally fixes issues, which can be a bit slow). However given the other points above, these may improve over time. So worth keeping an eye on.

pitrou · 2022-06-11T07:57:15Z

Given that it's considered alpha, I'm not sure you want to go through debugging potential mypyc bugs on the Distributed source code. Also, I don't see any benchmark results on non-trivial use cases, though I may be missing something.

mrocklin mentioned this issue May 5, 2020

Don't offload serialization in scheduler #3776

Closed

quasiben mentioned this issue May 7, 2020

[Discussion] Client/Scheduler Performance #3783

Open

13 tasks

madsbk mentioned this issue May 19, 2020

[FEA] Dynamic Task Graph / Task Checkpointing #3811

Closed

fjetter mentioned this issue Jan 24, 2022

Drop PyPy compatibility? #5681

Closed

jakirkham mentioned this issue Apr 12, 2022

Start uncythonization #6104

Merged

3 tasks

mdboom mentioned this issue Nov 16, 2022

Add a Dask benchmark python/pyperformance#246

Merged

martindurant closed this as completed Mar 17, 2023

Accelerate Scheduler with Cython, PyPy, or C #854

Accelerate Scheduler with Cython, PyPy, or C #854

Comments

mrocklin commented Feb 3, 2017

mrocklin commented Feb 3, 2017 • edited Loading

kszucs commented Feb 3, 2017

mrocklin commented Feb 3, 2017

Kobzol commented Feb 3, 2017

hussainsultan commented Feb 3, 2017

kszucs commented Feb 4, 2017 • edited Loading

mrocklin commented Feb 4, 2017

mrocklin commented Feb 4, 2017

GaelVaroquaux commented Feb 4, 2017 via email

mrocklin commented Feb 7, 2017

mrocklin commented Feb 11, 2017

PyPy

CPython

GaelVaroquaux commented Feb 12, 2017 via email

mrocklin commented Feb 12, 2017

GaelVaroquaux commented Feb 12, 2017

honnibal commented May 24, 2017

mrocklin commented May 24, 2017

jkterry1 commented Nov 1, 2017

mrocklin commented Nov 1, 2017

honnibal commented Nov 1, 2017

jkterry1 commented Nov 1, 2017 • edited Loading

mrocklin commented Nov 1, 2017

fijal commented Nov 11, 2017

mrocklin commented Nov 11, 2017

njsmith commented Nov 11, 2017

fijal commented Nov 13, 2017

pitrou commented Nov 13, 2017

fijal commented Nov 13, 2017

pitrou commented Nov 13, 2017

pitrou commented Nov 13, 2017

fijal commented Nov 13, 2017

mrocklin commented May 5, 2020

mrocklin commented May 26, 2020

quasiben commented Sep 8, 2020

Kobzol commented Sep 8, 2020

mrocklin commented Oct 26, 2020

scoder commented Oct 26, 2020

kkraus14 commented Oct 26, 2020

scoder commented Oct 26, 2020

mrocklin commented Oct 26, 2020

jakirkham commented Oct 26, 2020

mrocklin commented Oct 26, 2020

pitrou commented Oct 26, 2020

jakirkham commented Oct 26, 2020

pitrou commented Nov 4, 2020

fijal commented Nov 4, 2020 via email

mrocklin commented Nov 4, 2020

fijal commented Nov 4, 2020 via email

mrocklin commented Nov 4, 2020

fijal commented Nov 4, 2020 via email

mrocklin commented Nov 4, 2020

jakirkham commented Nov 25, 2020

jakirkham commented Jun 11, 2022

pitrou commented Jun 11, 2022

mrocklin commented Feb 3, 2017 •

edited

Loading

kszucs commented Feb 4, 2017 •

edited

Loading

jkterry1 commented Nov 1, 2017 •

edited

Loading