Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to speed up object conversion? #404

Closed
bluesmoon opened this issue Jun 22, 2017 · 14 comments
Closed

How to speed up object conversion? #404

bluesmoon opened this issue Jun 22, 2017 · 14 comments

Comments

@bluesmoon
Copy link

I'm using PyCall (1.7.2) to run an SQL query against a database, and then getting the results into Julia. It appears to be very slow to convert the python list of tuples to a Julia array of tuples, and it seems that all the slowness is in iterating through the list elements, i.e., the speed is O(n) on the number of items in the list.

Here's some example code. The SQL statement returns exactly 100,000 rows:

Using automatic type conversion

@time rows = cs.cursor[:fetchall]()     # PyCall automatically converts to a Julia array of tuples
157.480336 seconds (24.49 M allocations: 649.669 MB, 0.86% gc time)

@time rowarray = map(collect, rows)     # Convert the tuples to arrays
  0.338664 seconds (1.86 M allocations: 65.706 MB, 37.67% gc time)

length(rowarray)
100000

Getting a PyObject and then converting with map

@time rows = pycall(cs.cursor[:fetchall], PyObject)
  7.685769 seconds (73 allocations: 4.031 KB)

@time rowarray = map(collect, rows)
119.437264 seconds (27.05 M allocations: 745.472 MB, 1.29% gc time)

length(rowarray)
100000

As you can see, calling fetchall() takes 157seconds, and then the map is very fast, whereas calling pycall(fetchall, PyObject) takes 7seconds, and then the map is very slow.

So, wise PyCall devs, is there a way for me to combine the fastest parts of the two approaches? I'm not averse to going as low level as necessary as this level of database slowness is causing us a lot of grief.

PS: I've tried parallelising this with pmap and other low level julia parallel functions, but this involves copying the object which has the same issue.

@stevengj
Copy link
Member

If you know that fetchall produces an array of tuples, then you can specify the exact return type by using pycall or @pycall. This will speed things up considerably by eliminating the type introspection. But it will always be O(n), since it has to convert each object.

@bluesmoon
Copy link
Author

When I do this:

@time rows = pycall(cs.cursor[:fetchall], Array{Tuple})

I get this:

BoundsError

 in convert at /home/ubuntu/.julia/v0.4/PyCall/src/conversions.jl:178
 in py2array at /home/ubuntu/.julia/v0.4/PyCall/src/conversions.jl:312
 in convert at /home/ubuntu/.julia/v0.4/PyCall/src/conversions.jl:377
 in pycall at /home/ubuntu/.julia/v0.4/PyCall/src/PyCall.jl:565

What should I pass in as the type?

@bluesmoon
Copy link
Author

Actually it looks like fetchall returns an iterator, so I guess I can't just convert it to a Tuple

@bluesmoon
Copy link
Author

Ok, so what I have is an iterator, that correctly returns a Tuple, but it's doing type introspection to determine that it needs to return a tuple. How do I pass a hint about the type to speed that up?

@stevengj
Copy link
Member

What does the iterator return? A tuple of what size/type?

@bluesmoon
Copy link
Author

it depends on the query, but let's say for this particular query it is: Tuple{Int64, Int64, Int64, AbstractString, AbstractString, Float64}

@stevengj
Copy link
Member

stevengj commented Jun 23, 2017

You can do collect(Tuple{Int64, Int64, Int64, String, String, Float64}, pycall(cs.cursor["fetchall"], PyObject))? (passing the desired element type to collect)

@stevengj
Copy link
Member

With the latest PyCall master, you should be able to do collect(Tuple{Vararg{PyAny}}}, pycall(cs.cursor["fetchall"], PyObject)) if you don't know the tuple types. (But in this case it will still do type introspection on each tuple element.)

One thing that has been on my to-do list for a while is to speed up type introspection by caching types in a hash table.

@bluesmoon
Copy link
Author

Thanks. That didn't change the time it takes because presumably the call to collect is already operating on the iterator. I've also tried calling pycall(rows[:__getitem__], Tuple{...}, i) but this appears to be slower for the moment. Will let you know if I get anything further.

@bluesmoon
Copy link
Author

Ok, so I've managed to get huge speedups (90%) by trying several different things:

  1. Instead of cs.cursor[:fetchall]() which returns a list iterator, I use pycall(cs.cursor[:fetchall], PyObject) which returns a PyObject that behaves like a list iterator

  2. Instead of iterating through each row and type converting individual elements using collect or a loop, I do this:

     rowarray = map(1:length(rows)) do i
         row = get(rows, PyObject, i-1)
         return map( j -> get(row, PyObject, j-1), 1:length(cs.description) )
     end
    

    Which returns an Array{Array{PyObject, 1}, 1} which is about 500-1000x faster than doing the type conversion to get Array{Tuple{...}} or Array{Array{Any}}

  3. I then call hcat(rowarray...) which converts the above Array{Array{PyObject, 1}, 1} into an Array{PyObject, 2}

  4. I then call transpose on the resulting array to convert it from python's row major format to julia's column major format.

  5. I then do the type conversion one column at a time which is extremely fast, again 500-1000x faster than if it were done via get or pycall or type introspection. This is easy to do since I know in advance the type of every column (since this is available via cs.description)

Overall on average for 100,000 rows & 40 colums I've seen a change from approx 200seconds to 17seconds. Larger datasets have had even more improvements, going from several hours of running time to under 10 minutes.

Hope this helps others.

@stevengj
Copy link
Member

Why not just do

m = length(rows)
n = length(cs.description)
a = Array{PyObject}(m,n)
for i=1:m
    row = get(rows, PyObject, i-1)
    for j = 1:n
        a[i,j] = get(row, PyObject, j-1)
    end
end

That way, you eliminate all of the constructions of temporary arrays, transposition, etcetera.

@bluesmoon
Copy link
Author

Nice, that does improve performance a little and reduces memory usage.

@bluesmoon
Copy link
Author

I made a small change to your code and got it a little faster:

m = length(rows)
n = length(cs.description)
a = Array{PyObject}(n,m)
for i=1:m
    row = get(rows, PyObject, i-1)
    for j = 1:n
        a[j,i] = get(row, PyObject, j-1)
    end
end

b = transpose(a)

This is faster because indexing a julia array a column at a time is faster than indexing it a row at a time and the transpose adds less overhead than switching indexes gains us.

@bluesmoon
Copy link
Author

Slightly faster and more concise with this:

m = length(rows)
n = length(cs.description)
a = Array{PyObject}(n, m)
for i = 1:m
    a[:, i] = get(rows, PyVector{PyObject}, i-1)
end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants