Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EnsembleGPUKernel + Texture Memory Support #224

Open
agerlach opened this issue Jan 20, 2023 · 20 comments
Open

EnsembleGPUKernel + Texture Memory Support #224

agerlach opened this issue Jan 20, 2023 · 20 comments

Comments

@agerlach
Copy link

@ChrisRackauckas @utkarsh530

Per our conversation here is an example demonstrating how texture memory interpolation could be use.

I would like to be able to leverage CUDA.jl's texture memory support for interpolation of data in the EOM and/or in a callback. A use case could be dropping a ball in a wind field with ground impact termination for a non-flat terrain. Here, one would want to interpolate the wind field as a function of state in the eom as a forcing term and an elevation map as a function of altitude.

Below is an initial prototype. This includes a CPU implementation that leverages DataInterpolations.jl to demonstrate the functionality desired using this data forecast.txt I also included an initial non-working prototype using texture memory.

No interpolation

Working model for CPU and GPU w/o interpolation

using CUDA, DiffEqGPU, OrdinaryDiffEq, Plots, Serialization, StaticArrays, Distributions, LinearAlgebra
import DataInterpolations
const DI = DataInterpolations

# Ballistic model
function ballistic(u, p, t, weather)
    CdS, mass, g = p
    vel = @view u[4:6]

    wind, ρ = get_weather(weather, u[3])

    airvelocity = vel - wind
    airspeed = norm(airvelocity)
    accel = -* CdS * airspeed) / (2 * mass) * airvelocity - mass*SVector{3}(0f0, 0f0, g)

    return SVector{6}(vel..., accel...)
end

# constant wind and airdensity.
function get_weather(weather, z)
    wind = SVector{3}(1f0, 0f0, 0f0)
    ρ = 1.27f0
    wind, ρ
end

trajectories = 100
u0 = @SVector [0.0f0, 0.0f0, 10000.0f0, 0f0, 0f0, 0f0]
tspan = (0.0f0, 50.0f0)
saveat = LinRange(tspan..., 100)
p = @SVector [25f0, 225f0, 9.807f0]
CdS_dist = Normal(0f0, 1f0)

eom_nowind = (u, p, t) -> ballistic(u,p,t,nothing)
prob = ODEProblem{false}(eom_nowind, u0, tspan, p)
sol = solve(prob, Tsit5())

prob_func = (prob, i, repeat) -> remake(prob, p = p + SVector{3}(rand(CdS_dist), 0f0, 0f0))
eprob = EnsembleProblem(prob, prob_func = prob_func, safetycopy = false)

esol = solve(eprob, Tsit5(); trajectories, saveat)
esol_gpu = solve(eprob, GPUTsit5(), EnsembleGPUKernel(); trajectories, saveat)

DataInterpolations.jl CPU Example

This demonstrates the basic capability I would like to replicate in w/ EnsembleGPUKernel using CUDA.CuTexture

# Ballistic w/ Interpolated winds via DataInterpolations.jl
data = deserialize("forecast.txt")
N = length(data.altitude)

weather_sa = map(data.altitude, data.windx, data.windy, data.density) do alt, wx, wy, ρ
    SVector{4}(alt, wx, wy, ρ)
end

interp = DI.LinearInterpolation(weather_sa,data.altitude)

function get_weather(itp::DI.LinearInterpolation, z)
    weather = itp(u[3])
    wind = SVector{3}(weather[2], weather[3], 0f0)
    ρ = weather[4]
    wind, ρ
end

eom_di = (u,p,t) -> ballistic(u,p,t,interp)
prob = ODEProblem{false}(eom_di, u0, tspan, p)
sol = solve(prob, Tsit5())

prob_func = (prob, i, repeat) -> remake(prob, p = p + SVector{3}(rand(dist), 0f0, 0f0))
eprob = EnsembleProblem(prob, prob_func = prob_func, safetycopy = false)
esol = solve(eprob, Tsit5(); trajectories, saveat)

GPU Texture Interpolation Validation

Demonstrate usage of CuTexture for interpolation. Note, here I index into the texture memory by broadcasting over a CuArray{Float32} of indices via dst_gpu .= getindex.(Ref(texture), idx_tlu)

weather = map(weather_sa) do w
    (w...,)
end

weather_TA = CuTextureArray(weather)
texture = CuTexture(weather_TA; address_mode = CUDA.ADDRESS_MODE_CLAMP, normalized_coordinates = true, interpolation = CUDA.LinearInterpolation())

## Test Texture interpolation
idx = LinRange(0f0, 1f0, 4000)
idx_gpu = CuArray(idx)
idx_tlu = (1f0-1f0/N)*idx_gpu .+ 0.5f0/N  # normalized table lookup form https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#table-lookup
dst_gpu = CuArray{NTuple{4, Float32}}(undef, size(idx))
dst_gpu .= getindex.(Ref(texture), idx_tlu)  # interpolation ℝ->ℝ⁴

begin
    p1 = scatter(data.altitude, data.altitude, ylabel = "Altitude", label = "Raw", ms = 2, marker = :x)
    plot!(idx*data.altitude[end], getindex.(dst_cpu, 1), label = "Texture",lw = 2)

    p2 = scatter(data.altitude, data.windx, ylabel = "Wind X", label = "Raw", ms = 2, marker = :x, leg = false)
    plot!(idx*data.altitude[end], getindex.(dst_cpu, 2), label = "Texture",lw = 2)

    p3 = scatter(data.altitude, data.windy, ylabel = "Wind Y", label = "Raw", ms = 2, marker = :x, leg = false)
    plot!(idx*data.altitude[end], getindex.(dst_cpu, 3), label = "Texture",lw = 2)

    p4 = scatter(data.altitude, data.density, ylabel = "Density", label = "Raw", ms = 2, marker = :x, leg = false)
    plot!(idx*data.altitude[end], getindex.(dst_cpu, 4), label = "Texture",lw = 2)

    plot(p1, p2, p3, p4, link=:x, xlabel = "Altitude")
end

EnsembleGPUKernel + CuTexture prototype

Non-working prototype. Note here the get_weather function is indexing the texture at a single index for a single trajectory which isn't supported by CUDA.jl. Although this is scalar indexing it should actually be occurring for each trajectory in the ensemble.

function get_weather(tex::CuTexture, z, zmax = data.altitude[end], N = length(data.altitude))
    idx = (1f0-1f0/N)*z/zmax + 0.5f0/N # normalized input for table lookup based on https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#table-lookup
    tex[idx]
end

eom_tex = (u,p,t) -> ballistic(u,p,t,texture)
prob = ODEProblem{false}(eom_tex, u0, tspan, p)
prob_func = (prob, i, repeat) -> remake(prob, p = p + SVector{3}(rand(dist), 0f0, 0f0))
eprob = EnsembleProblem(prob, prob_func = prob_func, safetycopy = false)
esol_gpu = solve(eprob, GPUTsit5(), EnsembleGPUKernel(); trajectories, saveat) #InvalidIR 

ContinousCallback Prototype

The above example only does interpolation in the eom. However, interpolation could also occur in evaluating a callback. e.g. something like this

terrain_texture = CuTexture(...)  # ℝ² -> ℝ, map xy position to ground elevation

condition(u,t,integrator) = u[3] - terrain_texture[@view u[1:2]]. # check height above elevation
cb = ContinuousCallback(condition, terminate!)
@ChrisRackauckas
Copy link
Member

This looks great!

@agerlach
Copy link
Author

agerlach commented Jan 20, 2023

The texture interpolation results should like this

image

@utkarsh530
Copy link
Member

utkarsh530 commented Jan 23, 2023

EnsembleGPUKernel + DataInterpolations.jl

As a first attempt to try out interpolation within GPU ODE solves, I got the MWE working with EnsembleGPUKernel + DataInterpolations.jl.


using CUDA, DiffEqGPU, OrdinaryDiffEq, Plots, Serialization, StaticArrays, Distributions, LinearAlgebra
import DataInterpolations
const DI = DataInterpolations

## CPU Working Example

# Ballistic model

function ballistic(u, p, t, weather)
    CdS, mass, g = p
    vel = @view u[4:6]

    wind, ρ = get_weather(weather, u[3])

    airvelocity = vel - wind
    airspeed = norm(airvelocity)
    accel = -(ρ * CdS * airspeed) / (2 * mass) * airvelocity - mass*SVector{3}(0f0, 0f0, g)

    return SVector{6}(vel..., accel...)
end


trajectories = 100
u0 = @SVector [0.0f0, 0.0f0, 10000.0f0, 0f0, 0f0, 0f0]
tspan = (0.0f0, 50.0f0)
saveat = LinRange(tspan..., 100)
p = @SVector [25f0, 225f0, 9.807f0]
CdS_dist = Normal(0f0, 1f0)

# Ballistic w/ Interpolated winds via DataInterpolations.jl

data = deserialize("forecast.txt")
N = length(data.altitude)

weather_sa = map(data.altitude, data.windx, data.windy, data.density) do alt, wx, wy, ρ
    SVector{4}(alt, wx, wy, ρ)
end

weather_sa = SVector{length(weather_sa)}(weather_sa)

interp = DI.LinearInterpolation{true}(hcat(weather_sa...),data.altitude)

function get_weather(itp::DI.LinearInterpolation, z)
    weather = itp(z)
    wind = SVector{3}(weather[2], weather[3], 0f0)
    ρ = weather[4]
    wind, ρ
end

eom_di = (u,p,t) -> ballistic(u,p,t,interp)
prob = ODEProblem{false}(eom_di, u0, tspan, p)
sol = solve(prob, Tsit5())

prob_func = (prob, i, repeat) -> remake(prob, p = p + SVector{3}(rand(CdS_dist), 0f0, 0f0))
eprob = EnsembleProblem(prob, prob_func = prob_func, safetycopy = false)
esol = solve(eprob, Tsit5(); trajectories, saveat)


### solving the ODE on GPU + Interpolation using DataInterpolations

function ballistic_gpu(u, p, t)
    CdS, mass, g = p[1]
    interp = p[2]
    vel = @view u[4:6]

    # tt = interp(0.1f0)[1]

    # @cushow tt

    # wind, ρ = get_weather(interp, u[3], zmax, N)
    wind, ρ = get_weather(interp, u[3])

    airvelocity = vel - wind
    airspeed = norm(airvelocity)
    accel = -(ρ * CdS * airspeed) / (2 * mass) * airvelocity - mass*SVector{3}(0f0, 0f0, g)

    return SVector{6}(vel..., accel...)
end


prob = ODEProblem(ballistic_gpu, u0, tspan, (p, interp))

prob_func = (prob, i, repeat) -> remake(prob, p = (p + SVector{3}(rand(CdS_dist), 0f0, 0f0), interp))
eprob = EnsembleProblem(prob, prob_func = prob_func, safetycopy = false)
esol_gpu = solve(eprob, GPUTsit5(), EnsembleGPUKernel(); trajectories, saveat)

Workaround: Using the default LinearInterpolation object caused some dynamic function invocation. Hence, I changed the interp.u to be a matrix instead of a vector of SVector. Also, I clubbed the interp buffer with parameters as using the lambda function caused some dynamic function invocation.

Attempt to use Texture Memory

I tried to modify the problem definition and investigated where dynamic invocation was happening. I had to change the building of the problem, as mentioned above. I tried to isolate some pieces here:

using CUDA, DiffEqGPU, OrdinaryDiffEq, Plots, Serialization, StaticArrays, Distributions, LinearAlgebra
import DataInterpolations
const DI = DataInterpolations




trajectories = 100
u0 = @SVector [0.0f0, 0.0f0, 10000.0f0, 0f0, 0f0, 0f0]
tspan = (0.0f0, 50.0f0)
saveat = LinRange(tspan..., 100)
p = @SVector [25f0, 225f0, 9.807f0]
CdS_dist = Normal(0f0, 1f0)

###

data = deserialize("forecast.txt")
N = length(data.altitude)

weather_sa = map(data.altitude, data.windx, data.windy, data.density) do alt, wx, wy, ρ
    SVector{4}(alt, wx, wy, ρ)
end

weather = map(weather_sa) do w
    (w...,)
end

weather_TA = CuTextureArray(weather)
texture = CuTexture(weather_TA; address_mode = CUDA.ADDRESS_MODE_CLAMP, normalized_coordinates = true, interpolation = CUDA.LinearInterpolation())

## Test Texture interpolation
idx = LinRange(0f0, 1f0, 4000)
idx_gpu = CuArray(idx)
idx_tlu = (1f0-1f0/N)*idx_gpu .+ 0.5f0/N  # normalized table lookup form https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#table-lookup
dst_gpu = CuArray{NTuple{4, Float32}}(undef, size(idx))
dst_gpu .= getindex.(Ref(texture), idx_tlu)  # interpolation ℝ->ℝ⁴


def_zmax = data.altitude[end]
N = length(data.altitude)
@inline function get_weather(tex, z, zmax, N)
    idx = (1f0-1f0/N)*z/zmax + 0.5f0/N # normalized input for table lookup based on https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#table-lookup
    weather = tex[idx]
    wind = SVector{3}(weather[2], weather[3], 0f0)
    ρ = weather[4]
    wind, ρ
end

### Experimentation

function kernel(texture, idx)
    tid = threadIdx().x + (blockIdx().x - 1) * blockDim().x
    @cushow texture[idx[tid]][1]
    return
end

@cuda kernel(texture, idx_tlu)

function ballistic_t(u, p, t)
    CdS, mass, g = p[1]
    interp = p[2]
    zmax = p[3]
    N = p[4]
    vel = @view u[4:6]

    tt = interp[zmax][1]

    @cushow tt

    wind, ρ = get_weather(interp, u[3], zmax, N)

    airvelocity = vel - wind
    airspeed = norm(airvelocity)
    accel = -(ρ * CdS * airspeed) / (2 * mass) * airvelocity - mass*SVector{3}(0f0, 0f0, g)

    return SVector{6}(vel..., accel...)
end

prob = ODEProblem(ballistic_t, u0, tspan, (p, texture, def_zmax, N))

function kernel_prob(prob)
    tt = prob.f(prob.u0, prob.p, 1.f0)
    return
end

@cuda kernel_prob(prob) # Works


eprob = EnsembleProblem(prob, safetycopy = false)
esol_gpu = solve(eprob, GPUTsit5(), EnsembleGPUKernel(0.0); trajectories, adptive = false, dt = 0.001f0) #InvalidIR

After performing the solve, I get this error:

ERROR: InvalidIRError: compiling kernel #tsit5_kernel(CuDeviceVector{ODEProblem{SVector{6, Float32}, Tuple{Float32, Float32}, false, Tuple{SVector{3, Float32}, CuTexture{NTuple{4, Float32}, 1, CuTextureArray{NTuple{4, Float32}, 1}}, Float32, Int64}, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(ballistic_t), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, 1}, CuDeviceMatrix{SVector{6, Float32}, 1}, CuDeviceMatrix{Float32, 1}, Float32, CallbackSet{Tuple{}, Tuple{}}, Nothing, Int64, Nothing, Val{true}) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to (f::ODEFunction)(args...) in SciMLBase at /home/gridsan/utkarsh/.julia/packages/SciMLBase/QqtZA/src/scimlfunctions.jl:2096)
Stacktrace:
 [1] tsit5_kernel
   @ ~/.julia/dev/DiffEqGPU/src/perform_step/gpu_tsit5_perform_step.jl:81

This seems a bit strange to me; a simple kernel having prob as a parameter just worked. So I'm not sure what's breaking here?. The dispatch to solve essentially is cu(probs), which is roughly mentioned here: https://docs.sciml.ai/DiffEqGPU/stable/tutorials/lower_level_api/. I can isolate more pieces here, but I'll need a different set of eyes to have a look at it.
Maybe @maleadt can you look at this and help figure out how to use Texture Memory with EnsembleGPUKernel?

@maleadt
Copy link
Contributor

maleadt commented Jan 30, 2023

Inspecting this kernel with Cthulhu (using Cthulthu, @device_code_warntype main(), press h to hide non-problematic code and navigate to the problematic code) reveals that the dynamic IR comes from the following call:

%419 = invoke ODEFunction(::SVector{6, Float32},::Tuple{SVector{3, Float32}, CuTexture{NTuple{4, Float32}, 1, CuTextureArray{NTuple{4, Float32}, 1}}, Float32, Int64},::Float32)::Union{}

That immediately shows the problem: your kernels still contain CPU datastructures (CuTexture, CuTextureArray) which first need to be converted to their GPU counterparts (CuDeviceTextire). Normally this conversion happens automatically when passing such objects to a kernel. In the case of structs containing GPU objects you need to define Adapt rules. It seems that ODEProblem already does so, because the kernel_prob invocation does already convert ODEProblem-contained GPU objects:

#kernel_prob#23(ODEProblem{SVector{6, Float32}, Tuple{Float32, Float32}, false, Tuple{SVector{3, Float32}, CuDeviceTexture{NTuple{4, Float32}, 1, CUDA.ArrayMemory, true, CUDA.LinearInterpolation}, Float32, Int64}, ODEFunction{false, true, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(ballistic_t), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem})

However, the problematic atsit5 invocation below passes a GPU vector of problems, and the CPU-to-GPU object conversion does not happen automatically for array elements (because it would otherwise require a download from GPU->CPU, perform the conversion there, allocate a new array, upload again; making GPU kernel launches unacceptably expensive):

#atsit5_kernel(CuDeviceVector{ODEProblem{SVector{6, Float32}, Tuple{Float32, Float32}, false, Tuple{SVector{3, Float32}, CuTexture{NTuple{4, Float32}, 1, CuTextureArray{NTuple{4, Float32}, 1}}, Float32, Int64}, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(ballistic_t), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, 1}, CuDeviceMatrix{SVector{6, Float32}, 1}, CuDeviceMatrix{Float32, 1}, Float32, CallbackSet{Tuple{}, Tuple{}}, Nothing, Float32, Float32, Nothing, Val{false})

The simplest solution here would be to pass a tuple of ODEProblems, because for tuples we can do the conversion efficiently. I'm not sure where that change would be needed, but @utkarsh530 or @ChrisRackauckas probably know where this comes from.

@maleadt
Copy link
Contributor

maleadt commented Jan 30, 2023

Hmm, passing these ODE problems as a tuple isn't going to work because of their size:

using Adapt

# HACK: force a GPU array of ODE problems to be passed as a tuple
Adapt.adapt_structure(to::CUDA.Adaptor, x::CuArray{<:ODEProblem}) =
    tuple(adapt.(Ref(to), Array(x))...)

... yields uses too much parameter space (0x22c4 bytes, 0x1100 max). So we can keep the array, but as I mentioned that conversion is expensive:

function Adapt.adapt_structure(to::CUDA.Adaptor, x::CuArray{<:ODEProblem})
    # first convert the contained ODE problems
    y = CuArray(adapt.(Ref(to), Array(x)))
    # continue doing what the default method does
    Base.unsafe_convert(CuDeviceArray{eltype(y),ndims(y),CUDA.AS.Global}, y)
end

And with that, the kernel launches and runs 🙂

@utkarsh530
Copy link
Member

The array of problems are built here:

if ensembleprob.safetycopy
probs = map(I) do i
ensembleprob.prob_func(deepcopy(ensembleprob.prob), i, 1)
end
else
probs = map(I) do i
ensembleprob.prob_func(ensembleprob.prob, i, 1)
end
end
, and the cu(probs) dispatch is here:
if adaptive
ts, us = vectorized_asolve(cu(probs), ensembleprob.prob, alg;
kwargs..., callback = _callback)
else
ts, us = vectorized_solve(cu(probs), ensembleprob.prob, alg;
kwargs..., callback = _callback)
end

But yes, I understood your point as to why that wasn't working. We can definitely try to figure out how "expensive" it is, but it works fine currently

@utkarsh530
Copy link
Member

utkarsh530 commented Jan 30, 2023

Here's the comparison with interp as DataInterpolations.jl vs CUDA texture memory:

Script
using CUDA, DiffEqGPU, OrdinaryDiffEq, Plots, Serialization, StaticArrays, Distributions, LinearAlgebra
import DataInterpolations
const DI = DataInterpolations

trajectories = 100
u0 = @SVector [0.0f0, 0.0f0, 10000.0f0, 0f0, 0f0, 0f0]
tspan = (0.0f0, 50.0f0)
saveat = LinRange(tspan..., 100)
p = @SVector [25f0, 225f0, 9.807f0]
CdS_dist = Normal(0f0, 1f0)

###

data = deserialize("forecast.txt")
N = length(data.altitude)

weather_sa = map(data.altitude, data.windx, data.windy, data.density) do alt, wx, wy, ρ
    SVector{4}(alt, wx, wy, ρ)
end

weather = map(weather_sa) do w
    (w...,)
end

weather_TA = CuTextureArray(weather)
texture = CuTexture(weather_TA; address_mode = CUDA.ADDRESS_MODE_CLAMP, normalized_coordinates = true, interpolation = CUDA.LinearInterpolation())

## Test Texture interpolation
idx = LinRange(0f0, 1f0, 4000)
idx_gpu = CuArray(idx)
idx_tlu = (1f0-1f0/N)*idx_gpu .+ 0.5f0/N  # normalized table lookup form https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#table-lookup
dst_gpu = CuArray{NTuple{4, Float32}}(undef, size(idx))
dst_gpu .= getindex.(Ref(texture), idx_tlu)  # interpolation ℝ->ℝ⁴


def_zmax = data.altitude[end]
N = length(data.altitude)
@inline function get_weather(tex, z, zmax, N)
    idx = (1f0-1f0/N)*z/zmax + 0.5f0/N # normalized input for table lookup based on https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#table-lookup
    weather = tex[idx]
    wind = SVector{3}(weather[2], weather[3], 0f0)
    ρ = weather[4]
    wind, ρ
end

### Experimentation

function ballistic_t(u, p, t)
    CdS, mass, g = p[1]
    interp = p[2]
    zmax = p[3]
    N = p[4]
    vel = @view u[4:6]

    wind, ρ = get_weather(interp, u[3], zmax, N)

    airvelocity = vel - wind
    airspeed = norm(airvelocity)
    accel = -(ρ * CdS * airspeed) / (2 * mass) * airvelocity - mass*SVector{3}(0f0, 0f0, g)

    return SVector{6}(vel..., accel...)
end

prob = ODEProblem(ballistic_t, u0, tspan, (p, texture, def_zmax, N))

using Adapt

function Adapt.adapt_structure(to::CUDA.Adaptor, x::CuArray{<:ODEProblem})
    # first convert the contained ODE problems
    y = CuArray(adapt.(Ref(to), Array(x)))
    # continue doing what the default method does
    Base.unsafe_convert(CuDeviceArray{eltype(y),ndims(y),CUDA.AS.Global}, y)
end

prob_func = (prob, i, repeat) -> remake(prob, p = (p + SVector{3}(rand(CdS_dist), 0f0, 0f0), texture, def_zmax, N))
eprob_texture = EnsembleProblem(prob, prob_func = prob_func, safetycopy = false)
esol_gpu = solve(eprob_texture, GPUTsit5(), EnsembleGPUKernel(0.0); trajectories, saveat)


## CPU Working Example

# Ballistic model

trajectories = 100
u0 = @SVector [0.0f0, 0.0f0, 10000.0f0, 0f0, 0f0, 0f0]
tspan = (0.0f0, 50.0f0)
saveat = LinRange(tspan..., 100)
p = @SVector [25f0, 225f0, 9.807f0]
CdS_dist = Normal(0f0, 1f0)

# Ballistic w/ Interpolated winds via DataInterpolations.jl

data = deserialize("forecast.txt")
N = length(data.altitude)

weather_sa = map(data.altitude, data.windx, data.windy, data.density) do alt, wx, wy, ρ
    SVector{4}(alt, wx, wy, ρ)
end

weather_sa = SVector{length(weather_sa)}(weather_sa)

interp = DI.LinearInterpolation{true}(hcat(weather_sa...),data.altitude)

function get_weather(itp::DI.LinearInterpolation, z)
    weather = itp(z)
    wind = SVector{3}(weather[2], weather[3], 0f0)
    ρ = weather[4]
    wind, ρ
end


### solving the ODE on GPU + Interpolation using DataInterpolations

function ballistic_gpu(u, p, t)
    CdS, mass, g = p[1]
    interp = p[2]
    vel = @view u[4:6]

    wind, ρ = get_weather(interp, u[3])

    airvelocity = vel - wind
    airspeed = norm(airvelocity)
    accel = -(ρ * CdS * airspeed) / (2 * mass) * airvelocity - mass*SVector{3}(0f0, 0f0, g)

    return SVector{6}(vel..., accel...)
end


prob_interp = ODEProblem(ballistic_gpu, u0, tspan, (p, interp))

prob_func = (prob, i, repeat) -> remake(prob_interp, p = (p + SVector{3}(rand(CdS_dist), 0f0, 0f0), interp))
eprob_interp = EnsembleProblem(prob, prob_func = prob_func, safetycopy = false)
esol_gpu = solve(eprob_interp, GPUTsit5(), EnsembleGPUKernel(0.0); trajectories, saveat)

Benchmarking:

julia> @benchmark @CUDA.sync esol_gpu = solve(eprob_texture, GPUTsit5(), EnsembleGPUKernel(0.0); trajectories, saveat)
BenchmarkTools.Trial: 705 samples with 1 evaluation.
 Range (min … max):  5.657 ms … 34.047 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     6.443 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.090 ms ±  3.021 ms  ┊ GC (mean ± σ):  0.81% ± 3.84%

  ▃▇█▇▄▁
  ███████▆▆▅▆▇▅▅▄▄▄▁▁▁▁▁▁▄▁▁▁▄▁▁▄▁▁▄▁▄▁▁▁▁▁▄▁▄▁▄▄▆▁▁▁▁▁▄▁▁▅▅ ▇
  5.66 ms      Histogram: log(frequency) by time     23.9 ms <

 Memory estimate: 850.06 KiB, allocs estimate: 12149.

julia> @benchmark @CUDA.sync esol_gpu = solve(eprob_interp, GPUTsit5(), EnsembleGPUKernel(0.0); trajectories, saveat)
BenchmarkTools.Trial: 393 samples with 1 evaluation.
 Range (min … max):   9.966 ms … 48.706 ms  ┊ GC (min … max): 0.00% … 48.42%
 Time  (median):     11.320 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   12.722 ms ±  5.161 ms  ┊ GC (mean ± σ):  3.32% ±  8.46%

  ▅▇▇█▆
  █████▅▇▅▇▆▄▄▁▁▅▅▄▁▁▄▆▄▅▁▁▁▄▁▄▁▁▄▁▁▁▄▄▄▄▄▁▄▄▄▄▆▄▄▁▁▁▄▄▁▁▁▁▄▅ ▆
  9.97 ms      Histogram: log(frequency) by time      35.7 ms <

 Memory estimate: 4.84 MiB, allocs estimate: 39854.

@agerlach
Copy link
Author

@utkarsh530 is that the right script? There is no eprob_texture in the script you included.

@utkarsh530
Copy link
Member

Sorry, I just updated it.

@agerlach
Copy link
Author

agerlach commented Mar 3, 2023

I am trying to test this on an A100 and I get the following on all the solves.

julia> esol_gpu = solve(eprob_interp, GPUTsit5(), EnsembleGPUKernel(0.0); trajectories, saveat)
ERROR: UndefKeywordError: keyword argument dt not assigned
Stacktrace:
 [1] batch_solve_up_kernel(ensembleprob::EnsembleProblem{ODEProblem{SVector{6, Float32}, Tuple{Float32, Float32}, false, Tuple{SVector{3, Float32}, CuTexture{NTuple{4, Float32}, 1, CuTextureArray{NTuple{4, Float32}, 1}}, Float32, Int64}, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(ballistic_t), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, var"#15#16", typeof(SciMLBase.DEFAULT_OUTPUT_FUNC), typeof(SciMLBase.DEFAULT_REDUCTION), Nothing}, probs::Vector{ODEProblem{SVector{6, Float32}, Tuple{Float32, Float32}, false, Tuple{SVector{3, Float32}, DataInterpolations.LinearInterpolation{SMatrix{4, 64, Float32, 256}, LinRange{Float32, Int64}, true, Float32}}, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(ballistic_gpu), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}}, alg::GPUTsit5, ensemblealg::EnsembleGPUKernel, I::UnitRange{Int64}, adaptive::Bool; kwargs::Base.Pairs{Symbol, Any, Tuple{Symbol, Symbol}, NamedTuple{(:unstable_check, :saveat), Tuple{DiffEqGPU.var"#13#19", LinRange{Float32, Int64}}}})
   @ DiffEqGPU ~/.julia/packages/DiffEqGPU/CiiCq/src/DiffEqGPU.jl:382
 [2] batch_solve(ensembleprob::EnsembleProblem{ODEProblem{SVector{6, Float32}, Tuple{Float32, Float32}, false, Tuple{SVector{3, Float32}, CuTexture{NTuple{4, Float32}, 1, CuTextureArray{NTuple{4, Float32}, 1}}, Float32, Int64}, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(ballistic_t), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, var"#15#16", typeof(SciMLBase.DEFAULT_OUTPUT_FUNC), typeof(SciMLBase.DEFAULT_REDUCTION), Nothing}, alg::GPUTsit5, ensemblealg::EnsembleGPUKernel, I::UnitRange{Int64}, adaptive::Bool; kwargs::Base.Pairs{Symbol, Any, Tuple{Symbol, Symbol}, NamedTuple{(:unstable_check, :saveat), Tuple{DiffEqGPU.var"#13#19", LinRange{Float32, Int64}}}})
   @ DiffEqGPU ~/.julia/packages/DiffEqGPU/CiiCq/src/DiffEqGPU.jl:345
 [3] macro expansion
   @ ./timing.jl:382 [inlined]
 [4] __solve(ensembleprob::EnsembleProblem{ODEProblem{SVector{6, Float32}, Tuple{Float32, Float32}, false, Tuple{SVector{3, Float32}, CuTexture{NTuple{4, Float32}, 1, CuTextureArray{NTuple{4, Float32}, 1}}, Float32, Int64}, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(ballistic_t), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, var"#15#16", typeof(SciMLBase.DEFAULT_OUTPUT_FUNC), typeof(SciMLBase.DEFAULT_REDUCTION), Nothing}, alg::GPUTsit5, ensemblealg::EnsembleGPUKernel; trajectories::Int64, batch_size::Int64, unstable_check::Function, adaptive::Bool, kwargs::Base.Pairs{Symbol, LinRange{Float32, Int64}, Tuple{Symbol}, NamedTuple{(:saveat,), Tuple{LinRange{Float32, Int64}}}})
   @ DiffEqGPU ~/.julia/packages/DiffEqGPU/CiiCq/src/DiffEqGPU.jl:254
 [5] #solve#33
   @ ~/.julia/packages/DiffEqBase/Lq1gG/src/solve.jl:851 [inlined]
 [6] top-level scope
   @ ~/GPUODEBenchmarks/GPU_ODE_SciML/Texture/wind.jl:132

@agerlach
Copy link
Author

agerlach commented Mar 3, 2023

nvm, up took care of the issue

@agerlach
Copy link
Author

agerlach commented Mar 3, 2023

A100 results for comparison.

julia> @benchmark @CUDA.sync esol_gpu = solve(eprob_texture, GPUTsit5(), EnsembleGPUKernel(0.0); trajectories, saveat)
BenchmarkTools.Trial: 849 samples with 1 evaluation.
 Range (min … max):  5.573 ms … 33.261 ms  ┊ GC (min … max): 0.00% … 69.47%
 Time  (median):     5.712 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.887 ms ±  1.619 ms  ┊ GC (mean ± σ):  1.40% ±  4.25%

    ▃█▇▅▁          ▁▁▁                                        
  ▃▆█████▆▃▃▄▃▅▅▅▇▇███▆▄▃▄▃▃▃▂▂▁▁▁▁▂▁▁▁▁▁▁▁▂▁▂▃▂▂▃▂▃▂▂▂▃▂▂▁▂ ▃
  5.57 ms        Histogram: frequency by time        6.53 ms <

 Memory estimate: 848.19 KiB, allocs estimate: 12097.

julia> @benchmark @CUDA.sync esol_gpu = solve(eprob_interp, GPUTsit5(), EnsembleGPUKernel(0.0); trajectories, saveat)
BenchmarkTools.Trial: 379 samples with 1 evaluation.
 Range (min … max):  11.605 ms … 47.908 ms  ┊ GC (min … max): 0.00% … 70.28%
 Time  (median):     12.987 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   13.203 ms ±  4.656 ms  ┊ GC (mean ± σ):  4.67% ±  9.64%

  ▇ █                                                          
  █▇█▄▄▁▁▄▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▅ ▆
  11.6 ms      Histogram: log(frequency) by time      47.1 ms <

 Memory estimate: 4.84 MiB, allocs estimate: 39802.

@agerlach
Copy link
Author

agerlach commented Mar 3, 2023

@utkarsh530 I am trying to benchmark the Texture memory for different number of trajectories (above only uses 100) trajectories. However, I am noticing when doing GPUTsit5() for large numbers, e.g. 8388608. I have essentially 0% GPU usage according to watch -n 0.2 nvidia-smi with random blips to ~25%, however I essentially have 100% CPU usage all the time. Is it possible that it is spending almost all its time constructing the solution on the CPU? If so, can that be multi-threaded?

@utkarsh530
Copy link
Member

utkarsh530 commented Mar 3, 2023

Generally, that's the case with the EnsembleGPUKernel that competes to solve on GPU is faster than converting back to CPU arrays + building SciMLSolution. Can you try to use https://docs.sciml.ai/DiffEqGPU/dev/tutorials/lower_level_api/ ? It does not perform any expensive operations required on the CPU.

@agerlach
Copy link
Author

agerlach commented Mar 3, 2023

script
using Pkg; cd(@__DIR__); Pkg.activate(".")

using CUDA, DiffEqGPU, OrdinaryDiffEq, Plots, Serialization, StaticArrays, Distributions, LinearAlgebra, Adapt
import DataInterpolations
const DI = DataInterpolations

function ballistic(u, p, t)
    CdS, mass, g = p[1]
    interp = p[2]
    zmax = p[3]
    N = p[4]
    vel = @view u[4:6]

    wind, ρ = get_weather(interp, u[3], zmax, N)

    airvelocity = vel - wind
    airspeed = norm(airvelocity)
    accel = -* CdS * airspeed) / (2 * mass) * airvelocity - mass*SVector{3}(0f0, 0f0, g)

    return SVector{6}(vel..., accel...)
end

@inline function get_weather(tex, z, zmax, N)
    idx = (1f0-1f0/N)*z/zmax + 0.5f0/N # normalized input for table lookup based on https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#table-lookup
    weather = tex[idx]
    wind = SVector{3}(weather[2], weather[3], 0f0)
    ρ = weather[4]
    wind, ρ
end

@inline function get_weather(itp::DI.LinearInterpolation, z, zmax, N)
    weather = itp(z)
    wind = SVector{3}(weather[2], weather[3], 0f0)
    ρ = weather[4]
    wind, ρ
end

function Adapt.adapt_structure(to::CUDA.Adaptor, x::CuArray{<:ODEProblem})
    # first convert the contained ODE problems
    y = CuArray(adapt.(Ref(to), Array(x)))
    # continue doing what the default method does
    Base.unsafe_convert(CuDeviceArray{eltype(y),ndims(y),CUDA.AS.Global}, y)
end

# build interpolants
data = deserialize(joinpath(@__DIR__,"data","forecast.txt"))
N = length(data.altitude)
def_zmax = data.altitude[end]

weather_sa = map(data.altitude, data.windx, data.windy, data.density) do alt, wx, wy, ρ
    SVector{4}(alt, wx, wy, ρ)
end

weather_sa = SVector{length(weather_sa)}(weather_sa)
interp = DI.LinearInterpolation{true}(hcat(weather_sa...),data.altitude)

weather = map(weather_sa) do w
    (w...,)
end

weather_TA = CuTextureArray(weather)
texture = CuTexture(weather_TA; address_mode = CUDA.ADDRESS_MODE_CLAMP, normalized_coordinates = true, interpolation = CUDA.LinearInterpolation())


### Simulation parameters
trajectories = 10_000
u0 = @SVector [0.0f0, 0.0f0, 10000.0f0, 0f0, 0f0, 0f0]
tspan = (0.0f0, 40.0f0)
saveat = LinRange(tspan..., 100)
p = @SVector [25f0, 225f0, 9.807f0]
p_tx = (p, texture, def_zmax, N)
p_di = (p, interp, def_zmax, N)
CdS_dist = Normal(0f0, 1f0)
prob_func = (prob, i, repeat) -> remake(prob, p = (p + SVector{3}(rand(CdS_dist), 0f0, 0f0), prob.p[2:end]...))
### Texture Solve Test
# High Level

prob_tx = ODEProblem(ballistic, u0, tspan, p_tx)
eprob_tx = EnsembleProblem(prob_tx; prob_func, safetycopy = false)
@time esol_gpu = solve(eprob_tx, GPUTsit5(), EnsembleGPUKernel(0.0); trajectories, saveat)

# Low Level
@time begin
    probs_tex = map(1:trajectories) do i
        prob_func(prob_tx, i, false)
    end |> cu

    ts,us = DiffEqGPU.vectorized_asolve(probs_tex, prob_tx, GPUTsit5(); saveat)
end

### DI Solve Test
# High Level

prob_di = ODEProblem(ballistic, u0, tspan, p_di)
eprob_di = EnsembleProblem(prob_di; prob_func, safetycopy = false)
@time esol_di = solve(eprob_di, GPUTsit5(), EnsembleGPUKernel(0.0); trajectories, saveat)

# Low Level
@time begin
    probs_di = map(1:trajectories) do i
        prob_func(prob_di, i, false)
    end |> cu

    ts,us = DiffEqGPU.vectorized_asolve(probs_di, prob_di, GPUTsit5(); saveat)
end

# trajs = 8*4 .^(0:9)
using BenchmarkTools
BenchmarkTools.DEFAULT_PARAMETERS.samples = 3
trajs = 8*4 .^(0:8)
times = map(trajs[end]) do traj

    @show traj
    tx_lo = @benchmark @CUDA.sync begin
        probs_tex = map(1:$traj) do i
            prob_func(prob_tx, i, false)
        end |> cu
    
        ts,us = DiffEqGPU.vectorized_asolve(probs_tex, prob_tx, GPUTsit5(); $saveat)
    end
    display(tx_lo)

    di_lo = @benchmark @CUDA.sync  begin
        probs_di = map(1:$traj) do i
            prob_func(prob_di, i, false)
        end |> cu
    
        ts,us = DiffEqGPU.vectorized_asolve(probs_di, prob_di, GPUTsit5(); $saveat)
    end
    display(di_lo)
    
    
    tx = @benchmark @CUDA.sync esol_gpu = solve(eprob_tx, GPUTsit5(), EnsembleGPUKernel(0.0); trajectories = $traj, saveat = $saveat)
    display(tx)
    di = @benchmark @CUDA.sync esol_gpu = solve(eprob_di , GPUTsit5(), EnsembleGPUKernel(0.0); trajectories = $traj, saveat = $saveat)
    display(di)
    dicpu = @benchmark esol_cpu = solve(eprob_di , Tsit5(), EnsembleThreads(); trajectories = $traj, saveat = $saveat)
    display(dicpu)
    (tx    = minimum(tx.times) / 1e6, 
     di    = minimum(di.times) / 1e6, 
     dicpu = minimum(dicpu.times) / 1e6,
     tx_lo = minimum(tx_lo.times) / 1e6,
     di_lo = minimum(di_lo.times) / 1e6)
end

using Plots
begin
    plt = plot(trajs, getindex.(times, :tx), label = "Texture", marker = :utriangle, legend = :topleft, yaxis = :log, xaxis=:log)
    plot!(trajs, getindex.(times, :di), label = "Software", marker = :ltriangle)
    plot!(trajs, getindex.(times, :dicpu), label = "CPU", marker = :square)
    plot!(trajs, getindex.(times, :tx_lo), label = "Texture Low-Level", marker = :dtriangle, legend = :topleft)
    plot!(trajs, getindex.(times, :di_lo), label = "Software Low-Level", marker = :rtriangle)
    xlabel!("Trajectories")
    ylabel!("(ms)")
    plt
end

image

The low-level times includes the time to create and copy the probs. CPU times is using EnsembleThreads() on a 12 core machine. GPU is A100 40GB.

For say 524288 trajectories, if I time just the solve with the low level interface, I get

@time ts,us = DiffEqGPU.vectorized_asolve(probs_tex, prob_tx, GPUTsit5(); saveat);
 14.335337 seconds (51.38 M allocations: 1.992 GiB)

During this low-level solve I get 100% CPU usage on a single core and 0% usage on the GPU until right before the solve completes where it blips to ~30%.

@utkarsh530
Copy link
Member

However, the problematic atsit5 invocation below passes a GPU vector of problems, and the CPU-to-GPU object conversion does not happen automatically for array elements (because it would otherwise require a download from GPU->CPU, perform the conversion there, allocate a new array, upload again; making GPU kernel launches unacceptably expensive):

I am not sure, but the 100% CPU usage could be due to expensive GPU kernel launches requiring multiple uploads to GPU and CPU due to the reason Tim pointed out earlier. Even the scaling of the plot with trajectories seems a bit off, compared to plain ODE solves benchmarks. IMHO, we should only compare the time spent on solving the ODE rather than setup (which is generating GPU arrays of ODEProblem being parallelized before the solve, similar to just generating an array of parameters before and uploading them on GPU. (A common setup in other programs)).

@utkarsh530
Copy link
Member

I think @maleadt might be able to comment better here.

@agerlach
Copy link
Author

agerlach commented Mar 3, 2023

After reviewing @maleadt's earlier comments, I wonder if makes sense to have a way to pass some parameters that are "global" to all ensembles or as Refs, so that the conversion needs to only happen once.

@agerlach
Copy link
Author

agerlach commented Mar 6, 2023

So, I've been experimenting with different options for passing the CuTexture w/o incurring the huge conversion overhead and I am at a bit of a loss for what I am observing. Because the texture is constant for all problems in the ensemble, I would think that I should be able to just write an ode in the form eom(u, p, t, texture) and do a closure over the texture. However, I still get a dynamic invocation with CuTexture instead of CuDeviceTexture.

NOTE: I am trying to avoid using the Adapt rule above due to the huge conversion overhead.

Here is a MWE demonstrating the issue

First, create a texture that is just 0 everywhere and verify that it can be used in a closure w/ proper conversions. This works as expected.

using CUDA, DiffEqGPU, OrdinaryDiffEq, StaticArrays, Adapt

data = map(1:5000) do w
    (zeros(Float32, 4)...,)
end

texture = CuTexture( CuTextureArray(data); address_mode = CUDA.ADDRESS_MODE_CLAMP, 
                        normalized_coordinates = true, interpolation = CUDA.LinearInterpolation())

idx_gpu = CuArray(LinRange(0f0, 1f0, 4000))
dst_gpu = CuArray{NTuple{4, Float32}}(undef, size(idx_gpu))

function interp!(dst, idx, tex)
    dst .= getindex.(Ref(tex), idx)
end

cl_let = let tex = texture
    (d,i) -> interp!(d, i, tex)
end
cl_let(dst_gpu, idx_gpu);

Next, lets define a simple ODE that accepts a texture as an argument but does nothing with it and solves over a closure of tex

function eom(u, p, t, tex)
    return @SVector zeros(Float32, 4)
end

trajectories = 8192
u0 = @SVector zeros(Float32, 4)
tspan = (0.0f0, 40.0f0)
saveat = LinRange(tspan..., 100)

prob_func = (prob, i, repeat) -> remake(prob, u0 = @SVector randn(Float32, 4))

cl = let tex = texture
        (x,p,t)->eom(x, p, t, tex)
end

prob = ODEProblem(cl, u0, tspan)
eprob = EnsembleProblem(prob; prob_func, safetycopy = false)
esol = solve(eprob, GPUTsit5(), EnsembleGPUKernel(0.0); adaptive= true, trajectories, saveat)

This solves with no issue. Note: no adapt rule was defined.

Next, lets solve similarly but with an ODE that uses the texture

function eom_tex(u, p, t, tex)
    @inbounds w = tex[0.5] #NTuple of zeros
    return SVector(w...)
end

cl_tex = let tex = texture
    (x,p,t)->eom_tex(x, p, t, tex)
end

prob_tex = ODEProblem(cl_tex, u0, tspan)
eprob_tex = EnsembleProblem(prob_tex; prob_func, safetycopy = false)
esol_tex = solve(eprob_tex, GPUTsit5(), EnsembleGPUKernel(0.0); adaptive= true, trajectories, saveat)

leading to

julia> esol_tex = solve(eprob_tex, GPUTsit5(), EnsembleGPUKernel(0.0); adaptive= true, trajectories, saveat)
ERROR: InvalidIRError: compiling kernel #atsit5_kernel(CuDeviceVector{ODEProblem{false,SVector{4, Float32},Tuple{Float32, Float32},…}, 1}, CuDeviceMatrix{SVector{4, Float32}, 1}, CuDeviceMatrix{Float32, 1}, Float32, CallbackSet{Tuple{}, Tuple{}}, Nothing, Float32, Float32, CuDeviceVector{Float32, 1}, Val{false}) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to (f::ODEFunction)(args...) in SciMLBase 

On inspection I see we still have a CuTexture in the function call

%2 = invoke eom_tex(::SVector{4, Float32},::SciMLBase.NullParameters,::Float32,::CuTexture{NTuple{4, Float32}, 1, CuTextureArray{NTuple{4, Float32}, 1}})::Union{}

Why? It seems like the proper conversion occurred in the other examples using a closure over the texture. If I add the adapt rule, then it works. However it spends almost all the time converting. I don't understand why it is needed though if the texture is not an explicit parameter and is used in a closure instead.

@maleadt
Copy link
Contributor

maleadt commented Mar 15, 2023

Sorry for the slow response, I needed some time to catch up with my GH notifications 🙂

Why? It seems like the proper conversion occurred in the other examples using a closure over the texture.

That's because in those cases you were invoking a closure directly, and Adapt.jl has rules (Adapt.adapt_structure methods) to recursively convert the captures of a closure. Those rules are then used by CUDA.jl to convert the function (closure) you're invoking, as well as any of its arguments, to a GPU-compatible representation (CuTexture->CuDeviceTexture, etc).

Here, however, a kernel is invoked with an EnsembleProblem argument, which contains an ODEProblem, which contains an ODEFunction, which contains a closure that captures a CuTexture and a CuTextureArray. Although CUDA will use Adapt to try and convert such an argument to a device-compatible representation, there are no Adapt rules defined for these types of objects, so the conversion is a no-op.

So basically, there would need to be Adapt rules for each of these types so that kernel conversion recurses into the objects when the EnsembleProblem is passed to a kernel. Alternatively, the code constructing an EnsembleProblem could manually call cudaconvert on the fields that need to be converted, but depending on how that is done it would still need rules for all but the outermost object type (and also breaks platform portability).

One simple way to add Adapt rules is to use @adapt_structure, but that relies on default constructors being present, and that's not the case for ODEFunction:

julia> Adapt.@adapt_structure EnsembleProblem

julia> Adapt.@adapt_structure ODEProblem

julia> Adapt.@adapt_structure ODEFunction

julia> cudaconvert(eprob_tex)
ERROR: MethodError: no method matching ODEFunction(::var"#14#15"{CuDeviceTexture{NTuple{4, Float32}, 1, CUDA.ArrayMemory, true, CUDA.LinearInterpolation}}, ::LinearAlgebra.UniformScaling{Bool}, ::Nothing, ::Nothing, ::Nothing, ::Nothing, ::Nothing, ::Nothing, ::Nothing, ::Nothing, ::Nothing, ::Nothing, ::Nothing, ::Nothing, ::Nothing, ::typeof(SciMLBase.DEFAULT_OBSERVED), ::Nothing, ::Nothing)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants