Use generalized dot product in Mahalanobis distance. #242

bvdmitri · 2022-08-08T09:04:01Z

It would be better to replace dot(z, Q * z) (here https://github.com/JuliaStats/Distances.jl/blob/master/src/mahalanobis.jl#L83 and in other places) with dot(z, Q, z) without storing the intermediate result of Q*z. The problem here is that Q * z allocates and sometimes this allocation might slowdown the overall execution by a lot.

This is the small benchmark:

julia> @btime dot($x, $Q*$x)
  87.190 ns (1 allocation: 80 bytes)
0.850287710619438

julia> @btime dot($x, $Q, $x)
  13.774 ns (0 allocations: 0 bytes)
0.8502877106194378

Another nice addition would be to provide a tmp storage for inplace z = a - b operation (which also allocates). That would help if some executes the mahalanobis distance in a tight for loop.

In our application, simply by changing the dot and providing tmp storage we reduced allocations from 1.107791 seconds (20.21 M allocations: 1.141 GiB, 18.80% gc time) to 0.532219 seconds (10.12 M allocations: 393.585 MiB, 17.05% gc time).

This is the sketch of our implementation (without checks):

struct FastSqMahalanobis{ M <: AbstractMatrix, V <: AbstractVector }
    Q :: M
    z :: V
end

function FastSqMahalanobis(Q::AbstractMatrix) 
    return FastSqMahalanobis(Q, zeros(eltype(Q), size(Q, 1)))
end

function (distance::FastSqMahalanobis)(a::AbstractVector, b::AbstractVector)
    if length(a) != length(b)
        throw(DimensionMismatch("first array has length $(length(a)) which does not match the length of the second, $(length(b))."))
    end

    Q = distance.Q
    z = distance.z

    # inplace z = a - b
    map!(-, z, a, b)

    return dot(z, Q, z)
end

The text was updated successfully, but these errors were encountered:

dkarrasch · 2022-09-14T12:30:27Z

This has been proposed earlier, but it is somewhat controversial. The 3-arg dot has specializations for structured matrices and outperform the 2-arg dot with product in small cases. Unfortunately, for large dense Q, the latter outperforms the former, so it's not clear what to do generically. But having a FastSqMahalanobis type may be a way out, or replace the call to dot by _dot, which generically computes dot(z, Q*z), and in structured cases dot(z, Q, z)?

dkarrasch · 2022-09-14T12:31:38Z

x-ref #173

bvdmitri · 2022-09-14T17:05:54Z

I see you reasoning, but in the link you've provided I see only a small isolated benchmark. It can be true, that a 2 argument dot function might be faster on large inputs (which I suppose can be filed as a performance bug in the Julia itself). The problem, however, is the fact that 2-argument function allocates and creates a lot of unnecessary pressure on the garbage collector in real applications. While it can be faster in an isolated single test, I doubt it will be faster in a real application where you need to perform this operation hundreds or thousands of times and allocate gigabytes of memory (that is what happened to us). Ineffective RAM usage slows down this kind of operation a lot as it requires to run GC more often.

I don't have a computer right now to test how it performs in case when you need to perform dot many many times, but I can benchmark it later,

dkarrasch · 2024-06-10T10:00:22Z

The issue with temporal storage in a generic context is that you never know whether the storage has the right eltype. For instance, you could use a real Q, but multiply with complex vectors. Nothing stops you from doing that. However, in a specific use case, where you have control over the input/intermediate variables, you can define your own metric struct as you have done. But look, you may wish to also compute the colwise distance, then you'll need an intermediate matrix. In a generic context like this package, it's just not clear what storage may be needed for specific computations.

(which I suppose can be filed as a performance bug in the Julia itself)

It's not a performance bug. It is because the 2-arg dot can use BLAS for appropriate eltypes, which performs the intermediate multiplication much faster.

bvdmitri · 2024-06-10T10:38:42Z

It is because the 2-arg dot can use BLAS for appropriate eltypes, which performs the intermediate multiplication much faster.

But nothing really stops the 3-arg dot to also use BLAS given appropriate eltypes or even call 2-arg version in some specific situations (e.g. as you suggested for _dot). Maybe the real winner here is the simd/vectorization of the 2-arg version, but I'm not very experienced in it to comment on this.

dkarrasch · 2024-06-10T11:03:37Z

The promise of 3-arg dot in Base is not to allocate an intermediate vector, so you can't compute first z = Q*y and then dot(x,z). If users don't mind the allocation, they write dot(x, Q*y), if they do, they use dot(x, Q, y). It was the desire to remove intermediate allocations that triggered the 3-arg implementation. dot(x, Q*y) has been available ever since.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use generalized dot product in Mahalanobis distance. #242

Use generalized dot product in Mahalanobis distance. #242

bvdmitri commented Aug 8, 2022

dkarrasch commented Sep 14, 2022

dkarrasch commented Sep 14, 2022

bvdmitri commented Sep 14, 2022

dkarrasch commented Jun 10, 2024

bvdmitri commented Jun 10, 2024

dkarrasch commented Jun 10, 2024

Use generalized dot product in Mahalanobis distance. #242

Use generalized dot product in Mahalanobis distance. #242

Comments

bvdmitri commented Aug 8, 2022

dkarrasch commented Sep 14, 2022

dkarrasch commented Sep 14, 2022

bvdmitri commented Sep 14, 2022

dkarrasch commented Jun 10, 2024

bvdmitri commented Jun 10, 2024

dkarrasch commented Jun 10, 2024