-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make pairwise
multi-threaded and make corkendall
and corspearman
wrap pairwise
.
#923
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really great. Thanks for contributing such a PR. I'm happy with it but lets keep if open for a couple of days to see if anybody else would like to comment.
src/rankcorr.jl
Outdated
Xjrank = tiedrank(Xj) | ||
C[j,1] = cor(Xjrank, yrank) | ||
function corspearman(x::AbstractMatrix, y::AbstractVector; skipmissing::Symbol=:none) | ||
return corspearman(x, reshape(y, (length(y), 1)); skipmissing) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the 1.0 tests are failing because of this way of passing the keyword argument. I'd be okay with bumping the minimum version for StatsBase since 1.0 is ancient but maybe it's better to just make it skipmissing=skipmissing
here for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When writing and testing the code (in KendallTau) I was unaware that support for Julia 1.0.5 was necessary, so the test failures when I submitted the PR came as a surprise. Changing how keyword arguments are handled would be easy, as would avoiding trailing backslashes within string literals. But the fact that 1.0.5 does not support eachcol
is a bit more difficult. eachcol
was added in 1.1. For the time being I will work on responding to @devmotion's comments and wait for your guidance as to whether bumping the minimum version might happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done some more investigation:
Other popular packages (the compat section of Project.toml) have the following minimum versions:
1.9.4 for Statistics
1.6 for Plots, DataFrames, Revise, Optim, ForwardDiff, OhMyREPL, CSV
1.3 for Distributions
1 for StatsBase
So upping the minimum version for StatsBase seems sensible.
But as I remarked above, eachcol
is crucial to how corkendall
and corspearman
are mere wrappers to pairwise
. eachcol
was introduced in Julia 1.1, so I imagined that upping the minimum Julia version for StatsBase to 1.1 would suffice. But no - the code appears to require Julia 1.9.4, and as of now does not work with Julia 1.8.5. For example the method check_vectors
fails because:
# Julia 1.9.4
julia> keys((eachcol([1 2;3 4]))[2])
2-element LinearIndices{1, Tuple{Base.OneTo{Int64}}}:
1
2
# Julia 1.8.5
julia> keys((eachcol([1 2;3 4]))[2])
ERROR: MethodError: no method matching getindex(::Base.Generator{Base.OneTo{Int64}, Base.var"#242#243"{Matrix{Int64}}}, ::Int64)
Stacktrace:
[1] top-level scope
@ REPL[1]:1
So the alternatives are:
- A. We increase the minimum version to 1.9.4, which happens to be that for Statistics, but is more modern than the current LTS.
- B. I find a workaround so that this PR requires either no increase in the minimum version number or an increase to 1.6 which is more in line with other popular packages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's now the case that increasing the minimum version to 1.6 would work. See here.
src/pairwise.jl
Outdated
#cor and friends are special-cased. | ||
iscor = (f in (corkendall, corspearman, cor)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a bit unfortunate that this means other functions can't benefit from the improvements (?) since it is not extendable. Maybe an additional argument would be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I check that you've understood the purpose of the iscor
variable (which might be better named diag_is_1
)?
It's a Boolean that answers the question "Does x === y
imply f(x,y) == 1.0
, irrespective of NaN
, missing
and Inf
entries, and that therefore there is a speedup to be had by not bothering to call f
when x[i] === y[i]
".
The upshot is that if f
was some new correlation estimator for which the answer to the question above was "yes" but where we failed to update line 15 then we'd miss out on the "diagonal elements are near-zero-cost to calculate" but still have the benefit of multithreading. Of course, an additional argument is possible, but it makes the interface to pairwise
a bit more complicated.
There's also the fact that the author of funky_cor
could simply include the line x===y && return 1.0
in its definition, which remark has got me wondering whether I should do that for corspearman
and corkendall
...
src/pairwise.jl
Outdated
[26, 25, 16, 15, 6, 5] | ||
``` | ||
""" | ||
function equal_sum_subsets(n::Int, num_subsets::Int)::Vector{Vector{Int}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case this function is kept (see comments above), is it necessary to return a vector of vectors or could one return an iterator of vectors instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm reasonably sure that equal_sum_subsets
works well in the sense of doing a good job of load balancing. The only reason it currently returns a vector of vectors is that's what I know how to do, so I would need to read up on iterators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In commit a1d9ae9 I have replaced the function equal_sum_subsets
with an iterator EqualSumSubsets
whose element type is another iterator TwoStepRange
. It seems to work well and leads to a small reduction in allocations. But please review as this is the first time I've written code to create an iterator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, didn't work on 32 bit. Maybe I need to replace use of Int64 with Int?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commit 951b391 fixes EqualSumSubsets
on 32-bit.
No longer use eachcol when VERSION < v"1.9"
I've now made changes so that Would this meet with approval from the StatsBase maintainers? |
The failed checks after commit [46d5655] are all in the codecov step, with error:
IIUC (this page) I need to wait and try again. |
It's been a while... May I ask why this PR has not been merged? Is it: |
I know people have many obligations and PRs/emails get ignored sometimes, but this is a bit of an unfortunate situation. @PGS62 was rebuffed from registering a new package, but now this PR sits languishing. |
This pull request follows up on #849. In that discussion I envisaged amending
corkendall
to be multi-threaded. Instead I have done the following:pairwise
andpairwise!
multi-threaded, and in the case ofskipmissing = :pairwise
, greatly reduced allocations which is needed for good multi-threaded performance. A nice illustration of the performance improvements this leads to is obtained withf = LinearAlgebra.dot
. The example here shows a 31.5 times speedup on a 12 core PC.corkendall
now simply wrapspairwise
but with added methods of_pairwise!
fortypeof(corkendall)
to retain the speedups tocorkendall
that were a result of Speeding up Kendall Correlationcorkendall
#634.corspearman
also wrapspairwise
with added methods of_pairwise!
fortypeof(corspearman)
. The purpose of these added methods is to reduce the number of calls totiedrank
from2nm
to2(n+m)
whenn
andm
are respectively the number of columns in the argumentsx
andy
to the call tocorspearman(x,y...
.Since
corkendall
andcorspearman
now wrappairwise
they have askipmissing
argument, which introduces an additional inconsistency betweencorkandall
/corspearman
andcov
/cor
.I developed the new code in my package KendallTau. Please see the README file for further examples of the performance improvements achieved, which for all but the smallest input data exceed the number of cores on the PC (thanks to reduced allocations).
I have amended the tests as needed, and I believe that test coverage is as close to 100% as possible.
I apologise that there is quite a lot for maintainers to review, and I'll be happy to answer questions.