-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extracting a band diagonal with KeOps? #24
Comments
@bcharlier, @jeanfeydy, Reading more about the syntax here https://www.kernel-operations.io/keops/api/math-operations.html, looks like the operation I need doesn't fit the format you expect. Is there a way around that? Also, does keops support fp16? Thanks |
Hi @ibeltagy , Happy to see that you could install everything! for j in range(-c, c):
r[:, j] = (t1 * t2[ j : M+j, :]).sum(dim=1) would be a much simpler option, and reasonably efficient. Of course, you would need zero-pad As for your second question: today, KeOps only supports fp32 ("float") and fp64 ("double") precisions. Adding support for other float numbers shouldn't be too hard, but we simply never thought about it. What do you think, @bcharlier ? Best regards, Jean |
This PyTorch solution is easy but it is super slow because of large number of slicing operations ( |
Another way of doing the same thing is using
As far as I can tell, |
But would it be fast ? Also about this,
I saw one example in your tutorials that has loops. Can this solution be implemented in keops? I guess it will result in a very long formula (with lots of sub-formulas, one for each value of |
Hi @ibeltagy , I've been thinking about it during the night: thanks for the interesting problem! import torch
from pykeops.torch import LazyTensor
M, D, c = 100000, 64, 100
t1 = torch.randn(M, D).cuda()
t2 = torch.randn(M, D).cuda()
indices = torch.arange(M).float().cuda()
i = LazyTensor( indices[:,None,None] )
j = LazyTensor( indices[None,:,None] )
x = LazyTensor( t1[:,None,:] )
y = LazyTensor( t2[None,:,:] )
K_ij = (j - i + c).one_hot(2*c+1) * (x | y)
K_ij.ranges = blablabla # block-sparsity pattern around the band diagonal
r = K_ij.sum(dim=1) will probably be good enough, with streamlined computations and relatively few wasted operations. Best regards, Jean |
@jeanfeydy, that's very cool. Thanks for helping. Let me add two more requirements to make sure that your solution is going to address my use case. 1- A way to handle the boundary conditions. For example, 2- The real use case is that the diagonals I need are not necessarily a contagious band around the main diagonal. Would it be possible to provide a list of diagonal indices that I care about? the list is fixed and known during compilation time. This is probably your Thanks again for doing this. |
Hi @ibeltagy , I think that 26b4829 does the trick! I'm pretty busy right now and could not fully test everything, but I think that the code below answers your question: import torch
from pykeops.torch import LazyTensor
use_cuda = torch.cuda.is_available()
load = lambda x : x.cuda() if use_cuda else x
M, D, c = 100000, 64, 100
t1 = load(torch.randn(M, D))
t2 = load(torch.randn(M, D))
indices = load(torch.arange(M).float())
i = LazyTensor( indices[:,None,None] )
j = LazyTensor( indices[None,:,None] )
x = LazyTensor( t1[:,None,:] )
y = LazyTensor( t2[None,:,:] )
K_ij = (j - i + c).one_hot(2*c+1) * (x | y) # "Kernel" matrix
#r = K_ij.sum(dim=1) # Bruteforce, quadratic computation
if True: # Use a block-sparsity pattern around the band diagonal
from math import ceil
Mc = ceil(M / c) # Number of blocks of size (c,c)
I = load(torch.arange(Mc).int())
# The lines below describe a block-sparse pattern of width
# 3 around the diagonal, with blocks of size (c,c):
# 1. "i" indices are grouped in clusters of size c:
ranges_i = torch.stack( (I * c, (I+1) * c) ).clamp(0, M).t().contiguous() # [[0, c], [c, 2*c], ...]
# 2. each "i" cluster is associated to one range over "j":
slices_i = load(torch.arange(1, Mc + 1).int()) # [1, 2, ..., Mc]
# 3. the reduction ranges are given by [[0, 2*c], [0, 3*c], [c, 4*c], ...]:
ranges_j = torch.stack( ((I-1) * c, (I+2) * c) ).clamp(0, M).t().contiguous()
# N.B.: here, because the matrix is square, the pattern is symmetric for the transpose.
K_ij.ranges = (ranges_i, slices_i, ranges_j, ranges_i, slices_i, ranges_j)
r = K_ij.sum(dim=1) # Actual computation
# Test:
I, J = 34, 58
print(r[I, c + J - I])
print(torch.dot( t1[I], t2[J] )) You can try it on Google Colab with the following imports in your first and second cells: # Install the dependencies for sphinx and KeOps
!pip install numpy GPUtil cmake ninja > install.log
!pip install sphinx-gallery recommonmark sphinxcontrib-httpdomain sphinx_rtd_theme >> install.log
# Download KeOps... And don't forget to update pybind11.
!git clone --recursive https://github.com/getkeops/keops.git keops/ >> install.log # Put KeOps in the current environment
import sys
sys.path.append('/content/keops/') As far as I can tell, this runs pretty fast. What do you think? Best, |
Oooops, just remarked a small bug in my C++ implementation: |
Very cool. I will try this today and get back to you. As for the diagonals, they are evenly spaced and the spacing is an argument. The example above is with spacing=0 |
I see! I have updated the instructions in the post above: everything should be fine now. |
I ran the code above and it does work. I will test the gradients and do some profiling for the performance and get back to you. |
I can the code vs the naive pytorch solution mentioned earlier, and it seems that from math import ceil
from pykeops.torch import LazyTensor
use_cuda = torch.cuda.is_available()
load = lambda x : x.cuda() if use_cuda else x
import time
# this function is called many times
def compute_mm_diagonals_keops(t1, t2, c, M, D, indices, ranges_i, slices_i, ranges_j):
i = LazyTensor( indices[:,None,None] )
j = LazyTensor( indices[None,:,None] )
x = LazyTensor( t1[:,None,:] )
y = LazyTensor( t2[None,:,:] )
K_ij = (j - i + c).one_hot(2 * c + 1) * (x | y) # "Kernel" matrix
# N.B.: here, because the matrix is square, the pattern is symmetric for the transpose.
K_ij.ranges = (ranges_i, slices_i, ranges_j, ranges_i, slices_i, ranges_j)
r = K_ij.sum(dim=1) # Actual computation
return r
def compute_mm_diagonals_pytorch(t1, t2, c):
seq_len = t1.size()[0]
diagonals_list = []
for i, d in enumerate(range(-c, c+1)):
if d >= 0:
slice_1 = t1.narrow(dim=0, start=0, length=seq_len - d)
slice_2 = t2.narrow(dim=0, start=d, length=seq_len - d)
pad_left = 0
pad_right = d
else:
slice_1 = t1.narrow(dim=0, start=-d, length=seq_len + d)
slice_2 = t2.narrow(dim=0, start=0, length=seq_len + d)
pad_left = -d
pad_right = 0
diagonals_list.append(torch.nn.functional.pad((slice_1 * slice_2).sum(-1), pad=(pad_left, pad_right), value=0))
r = torch.cat(diagonals_list, dim=-1).view(c * 2 + 1, seq_len).transpose(dim0=0, dim1=1)
return r
# building the indices and slices happens only once
M, D, c = 16000, 64, 128
indices = load(torch.arange(M).float())
Mc = ceil(M / c) # Number of blocks of size (c,c)
I = load(torch.arange(Mc).int())
# The lines below describe a block-sparse pattern of width
# 3 around the diagonal, with blocks of size (c,c):
# 1. "i" indices are grouped in clusters of size c:
ranges_i = torch.stack( (I * c, (I+1) * c) ).clamp(0, M).t().contiguous() # [[0, c], [c, 2*c], ...]
# 2. each "i" cluster is associated to one range over "j":
slices_i = load(torch.arange(1, Mc + 1).int()) # [1, 2, ..., Mc]
# 3. the reduction ranges are given by [[0, 2*c], [0, 3*c], [c, 4*c], ...]:
ranges_j = torch.stack( ((I-1) * c, (I+2) * c) ).clamp(0, M).t().contiguous()
# call it once to compile
t1 = load(torch.randn(M, D))
t2 = load(torch.randn(M, D))
r = compute_mm_diagonals_keops(t1, t2, c, M, D, indices, ranges_i, slices_i, ranges_j)
# call many times and measure time
time1 = 0
time2 = 0
for i in range(100):
t1 = load(torch.randn(M, D))
t2 = load(torch.randn(M, D))
torch.cuda.synchronize()
start = time.time()
r = compute_mm_diagonals_keops(t1, t2, c, M, D, indices, ranges_i, slices_i, ranges_j)
torch.cuda.synchronize()
time1 += time.time() - start
start = time.time()
r2 = compute_mm_diagonals_pytorch(t1, t2, c)
torch.cuda.synchronize()
time2 += time.time() - start
if not torch.allclose(r, r2, atol=1e-05):
print('not equal')
print(f'Time keops: {time1} s')
print(f'Time pytorch: {time2} s')
|
Hi @ibeltagy , Indeed! Running some benchmarks ( Best regards, |
that would be cool. Thanks |
Hi @ibeltagy , def compute_mm_diagonals_keops(t1, t2, c, M, D, indices, ranges_i, slices_i, ranges_j):
i = LazyTensor( indices[:,None,None] )
j = LazyTensor( indices[None,:,None] )
x = LazyTensor( t1[:,None,:] )
y = LazyTensor( t2[None,:,:] )
K_ij = (x | y)
# N.B.: here, because the matrix is square, the pattern is symmetric for the transpose.
K_ij.ranges = (ranges_i, slices_i, ranges_j, ranges_i, slices_i, ranges_j)
r = K_ij.scattered_sum(j - i + c, 2*c+1, dim=1) # Actual computation
return r Performances vary with Time keops: 0.7824671268463135 s
Time pytorch: 4.656574964523315 s Now, I "just" have to document all this, implement the gradient and think a bit about the Best regards, |
Gradient is implemented now (9c61d63): just have to document everything properly, add a unit test and implement the spacing! |
The problem seems to be solved. Feel free to re-open the issue if there is any question regarding band diagonal extraction. |
Hi, @bcharlier and @jeanfeydy. Are you planning to merge scattered_sum branch into the master and add it to the next release? |
Hi @awav , Indeed! The main cause for this delay has been the CUDA 11 release this summer, which has introduced a significant regression on compilation times for KeOps formulas and made our continuous integration pipeline really impractical to run (from ~1 hour to ~5 hours per commit at some point in September...). In September-December, we have prioritized this issue and the release of our NeurIPS paper over the integration of new operations. Out of curiosity, could you maybe tell us a little bit more about your intended use case? I understand that you are mostly working with TensorFlow: if you believe that writing a KeOps binder for TF is do-able and worth doing, we'd be glad to get your opinion on this! Best regards, |
Hey @jeanfeydy , I'm trying to get the same diagonal band operation as @ibeltagy . @ibeltagy has in the meantime found a solution using plain pytorch by chunking matrices. This however uses more 2x more memory than necessary and requires padding the matrices to the window size. I've now tried to get the Would you recommend using the old branch or have the changes brought considerable improvements? If so, what would the effort be for integrating the scattered sum reduction in the new codebase? Could you give me pointers on where to start? Best regards Ferdinand |
Additionally, when testing the benchmarking code above, I get a stack smashing detected error:
|
I want to do matrix multiplication of 2D tensors where I only care about a few diagonals of the resulting matrix, and I want to run this on GPU on PyTorch. Is this something that can be done with
keops
?For illustration, here's the numpy code, but what I really need is PyTorch/GPU
PS: I am running into compilation errors with the tutorial examples, but will ask about that later.
The text was updated successfully, but these errors were encountered: