diff --git a/doc/api/math-operations.rst b/doc/api/math-operations.rst
index 52ca5e495..aae4ad0d5 100644
--- a/doc/api/math-operations.rst
+++ b/doc/api/math-operations.rst
@@ -17,10 +17,10 @@ or
    \beta_j = \operatorname{Reduction}_i\limits \big[ F(x^0_{\iota_0}, ... , x^{n-1}_{\iota_{n-1}})  \big]
 
 where :math:`F` is a symbolic formula, the :math:`x^k_{\iota_k}`'s are vector variables
-and 
+and
 :math:`\text{Reduction}` is a Sum, LogSumExp or any other standard operation (see :ref:`part.reduction` for the full list of supported reductions).
 
-We now describe the symbolic syntax that 
+We now describe the symbolic syntax that
 can be used through all KeOps bindings.
 
 .. _`part.varCategory`:
@@ -29,7 +29,7 @@ Variables: category, index and dimension
 ========================================
 
 
-At a low level, every variable :math:`x^k_{\iota_k}` is specified by its **category** :math:`\iota_k\in\{i,j,\emptyset\}` (meaning that the variable is indexed by :math:`i`, by :math:`j`, or is a fixed parameter across indices), its **positional index** :math:`k` and its **dimension** :math:`d_k`. 
+At a low level, every variable :math:`x^k_{\iota_k}` is specified by its **category** :math:`\iota_k\in\{i,j,\emptyset\}` (meaning that the variable is indexed by :math:`i`, by :math:`j`, or is a fixed parameter across indices), its **positional index** :math:`k` and its **dimension** :math:`d_k`.
 
 In practice, the category :math:`\iota_k` is given through a keyword
 
@@ -78,10 +78,11 @@ To define formulas with KeOps, you can use simple arithmetics:
 Elementary functions:
 
 ======================   =========================================================================================================
+``Minus(f)``              element-wise opposite of ``f``
 ``Inv(f)``                element-wise inverse ``1 ./ f``
 ``Exp(f)``                element-wise exponential function
 ``Log(f)``                element-wise natural logarithm
-``XLogX(f)``              computes ``f * log(f)`` element-wise (with value ``0`` at ``0``) 
+``XLogX(f)``              computes ``f * log(f)`` element-wise (with value ``0`` at ``0``)
 ``Sin(f)``                element-wise sine function
 ``SinXDivX(f)``           function ``sin(f)/f`` element-wise (with value ``1`` at ``0``)
 ``Asin(f)``               element-wise arc-sine function
@@ -172,7 +173,7 @@ Elementary dot products:
 ``MatVecMult(f, g)``                                matrix-vector product ``f x g``: ``f`` is vector interpreted as matrix (column-major), ``g`` is vector
 ``VecMatMult(f, g)``                                vector-matrix product ``f x g``: ``f`` is vector, ``g`` is vector interpreted as matrix (column-major)
 ``TensorProd(f, g)``                                tensor cross product ``f x g^T``: ``f`` and ``g`` are vectors of sizes M and N, output is of size MN.
-``TensorDot(f, g, dimf, dimg, contf, contg)``       tensordot product ``f : g``(similar to `numpy's tensordot <https://docs.scipy.org/doc/numpy/reference/generated/numpy.tensordot.html>`_ in the spirit): ``f`` and ``g`` are tensors of sizes listed in ``dimf`` and ``dimg`` :ref:`index sequences <part.reservedWord>` and contracted along the dimensions listed in ``contf`` and ``contg`` :ref:`index sequences <part.reservedWord>`. The ``MatVecMult``, ``VecMatMult`` and ``TensorProd`` operations are special cases of ``TensorDot``.
+``TensorDot(f, g, dimf, dimg, contf, contg)``       tensordot product ``f : g``(similar to `numpy\'s tensordot <https://docs.scipy.org/doc/numpy/reference/generated/numpy.tensordot.html>`_ in the spirit): ``f`` and ``g`` are tensors of sizes listed in ``dimf`` and ``dimg`` :ref:`index sequences <part.reservedWord>` and contracted along the dimensions listed in ``contf`` and ``contg`` :ref:`index sequences <part.reservedWord>`. The ``MatVecMult``, ``VecMatMult`` and ``TensorProd`` operations are special cases of ``TensorDot``.
 ==============================================     ====================================================================================================================================================================================================================================================================================
 
 Symbolic gradients:
@@ -194,7 +195,7 @@ The operations that can be used to reduce an array are described in the followin
 code name                    arguments              mathematical expression                                                                                                       remarks
                                                     (reduction over j)
 =========================    =====================  ============================================================================================================================  =========================================================================
-``Sum``                      ``f``                  :math:`\sum_j f_{ij}`                                                                                        
+``Sum``                      ``f``                  :math:`\sum_j f_{ij}`
 ``Max_SumShiftExp``          ``f`` (scalar)         :math:`(m_i,s_i)` with :math:`\left\{\begin{array}{l}m_i=\max_j f_{ij}\\s_i=\sum_j\exp(f_{ij}-m_i)\end{array}\right.`         - core KeOps reduction for ``LogSumExp``.
                                                                                                                                                                                   - gradient is a pseudo-gradient, should not be used by itself
 ``LogSumExp``                ``f`` (scalar)         :math:`\log\left(\sum_j\exp(f_{ij})\right)`                                                                                   only in Python bindings
@@ -209,7 +210,7 @@ code name                    arguments              mathematical expression
 ``ArgMax``                   ``f``                  :math:`\text{argmax}_j f_{ij}`                                                                                                gradient returns zeros
 ``Max_ArgMax``               ``f``                  :math:`\left(\max_j f_{ij},\text{argmax}_j f_{ij}\right)`                                                                     no gradient
 ``KMin``                     ``f``, ``K`` (int)     :math:`\begin{array}{l}\left[\min_j f_{ij},\ldots,\min^{(K)}_jf_{ij}\right]                                                   no gradient
-                                                    \\(\min^{(k)}\text{means k-th smallest value})\end{array}`                                                                     
+                                                    \\(\min^{(k)}\text{means k-th smallest value})\end{array}`
 ``ArgKMin``                  ``f``, ``K`` (int)     :math:`\left[\text{argmin}_jf_{ij},\ldots,\text{argmin}^{(K)}_j f_{ij}\right]`                                                gradient returns zeros
 ``KMin_ArgKMin``             ``f``, ``K`` (int)     :math:`\left([\min^{(1...K)}_j f_{ij} ],[\text{argmin}^{(1...K)}_j f_{ij}]\right)`                                            no gradient
 =========================    =====================  ============================================================================================================================  =========================================================================
diff --git a/doc/cpp/generic-syntax.rst b/doc/cpp/generic-syntax.rst
index ebe2b9b5f..8ebf54eb1 100644
--- a/doc/cpp/generic-syntax.rst
+++ b/doc/cpp/generic-syntax.rst
@@ -26,7 +26,7 @@ Next we define the type of reduction that we want to perform, as follows:
    auto Sum_f = Sum_Reduction(f,0);
 
 
-This means that we want to perform a sum reduction over the "j" indices, resulting in a "i"-indexed output. 
+This means that we want to perform a sum reduction over the "j" indices, resulting in an "i"-indexed output. 
 We would use ``Sum_Reduction(f,1)`` for a reduction over the "i" indices.
 
 The list of available reductions is given in the following table: :ref:`part.reduction`. The code name of the reduction must be followed by ``_Reduction`` in C++ code.
@@ -39,7 +39,7 @@ The convolution operation is then performed using one of these three calls:
    EvalRed<GpuConv1D_FromHost>(Sum_f,Nx, Ny, pres, params, px, py, pu, pv, pb);
    EvalRed<GpuConv2D_FromHost>(Sum_f,Nx, Ny, pres, params, px, py, pu, pv, pb);
 
-where ``pc``, ``pp``, ``pa``, ``px``, and ``py`` are pointers to their respective arrays in (Cpu) memory, ``pc`` denoting the output. These three functions correspond to computations performed repectively on the Cpu, on the Gpu with a "1D" tiling algorithm, and with a "2D" tiling algorithm.
+where ``pc``, ``pp``, ``pa``, ``px``, and ``py`` are pointers to their respective arrays in (Cpu) memory, ``pc`` denoting the output. These three functions correspond to computations performed respectively on the Cpu, on the Gpu with a "1D" tiling algorithm, and with a "2D" tiling algorithm.
 
 For a minimal working example code, see the files
 `./keops/examples/test_simple.cpp <https://github.com/getkeops/keops/tree/master/keops/examples/test_simple.cpp>`_ and
diff --git a/doc/formulas/index.rst b/doc/formulas/index.rst
index 9f9d28feb..5488644ab 100644
--- a/doc/formulas/index.rst
+++ b/doc/formulas/index.rst
@@ -4,7 +4,7 @@
 Generic formulas
 ========================
 
-The two previous sections have higlighted the need for **efficient
+The two previous sections have highlighted the need for **efficient
 Map-Reduce GPU routines** in data sciences. To complete our guided tour of
 the inner workings of the KeOps library, we now explain how
 **generic reductions and formulas** are encoded within our C++
diff --git a/doc/introduction/why_using_keops.rst b/doc/introduction/why_using_keops.rst
index ab9a5fa93..090f8c74c 100644
--- a/doc/introduction/why_using_keops.rst
+++ b/doc/introduction/why_using_keops.rst
@@ -6,14 +6,14 @@ Scalable kernel operations
 
 KeOps can be used on a broad class of problems (:ref:`part.formula`).
 But the first motivation behind this library is simple:
-we intend to accelerate the computation of Gaussian convolutions on point clouds, 
-also known as **RBF kernel products** on sampled data. 
+we intend to accelerate the computation of Gaussian convolutions on point clouds,
+also known as **RBF kernel products** on sampled data.
 
 Working in a vector space :math:`\mathbb{R}^{\mathrm{D}}`, let
 us consider for **large values** of :math:`\mathrm{M}` and :math:`\mathrm{N}`:
 
-- a **target** point cloud :math:`x_1, \dots, x_{\mathrm{M}} \in  \mathbb{R}^{\mathrm{D}}`, encoded as a :math:`\mathrm{M}\times\mathrm{D}` array;
-- a **source** point cloud :math:`y_1, \dots, y_{\mathrm{N}} \in  \mathbb{R}^{\mathrm{D}}`, encoded as a :math:`\mathrm{N}\times\mathrm{D}` array;
+- a **target** point cloud :math:`x_1, \dots, x_{\mathrm{M}} \in  \mathbb{R}^{\mathrm{D}}`, encoded as an :math:`\mathrm{M}\times\mathrm{D}` array;
+- a **source** point cloud :math:`y_1, \dots, y_{\mathrm{N}} \in  \mathbb{R}^{\mathrm{D}}`, encoded as an :math:`\mathrm{N}\times\mathrm{D}` array;
 - a **signal** :math:`b_1, \dots, b_{\mathrm{N}} \in  \mathbb{R}`, attached to the :math:`y_j`'s and encoded as a :math:`\mathrm{N}\times 1` vector.
 
 Then, KeOps allows us to compute efficiently
@@ -40,7 +40,7 @@ in our programs.
 A generic framework
 =========================================
 
-Going further, KeOps supports a wide range of **mathematical** and 
+Going further, KeOps supports a wide range of **mathematical** and
 **deep learning computations**. Let's say that we have at hand:
 
 - a collection :math:`p^1, p^2, ..., p^{\mathrm{P}}` of vectors;
@@ -85,19 +85,19 @@ This type of computation is common in machine learning and applied mathematics:
   K-NN search is a key building block for numerous methods in data sciences, from `simple classifiers <https://scikit-learn.org/stable/modules/neighbors.html>`_ to advanced methods in `topological data analysis <https://en.wikipedia.org/wiki/Topological_data_analysis>`_ and `dimensionality reduction <https://umap-learn.readthedocs.io/en/latest/>`_. KeOps intends to provide fast runtimes for **all types of metrics**, beyond the standard Euclidean distance and cosine similarity: we refer to our :doc:`benchmarks <../_auto_benchmarks/plot_benchmark_KNN>` for an extensive discussion. |br|  |br|
 
 - In **computer graphics** and **geometric deep learning**, we implement
-  **point cloud convolutions** and 
+  **point cloud convolutions** and
   **message passing layers** using a function:
-  
+
   .. math::
     F(p,x_i,y_j,f_j)=\text{Window}(x_i,y_j)\cdot \text{Filter}(p,x_i,y_j,f_j)
-    
+
   that is the product of a neighborhood :math:`\text{Window}(x_i,y_j)` between point positions :math:`x_i`, :math:`y_j` and of a parametric filter that is applied to a collection of feature vectors :math:`f_j`. The reduction or "pooling" operator is usually a (weighted) sum or a maximum.
 
   Most architectures in computer vision rely on K-Nearest Neighbors graphs (":math:`x_i \leftrightarrow y_j`") to define sparse neighborhood windows. These are equal to 1 if :math:`y_j` is one of the closest neighbors of :math:`x_i` and 0 otherwise. The point convolution then reads:
 
   .. math::
     a_i \gets \sum_{\substack{j \text{ such that }\\ x_i \leftrightarrow y_j}} \text{Filter}(p,x_i,y_j,f_j) ~.
-  
+
   Crucially, KeOps now also lets users work with **global point convolutions** without compromising on performances: we refer to the Section 5.3 of our `NeurIPS 2020 paper <http://jeanfeydy.com/Papers/KeOps_NeurIPS_2020.pdf>`_ and to `this presentation <https://www.biorxiv.org/content/10.1101/2020.12.28.424589v1.full.pdf>`_ of quasi-geodesic convolutions on protein surfaces for a detailed discussion. |br|  |br|
 
 - In **natural language processing**,
@@ -117,10 +117,10 @@ This type of computation is common in machine learning and applied mathematics:
 
 - We implement the **Fourier transform** with
   a sum reduction and a complex exponential:
-  
+
   .. math::
     \widehat{f_i} = \widehat{f}(\omega_i)
-    ~\gets~ 
+    ~\gets~
     \sum_{j=1}^{\mathrm{N}}
     \exp(i\langle \omega_i,x_j\rangle)~\cdot~ f_j ~.
 
@@ -141,15 +141,15 @@ This type of computation is common in machine learning and applied mathematics:
     \max_{j=1,\, \dots\,,\,\mathrm{N}} \langle u_i, x_j\rangle - f_j.
 
 
-- In **imaging sciences**, 
-  we implement the `distance transform <https://en.wikipedia.org/wiki/Distance_transform>`_ 
+- In **imaging sciences**,
+  we implement the `distance transform <https://en.wikipedia.org/wiki/Distance_transform>`_
   of a binary mask :math:`m_j = m(y_j) \in \{0, 1\}`
   that is defined on the rectangle domain :math:`[\![1, \text{W} ]\!] \times [\![1, \text{H} ]\!]`
   using a minimum reduction and a squared distance function:
-  
+
   .. math::
     \forall x_i \in [\![1, \text{W} ]\!] \times [\![1, \text{H} ]\!],~~
-    d_i = d(x_i) ~\gets~ 
+    d_i = d(x_i) ~\gets~
     \min_{y_j \in [\![1, \text{W} ]\!] \times [\![1, \text{H} ]\!]}
     \|x_i-y_j\|^2 - \log(m_j) .
 
@@ -157,20 +157,20 @@ This type of computation is common in machine learning and applied mathematics:
   the distance transform is **separable** and can be implemented
   efficiently on 2D and 3D grids.
   Just as with `separable Gaussian convolution <https://en.wikipedia.org/wiki/Gaussian_blur#Implementation>`_,
-  the trick is to apply the transform **successively** on the lines 
+  the trick is to apply the transform **successively** on the lines
   and columns of the image.
   Thanks to its native support for batch processing,
   KeOps is ideally suited to these manipulations:
-  it can be used to implement all types of fast separable transforms 
+  it can be used to implement all types of fast separable transforms
   on the GPU. |br|  |br|
 
 
-- In `optimal transport theory <https://optimaltransport.github.io/book/>`_, 
+- In `optimal transport theory <https://optimaltransport.github.io/book/>`_,
   we implement the **C-transform** using a "min" reduction and a formula :math:`F(x_i,y_j,g_j)=\text{C}(x_i,y_j) -g_j` that penalizes the value of the ground cost function :math:`\text{C}` by that of the dual potential :math:`g` :
 
   .. math::
     a_i \gets \min_{j=1,\, \dots\,,\,\mathrm{N}} \big[ \text{C}(x_i,y_j) - g_j \big],  \qquad i=1,\dots,\mathrm{M}~.
-  
+
   Going further, numerically stable **Sinkhorn iterations** correspond to the case where the minimum in the C-transform is replaced by a (rescaled) log-sum-exp reduction, known as a **soft minimum** at temperature :math:`\varepsilon > 0`:
 
   .. math::
@@ -179,14 +179,14 @@ This type of computation is common in machine learning and applied mathematics:
   As detailed in our `NeurIPS 2020 paper <https://www.jeanfeydy.com/Papers/KeOps_NeurIPS_2020.pdf>`_, KeOps speeds up modern optimal transport solvers by **one to three orders of magnitude**, from standard auction iterations to multiscale Sinkhorn loops. A collection of reference solvers is provided by the `GeomLoss library <https://www.kernel-operations.io/geomloss>`_, that now scales up to millions of samples in seconds. |br|  |br|
 
 - Numerous **particle** and **swarming** models
-  rely on **interaction steps** that fit this template to update the positions and inner states of their agents. For instance, on modest gaming hardware, KeOps can scale up simulations of `Vicsek-like systems <https://en.wikipedia.org/wiki/Vicsek_model>`_ to 
+  rely on **interaction steps** that fit this template to update the positions and inner states of their agents. For instance, on modest gaming hardware, KeOps can scale up simulations of `Vicsek-like systems <https://en.wikipedia.org/wiki/Vicsek_model>`_ to
   `millions of active swimmers and flyers <https://arxiv.org/pdf/2101.10864.pdf>`_: this allows researchers to make original conjectures on their models with a minimal amount of programming effort.
 
 
 
 Crucially, we can understand all these computations as **reductions of "symbolic" matrices** whose coefficients are given, for all indices :math:`i` and :math:`j`, by a mathematical formula :math:`F(p, x_i, y_j)`.
 As detailed on the :doc:`front page <../index>` of this website,
-**the KeOps library is built around this remark**. We introduce a new type of "symbolic" tensor that lets users implement all these operations efficiently, with a small memory footprint. 
+**the KeOps library is built around this remark**. We introduce a new type of "symbolic" tensor that lets users implement all these operations efficiently, with a small memory footprint.
 Under the hood, operations on KeOps :mod:`LazyTensors <pykeops.common.lazy_tensor.GenericLazyTensor>` avoid storing in memory the matrix of values :math:`F(p,x_i,y_j)` and rely instead on fast C++/CUDA routines that are compiled on demand.
 We refer to our :doc:`guided tour of the KeOps++ engine <../engine/index>` for more details.
 
@@ -194,11 +194,11 @@ We refer to our :doc:`guided tour of the KeOps++ engine <../engine/index>` for m
 High performances
 =================
 
-KeOps fits within a thriving ecosystem of Python/C++ libraries for scientific computing. So how does it compare with other acceleration frameworks such as 
-`Numba <https://numba.pydata.org>`_, 
-`Halide <https://halide-lang.org>`_, 
+KeOps fits within a thriving ecosystem of Python/C++ libraries for scientific computing. So how does it compare with other acceleration frameworks such as
+`Numba <https://numba.pydata.org>`_,
+`Halide <https://halide-lang.org>`_,
 `TVM <https://tvm.apache.org>`_,
-`Julia <https://julialang.org>`_ or 
+`Julia <https://julialang.org>`_ or
 `JAX/XLA <https://github.com/google/jax>`_?
 To answer this question, let us now briefly explain the relationship between our library and the wider software environment for tensor computing.
 
@@ -206,26 +206,26 @@ To answer this question, let us now briefly explain the relationship between our
 Tensor computing on the GPU
 ----------------------------
 
-**Fast numerical methods are the fuel of machine learning research.** 
+**Fast numerical methods are the fuel of machine learning research.**
 Over the last decade, the sustained
-development of the CUDA ecosystem has driven the progress in the field: 
+development of the CUDA ecosystem has driven the progress in the field:
 though Python is the lingua
-franca of data science and machine learning, 
+franca of data science and machine learning,
 most frameworks rely on **efficient C++ backends** to
-leverage the computing power of GPUs. 
+leverage the computing power of GPUs.
 Recent advances in computer vision or natural
-language processing attest to the fitness of modern libraries: 
-they stem from the **mix of power and flexibility** 
-that is provided by `PyTorch  <http://pytorch.org>`_, 
-`TensorFlow <http://www.tensorflow.org>`_ and general purpose accelerators such 
+language processing attest to the fitness of modern libraries:
+they stem from the **mix of power and flexibility**
+that is provided by `PyTorch  <http://pytorch.org>`_,
+`TensorFlow <http://www.tensorflow.org>`_ and general purpose accelerators such
 as `JAX/XLA <https://github.com/google/jax>`_.
 
 Nevertheless, **important work remains to be done.** Geometric computations present a clear gap
 in performances between Python and C++: notable examples are implementations of point cloud
-convolutions or of the nearest neighbor search that is discussed above. 
+convolutions or of the nearest neighbor search that is discussed above.
 To scale up geometric computations to
 real-world data, a common practice is therefore to replace the compute-intensive parts of a Python
-code by **handcrafted CUDA kernels**. 
+code by **handcrafted CUDA kernels**.
 These are expensive to develop and maintain, which
 leads to an unfortunate need to **compromise between ease of development and scalability**.
 
@@ -233,8 +233,8 @@ leads to an unfortunate need to **compromise between ease of development and sca
 Related works
 ---------------
 
-**KeOps fixes this issue** for computations that fit 
-the Map-Reduce template of the above section. 
+**KeOps fixes this issue** for computations that fit
+the Map-Reduce template of the above section.
 It is part of a large body of work
 that lowers the :math:`O(\text{N}\text{M})` computational cost of
 such an operation.
@@ -246,8 +246,8 @@ let us now recall the main approaches to this problem.
 
 **Sparse matrices.**
 A first strategy is to prune out negligible terms:
-for every index :math:`i`, we perform the reduction 
-on a subset of neighbors 
+for every index :math:`i`, we perform the reduction
+on a subset of neighbors
 :math:`\mathcal{N}(i)\subset [\![1,\text{N} ]\!]`.
 As illustrated on our :doc:`front page <../index.rst>`,
 this method is akin to using sparse matrices:
@@ -260,7 +260,7 @@ at a low level, truncated reductions rely on
 random memory accesses that **do not stream well on GPUs**.
 Consequently, speed-ups are only achieved if the
 neighborhoods :math:`\mathcal{N}(i)` are orders of magnitude smaller
-than the full set of indices :math:`[\![1,\text{N} ]\!]` 
+than the full set of indices :math:`[\![1,\text{N} ]\!]`
 - a condition that is often too restrictive and cannot be satisfied.
 
 
@@ -268,10 +268,10 @@ than the full set of indices :math:`[\![1,\text{N} ]\!]`
 Going further, the implementation
 of KNN queries is itself a geometric
 problem that fits the "KeOps template".
-When the datasets 
-:math:`(x_i)` and 
+When the datasets
+:math:`(x_i)` and
 :math:`(y_j)` have a small
-intrinsic dimension, 
+intrinsic dimension,
 `efficient approximate schemes <http://ann-benchmarks.com>`_
 can outperform brute-force approaches by a wide margin.
 Unfortunately, these methods tend to rely
@@ -279,8 +279,8 @@ on **pre-computations** that are too expensive to
 be performed at every iteration of a "training loop".
 Reference implementations also tend to lack flexibility
 and only support a **handful of metrics**:
-for instance, in spite of a strong interest for 
-`hyperbolic embeddings <https://hazyresearch.stanford.edu/hyperE/>`_ 
+for instance, in spite of a strong interest for
+`hyperbolic embeddings <https://hazyresearch.stanford.edu/hyperE/>`_
 in the machine learning literature,
 `Poincaré metrics <https://en.wikipedia.org/wiki/Poincaré_metric>`_ are not supported out-of-the-box
 by standard libraries.
@@ -295,20 +295,20 @@ we understand the interaction:
 
 
 as a discrete convolution.
-To speed up this operation, a first idea is to rely on 
-`low-rank decompositions <https://en.wikipedia.org/wiki/Low-rank_matrix_approximations>`_ 
+To speed up this operation, a first idea is to rely on
+`low-rank decompositions <https://en.wikipedia.org/wiki/Low-rank_matrix_approximations>`_
 of the kernel matrix :math:`(K_{i,j})`.
-`Multiscale schemes <https://math.nyu.edu/faculty/greengar/shortcourse_fmm.pdf>`_ 
+`Multiscale schemes <https://math.nyu.edu/faculty/greengar/shortcourse_fmm.pdf>`_
 can be used to handle singular kernels such as the Newton potential
-or 
-`compress generic operators <https://en.wikipedia.org/wiki/Hierarchical_matrix>`_. 
+or
+`compress generic operators <https://en.wikipedia.org/wiki/Hierarchical_matrix>`_.
 Alternatively, semi-Eulerian methods rely on
 intermediate grid representations to leverage
-`fast Fourier transforms <https://www-user.tu-chemnitz.de/~potts/nfft/>`_ 
+`fast Fourier transforms <https://www-user.tu-chemnitz.de/~potts/nfft/>`_
 or convolution routines on grids.
 These approaches can achieve dramatic speed-ups but
 tend to require a significant amount
-of tuning for each kernel :math:`k`. 
+of tuning for each kernel :math:`k`.
 They work best when the
 latter is smooth or is defined on a space of dimension :math:`\text{D} \leqslant 3`.
 
@@ -318,19 +318,19 @@ In contrast to mathematical approaches,
 several compilation frameworks have
 been designed to speed-up machine learning architectures.
 Modern toolboxes accelerate a wide range of operations but
-are **not geared towards geometric problems**: 
-most of them keep a focus on 
-`distributed learning <https://github.com/tensorflow/mesh>`_ 
-or 
+are **not geared towards geometric problems**:
+most of them keep a focus on
+`distributed learning <https://github.com/tensorflow/mesh>`_
+or
 `image processing <https://halide-lang.org>`_
-and 
+and
 `dense tensor manipulations <https://github.com/plaidml/plaidml>`_.
-`TVM <https://tvm.apache.org>`_ and 
+`TVM <https://tvm.apache.org>`_ and
 `CuPy <https://cupy.dev>`_
 are the two libraries which are closer to KeOps:
 they both provide **partial support for symbolic tensors**.
 However, they have limited support for
-automatic differentiation and require the use of a 
+automatic differentiation and require the use of a
 custom low-level syntax to produce efficient binaries.
 
 
@@ -341,26 +341,26 @@ KeOps: a specialized tool
 **Requirements for geometric data analysis and learning.**
 None of the aforementioned methods are fully suited
 for exploratory research in geometric data analysis and machine learning.
-Let us briefly explain why: 
+Let us briefly explain why:
 
-1. First of all, some acceleration schemes 
+1. First of all, some acceleration schemes
    **do not stream well on GPUs** or have to rely on **expensive pre-computations**:
    `hierarchical matrices <https://en.wikipedia.org/wiki/Hierarchical_matrix>`_
-   or `advanced nearest neighbor finders <https://github.com/nmslib/hnswlib>`_ 
+   or `advanced nearest neighbor finders <https://github.com/nmslib/hnswlib>`_
    can hardly be used in the training loop of a neural network.
 
 2. Other strategies make **strong assumptions** on the properties
-   of the convolution filter :math:`k` or on 
+   of the convolution filter :math:`k` or on
    the dimension and geometry of the
-   ambient feature space. 
-   These restrictions make existing tools cumbersome 
-   to use in e.g. deep learning, where one 
+   ambient feature space.
+   These restrictions make existing tools cumbersome
+   to use in e.g. deep learning, where one
    wishes to have **modelling freedom**
-   with respect to the choice of the embedding space geometry and dimension. 
+   with respect to the choice of the embedding space geometry and dimension.
 
 3. Finally, most acceleration frameworks for Python
    expect users to be **knowledgeable on GPU parallelism**
-   or do not support **automatic differentiation**. 
+   or do not support **automatic differentiation**.
 
 The bottomline is that most existing tools are not ready to be used by a majority
 of researchers in the community.
@@ -368,21 +368,21 @@ of researchers in the community.
 **A gap in the literature.**
 In order to tackle these issues,
 the developers of deep learning libraries
-have recently put an emphasis on 
-**just-in-time compilation for neural networks**. 
-For instance, the recent 
-`PyTorch JIT <https://pytorch.org/docs/stable/jit.html>`_ and 
+have recently put an emphasis on
+**just-in-time compilation for neural networks**.
+For instance, the recent
+`PyTorch JIT <https://pytorch.org/docs/stable/jit.html>`_ and
 `XLA <https://www.tensorflow.org/xla>`_ engines enable operator
 fusion and unlock performance speed-ups for research code.
 These **general purpose compilers** are fully transparent to users
-and show promise for a wide range of applications. 
-Nevertheless, 
+and show promise for a wide range of applications.
+Nevertheless,
 **they fall short** on the type of **geometric computations** that are discussed above.
 This is most apparent for
 :doc:`nearest neighbor search <../_auto_benchmarks/plot_benchmark_KNN>`,
-:doc:`matrix-vector products <../_auto_benchmarks/plot_benchmark_convolutions>` 
+:doc:`matrix-vector products <../_auto_benchmarks/plot_benchmark_convolutions>`
 with kernel matrices
-and `message passing methods <https://pytorch-geometric.readthedocs.io/en/latest/>`_ on point clouds, 
+and `message passing methods <https://pytorch-geometric.readthedocs.io/en/latest/>`_ on point clouds,
 where one still has to develop and maintain custom CUDA kernels to achieve state-of-the-art performance.
 
 **A unique position.**
@@ -390,7 +390,7 @@ KeOps intends to fix this
 **specific but important problem** with all the convenient
 features of a modern library.
 We present examples of applications
-in our 
+in our
 :doc:`gallery of tutorials <../_auto_tutorials/index>`
 and discuss its inner workings
 in our
@@ -399,7 +399,7 @@ As evidenced by our :doc:`benchmarks <../_auto_benchmarks/index>`,
 the KeOps routines **outperform** their standard counterparts
 **by two orders of magnitude** in many settings.
 On top of a reduced memory usage, they can thus bring
-a considerable speed-up to numerous methods 
+a considerable speed-up to numerous methods
 in machine learning, computational physics and other applied fields.
 
 
@@ -408,21 +408,21 @@ Is KeOps going to speed-up your program?
 -----------------------------------------
 
 **Strengths.**
-At its heart, KeOps leverages the low 
+At its heart, KeOps leverages the low
 `Kolmogorov complexity <https://en.wikipedia.org/wiki/Kolmogorov_complexity>`_ of symbolic arrays: it can be used when the computational bottleneck
-of a method is an interaction step 
-that fits a simple Map-Reduce template. 
-In practice, it is thus likely to offer gains on runtime and memory usage when 
-the formula :math:`F(x_i,y_j)` is compact 
+of a method is an interaction step
+that fits a simple Map-Reduce template.
+In practice, it is thus likely to offer gains on runtime and memory usage when
+the formula :math:`F(x_i,y_j)` is compact
 and the numbers of samples :math:`\text{M}` and :math:`\text{N}` range from :math:`10^3` to :math:`10^7`.
 
 **Limitations.**
 On the other hand, the main limitations of KeOps stem from the overflow of CUDA registers in the computation of the formula :math:`F(x_i,y_j)`.
 These result in decreased performances on large feature vectors
-of dimension D > 100. 
-The problem is known as 
+of dimension D > 100.
+The problem is known as
 `register spilling <https://en.wikipedia.org/wiki/Register_allocation>`_,
-with some documented but non-trivial work-arounds. 
+with some documented but non-trivial work-arounds.
 Another drawback is that we do not pre-ship binaries
 but instead rely on C++/CUDA compilers to run our kernels.
 
@@ -443,4 +443,3 @@ Among other features, KeOps supports:
 - Batch processing and :doc:`block-wise sparsity masks <../python/sparsity>`.
 - :doc:`High-order derivatives <../_auto_tutorials/surface_registration/plot_LDDMM_Surface>` with respect to all parameters and variables.
 - The resolution of positive definite linear systems using a :doc:`conjugate gradient solver <../_auto_benchmarks/plot_benchmark_invkernel>`.
-
diff --git a/doc/matlab/generic-syntax.rst b/doc/matlab/generic-syntax.rst
index 93e7561a8..4f826bd7f 100644
--- a/doc/matlab/generic-syntax.rst
+++ b/doc/matlab/generic-syntax.rst
@@ -3,7 +3,7 @@ Matlab API
 
 The example described below is implemented in the example Matlab script `script_GenericSyntax.m <https://github.com/getkeops/keops/blob/master/keopslab/examples/script_GenericSyntax.m>`_ located in directory ``keopslab/examples``.
 
-The Matlab bindings provide a function `keops_kernel <https://github.com/getkeops/keops/blob/master/keopslab/generic/keops_kernel.m>_` which can be used to define the corresponding convolution operations. Following the previous example, one may write
+The Matlab bindings provide a function `keops_kernel <https://github.com/getkeops/keops/blob/master/keopslab/generic/keops_kernel.m>`_ which can be used to define the corresponding convolution operations. Following the previous example, one may write
 
 .. code-block:: matlab
      
@@ -16,7 +16,7 @@ which defines a Matlab function handler ``f`` which can be used to perform a sum
     c = f(p,a,x,y);
 
 
-where ``p``, ``a``, ``x``, ``y`` must be arrays with compatible dimensions as previously explained. A gradient function `keops_grad <https://github.com/getkeops/keops/blob/master/keopslab/generic/keops_grad.m>_` is also provided. For example, to get the gradient with respect to ``y`` of the previously defined function ``f``, one needs to write:
+where ``p``, ``a``, ``x``, ``y`` must be arrays with compatible dimensions as previously explained. A gradient function `keops_grad <https://github.com/getkeops/keops/blob/master/keopslab/generic/keops_grad.m>`_ is also provided. For example, to get the gradient with respect to ``y`` of the previously defined function ``f``, one needs to write:
 
 .. code-block:: matlab
     
diff --git a/doc/matlab/installation.rst b/doc/matlab/installation.rst
index f9088c435..8c4158f0e 100644
--- a/doc/matlab/installation.rst
+++ b/doc/matlab/installation.rst
@@ -23,7 +23,7 @@ Packaged version (recommended)
 
 Note that temporary files will be written into the ``/path/to/keops-master/keopslab/build`` folder: this directory must have write permissions.
 
-2. Manually add the directory ``/path/to/keops-master/keopslab`` to you Matlab path, as documented :ref:`below <part.path>`.
+2. Manually add the directory ``/path/to/keops-master/keopslab`` to your Matlab path, as documented :ref:`below <part.path>`.
 
 3. :ref:`Test your installation <part.test>`.
 
@@ -40,7 +40,7 @@ From source using git
 
 Note that temporary files will be written into the ``/path/to/keops/keopslab/build`` folder: this directory must have write permissions.
 
-2. Manually add the directory ``/path/to/keops/keopslab`` to you matlab path: see :ref:`part.path`
+2. Manually add the directory ``/path/to/keops/keopslab`` to your Matlab path: see :ref:`part.path`
 
 3. :ref:`Test your installation <part.test>`.
 
diff --git a/pykeops/common/lazy_tensor.py b/pykeops/common/lazy_tensor.py
index 30c9b6174..baf57bacf 100644
--- a/pykeops/common/lazy_tensor.py
+++ b/pykeops/common/lazy_tensor.py
@@ -261,7 +261,7 @@ def lt_constructor(self, x=None, axis=None):
 
     def get_tools(self):
         r"""This method is specialized in :class:`pykeops.numpy.LazyTensor` and :class:`pykeops.torch.LazyTensor`. It
-        populate the tools class."""
+        populates the tools class."""
         pass
 
     def fixvariables(self):
@@ -317,7 +317,7 @@ def fixvariables(self):
     def separate_kwargs(self, kwargs):
         # separating keyword arguments for Genred init vs Genred call...
         # Currently the only four additional optional keyword arguments that are passed to Genred init
-        # are accuracy options: dtype_acc, use_double_acc and sum_cheme,
+        # are accuracy options: dtype_acc, use_double_acc and sum_scheme,
         # chunk mode option enable_chunks,
         # and compiler option optional_flags.
         kwargs_init = []
@@ -873,7 +873,7 @@ def __call__(self, *args, **kwargs):
         """
         if not hasattr(self, "reduction_op"):
             raise ValueError(
-                "A LazyTensor object may be called only if it corresponds to the ouput of a reduction operation or solve operation."
+                "A LazyTensor object may be called only if it corresponds to the output of a reduction operation or solve operation."
             )
 
         self.kwargs.update(kwargs)
@@ -964,7 +964,7 @@ def dtype(self):
 
     @property
     def _shape(self):
-        r"""returns the internal shape of the LazyTensor."""
+        r"""Returns the internal shape of the LazyTensor."""
         btch = () if self.batchdims is None else self.batchdims
         ni = 1 if self.ni is None else self.ni
         nj = 1 if self.nj is None else self.nj
@@ -973,7 +973,7 @@ def _shape(self):
 
     @property
     def shape(self):
-        r"""returns the shape of the LazyTensor"""
+        r"""Returns the shape of the LazyTensor"""
         s = self._shape
         if s[-1] == 1:
             return s[:-1]
@@ -1064,7 +1064,7 @@ def mulop(self, other, **kwargs):
 
     def __mul__(self, other):
         r"""
-        Broadcasted elementwise product - a binary operation.
+        Broadcasted element-wise product - a binary operation.
 
         ``x * y`` returns a :class:`LazyTensor` that encodes, symbolically,
         the elementwise product of ``x`` and ``y``.
@@ -1084,7 +1084,7 @@ def __mul__(self, other):
 
     def __rmul__(self, other):
         r"""
-        Broadcasted elementwise product - a binary operation.
+        Broadcasted element-wise product - a binary operation.
 
         ``x * y`` returns a :class:`LazyTensor` that encodes, symbolically,
         the elementwise product of ``x`` and ``y``.
@@ -1105,7 +1105,7 @@ def divop(self, other, **kwargs):
 
     def __truediv__(self, other):
         r"""
-        Broadcasted elementwise division - a binary operation.
+        Broadcasted element-wise division - a binary operation.
 
         ``x / y`` returns a :class:`LazyTensor` that encodes, symbolically,
         the elementwise division of ``x`` by ``y``.
@@ -1119,7 +1119,7 @@ def __truediv__(self, other):
 
     def __rtruediv__(self, other):
         r"""
-        Broadcasted elementwise division - a binary operation.
+        Broadcasted element-wise division - a binary operation.
 
         ``x / y`` returns a :class:`LazyTensor` that encodes, symbolically,
         the elementwise division of ``x`` by ``y``.
diff --git a/pykeops/numpy/lazytensor/LazyTensor.py b/pykeops/numpy/lazytensor/LazyTensor.py
index 0b6044abf..ff7eaa49f 100644
--- a/pykeops/numpy/lazytensor/LazyTensor.py
+++ b/pykeops/numpy/lazytensor/LazyTensor.py
@@ -17,21 +17,21 @@ def Var(x_or_ind, dim=None, cat=None):
 
 def Vi(x_or_ind, dim=None):
     r"""
-    Simple wrapper that return an instantiation of :class:`LazyTensor` of type 0.
+    Simple wrapper that returns an instantiation of :class:`LazyTensor` of type 0.
     """
     return Var(x_or_ind, dim, 0)
 
 
 def Vj(x_or_ind, dim=None):
     r"""
-    Simple wrapper that return an instantiation of :class:`LazyTensor` of type 1.
+    Simple wrapper that returns an instantiation of :class:`LazyTensor` of type 1.
     """
     return Var(x_or_ind, dim, 1)
 
 
 def Pm(x_or_ind, dim=None):
     r"""
-    Simple wrapper that return an instantiation of :class:`LazyTensor` of type 2.
+    Simple wrapper that returns an instantiation of :class:`LazyTensor` of type 2.
     """
     return Var(x_or_ind, dim, 2)
 
diff --git a/pykeops/torch/lazytensor/LazyTensor.py b/pykeops/torch/lazytensor/LazyTensor.py
index b7a422d75..656b7cbc2 100644
--- a/pykeops/torch/lazytensor/LazyTensor.py
+++ b/pykeops/torch/lazytensor/LazyTensor.py
@@ -18,21 +18,21 @@ def Var(x_or_ind, dim=None, cat=None):
 
 def Vi(x_or_ind, dim=None):
     r"""
-    Simple wrapper that return an instantiation of :class:`LazyTensor` of type 0.
+    Simple wrapper that returns an instantiation of :class:`LazyTensor` of type 0.
     """
     return Var(x_or_ind, dim, 0)
 
 
 def Vj(x_or_ind, dim=None):
     r"""
-    Simple wrapper that return an instantiation of :class:`LazyTensor` of type 1.
+    Simple wrapper that returns an instantiation of :class:`LazyTensor` of type 1.
     """
     return Var(x_or_ind, dim, 1)
 
 
 def Pm(x_or_ind, dim=None):
     r"""
-    Simple wrapper that return an instantiation of :class:`LazyTensor` of type 2.
+    Simple wrapper that returns an instantiation of :class:`LazyTensor` of type 2.
     """
     return Var(x_or_ind, dim, 2)
 
diff --git a/rkeops/doc/using_rkeops.rst b/rkeops/doc/using_rkeops.rst
index ce3e122a5..173acccdf 100644
--- a/rkeops/doc/using_rkeops.rst
+++ b/rkeops/doc/using_rkeops.rst
@@ -122,8 +122,8 @@ it was only tested on Linux and MacOS.
 
 .. code:: r
 
-    devtools::install_git("https://github.com/getkeops/keops", 
-                          subdir = "rkeops", 
+    devtools::install_git("https://github.com/getkeops/keops",
+                          subdir = "rkeops",
                           args="--recursive")
     # not possible to use `devtools::intall_github()` because of the required submodule
 
@@ -203,7 +203,7 @@ Load RKeOps in R:
 .. code:: r
 
     library(rkeops)
-    ## 
+    ##
     ## You are using rkeops version 1.4.2
 
 .. raw:: html
@@ -232,7 +232,7 @@ on GPU.
     args = c("x = Vi(3)",      # vector indexed by i (of dim 3)
              "y = Vj(3)",      # vector indexed by j (of dim 3)
              "b = Vj(6)",      # vector indexed by j (of dim 6)
-             "s = Pm(1)")      # parameter (scalar) 
+             "s = Pm(1)")      # parameter (scalar)
     # compilation
     op <- keops_kernel(formula, args)
     # data and parameter values
@@ -282,7 +282,7 @@ mathematical operations. It should be characterized by two elements:
    entries are defined by 1.
 
 | RKeOps implements a wide range of mathematical operators and
-  reduction: please refers to
+  reduction: please refer to
   https://www.kernel-operations.io/keops/api/math-operations.html for
   more details.
 
@@ -333,7 +333,7 @@ string**:
 
 The formula describing your computation can take several input
 arguments: variables and parameters. The input variables will generally
-corresponds to rows or columns of your data matrices, you need to be
+correspond to rows or columns of your data matrices, you need to be
 cautious with their dimensions.
 
 .. raw:: html
@@ -343,7 +343,7 @@ cautious with their dimensions.
 .. rubric:: Input matrix
    :name: input-matrix
 
-| You can use two type of input matrices with RKeOps:
+| You can use two types of input matrices with RKeOps:
 
 -  | ones whose rows (or columns) are indexed by \\(i=1,...,M\\) such as
      \\(\\mathbf X = [x\_{ik}]\_{M \\times D}\\)
@@ -453,7 +453,7 @@ or \\(j\\) with their corresponding inner dimensions:
     args = c("x = Vi(3)",      # vector indexed by i (of dim 3)
              "y = Vj(3)",      # vector indexed by j (of dim 3)
              "b = Vj(6)",      # vector indexed by j (of dim 6)
-             "s = Pm(1)")      # parameter (scalar) 
+             "s = Pm(1)")      # parameter (scalar)
 
 .. raw:: html
 
@@ -516,7 +516,7 @@ precision), to avoid useless recompilation.
    :name: run-computations
 
 We generate data with inner dimensions (number of columns) corresponding
-to each arguments expected by the operator ``op``. The function ``op``
+to each argument expected by the operator ``op``. The function ``op``
 takes in input a list of input arguments. If the list if named, ``op``
 checks the association between the supplied names and the names of the
 formula arguments. In this case only, it can also correct the order of
@@ -839,7 +839,7 @@ Thus, you can keep the default GPU device id = 0 in RKeOps.
 
 By default, RKeOps uses float 32bits precision for computations. Since R
 only considers 64bits floating point numbers, if you want to use float
-32bits, input data and output results will be casted befors and after
+32bits, input data and output results will be casted before and after
 computations respectively in your RKeOps operator. If your application
 requires to use float 64bits (double) precision, keep in mind that you
 will suffer a performance loss (potentially not an issue on high-end
@@ -857,7 +857,7 @@ KeOps to correct for 32bits floating point arithmetic errors.
 .. rubric:: Data storage orientation
    :name: data-storage-orientation
 
-| In R, matrices are stored using a column-major order, meaning that a
+| In R, matrices are stored using a column-major order, meaning that an
   \\(M \\times D\\) matrix is stored in memory as a succession of
   \\(D\\) vectors of length \\(M\\) representing each of its columns. A
   consequence is that two successive entries of a column are contiguous
@@ -869,7 +869,7 @@ KeOps to correct for 32bits floating point arithmetic errors.
 For RKeOps to be computationnally efficient, it is important that
 elements of the input matrices are contiguous along the inner dimensions
 \\(D\\) (or \\(D'\\)). Thus, it is recommended to use input matrices
-where the outer dimension (i.e. indexes \\(i\\) or \\(j\\)) are the
+where the outer dimensions (i.e. indexes \\(i\\) or \\(j\\)) are the
 columns, and inner dimensions the rows, e.g. transpose matrices
 \\(\\mathbf X^{t} = [x\_{ki}]\_{D \\times M}\\) or \\(\\mathbf Y^{t} =
 [y\_{k'i}]\_{D' \\times N}\\).
@@ -907,7 +907,7 @@ Example:
     X <- matrix(runif(nx*3), nrow=nx, ncol=3)
     # y_j = rows of the matrix Y
     Y <- matrix(runif(ny*3), nrow=ny, ncol=3)
-    # computing the result (here, by default `inner_dim=1` and columns corresponds
+    # computing the result (here, by default `inner_dim=1` and columns correspond
     # to the inner dimension)
     res <- op(list(X,Y))
 
@@ -919,7 +919,7 @@ Example:
     # y_j = columns of the matrix Y
     Y <- matrix(runif(ny*3), nrow=3, ncol=ny)
     # computing the result (we specify `inner_dim=0` to indicate that rows
-    # corresponds to the inner dimension)
+    # correspond to the inner dimension)
     res <- op(list(X,Y), inner_dim=0)
 
 .. raw:: html
diff --git a/rkeops/vignettes/using_rkeops.Rmd b/rkeops/vignettes/using_rkeops.Rmd
index 8c09ca77c..87807f7b3 100644
--- a/rkeops/vignettes/using_rkeops.Rmd
+++ b/rkeops/vignettes/using_rkeops.Rmd
@@ -1,6 +1,6 @@
 ---
 title: "Using RKeOps"
-output: 
+output:
   rmarkdown::html_vignette:
     toc: true
   pdf_document:
@@ -50,8 +50,8 @@ install.packages("rkeops")
 
 * Install directly from Github (requires `git`)
 ```{r install_github, eval=FALSE}
-devtools::install_git("https://github.com/getkeops/keops", 
-                      subdir = "rkeops", 
+devtools::install_git("https://github.com/getkeops/keops",
+                      subdir = "rkeops",
                       args="--recursive")
 # not possible to use `devtools::intall_github()` because of the required submodule
 ```
@@ -93,7 +93,7 @@ formula = "Sum_Reduction(Exp(-s * SqNorm2(x - y)) * b, 0)"
 args = c("x = Vi(3)",      # vector indexed by i (of dim 3)
          "y = Vj(3)",      # vector indexed by j (of dim 3)
          "b = Vj(6)",      # vector indexed by j (of dim 6)
-         "s = Pm(1)")      # parameter (scalar) 
+         "s = Pm(1)")      # parameter (scalar)
 # compilation
 op <- keops_kernel(formula, args)
 # data and parameter values
@@ -123,7 +123,7 @@ To use RKeOps and define new operators, you need to write the corresponding _for
 
 2. a reduction over indexes $i=1,...,M$ (row-wise) or $j=1,...,N$ (column-wise) of the $M \times N$ matrix whose entries are defined by 1.
 
-RKeOps implements a wide range of mathematical operators and reduction: please refers to 
+RKeOps implements a wide range of mathematical operators and reduction: please refers to
 <https://www.kernel-operations.io/keops/api/math-operations.html> for more details.<br><br>
 
 **Example:** We want to implement the following kernel-based reduction (convolution with a Gaussian kernel): $$\sum_{j=1}^{N} \exp\Big(-\sigma || \mathbf x_i - \mathbf y_j ||_2^{\,2}\Big)\,\mathbf b_j$$
@@ -157,7 +157,7 @@ You can use two type of input matrices with RKeOps:<br>
 
 The dimensions over indexes $i$ or $j$ are called the **outer dimensions** (i.e. $M$ or $N$). The other dimensions (i.e. $D$ or $D'$) are called the **inner dimensions**. These terms refer to the contiguity of the data in memory:<br>
 
-* **Outer dimensions** $M$ and $N$ (over indexes $i$ and $j$ respectively) can be **very large**, even to large for GPU memory.<br>
+* **Outer dimensions** $M$ and $N$ (over indexes $i$ and $j$ respectively) can be **very large**, even too large for GPU memory.<br>
 
 * **Inner dimensions** $D$ and $D'$ should be **small** enough to fit in GPU memory, in particular to ensure data colocality and avoid useless memory transfers. Corresponding columns (or rows) should be contiguous in memory (this point is handled for you in RKeOps, see this [section](#data-storage-orientation)).<br>
 
@@ -167,7 +167,7 @@ The dimensions over indexes $i$ or $j$ are called the **outer dimensions** (i.e.
 
 ### Notations
 
-Input arguments of the formula are defined by using keywords, they can be of different types: 
+Input arguments of the formula are defined by using keywords, they can be of different types:
 
 | keyword | meaning                 |
 |:--------|:------------------------|
@@ -198,7 +198,7 @@ of the `X`$^\text{th}$ variable in the formula.<br><br>
 args = c("x = Vi(3)",      # vector indexed by i (of dim 3)
          "y = Vj(3)",      # vector indexed by j (of dim 3)
          "b = Vj(6)",      # vector indexed by j (of dim 6)
-         "s = Pm(1)")      # parameter (scalar) 
+         "s = Pm(1)")      # parameter (scalar)
 ```
 
 
@@ -326,11 +326,11 @@ set_rkeops_option("precision", "float")    # float 32bits (default)
 set_rkeops_option("precision", "double")   # float 64bits
 ```
 
-You can directly change the precision used in compiled operators with the 
-functions `compile4float32` and `compile4float64` which respectively 
+You can directly change the precision used in compiled operators with the
+functions `compile4float32` and `compile4float64` which respectively
 enable float 32bits precision (default) and float 64bits (or double) precision.
 
-* other compile options (including boolean value to enable verbosity or to 
+* other compile options (including boolean value to enable verbosity or to
 add debugging flag), see `?compile_options`
 
 
@@ -341,7 +341,7 @@ By default, RKeOps runs computations on CPU (even for GPU-compiled operators). T
 use_gpu()
 # see `?runtime_options` for a more advanced use of GPU inside RKeOps
 ```
-You can also specify the GPU id that you want to use, e.g. `use_gpu(device=0)` 
+You can also specify the GPU id that you want to use, e.g. `use_gpu(device=0)`
 to use GPU 0 (default) for instance.
 
 To deactivate GPU computations, you can run `use_cpu()`.
@@ -407,4 +407,3 @@ res <- op(list(X,Y), inner_dim=0)
 The compilation of new operators produces shared library (or share object `.so`) files stored in a `build` sub-directory of the package installation directory, to be reused and avoid recompilation of already defined operators.
 
 You can check where your compiled operators are stored by running `get_build_dir()`. To clean RKeOps install and remove all shared library files, you can run `clean_rkeops()`.
-