Skip to content
Cem Bassoy edited this page Jul 20, 2018 · 4 revisions

Tensor Project Proposal

It would be nice if every kind of numeric software could be written in C++ without loss of efficiency, but unless something can be found that achieves this without compromising the C++ type system it may be preferable to rely on Fortran, assembler or architecture-specific extensions (Bjarne Stroustrup).

Rationale

Tensors provide a natural and compact representation for massive multidimensional data with a high dimensionality which occur in disciplines like computational neuroscience, neuroinformatics, pattern/image recognition, signal processing and machine learning [14][15]. Common numerical methods are for instance the higher-order singular value decomposition (HOSVD) or higher-order orthogonal iterator (HOOI) with no restrictions on the number of tensor dimensions [18]. Our tensor project proposal is based on the foundations Boost’s uBlas implementation. Extending uBlas with tensors shall therefore primarily focus on scientific computing with basic multilinear algebra operations with tensors, matrices and vectors. The extension shall primarily support dense tensors that can be projected on subtensors with ranges or slices. We want to provide expressions templates for basic tensors operations using static polymorphism with generic lambdas. The library shall support all basic multilinear algebra operations such as the modal tensor contractions and tensor transpositions where the mode is runtime variable. Tensor object shall be able to communicate with Boost’s matrix and vector objects through expression templates and free functions. Repeating the primary design goals of uBlas, the tensor library shall be efficient with almost no abstraction penalties, functional, compatible and provide a convenient mathematical notation using new language of the C++14 and C++17 standard.

Notation

In the realm of numerical multilinear algebra, a tensor (or hypermatrix) refers to a multiway array. The number of dimensions of a tensor is referred to as its order. Therefore, a 0th-order tensor is a scalar, a 1st-order tensor a vector and a 2nd-order tensor a matrix. An element of a pth-order tensor is accessible with a multiindex of length p. Basic multilinear algebra operations are according to [10][11]: reshaping a tensor into a matrix/vector depending on the mode k (mode-k unfolding) tensor transposition depending on a permutation tuple tensor-tensor, tensor-matrix and tensor-vector multiplication depending on the mode k tensor times multiple vectors and matrices inner- and outer-product of tensors

Primary Goals regarding Functionality

  1. Make use of my findings described in [5][12]
  2. Description of concepts for tensors, tensor expressions and multidimensional iterators
  3. Implement generic tensor types, expression templates and entrywise tensor operations
  4. Implement tensor multiplication algorithms with variable mode k

Primary Goals regarding Programming

  1. Make use of the C++14 and C++17 standard, partly shown here
  2. Integration of generic lambda for expression templates
  3. Utilization of auto and decltype(auto) type deduction possibilities
  4. Utilization of variadic templates for convenient initialization of tensor parameters

Preliminaries and Limitations

Following the presumption of the existing uBLAS implementation, we also want to store tensor elements within a contiguously allocated memory space. We limit ourselves to the implementation of dense tensors and subtensors. Tuning will be based on dense tensors and without using machine dependent assembly instructions or intrinsics. All types shall be in the namespace boost::numeric. All extensions and descriptions are based on the terminology and definitions that are given here. Following these definitions, our proposal of uBLAS for tensor template types will follow uBlas generalization of vectors for matrices.

Extending uBlas for Tensors with Container Concepts

We want to extend uBlas with the following container concepts that are based on existing ones except the multidimensional iterator.

  • A Dimensionspecific Iterator is an iterator of a Tensor and allows to iterator over a dimension of a Tensor.
  • A Multidimensional Iterator shall encapsulate and create dimension specific iterators. This is a might be a new concept in Boost.
  • A Tensor Expression is an expression evaluatable to a Tensor. It shall provide element access with multi-indices and assignments. The definition of a Tensor Expression shall based on the definition provided here with templated iterator types that depend on the specified rank. All valid mutable expressions shall therefore have range functions that also depend on the specified rank. It shall also have access and assignment functions with the corresponding types.
  • A Tensor describes common aspects of dense, packed and sparse tensors. It shall be be default constructible and be also a Tensor Expression similarly defined to the container concepts provided here. In addition to the expressions defined in Tensor Expression it shall have sizing constructor, insert_element(), erase_element(), clear(), resize() and storage functions.

Extending uBlas for Tensors with Generic Class Types

We want to extend uBlas with the following generic class types.

  • The template class tensor<T,S,A,P> is a base container for dense tensors where T is the element type, S is (any non-hierarchical) storage format given by a permutation tuple, A is the storage or allocator type, P is the order of the tensor. It shall provide overloaded operators for element access with multi-indices and assignment operators including tensor expressions as their arguments. The tensor shall be able to transform itself to a uBlas vector or matrix to support the unfolding operation. It shall inherit from tensor_expression<E> for the CRTP.
  • A templated class such as multi_iterator<T> shall model Multidimensional Iterator, where T models Tensor. The user shall be able to create multidimensional and dimensionspecific iterators with tensors, tensor proxies and tensor expressions. Possible implementation of both templated types are given in [5][16].
  • A templated class such as tensor_expression<E> shall model Tensor Expression to support lazy evaluation of tensor operations and compute tensor expressions without temporary objects where E is the type of the tensor expression. The user shall be able to force the expression to be evaluated.
  • A template class such as tensor_lambda<E> shall model a Tensor Expression and that is a proxy class to generic lambda functions for unary and binary tensor operations. Its base type shall be tensor_expression<tensor_lambda<E>>.
  • [optional] A tensor proxy such as tensor_view<T>, tensor_slice<T> and tensor_fiber<T> referencing a subdomain of a tensor, where T models Tensor. They can be generated by projection functions or overloaded member functions of T using uBlas’ range and slice types. A tensor proxy can be mutable or constant depicting almost the same interface as its referenced tensor type T. They shall inherit from tensor_expression<E> for the CRTP.

Including new tensor template functions:

  • Extension of uBlas with the following generic function types by operator overloading:

    • Assignment operators for tensors such as
      tensor::operator=(tensor_expression const&).
    • Unary operators for tensors such as
      tensor_expression operator-(tensor_expression const&).
    • Binary operators for tensors such as
      tensor_expression operator+(tensor_expression const&, tensor_expression const&).
    • Multiplication of tensors with a scalar such as
      tensor_expression operator*(scalar_expression const&, tensor_expression const&).
  • Extension of uBlas with the following generic function types by operator overloading:

    • Tensor-vector- and vector-tensor multiplication
      tensor_expression prod(mode const&, tensor_expression const&, vector_expression const&)
      tensor_expression prod(mode const&, vector_expression const&, tensor_expression const&)
    • Tensor-matrix- and matrix-tensor multiplication
      tensor_expression prod(mode const&, tensor_expression const&, matrix_expression const&)
      tensor_expression prod(mode const&, matrix_expression const&, tensor_expression const&)
    • Tensor-tensor multiplication with multiple multiplication modes
      tensor_expression prod(modes const&, tensor_expression const&, tensor_expression const&)
    • Inner and outer product of a tensor
      tensor_expression inner_prod(tensor_expression const&, tensor_expression const&)
      tensor_expression outer_prod(tensor_expression const&, tensor_expression const&)
    • Transpose of a tensor
      tensor_expression trans(tensor_expression const&)
    • Vectorization and matricization of a tensor
      matrix_type vectorize(mode const&, tensor_expression const&)
      matrix_type matricize(mode const&, tensor_expression const&)

Many of the numerical multilinear algebra methods and algorithms contract tensors with matrices or vectors within a loop according to an induction variable which is why we want the contraction modes to be runtime-variable for convenience.

  • Interaction with boost::numeric::matrix and boost::numeric::vector: Although a tensor is generalization of a matrix or vector, we do not intend to replace or change any of the existing template classes. We only extend the library by the previously introduced concepts, template classes and overloaded functions. Moreover, tensor and uBLAS matrix or vector objects will only communicate by the use of the contraction operations and through the vectorize or matrificize operations. If a tensor is unfolded, the user is able to work with a vectorized or matrificized tensor as a Boost vector or matrix types, respectively.

Implementation Detail:

  • Implementation of entry-wise tensor operations for tensor proxies. Assuming that dense tensors are contiguously stored in memory, entry wise tensor operations for the tensor class tensor such as tensor times scalar can be implemented just as entry wise matrix operations. However, operations on tensor proxies tensor_view<T>, tensor_slice<T> and tensor_fiber<T> do require a different algorithm than dense tensors as only a subdomain of a full dense is selected. Moreover, the projection depends on the state of the range or slice objects for each dimension. A recursive implementation is a convenient approach where the maximum recursion depth equals the order of the tensor. Depending on the implementation and selection of template parameters of the tensor template class index computations might greatly affect the runtime. The same applies for the implementation of the tensor multiplication algorithms. A discussion of different implementation methods for entry-wise tensor operations is given in [12].

  • Implementation of tensor contractions with variable contraction modes

    • Similar to the implementation approach for tensor proxies, the implementation of tensor contraction algorithms with variable contraction modes will be implemented with recursion. While the storage format does not affect the performance in case of entry-wise tensor operations, it will affect the runtime of tensor contraction operations such as the mode-k tensor-vector multiplication where k is the variable contraction mode. If the tensor is stored along the first-order, any mode-k multiplication except the first mode will result in uncoalesced memory access. It is therefore desirable to design tensor contraction algorithms that provide coalesced memory access independent of the storage format. The same holds for the tensor transposition.
    • A second approach is to vectorize or matricize the tensors (tensor unfolding) according to the mode of the multiplication and to perform the multiplication with matrices requiring additional memory space for the matrix. The arithmetic intensity of the tensor unfolding is comparable to the one of the tensor transposition and is therefore limited by the bandwidth. We refer to [13] for a detailed discussion. It should be noted that an efficient multiplication of tensors is a non-trivial task and a recent research topic. We intend to implement the in-place approaches with coalesced memory access independent of the storage format and contraction mode for the tensor-vector and tensor-matrix multiplication. Depending on the schedule, we might try to provide an efficient implementation of the tensor-tensor product based on the findings in [7][8]. We might also try to implement out-of-place approaches.
  • Expressing tensor contraction Tensor contraction, i.e. the tensor-tensor multiplication requires the definition of the contraction dimensions. We have identified different approaches to specify the contraction.

    • Operator overloading with contraction const string objects (compile-time indexing)
      e.g. C[“ijk”] = A[“kab”] * B[“akb”]; // see [4]
    • Free functions with explicit indexing with index objects (compile-time indexing): e.g. einsum<Index<k,a,b>,Index<a,k,b>,Index<i,j,k>>(A,B,C); // see [17]
    • Member functions with explicit indexing with a vector of integers (runtime indexing): e.g. auto C = A.times_tensor(B,2,vector{1,2,3},vector{2,1,3}); // see [5] or [3]

Deciding on the template parameters of the template class tensor

Selection and decision on template parameters of the template class tensor will affect the compile time and the quality of the assembly code, thus the execution time of tensor expressions. Setting the dimensions as a compile-time parameters using variadic templates enables the compiler to appropriately use its cost model to unroll loops and generate vector instructions for all tensor operations. Setting the order as a compile time variable will allow to appropriately inline functions. For multilinear algebra algorithms such as higher-order singular value decomposition, runtime-variable dimensions is the first option.

Conceptualizing Multidimensional Iterator

Conceptualizing Multidimensional Iterator and multidimensional ranges can be cumbersome. It is questionable if it is efficient to iterate over the multi-index space with multidimensional iterators. It might be advantageous to use factory class to generate dimension specific iterators and to have random access iterator for each recursion level of the tensor algorithm. We will use the design choices that has been made in [16] and [5].

Verification

Testing of C++ specific will be performed with Boost Test Library. We might also consider to validate the tensor multiplication results with the output of the Matlab-Toolbox presented in [6].

Related Work

We have identified the following libraries that are publicly available at GitHub.

The libraries are designed for different application areas or computing systems and therefore exhibit features. We plan to investigate those and to identify similarities within the libraries. Important aspects for the numerical multilinear algebra will be integrated into our work.

Future Work

In future we plan to accelerate tensor operations for data- and thread-parallel execution. Another feature of the library would be to support sparse tensors and its related operations.

Literature

[1] Garcia, Ronald, and Andrew Lumsdaine. "MultiArray: a C++ library for generic programming with arrays." Software: Practice and Experience 35.2 (2005): 159-188.
[2] Veldhuizen, Todd L. "Blitz++: The library that thinks it is a compiler." Advances in Software tools for scientific computing. Springer, Berlin, Heidelberg, 2000. 57-87.
[3] Abadi, Martín, et al. "TensorFlow: A System for Large-Scale Machine Learning." OSDI. Vol. 16. 2016.
[4] Solomonik, Edgar, et al. "Cyclops tensor framework: Reducing communication and eliminating load imbalance in massively parallel contractions." Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on. IEEE, 2013.
[5] Bassoy, C. "TLib: A Flexible C++ Tensor Framework for Numerical Tensor Calculus." arXiv preprint arXiv:1711.10912 (2017).
[6] Bader, Brett W., and Tamara G. Kolda. "Algorithm 862: MATLAB tensor classes for fast algorithm prototyping." ACM Transactions on Mathematical Software (TOMS) 32.4 (2006): 635-653.
[7] Springer, Paul, and Paolo Bientinesi. "Design of a High-Performance GEMM-like Tensor–Tensor Multiplication." ACM Transactions on Mathematical Software (TOMS) 44.3 (2018): 28.
[8] Matthews, Devin A. "High-performance tensor contraction without BLAS." arXiv preprint arXiv:1607.00291 (2016).
[9] Matthews, Devin A. "High-performance tensor contraction without transposition." SIAM Journal on Scientific Computing 40.1 (2018): C1-C24.
[10] Cichocki, Andrzej, et al. "Tensor decompositions for signal processing applications: From two-way to multiway component analysis." IEEE Signal Processing Magazine 32.2 (2015): 145-163.
[11] Wei, Yimin, and Weiyang Ding. Theory and computation of tensors: multi-dimensional arrays. Academic Press, 2016.
[12] Bassoy, C., et al. "Fast Higher-Order Functions for Tensor Calculus with Tensors and Subtensors." To appear in Procedia Computer Science (2018) [13] Springer, Paul, Jeff R. Hammond, and Paolo Bientinesi. "TTC: A high-performance compiler for tensor transpositions." ACM Transactions on Mathematical Software (TOMS) 44.2 (2017): 15.
[14] Cichocki, Andrzej. "Tensor Networks for Dimensionality Reduction, Big Data and Deep Learning." Advances in Data Analysis with Computational Intelligence Methods (2018).
[15] Cichocki, Andrzej, et al. "Tensor decompositions for signal processing applications: From two-way to multiway component analysis." IEEE Signal Processing Magazine 32.2 (2015): 145-163.
[16] Aragón, Alejandro M. "A C++ 11 implementation of arbitrary-rank tensors for high-performance computing." Computer Physics Communications 185.6 (2014): 1681-1696.
[17] Poya, Roman, Antonio J. Gil, and Rogelio Ortigosa. "A high performance data parallel tensor contraction framework: Application to coupled electro-mechanics." Computer Physics Communications 216 (2017): 35-52. [18] Kolda, Tamara G., and Brett W. Bader. "Tensor decompositions and applications." SIAM review 51.3 (2009): 455-500.