-
Notifications
You must be signed in to change notification settings - Fork 398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bad performance of views/slicing #2734
Comments
A little bit more detailed: #include <chrono>
#include <xtensor/xrandom.hpp>
#include <xtensor/xtensor.hpp>
double mean_milliseconds_from_total(std::chrono::nanoseconds total,
size_t num_repeats) {
std::chrono::duration<double, std::milli> total_ms = total;
return total_ms.count() / (double)num_repeats;
}
int main() {
size_t num_repeats = 100;
xt::xtensor<double, 1> a = xt::random::rand<double>({10000000});
xt::xtensor<double, 1> b = xt::random::rand<double>({10000000});
xt::xtensor<double, 1> c = xt::zeros<double>({10000000});
// case 1: full tensor
auto started = std::chrono::high_resolution_clock::now();
for (size_t i = 0; i < num_repeats; ++i)
c = a + b;
auto finished = std::chrono::high_resolution_clock::now();
std::cout << "case 1: "
<< mean_milliseconds_from_total(finished - started, num_repeats)
<< "ms" << std::endl;
// case 2a: view of tensor and expression with xt::all()
started = std::chrono::high_resolution_clock::now();
for (size_t i = 0; i < num_repeats; ++i)
xt::view(c, xt::all()) = xt::view(a + b, xt::all());
finished = std::chrono::high_resolution_clock::now();
std::cout << "case 2a: "
<< mean_milliseconds_from_total(finished - started, num_repeats)
<< "ms" << std::endl;
// case 2b: view of only tensor with xt::all()
started = std::chrono::high_resolution_clock::now();
for (size_t i = 0; i < num_repeats; ++i)
xt::view(c, xt::all()) = a + b;
finished = std::chrono::high_resolution_clock::now();
std::cout << "case 2b: "
<< mean_milliseconds_from_total(finished - started, num_repeats)
<< "ms" << std::endl;
// case 2c: view of only expression with xt::all()
started = std::chrono::high_resolution_clock::now();
for (size_t i = 0; i < num_repeats; ++i)
c = xt::view(a + b, xt::all());
finished = std::chrono::high_resolution_clock::now();
std::cout << "case 2c: "
<< mean_milliseconds_from_total(finished - started, num_repeats)
<< "ms" << std::endl;
// case 3a: view of tensor and expression with xt::range()
started = std::chrono::high_resolution_clock::now();
for (size_t i = 0; i < num_repeats; ++i)
xt::view(c, xt::range(0, c.size())) =
xt::view(a + b, xt::range(0, c.size()));
finished = std::chrono::high_resolution_clock::now();
std::cout << "case 3a: "
<< mean_milliseconds_from_total(finished - started, num_repeats)
<< "ms" << std::endl;
// case 3b: view of only tensor with xt::range()
started = std::chrono::high_resolution_clock::now();
for (size_t i = 0; i < num_repeats; ++i)
xt::view(c, xt::range(0, c.size())) = a + b;
finished = std::chrono::high_resolution_clock::now();
std::cout << "case 3b: "
<< mean_milliseconds_from_total(finished - started, num_repeats)
<< "ms" << std::endl;
// case 3c: view of only expression with xt::range()
started = std::chrono::high_resolution_clock::now();
for (size_t i = 0; i < num_repeats; ++i)
c = xt::view(a + b, xt::range(0, c.size()));
finished = std::chrono::high_resolution_clock::now();
std::cout << "case 3c: "
<< mean_milliseconds_from_total(finished - started, num_repeats)
<< "ms" << std::endl;
return 0;
} Result:
Accessing the expression through a view seems to be the expensive part. |
This actually a compiler issue. I posted this in the gitter channel as well. If intel can make sense of the view objects, then the other implementations should be able to as well. If you're using MSVC, Clang, or GCC that doesn't help much but it's good to know the optimizations exist. In MSVC with /O2 /Ob2 /arch:avx512 USING XSIMD case 1: 38.3063ms With Intel 2023 DPC++/C++ Optimizing Compiler using equivalent flags: case 1: 38.7353ms |
@tdegeus @razorx89 I have done more digging and the root cause of the bad performance is actually here: xtensor/include/xtensor/xview.hpp Line 226 in 8c0a484
Thus, all views of This also eliminates the possibility of using SIMD acceleration. I don't see any reason why |
Hi,
I am still getting used to the library, but was able to isolate an unexpected performance hit. I want to update just a subregion of a pre-allocated 1D tensor. Maybe there is a better pattern to achieve the same result?
Result:
I understand that introducing views should have a performance hit, but for doing essentially the same task (memory layout, contiguous memory, same range, equal step size of one), it is quite a big hit. Is this expected behavior or am I doing something wrong?
Thanks.
Versions:
The text was updated successfully, but these errors were encountered: