Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: CLBlast: Run-time error: -55 (500 is not divisible by 8) #340

Closed
quanghuy1258 opened this issue Nov 30, 2018 · 3 comments
Closed

Bug: CLBlast: Run-time error: -55 (500 is not divisible by 8) #340

quanghuy1258 opened this issue Nov 30, 2018 · 3 comments

Comments

@quanghuy1258
Copy link

When I do Xgemm, I get this error:
CLBlast: Run-time error: -55 (500 is not divisible by 8)
Here is my source code to test:

#include "common.h"

#include "gtest/gtest.h"

#include "cl.hpp"

#include <clblast.h>

TEST(TestCLBlast, Bug_kInvalidLocalThreadsDim) {
  // Use command clinfo to select the most suitable platform and device
  int platform_id = 1;
  int device_id = 0;

  // Initializes the OpenCL platform
  auto platforms = std::vector<cl::Platform>();
  cl::Platform::get(&platforms);
  ASSERT_FALSE(platforms.size() == 0 || platform_id >= platforms.size());
  auto platform = platforms[platform_id];

  // Initializes the OpenCL device
  auto devices = std::vector<cl::Device>();
  platform.getDevices(CL_DEVICE_TYPE_ALL, &devices);
  ASSERT_FALSE(devices.size() == 0 || device_id >= devices.size());
  auto device = devices[device_id];

  // Creates the OpenCL context, queue, and an event
  auto device_as_vector = std::vector<cl::Device>{device};
  auto context = cl::Context(device_as_vector);
  auto queue = cl::CommandQueue(context, device);
  auto event = cl_event{nullptr};

  int M = 4000;
  int K = 6000;
  int N = 8000;
  std::unique_ptr<float[]> A(new float[M * K]);
  std::unique_ptr<float[]> B(new float[K * N]);
  std::unique_ptr<float[]> C(new float[M * N]);
  int count = 1;
  for (int i = 0; i < M * K; i++)
    A[i] = 1.0 / (count++);
  for (int i = 0; i < K * N; i++)
    B[i] = 1.0 / (count++);
  for (int i = 0; i < M * N; i++)
    C[i] = 0.0;

  auto device_a =
      cl::Buffer(context, CL_MEM_READ_WRITE, (M * K) * sizeof(float));
  auto device_b =
      cl::Buffer(context, CL_MEM_READ_WRITE, (K * N) * sizeof(float));
  auto device_c =
      cl::Buffer(context, CL_MEM_READ_WRITE, (M * N) * sizeof(float));
  queue.enqueueWriteBuffer(device_a, CL_TRUE, 0, (M * K) * sizeof(float),
                           A.get());
  queue.enqueueWriteBuffer(device_b, CL_TRUE, 0, (K * N) * sizeof(float),
                           B.get());
  queue.enqueueWriteBuffer(device_c, CL_TRUE, 0, (M * N) * sizeof(float),
                           C.get());

  auto queue_plain = queue();
  auto status = clblast::Gemm<float>(
      clblast::Layout::kColMajor, clblast::Transpose::kNo,
      clblast::Transpose::kNo, M, N, K, 1.0, device_a(), 0, M, device_b(), 0, K,
      0.0, device_c(), 0, M, &queue_plain, &event);

  ASSERT_TRUE(status == clblast::StatusCode::kSuccess);
  clWaitForEvents(1, &event);
  clReleaseEvent(event);
  queue.enqueueReadBuffer(device_c, CL_TRUE, 0, (M * N) * sizeof(float),
                          C.get());
  clblast::ClearCache();

  for (int i = 0; i < M * N; i++) {
    int r = i % M;
    int c = i / M;
    float val = 0.0;
    for (int j = 0; j < K; j++) {
      val += A[j * M + r] * B[c * K + j];
    }
    ASSERT_TRUE(std::abs(C[i] - val) < 1e-10);
  }
}

with common.h only has C/C++ STL libraries, gtest/gtest.h is Google Test Library.
Here is the parameters I retrieve:

VWM : 2
STRM : 0
SB : 0
VWN : 2
SA : 0
KWI : 1
NDIMB : 8
MWG : 32
KWG : 1
KREG : 2
GEMMK : 1
MDIMA : 4
MDIMC : 4
STRN : 0
NDIMC : 8
NWG : 64

I think this bug is from this file CLBlast/src/routines/level3/xgemm.cpp:

  CalculateInternalDimensions(m, n, k, db_["MWG"], db_["NWG"], db_["KWG"] * db_["KREG"],
                              a_one_i, a_two_i, b_one_i, b_two_i, c_one_i, c_two_i,
                              db_["GEMMK"]);

and

  const auto global = std::vector<size_t>{
    (c_one_i * db_["MDIMC"]) / db_["MWG"],
    (c_two_i * db_["NDIMC"]) / db_["NWG"]
  };

because you rotate c_one_i and c_two_i in CalculateInternalDimensions but do not anything there. So please check your source code here. Thank you.

@quanghuy1258
Copy link
Author

I think fix this bug like that:

  const auto global = std::vector<size_t>{
    (c_one_i * db_["MDIMC"]) / ((c_one_i == m_ceiled) ? db_["MWG"] : db_["NWG"]),
    (c_two_i * db_["NDIMC"]) / ((c_two_i == n_ceiled) ? db_["NWG"] : db_["MWG"])
  };

It works for me.

@CNugteren
Copy link
Owner

CNugteren commented Nov 30, 2018

Thanks a lot for your careful observation, code, and suggestion for solution. With your example I managed to reproduce it, and I fixed it more or less as you suggested:
#341
(this is now merged in master)

Could you have a look and verify if that indeed fixes it for you as well? Thanks!

@CNugteren
Copy link
Owner

Closing this as I believe it is solved, feel free to open if you think otherwise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants