Increased number of timeouts after DBConnection change in Xandra.Connection #356

harunzengin · 2024-02-09T10:46:29Z

After trying out 0.18.1, we noticed that we get an increased amount of timeouts, both in the tests in the CI and locally and in our staging environment.

It is unclear to me why this is the case. I suspected that it was somehow a race condition with the stream ids, since we create the stream ids with MapSet.new(1..5000) and we fetch them with Enum.at(stream_ids, 1), meaning we get deterministically always the same order of stream ids. I tried to fetch random ids, but the timeouts are still there, so that is ruled out.

The text was updated successfully, but these errors were encountered:

whatyouhide · 2024-02-09T13:06:33Z

I tried to fetch random IDs

This shouldn't matter, as the IDs are just IDs and you could be using always the same and be fine.

Do you ever get over the 5000 concurrent requests, as far as you know? Because I’m looking at the code (which, to be clear, I wrote...) and I don't see handling of when we reach the max concurrent requests. Instead, I see my silly self is just doing

    {stream_id, data} =
      get_and_update_in(data.free_stream_ids, fn ids ->
        id = Enum.at(ids, 0)
        {id, MapSet.delete(ids, id)}
      end)

which even worse would result in storing in_flight_requests[nil] 🙈

harunzengin · 2024-02-09T14:56:37Z

Do you ever get over the 5000 concurrent requests, as far as you know?

Yeah, this stacktrace #354 (comment) is exactly what happens when we get out of concurrent connections.

peixian · 2024-02-23T17:28:30Z

@harunzengin we saw this as well when we canary deployed 0.18.0 a few months ago. We're still looking into the root cause though.

harunzengin · 2024-02-23T17:50:50Z

@harunzengin we saw this as well when we canary deployed 0.18.0 a few months ago. We're still looking into the root cause though.

@peixian Cool, please post if you find out more. I already asked a question in Stackoverflow: https://stackoverflow.com/questions/78035081/concurrent-cassandra-async-writes-leading-for-some-packages-to-get-lost

peixian · 2024-02-23T17:57:02Z

@harunzengin ah, we're on Scylla Enterprise, so it's a little different. Although do you have a repro for your issue, like a minimal set of queries or streams open? I can try on my end to see if Scylla also has the same problems.

I think the thing I'm seeing is slightly different but possibly related (#357)

whatyouhide · 2024-03-28T08:29:49Z

@harunzengin are you still seeing this? Did you have any feedback from the Scylla community here?

harunzengin · 2024-05-03T18:40:45Z

@peixian In our case, we insert multiple hundred times a second to our Cassandra cluster. I guess a minimal reproducable query would be sth. like this:

{:ok, conn} = Xandra.start_link(nodes: ["localhost"])
query = "insert into keyspace.table(id) values (:id)"
{:ok, prepared} = Xandra.prepare(conn, query)

Enum.each(1..100, fn _ ->
  Enum.each(1..200, fn _ ->
    Task.start(fn ->
      case Xandra.execute(conn, prepared, %{"id" => 12}, timeout: 5000) do
        {:error, %Xandra.ConnectionError{action: "execute", reason: :timeout}} -> IO.puts("timeout")
        _ -> :ok
      end
    end)
  end)

  Process.sleep(1000)
end)

as said, in comparison to Xandra v0.12, this version is causing way more timeouts. I also created a Grafana dashboard and deployed it to our staging environment, this is how it looks:

However, I have implemented a RetryStrategy, so this is not too bad. I unfortunately couldn't find anything regarding to the async protocol having an increased amount of timeouts on Cassandra 4.0.10.

peixian · 2024-05-03T18:57:51Z

@harunzengin did you see #362? Changing this value fixed the problem for us.

harunzengin · 2024-05-07T09:07:22Z

@peixian Yeah, the version that I deployed to our staging environment has already the commit 6aedc4b, so it didn't really fix the timeout problem.

jvf · 2024-06-07T09:38:44Z

I was able to reproduce the timeouts (at least I think this is the same problem) with very simple concurrent queries on a single connection:

iex(1)> Application.spec(:xandra)[:vsn]
~c"0.19.0"
# open a connection with a single connection process
iex(2)> {:ok, conn} = Xandra.Cluster.start_link(nodes: ["localhost"], authentication: {Xandra.Authenticator.Password, username: "cassandra", password: "cassandra"})
{:ok, #PID<0.349.0>}

# sequentially, everyting works fine and is very fast
iex(3)> Enum.each(1..3, fn x -> Xandra.Cluster.execute(conn, "SELECT cluster_name FROM system.local", [], timeout: 1000) |> IO.inspect(width: 200, label: x) end)
1: {:ok, #Xandra.Page<[rows: [%{"cluster_name" => "VPP"}], tracing_id: nil, more_pages?: false]>}
2: {:ok, #Xandra.Page<[rows: [%{"cluster_name" => "VPP"}], tracing_id: nil, more_pages?: false]>}
3: {:ok, #Xandra.Page<[rows: [%{"cluster_name" => "VPP"}], tracing_id: nil, more_pages?: false]>}
:ok

# concurrently, things get slow and we get errors
iex(4)> Enum.each(1..3, fn x -> Task.start(fn -> Xandra.Cluster.execute(conn, "SELECT cluster_name FROM system.local", [], timeout: 1000) |> IO.inspect(width: 200, label: x) end) end)
:ok
1: {:ok, #Xandra.Page<[rows: [%{"cluster_name" => "VPP"}], tracing_id: nil, more_pages?: false]>}
2: {:error, %Xandra.ConnectionError{action: "execute", reason: :timeout}}
3: {:error, %Xandra.ConnectionError{action: "execute", reason: :timeout}}

I tried this against a local single node dockerized Cassandra instance without load. Sometimes you need to run the Enum.each(1..3, fn x -> Task.start(fn -> Xandra.Cluster.execute(...) ... end) twice, otherwise reproduction is very reliable. I find this pretty worrisome!

I also tested this against a three node staging cluster (non-dockerized) with load, same result (for the three node cluster I use target_pools: 1 to force a single connection).

I went back through the versions, the first version that does not have the problem is Xandra v0.17.0:

~c"0.17.0"
iex(2)> {:ok, conn} = Xandra.Cluster.start_link(nodes: ["localhost"], authentication: {Xandra.Authenticator.Password, username: "cassandra", password: "cassandra"})
{:ok, #PID<0.223.0>}
iex(3)> Enum.each(1..3, fn x -> Task.start(fn -> Xandra.Cluster.execute(conn, "SELECT cluster_name FROM system.local", [], timeout: 1000) |> IO.inspect(width: 200, label: x) end) end)
:ok
1: {:ok, #Xandra.Page<[rows: [%{"cluster_name" => "VPP"}], tracing_id: nil, more_pages?: false]>}
3: {:ok, #Xandra.Page<[rows: [%{"cluster_name" => "VPP"}], tracing_id: nil, more_pages?: false]>}
2: {:ok, #Xandra.Page<[rows: [%{"cluster_name" => "VPP"}], tracing_id: nil, more_pages?: false]>}

Given that the regression was introduced with v0.18 and that we are talking about concurrent requests on a single connection here I think this is a problem with the implementation of multiplexing requests on a single connection (async protocol with stream IDs).

jvf · 2024-06-07T11:37:52Z

@harunzengin implemented a timeout counter that - well - counts the number of timeouts. In our staging system (three node cluster, two applications with 1000 insertions/s each) we get around 450 timeouts/minute (0.375 %).

In the experiment with the single connection process we see around 1-2 timeouts per 3 requests. Given the number of request (2000/s) this raises the question why we are not seeing more timeouts in the staging system.

My theory is this: With multiple nodes and a pool size > 1 we have a lot of connection processes. Request are load balanced to the connection pools according to the configured strategy, in our case Xandra.Cluster.LoadBalancingPolicy.DCAwareRoundRobin. Within a connection pool the connection is chosen randomly. I think we only see timeouts if we randomly hit a connection within a pool multiple times before the previous request was finished.

To test the theory I experimented with the pool_size setting in our staging system. A small pool size should increase the likelihood of hitting a single connection with multiple requests concurrently and I would thus expect to see an increase in timeouts. A large pool size should in turn lead to fewer timeouts.

Using the counter implemented by @harunzengin we can generate the following graph in Prometheus:

The yellow line is application A, with 1000 insertions/s. For this application I adjusted the pool size.

before the experiment we had pool_size: 10 with three nodes, i.e. 30 connection
around 14:30 I decreased to pool_size: 1 (Application A only)
around 14:40 I increased to pool_size: 50 Application A only)
around 14:50 I switched back to the default pool_size: 10

Application B is the blue line, also with 1000 insertions/s. It had pool_size: 10 the whole time.

As you can see, the number of connections has a quite significant effect on the number of timeouts. This is in line with what I would expect from my theory. I think this supports the idea that the timeouts we are seeing in staging and production are related to the multiplexing of requests on a single connection.

jvf · 2024-07-01T12:28:05Z

This seems to be a problem with the Cassandra Native protocol v5, see #368 (comment). This means you can set the protocol version to v4 as a workaround. On the staging system I tested this on, it completely eliminated timeouts for me.

whatyouhide · 2024-07-10T08:47:19Z

Closed in #368 🎉

harunzengin mentioned this issue Feb 9, 2024

Memory Leak in Xandra.Connection #354

Closed

harunzengin mentioned this issue Feb 9, 2024

Fix memory leak in Xandra.Connection #355

Merged

peixian mentioned this issue Mar 3, 2024

Performance regression in 0.18 (at least compared to 0.12) #357

Closed

whatyouhide added the Kind:Bug label Apr 13, 2024

whatyouhide mentioned this issue Jun 11, 2024

Fix decoding multiple frames in a single envelope in native protocol v5 #368

Merged

whatyouhide closed this as completed Jul 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increased number of timeouts after DBConnection change in Xandra.Connection #356

Increased number of timeouts after DBConnection change in Xandra.Connection #356

harunzengin commented Feb 9, 2024

whatyouhide commented Feb 9, 2024

harunzengin commented Feb 9, 2024

peixian commented Feb 23, 2024

harunzengin commented Feb 23, 2024 •

edited

Loading

peixian commented Feb 23, 2024

whatyouhide commented Mar 28, 2024

harunzengin commented May 3, 2024

peixian commented May 3, 2024

harunzengin commented May 7, 2024

jvf commented Jun 7, 2024 •

edited

Loading

jvf commented Jun 7, 2024

jvf commented Jul 1, 2024

whatyouhide commented Jul 10, 2024

Increased number of timeouts after DBConnection change in Xandra.Connection #356

Increased number of timeouts after DBConnection change in Xandra.Connection #356

Comments

harunzengin commented Feb 9, 2024

whatyouhide commented Feb 9, 2024

harunzengin commented Feb 9, 2024

peixian commented Feb 23, 2024

harunzengin commented Feb 23, 2024 • edited Loading

peixian commented Feb 23, 2024

whatyouhide commented Mar 28, 2024

harunzengin commented May 3, 2024

peixian commented May 3, 2024

harunzengin commented May 7, 2024

jvf commented Jun 7, 2024 • edited Loading

jvf commented Jun 7, 2024

jvf commented Jul 1, 2024

whatyouhide commented Jul 10, 2024

harunzengin commented Feb 23, 2024 •

edited

Loading

jvf commented Jun 7, 2024 •

edited

Loading