Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Differences in # of connections between MySql.Data and MySqlConnector #305

Closed
nvivo opened this issue Aug 4, 2017 · 46 comments
Closed

Differences in # of connections between MySql.Data and MySqlConnector #305

nvivo opened this issue Aug 4, 2017 · 46 comments

Comments

@nvivo
Copy link

nvivo commented Aug 4, 2017

This is not an actual issue, but more for information purposes. I migrated 30 VMs to MySqlConnector, and I noticed a slight increase in the number of open connections to the database, even though all connection strings and services are the same.

image

The stable part are the services running without intervention. The middle gap is the upgrade window where servers were up and down all the time, so additional connections are expected.

Interestingly, when doing the math, the # of open connections of all servers multiplied by the maxPoolSize should be around 1400, which matches the MySqlConnector behavior. So MySql.Data might be either smarter or more conservative with connections, even though it uses queues instead of stacks to manage the pool.

This is not being an issue, and it seems to be obbeying the limits. But since there is a difference in behavior, I thought it would be nice to share, in case this is something worth investigating.

@nvivo
Copy link
Author

nvivo commented Aug 4, 2017

One caveat here is the lack of default command timeout difference that could be causing MySqlConnector to wait longer for some queries to finish while they might be timing out before and releasing the connection. Might be an explanation.

@caleblloyd
Copy link
Contributor

Does your app make use of async? If so you are probably getting much higher concurrency with MySqlConnector. Higher concurrency = need for more parallel connections.

@nvivo
Copy link
Author

nvivo commented Aug 4, 2017

Can you elaborate more on that @caleblloyd ? Why would that be the case?

I do use async, but I was already using the same number of threads before. So, with MySql.Data, a thread opens a connection, does it's thing and returns it to the pool. With MySqlConnector supporting async correctly, what would change?

@caleblloyd
Copy link
Contributor

MySql.Data uses Sync I/O under the covers, even for async calls. So a single MySql.Data connection server one thread pool thread.

MySqlConnector uses Async I/O for async calls. This means that many MySqlConnector connections can be serviced by one thread pool thread.

We published a Concurrency Benchmark that shows the difference in practice.

If a flood of requests for async connections come in at the same time MySqlConnector will be able to service them all very quickly in parallel even if there are few thread pool threads available, having the side effect of maxing out the Connection Pool. MySql.Data will be limited by the thread pool, however, and may have to wait for thread pool threads to come available, having the side effect of "spreading out" the flood of requests and never hitting the max in the Connection Pool.

@bgrainger
Copy link
Member

So MySql.Data might be either smarter or more conservative with connections, even though it uses queues instead of stacks to manage the pool.

This is not being an issue, and it seems to be obbeying the limits. But since there is a difference in behavior, I thought it would be nice to share, in case this is something worth investigating.

Thanks for sharing this; it's a really interesting finding. Along with #302 this suggests that there may possibly be some kind of leak of open connections over time that could be worth digging deeper into.

Is it easy for you to estimate how many open connections "should" be needed based on the number of requests per second to the service(s), the average response time, and the average number of MySQL connections made for each request? Should the number of connections be holding at a steady state instead of slowly increasing over time?

@bgrainger
Copy link
Member

Are you using any ORM layer, or directly using the MySqlConnector ADO.NET objects? If the latter, is your code disposing (either via .Dispose or using) every MySqlCommand, MySqlDataReader, MySqlConnection? (The performance of the library when objects aren't disposed is not a scenario that I've tested heavily yet.)

@bgrainger
Copy link
Member

One caveat here is the lack of default command timeout difference that could be causing MySqlConnector to wait longer for some queries to finish while they might be timing out before and releasing the connection. Might be an explanation.

@nvivo A command timeout throws a MySqlException (usually with a TimeoutException inner exception). Was your code explicitly catching that and recovering/retrying the statement? If so, and if that was happening frequently, that might indeed explain why the number of connections is increasing with MySqlConnector. The timeout exceptions would have been artificially limiting the number of concurrent statements being executed.

@nvivo
Copy link
Author

nvivo commented Aug 4, 2017

@caleblloyd Now that I think about, it makes sense. I was thinking about having the same number of threads starting the query, but not what happens after that. I use Akka, and with real async, once it starts awaiting, Akka schedules the continuation of async as a message to the actor. In that case, it considers the message as processed and as you said, that frees the thread in the pool, allowing it to start processing another message in another actor. Given a lot of actors, it may be the case that more actors start processing messages to the point of starting queries than they would with MySql.Data. It's a real possibility.

@bgrainger It's a mix of dapper and EF Core with Pomelo's driver. It's always either
using (var ctx = contextFactory()){ } for EF or
using (var cn = await openConnectionAsync()) { } for dapper.

I noticed this morning that although I have a max pool size of 45 for most services, some of them had a more than 70 connections open (counting through mariadb processlist table).

I don't have an estimate of the ideal, but I have a history. Looking for the last 2 weeks:

image

There is definitely a difference. My guess now is it's a mix of being able to better use the thread pool and something allowing more connections to be open than the max pool (although I'm not sure all services have a max pool set, most of them have - that specific one with 70 connections was set to 45). But it doesn't look like a permanent leak so far, otherwise I would have seen it increasing constantly over the last 2 days, and it decreases from time to time.

@nvivo
Copy link
Author

nvivo commented Aug 4, 2017

Okay, I may have some evidence that indeed something strange is happening and there may be a leak here. I added some reflection to log some internals, specifically the counts of m_sessions and m_leasedSessions in ConnectionPool. This is what I get (note these are all mixed hosts):

image

This is what the ProcessList table shows me:

image

The Max Pool Size is 45. It started as 45, after a few seconds it started to grow 1 every second or so per host.

When I look at one specific host at some point, this is what I see:

image

But mysql shows me:

image

If the watchdog runs every 3 minutes or so, there shouldn't be sleeping connections open for more than that. Before adding the log, there were open connections with 20000 seconds. If the pattern from the log means what I think it means, it seems I'm using 40 connections when the service starts, but then it drops to only 1-3 per server after that. Still, I had dozens of open connections with 20000+ seconds open in sleeping state.

Makes sense?

@bgrainger
Copy link
Member

Thanks; this is fantastic debugging info. I'm still not quite sure exactly what it means yet, but it definitely provides a direction for further investigation.

@nvivo
Copy link
Author

nvivo commented Aug 5, 2017

Hi, some additional evidence to help with this. This is the processlist since yesterday:

https://docs.google.com/spreadsheets/d/1zoCpy0JAneE3ecdZEo6FYJQpH7hMS80XaUjCuBpXkpE/edit?usp=sharing

These processes come from a cluster of 25 servers with the same service, same connection string and same access pattern. Aside from basic settings, I override these settings:

Maximum Pool Size=40; Connect Timeout=120; Connection Lifetime=240;

Some interesting things here:

  1. Queries are happening on most recent connections, old ones are sleeping
  2. Lots of connections are sleeping for more than ConnectionIdleTimeout
  3. There is almost a pattern in the time interval between old connections in increments of 12.5 minutes. The interval may be because I run some process that peaks connection use at these intervals or something that happens in the driver that is forgetting to close them at these intervals. In any case, it looks like every now and then the driver is letting one connection open for some reason and it's not counting them in the m_sessions list.

Edit: The incremenets seem to have a pattern, but it's actually different for each server.

@bgrainger
Copy link
Member

  1. This is by design. MySqlConnector uses a stack so that the most recently-used connection keeps being used, so extra ones can become idle and be cleaned up. Compare https://bugs.mysql.com/bug.php?id=66476.
  2. I can't explain this yet (and haven't been able to reproduce it with a simple test app); will keep investigating.
  3. It definitely does seem like a pattern but I can't explain it yet.

@nvivo
Copy link
Author

nvivo commented Aug 5, 2017

For 1 makes perfect sense, just didn't connect the dots.

Just a note, this morning I had to restart all servers as no new connections were able to be open. From the logs this was gradual from 2am to 6am.

I'm not discarding a bug in my code, since the migration to EF Core required some additional changes, but since the driver is not seeing the connections in the leased or the idle sessions, my bet is in a bug in the driver. For now I'm adding some periodic housekeeping to a background job to kill long standing connections.

@bgrainger
Copy link
Member

Just a note, this morning I had to restart all servers as no new connections were able to be open.

This makes me wonder if this and #302 share the same underlying issue.

For now I'm adding some periodic housekeeping to a background job to kill long standing connections.

Thanks for your help working through this, and sorry you have to employ a workaround; I want to figure this out.

Are you running the client on Linux or Windows?

@bgrainger
Copy link
Member

OK, I can repro long-running idle connections if--in my test app--I periodically omit calling MySqlConnection.Dispose. So it seems like the code to recover "leaked" sessions isn't working properly. (However, I still don't ever see more than MaximumPoolSize concurrent connections, even in this scenario.) But there's definitely a connection recovery bug to fix first.

@bgrainger
Copy link
Member

There's a serious bug in the code to recover leaked sessions that's triggered if the last MySqlDataReader on that connection is not Disposed. I think it's likely that this is happening in your code (but this library needs to be resilient to callers not disposing MySqlConnection, MySqlCommand, MySqlDataReader, etc.)

@nvivo
Copy link
Author

nvivo commented Aug 6, 2017

I double checked every use of a connection or a context in the service. Did a check yesterday and one this morning, to avoid missing something, and absolutely all of them are enclosed by an using clause for the connection or the EF context. I checked any dependencies as well. There is no direct use of data readers or mysql commands anywhere, it's always dapper or EF context.

What I do have is a lot of lock wait timeouts and occasional MySqlExceptions due to increase in load. Is it possible that an exception or a connection in a bad state might be causing some step to not execute?

I find it curious that at some point the connection is being removed from the sessions list, but remains open. I'll try to debug this to see if I find something.

@nvivo
Copy link
Author

nvivo commented Aug 7, 2017

Just for an update. I did some debugging today in production. I added MySqlConnector source to my project, added an "m_allSessions" into ConnectionPool to track everything, independently of being leased or not, and stored an string LastSqlStatement into MySqlSession, pushed to the entire cluster and left it running for an hour or so checking by connection id which ones were open for a long time and where they came from.

I saw that most connections were cleaned up after 180 seconds as expected, and after that these "lost" connections didn't go back to m_sessions or m_leasedSessions. Also, they came from everywhere, not a single place, and from correct code, like this:

using (var cn = await dbFactory.OpenConnectionAsync())
{
    await cn.ExecuteAsync(cmd);
}

I wasn't able to find out why this happens yet. For now, setting wait_timeout to 1 hour (default is 8 hours) is good enough as a workaround. It causes the server to close idle connections after that period.

Note: I did notice however that some connections opened by each server had the following settings appended to the connection string: BufferResultSets=True;Use Affected Rows=False;Pooling=False, and it wasn't coming from my code. Couldn't find out why, not sure if it's Pomelo EF Core driver, not sure if this is even related.

@nvivo
Copy link
Author

nvivo commented Aug 7, 2017

I'll put your minds at ease in exchange for some public embarrassment. There is a problem, but it is not as crazy as I made it appear.

It turns out I have a single connection string in my config file, but I forgot I turn that into two during runtime in order to setup an EF context for another database. This creates 2 app pools. The way I was tracking the sessions before was looking at one pool only. When taking into account all the pools, all connections are accounted for and processlist matches what I see in the ConnectionPool over 2 hours at least.

Now, the chart with increased connections is still real, the spreadsheet is still real, and I still see these idle connections piling up over time that should have been cleaned up. From my logs, I can see all pools are periodically cleaned up, but these connections with long sleeping time were never disposed. Any ideas?

@bgrainger
Copy link
Member

When taking into account all the pools, all connections are accounted for and processlist matches what I see in the ConnectionPool over 2 hours at least.

That does put my mind at rest because I could not figure out how there would be more connections open to the server than sessions in the pool. Thanks for the update.

(I assume this explains the second connection string with BufferResultSets=True;Use Affected Rows=False;Pooling=False?)

Does the second pool still have ConnectionIdleTimeout set to a non-zero value? (I.e., idle connections will be automatically cleaned up?)

I still can't rule out the session leaking bug I discovered and mentioned earlier. If I were to try to write a fix for that (or add some additional diagnostics), what would be your preferred way of accessing it? Beta version on Nuget? Private build of the DLL sent directly to you? Something else?

@bgrainger
Copy link
Member

I just discovered that my test code that leaks readers also causes a connection timeout with MySql.Data. So this does not appear to be a new bug in MySqlConnector. (I've filed that as #306.)

@nvivo
Copy link
Author

nvivo commented Aug 7, 2017

(I assume this explains the second connection string with BufferResultSets=True;Use Affected Rows=False;Pooling=False?)

Not really. It seems this comes from Pomelo's EF driver. The first 2 settings are set globally, the Pooling=false seem to be for the first connection only, looks like some infrastructure call for discovery.

As for the issue, maybe you're on the right track with both issues you found. They explain part of the issues.

As for the idle connections, I added some logs to this part here, so I see when any connection is disposed and also the connections that are kept.

What I see is that these connections with long sleep times were never cleaned (never pass through the if block), but they stop being seen in m_sessions (stop passing through the else block). And the time they stop appearing there matches the sleep time in mysql. For example, it's 2:32pm here and I see this:

image

5822 / 60 = ~97 - that means 1:37 minutes ago (~12:55) I should see the that connection the last time in the logs.

image

That data being a dump of info on m_sessions inside that else block.

Any ideas what could be going on here?

Edit: BTW, I know it wasn't disposed because I would see it as I see for other connections:

image

And the dump from the last event before disposing, as it looks confusing on the screenshot:

{
    "Count": 4,
    "data": [
        { "ConnectionId": 135675, "SecondsSinceReturnToPool": 0, "LastSqlStatement": "..." },
        { "ConnectionId": 135676, "SecondsSinceReturnToPool": 1, "LastSqlStatement": "..." },
        { "ConnectionId": 135532, "SecondsSinceReturnToPool": 179, "LastSqlStatement": "..." },
        { "ConnectionId": 135674, "SecondsSinceReturnToPool": 179, "LastSqlStatement": "..." }
    ]
}
```

@bgrainger
Copy link
Member

What I see is that these connections with long sleep times were never cleaned (never pass through the if block), but they stop being seen in m_sessions (stop passing through the else block).

I'm thinking "unhandled exception".

It sounds like you're building a local copy of the library? Can you add exception logging here and see if it catches anything?

I'm suspecting that an exception in DisposeAsync will cause the session to be permanently lost, but never cleaned up.

I can add some simple exception handling there, but I'd like to know what the underlying problem is to know if there's a better way to handle it.

@nvivo
Copy link
Author

nvivo commented Aug 8, 2017

Okay, looks like I got to the bottom of this and I know what is happening, but I don't know why. =)

DisposeAsync had an exception, but not during reap. It was during Return.

System.InvalidOperationException: Expected state to be (Connected|Failed) but was Querying.

This is the pattern for all these lost connections:

image

Everytime the command executes, it goes to Querying, then goes back to Connected, but not that last time. It returns to the pool as Querying, it checks if it's connected, the state is not Connected and it's disposed. MySql sees the connection as idle, It didn't get cleaned because it never went back to the pool,, and it wasn't disposed because it was in the wrong state.

There are some stacktraces in this log, in this case, this is the code that triggered this last query:

image

The query is something like this:

insert ignore ...; set @inserted = row_count();
update ...; set @updated = row_count();
select @inserted as Inserted; @updated as Updated;

The service logs then show me this for the moment of the query:

MySql.Data.MySqlClient.MySqlException (0x80004005): Parameter '@updated' must be defined. To use this as a variable, set 'Allow User Variables=true' in the connection string.

Note: I've checked and I'm setting this via ConnectionStringBuilder, but it doesn't seem to be taken into account. Or I may be tired, will check again tomorrow.

After looking at other connections in the same state, it seems there was always an exception being thrown at some point, and even though there is an using clause, it is not clearing the status after the exception.

@bgrainger
Copy link
Member

Unfortunately, user variables in SQL statements are not permitted by default; see #194 for the full rationale. Doubly unfortunately, that leads to unexpected failures for code like yours that assumes they should just work by default.

Our assumption (see #194) was that this exception would bubble up and be presented to the programmer immediately, letting them know what was wrong. I'd love to know exactly where it's being swallowed--is it in MySqlConnector itself, or Dapper, or EF, somewhere else? (Or is it actually making its way to a log, as expected, but wasn't seen until now?)

Ideally, this fatal exception would be presented to the user so it could be fixed. But even if it's not, it shouldn't put this library into a bad state. If you're able to give a self-contained repro, that'd be great; otherwise, I can try to work on figuring one out.

@bgrainger
Copy link
Member

bgrainger commented Aug 8, 2017

I think I have a repro, which is:

using (var conn = new MySqlConnection(csb.ConnectionString))
{
	await conn.OpenAsync();
	try
	{
		await conn.ExecuteScalarAsync("select @test;");
	}
	catch (MySqlException ex)
	{
		// ignore
	}
} // connection is not returned to the pool when 'using' block ends but is left open

I can reproduce this problem in MySqlConnector, but not in MySql.Data, which makes it seem very likely to be the explanation for the original reported problem.

@nvivo
Copy link
Author

nvivo commented Aug 8, 2017

I was able to reproduce this too: https://gist.github.com/nvivo/3fc457baa29870c3395d46ecae4da2d2

image

Sleeping connections go over the idle connection timeout. Also, the open connections go over the Max Pool Size, this explains a lot.

@bgrainger
Copy link
Member

0.24.2 should fix this problem and will be available on NuGet very soon; can you retest with it to confirm?

Thanks again so much for your help in tracking this down.

@nvivo
Copy link
Author

nvivo commented Aug 8, 2017

90% solved! But about an hour later I saw some connections sleeping there, fewer than before, only one or two per server.

I tracked where they're coming from, and all of them are coming from RecoverLeakedSessions now:

image

Maybe you could ping these connections and reset them to Connected before returning them to the pool?

@bgrainger
Copy link
Member

Is there an exception being thrown from DisposeAsync and/or earlier from the command when it's being executed? I'm guessing that that session might be in a bad state (from MySqlConnector's perspective; not necessarily from the server's point of view) so it's not being cleaned up properly.

@nvivo
Copy link
Author

nvivo commented Aug 8, 2017

DisposeAsync gives the same:

Expected state to be (Connected|Failed) but was Querying.

I looked at some of these connections, I couldn't see any exception thrown between querying and the recover. All servers are NTP syncd, but I put a few minutes before and after and nothing for about half a dozen of them.

@nvivo
Copy link
Author

nvivo commented Aug 8, 2017

Just found something interesting here. There is one connection that was returned correctly to the pool by MySqlConnection.Dispose() and kept the Querying state, had the same issue.

@bgrainger
Copy link
Member

I can put a try/catch around DisposeAsync in RecoverLeakedConnections (and probably forcibly close it), but that feels like working around the issue instead of solving the root cause. I'd love to know how it ended up in a Querying state when it was recovered. Still trying to repro that here.

@nvivo
Copy link
Author

nvivo commented Aug 8, 2017

image

If it helps, the ReturnToPool was called from here:

   at MySql.Data.Serialization.MySqlSession.ReturnToPool()
   at MySql.Data.MySqlClient.MySqlConnection.DoClose()
   at MySql.Data.MySqlClient.MySqlConnection.Dispose(Boolean disposing)
   at System.ComponentModel.Component.Dispose()

Querying state was set from here:

   at MySql.Data.Serialization.MySqlSession.StartQuerying(MySqlCommand command)
   at MySql.Data.MySqlClient.CommandExecutors.TextCommandExecutor.<ExecuteReaderAsync>d__3.MoveNext()
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.Start[TStateMachine](TStateMachine& stateMachine)
   at MySql.Data.MySqlClient.CommandExecutors.TextCommandExecutor.ExecuteReaderAsync(String commandText, MySqlParameterCollection parameterCollection, CommandBehavior behavior, IOBehavior ioBehavior, CancellationToken cancellationToken)
   at MySql.Data.MySqlClient.CommandExecutors.TextCommandExecutor.<ExecuteNonQueryAsync>d__1.MoveNext()
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.Start[TStateMachine](TStateMachine& stateMachine)
   at MySql.Data.MySqlClient.CommandExecutors.TextCommandExecutor.ExecuteNonQueryAsync(String commandText, MySqlParameterCollection parameterCollection, IOBehavior ioBehavior, CancellationToken cancellationToken)
   at MySql.Data.MySqlClient.MySqlCommand.ExecuteNonQueryAsync(IOBehavior ioBehavior, CancellationToken cancellationToken)
   at MySql.Data.MySqlClient.MySqlCommand.ExecuteNonQueryAsync(CancellationToken cancellationToken)

Everything points to the flow working as expected and just ending in that state.

Note: These are 2 different problems, but it seems that if there is a chance of ending up in a querying state in a normal flow, there might not be a bug in the recover code.

@bgrainger
Copy link
Member

Are you able to tell me what the particular command is?

Also, is it possible that some object is keeping the MySqlConnection (that's throwing from Dispose) strongly rooted? The only way I can figure out that the connection to the server is still kept alive is:
(something)→MySqlConnection→MySqlSession m_session→Socket m_socket

Otherwise, because the MySqlSession isn't actually returned to the pool, it and the Socket it owns would be GCed, the Socket would be finalized, and the TCP socket would be rudely shut down by the OS, leading to an increase of "Aborted Clients" in MySQL Server.

So I think what we have is a failure in MySqlConnection.Dispose ultimately causing a connection leak if the MySqlConnection object can't be GCed.

@bgrainger bgrainger reopened this Aug 8, 2017
@nvivo
Copy link
Author

nvivo commented Aug 8, 2017

Wow, in this last case this may be just the bug that may be causing it: an incorrectly placed async call in a void method.

public void ReleaseAccountServiceId(int accountServiceId, Guid instanceGuid)
{
    using (var cn = _dataFactory.OpenConnection())
    {
        cn.ExecuteAsync(ReleaseAccountServiceIdSql, new { accountServiceId, instanceGuid });
    }
}

I'll need to check if there isn't more of these in the code.

@nvivo
Copy link
Author

nvivo commented Aug 8, 2017

In the previous case of the RecoverLeakedSessions, it's an EF mix of code that I'm not sure if it's allowed. Didn't cause issues with EF6, but maybe EF core works differently. It's something like:

public async Task Dequeue(SomeObj msg)
{
    using (var context = CreateEFContext())
    using (var trans = context.Database.BeginTransaction())
    {
         context.Entity.Add(new Entity { ... });
         await context.Database.ExecuteSqlCommandAsync(queryA);
         await context.Database.ExecuteSqlCommandAsync(queryB);
         await context.SaveChangesAsync();
         trans.Commit();
    }
}

The problem seems to frequently come from there. It's a legacy thing. I'll move everything to a single SQL statement inside of dapper and see if this goes away.

You have been very patient @bgrainger with me. Thank you a lot for your support with this. I'll let you know if changing this code fixes the issue.

@bgrainger
Copy link
Member

Wow, in this last case this may be just the bug that may be causing it: an incorrectly placed async call in a void method.

I think this one should be OK. Calling MySqlConnection.Dispose will synchronously Dispose the active data reader associated with the connection, and then return the connection to the pool.

In the previous case of the RecoverLeakedSessions, it's an EF mix of code that I'm not sure if it's allowed.

I don't really know EF, but I don't see anything obviously wrong in that code. CC @caleblloyd in case you can spot a problem there.

I think there still is a problem that the library can get in a bad state if DisposeAsync throws (which probably means it's already in a bad state), so I still need to look into when/how that can happen and handle it appropriately.

@nvivo
Copy link
Author

nvivo commented Aug 10, 2017

An update here. I fixed some stuff in the code, changed others and tried to avoid as many exceptions as possible.

Things seem MUCH more stable since 0.24.2. Some connections are still leaking, but in lower numbers. The cluster has been running for more than 12 hours since my last update and there are between 1 to 6 of these lost connections per server, vs up to 30 before. The total open connections didn't pass 600 since yesterday (using wait_timeout = 1 hour), so it's manageable and I'm not getting any Connect timeout exceptions as far as I can tell.

@nvivo
Copy link
Author

nvivo commented Aug 10, 2017

I'd be interested to know if someone else that is running an app for a long time is getting these lost connections or it's something affecting only me.

@bgrainger
Copy link
Member

Some connections are still leaking, but in lower numbers.

I'm assuming that this is a bug in MySqlConnector, but I don't think I've been able to reproduce it yet. Any further information you can provide about any exceptions that are being thrown would probably be extremely helpful.

@nvivo
Copy link
Author

nvivo commented Aug 10, 2017

I've been able to fix most errors and exceptions now are close to zero, but there are 2 exceptions that are constantly being thrown due to the high load:

  1. Lock wait timeout exceeded; try restarting transaction
  2. Deadlock found when trying to get lock; try restarting transaction

I handle these exceptions and retry. But I guess if this was causing the leak, there would be much more connections open. These things happen hundreds of times per hour.

I know AWS sometimes has issues even between VMs in local network. Sometimes I have connections dropping between nodes, and that has caused some strange behavior in Akka before.

@nvivo
Copy link
Author

nvivo commented Aug 13, 2017

Just gonna post this here in the hope you may have some insight... I have been adding more and more logs trying to pinpoint the exact location the issue happens. It's still quite obscure so please excuse my verbosity.

It seems to always happen in this code:

using (var context = _contextFactory())
{
    using (var trans = context.Database.BeginTransaction())
    {
        await context.Database.ExecuteSqlCommandAsync(sql, parameters: new object[] { msg.AccountId });
        trans.Commit();
    }
}

Note, the issue happens on ExecuteSqlCommandAsync, not on Commit.

_contextFactory being a EF Context, registered on DI as:

services.AddSingleton(sp => new Func<EntityContext>(() => new EntityContext(options)));

The only thing special there is that the SQL is dynamic with a fixed DELETE + INSERT and it may generate 1 or 2 additional INSERT statements. But doesn't make sense any error in SQL could cause this, also I can see the same transaction runs fins in most cases - about 1.5 million times a day in fact. I don't know the exact rate they happen per parameter, but I know they happen often enough to not match by far the lost connection rate of tens over many hours for the entire cluster.

So, I found that in these cases, it reaches

await m_command.Connection.Session.SendAsync(payload, ioBehavior, CancellationToken.None).ConfigureAwait(false);
but never continues.

Then, looking inside, SendAsync calls SendReplyAsync:

return new ValueTask<int>(task.AsTask().ContinueWith(TryAsyncContinuation, cancellationToken, TaskContinuationOptions.LazyCancellation | TaskContinuationOptions.ExecuteSynchronously, TaskScheduler.Default));

The continuation for this task just never fires for these cases, and I couldn't figure out why. There is no exceptions being thrown, the cancellationtoken is not cancelled, TryAsyncContinuation is never called. That same method appear to keep working for other connections, it looks quite random but there must be some logic there, maybe a race condition, don't know.

I added a continuation as:

var tmp = task.AsTask();
tmp.ContinueWith(t => { Log("...") });
return new ValueTask<int>(tmp.ContinueWith ...);

It just never fires the continuation. I could see the handler is always the StandardPayloadHandler, behavior is Asynchronous. Payload length varies from 1 byte to 400 bytes. tmp.Status reports WaitingForActivation.

I also tried scheduling a continuation directly in the ValueTask but didn't seem to fire for these cases either.

If you have any idea on what could be or what I could do to dig deeper, let me know.

@bgrainger
Copy link
Member

My guess is that the actual async I/O (e.g., started by SocketByteHandler here, if you're not using SSL:

await m_socket.SendAsync(m_socketAwaitable);
) is never completing, so the whole chain of continuations is never scheduled.

If so, this may be something to be addressed as part of #67 by ensuring that all I/O operations have some timeout.

It'd also be worthwhile for me to do some local testing that injects failures, to make sure exceptions are bubbled up correctly.

Since you were successfully using this code with MySql.Data, which is not asynchronous, it might be useful to try adding ForceSynchronous=true to your connection string to see if that changes the behaviour. (This connection string setting will make all Async method calls use synchronous I/O under the hood, which will go through a slightly different code path and may be easier to debug; taking a process dump will show if any method is "hanging" in socket I/O.)

Thanks again for persisting in debugging this issue!

@bgrainger
Copy link
Member

I've added code to DisposeAsync to stop it throwing an exception; this will ship in 0.25.1. This should cause the socket to be closed (rudely), which may be logged as an aborted connection but should prevent it leaking. If you are able to upgrade (or to pull the changes into your local build), I'd be interested to hear whether it does fix the leaked connections.

(I still would like to know how they're ending up in an unexpected state, though.)

@nvivo
Copy link
Author

nvivo commented Aug 18, 2017

Sorry, has been a busy week. I updated to 0.25.1, and after 3 hours running I didn't see any connections over the idle timeout limit. Didn't see any exceptions so far. I'm curious to see what will happen to these transactions. I'll keep a close eye on them and let you know if I find anything.

Thanks for your support on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants