Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Issues due to Thread Locking #333

Closed
zromano opened this issue Jan 6, 2023 · 14 comments
Closed

Performance Issues due to Thread Locking #333

zromano opened this issue Jan 6, 2023 · 14 comments
Assignees
Labels
bug Something isn't working

Comments

@zromano
Copy link

zromano commented Jan 6, 2023

Describe the bug

We were interested in switching our JDBC driver to the AWS MySql JDBC driver to utilize its fast failover capabilities. However, when we did performance testing with this new driver, we noticed it had a substantial performance impact to our service.

We do not experience this issue when using the mysql-connector-j JDBC driver.

Expected Behavior

I'd expect this to have similar performance as the mysql-connector-j JDBC driver since this is advertised as a "drop-in compatible".

Current Behavior

The JDBC driver was causing poor performance because it was causing Thread Locking due to the synchronous code in the following classes:

image

Reproduction Steps

We are using a SpringBoot 2.7.5 application on Java 17 backed by an Aurora Mysql DB and performing a Locust load test against it.

Sadly can't post code, but I'd think load testing a simple SpringBoot app that can communicate with Aurora would be sufficient to repro.

Possible Solution

No response

Additional Information/Context

No response

The AWS JDBC Driver for MySQL version used

1.1.2

JDK version used

17.0.5 (corretto)

Operating System and version

amazoncorretto:17 Docker Image

@zromano
Copy link
Author

zromano commented Jan 6, 2023

I see these questions from the other thread, I can add more details:

What connection string and configuration parameters do you use?

We are using a Hikari Connection Pool.
url: jdbc:mysql:aws://${SERVER}:${PORT}/${SCHEMA}?verifyServerCertificate=true&useSSL=true&requireSSL=true
connectionTimeout: 10000
minimumIdle: 25
maximumPoolSize: 72

What are usual SQL statements that your service executes?

Very simple query along the lines of SELECT * FROM table WHERE column IN (x, y, z). It doesn't matter how many values we provide for the IN clause, this always happens.

Did you have a chance to try MySQL JDBC Connector/J Driver instead of MariaDb driver? Any observations about performance? https://github.com/mysql/mysql-connector-j

Neither had this issue.

Does your application use any connection pool?

See above.

What database access frameworks your application uses?

We use Spring Data JPA, but the same issue happens even if we use plain JDBC queries.

@congoamz
Copy link
Contributor

congoamz commented Jan 6, 2023

Hi @zromano,

Thank you for reporting this issue. We will look into the problem and will share more info as we investigate. Thank you for your patience!

@congoamz
Copy link
Contributor

congoamz commented Jan 6, 2023

@zromano Had a couple of follow up questions:

  1. You mention you are using version 2.18.33 of our driver, however the latest driver release is 1.1.3. Can you clarify which version of our driver you are experiencing the problem on?
  2. Is the long lock wait time shown in your screenshot reported every time you run your performance test or only occasionally? If the answer is occasionally, do you know roughly how often it occurs (eg every X runs)?

@zromano
Copy link
Author

zromano commented Jan 6, 2023

Whoops, copied the wrong version.

  1. We are on version 1.1.2. (updated above as well).

  2. We only notice the locking issue when our pod is getting significant load. We don't register anything abnormal when our pod is receiving low traffic (or at least that I can see).
    During performance testing, this caused our pod to only handle around 12-15% the amount of requests per second as if we used mysql-connector-j. This happens every time we run a performance test with the AWS MySQL connector.

@zromano
Copy link
Author

zromano commented Jan 11, 2023

Okay so we did a little more testing, we actually see a 20-30% performance boost when switching from this driver to mysql-connector-j even when our pods aren't under much load.

It's also worth noting that this 20-30% is on our overall response time, not just on the time we are waiting on JDBC. I wasn't sure how to measure that.

Please let me know if there is any other data we can provide that might be helpful for your testing 👍

@hsuamz hsuamz assigned sergiyvamz and unassigned congoamz Feb 6, 2023
@sergiyvamz
Copy link
Contributor

Hello @zromano

Thread lock improvements have been merged, can you please test out our 1.1.4-Snapshot build here and let us know if the issue persists.
#356

Thank you!

@zromano
Copy link
Author

zromano commented Feb 15, 2023

Hey @sergiyvamz,

I just load tested our application with the following two JDBC drivers:

  1. implementation(files("aws-mysql-jdbc-1.1.5-20230214.000541-4.jar"))
  2. implementation "com.mysql:mysql-connector-j:8.0.32"

Summary:
The standard MySql connector had about 4x more throughput than the AWS one.

With the configuration I tested, I was getting ~450 requests per second over a 20 minute test period with the standard MySQL driver.

When I tested the SNAPSHOT version, it peaked at around 150 RPS and then exhausted our connection pool and started throwing errors. The profile of the application still shows a large amount of locks from the AuroraTopologyService.

image

image

@sergiyvamz
Copy link
Contributor

Hello @zromano

Thank you for a prompt check of the new snapshot build. It's sad to hear that the problem is still there. I'm afraid I need to ask you to provide a sample app that reproduce the issue. I'd also ask you to provide a description of tools and strategy that you use to measure locks and request throughput.

Thank you!

@zromano
Copy link
Author

zromano commented Feb 15, 2023

Unfortunately I can't provide a sample App. Our team has decided not to invest more time into this topic. 😞

However, I can provide as much info as possible though

For our tech stack:

  • Hosting app in a Kubernetes cluster, but only using a single pod for this performance testing
  • Obviously using AWS Aurora Mysql cluster for the DB🙂
  • Using DataDog agent to measure performance and gather analytics
  • Using Spring Data JPA and Hikari connection pool
  • Using Locust to run a load test against a simple endpoint that just fetches and returns data from the DB

I am willing to set up a sample Spring App that I assume will repro this, but unfortunately I don't have the resources to test it. I don't want to personally pay for the AWS resources

Please let me know.

@sergiyvamz
Copy link
Contributor

Hello @zromano

Thank you for providing details about your app. We understand that there may be challenges providing a sample app including time and necessary resources to test it. Our team would be appreciated to get a sample app with no proper testing. It's important to us investigate the issue and found a root cause of such dramatic (4x) performance degradation as you reported.

Thank you

@zromano
Copy link
Author

zromano commented Mar 11, 2023

I did my best to create a sample application for you. It can be seen in: https://github.com/zromano/AWS-JDBC-Performance

I tested that the app works locally, but didn't hook it up to a real AWS Aurora Mysql DB and it doesn't offer an help in terms of deploying the app on AWS.

Hope this helps, please let me know if there is anything else I can do to help

@karenc-bq
Copy link
Contributor

Hi @zromano, thank you for the sample application! We will take a look and keep you posted with our progress.
Thanks again!

@karenc-bq karenc-bq self-assigned this Mar 13, 2023
@karenc-bq
Copy link
Contributor

Hi @zromano,

I ran the sample application you provided with different versions of the AWS MySQL JDBC Driver:

  1. v1.1.4
  2. v1.1.5-20230214.000541-4 (Snapshot)
  3. latest main (77a1cdf)

In summary, we were able to reproduce the performance issues you raised with version 1.1.4 and the snapshot build. However, these issues are addressed in the latest main.

More details below.


v1.1.4

We saw a large amount of time spent waiting for locks in the ExpiringCache class, specifically in the synchronized get method.

To resolve this issue we introduced the CacheMap class, which uses concurrent hashmaps instead of locks.

snapshot build

However, as you mentioned, the issue persists. Instead of calling get from ExpiringCache we are just calling a different method at the same place in code.

image

main

To resolve another issue we decided to only update topology for mission critical method calls. This change significantly reduced the number of calls to the CacheMap. While profiling we also noticed another area of improvement, we will be looking into that.

image

We will be closing this ticket now. We appreciate the feedback and sample application in aiding in root causing this. Please let us know if there is anything else that we can provide support for while your team evaluates the driver.

@karenc-bq karenc-bq removed the Investigating Under investigation label Mar 15, 2023
@zromano
Copy link
Author

zromano commented Mar 15, 2023

Awesome, glad to hear that and glad I could help!

Thanks for fixing this 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants