Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for AWS RDS MySQL Multi-AZ Cluster auto-discovery #4406

Merged
merged 6 commits into from
Apr 1, 2024

Conversation

anphucbui
Copy link
Contributor

Adding support for AWS RDS MySQL Multi-AZ Cluster auto-discovery

@mirostauder
Copy link
Collaborator

Automated message: PR pending admin approval for build testing

@renecannao
Copy link
Contributor

add to whitelist

anphucbui and others added 4 commits February 2, 2024 16:46
Copy link
Collaborator

@JavierJF JavierJF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are several points that requires addressing / discussing before merging.


// Add the new servers if any
if (!new_servers.empty()) {
int successfully_added_all_servers = MyHGM->add_discovered_servers_to_mysql_servers_and_replication_hostgroups(new_servers);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we could simplify this error handling, there is no relevant exceptions that can be thrown in the inner code. So, better handling and logging any potential errors inside, and remove the exception safeguards. Only standard containers are used, and in the event of an exception there, it's better to threat that exception as an assert as it would most likely be a programmatic error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the outer exception handling

uint16_t port = std::get<1>(s);
long int hostgroup_id = std::get<2>(s);

// Add the discovered server with default values
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a newer abstraction that can be used for server placement MySQL_HostGroups_Manager::create_new_server_in_hg. Is there any motivation not to use it? If not, would be preferred using it for this case. If there is a limitation in the interface it may be simpler to fix it that duplicating code. For reference, here it's an usage of it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I will make use of the new abstraction. For the third field of srv_info_t struct, the kind field, what should I use? I will default to AWS RDS string, but let me know if it should be something else.

purge_mysql_servers_table();
mydb->execute("DELETE FROM mysql_servers");
proxy_debug(PROXY_DEBUG_MYSQL_CONNPOOL, 4, "DELETE FROM mysql_servers\n");
generate_mysql_servers_table();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block, starting at previous if is missing the update on the global checksums after this line. This section can be taken as an example update_group_replication_add_autodiscovered. Here we can see that the following part is missing:

		// Update the global checksums after 'mysql_servers' regeneration
		{
			unique_ptr<SQLite3_result> resultset { get_admin_runtime_mysql_servers(mydb) };
			string mysrvs_checksum { get_checksum_from_hash(resultset ? resultset->raw_checksum() : 0) };
			save_runtime_mysql_servers(resultset.release());
			proxy_info("Checksum for table %s is %s\n", "mysql_servers", mysrvs_checksum.c_str());

			pthread_mutex_lock(&GloVars.checksum_mutex);
			update_glovars_mysql_servers_checksum(mysrvs_checksum);
			pthread_mutex_unlock(&GloVars.checksum_mutex);
		}

Checksum re-computation is mandatory in this case after re-generating mysql_servers_table. But in this case there are a couple of extra operations that are required:

			// Update the global checksums after 'mysql_servers' regeneration
			{
				unique_ptr<SQLite3_result> resultset { get_admin_runtime_mysql_servers(mydb) };
				string mysrvs_checksum { get_checksum_from_hash(resultset ? resultset->raw_checksum() : 0) };
				save_runtime_mysql_servers(resultset.release());

				// PROPOSAL: Required to update the runtime_mysql_servers checksum with the new checksum, in a
				// similar way it's done during `commit` via `commit_update_checksum_from_mysql_servers`
	    	    uint64_t raw_checksum = this->runtime_mysql_servers ? this->runtime_mysql_servers->raw_checksum() : 0;
	    	    table_resultset_checksum[HGM_TABLES::MYSQL_SERVERS] = raw_checksum;

				// PROPOSAL: Required to update with the new 'mysql_group_replication_hostgroups'; this is
				// only required for preserving coherence in the checksums, otherwise they would be
				// inconsistent with `commit` generated checksums. This should be reworked into a function.
				{
					SpookyHash rep_hgs_hash {};
					bool init = false;
					uint64_t servers_v2_hash = table_resultset_checksum[HGM_TABLES::MYSQL_SERVERS_V2];

					if (servers_v2_hash) {
						if (init == false) {
							init = true;
							rep_hgs_hash.Init(19, 3);
						}
				
						rep_hgs_hash.Update(&servers_v2_hash, sizeof(servers_v2_hash));
					}

					CUCFT1(
						rep_hgs_hash, init, "mysql_replication_hostgroups", "writer_hostgroup",
						table_resultset_checksum[HGM_TABLES::MYSQL_REPLICATION_HOSTGROUPS]
					);
				}

				proxy_info("Checksum for table %s is %s\n", "mysql_servers", mysrvs_checksum.c_str());

				pthread_mutex_lock(&GloVars.checksum_mutex);
				update_glovars_mysql_servers_checksum(mysrvs_checksum);
				pthread_mutex_unlock(&GloVars.checksum_mutex);
			}

The comments with PROPOSAL mark the proposed changes to the previously mentioned required block. All this code is a proposal, and has not being tested.

proxy_debug(PROXY_DEBUG_MYSQL_CONNPOOL, 4, "DELETE FROM mysql_servers\n");
generate_mysql_servers_table();
update_table_mysql_servers_for_monitor(false);
rebuild_hostname_hostgroup_mapping();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the previous suggested changes, this function should be replaced by update_hostgroup_manager_mappings. The two functions are identical, only differing in the missing check on hgsm_mysql_servers_checksum in the new rebuild_hostname_hostgroup_mapping function. Updating this checksums is addressed in the previous comment, as it's better for keeping general consistency. If I have not missed anything, I suggest deleting this new function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleting the new rebuild_hostname_hostgroup_mapping and using update_hostgroup_manager_mappings instead

add(mysrvc, hostgroup_id);

proxy_info(
"Adding new discovered server %s:%d with: hostgroup=%d, weight=%ld, max_connections=%ld, use_ssl=%d\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following hostgroup=%d should be replaced with hostgroup=%ld as hostgroup_id is a long int, or remove the unnecessary casting into long int from the original int that is received inside the tuple.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this logging statement here as create_new_server_in_hg already has the same logging statement

@@ -1092,6 +1092,9 @@ class MySQL_HostGroups_Manager {
void set_server_current_latency_us(char *hostname, int port, unsigned int _current_latency_us);
unsigned long long Get_Memory_Stats();

int add_discovered_servers_to_mysql_servers_and_replication_hostgroups(vector<tuple<string, int, int>> new_servers);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason to pass the vector by value, also, because it's never modified, could be easily converted into const vector<tuple<string, int, int>>&.

@@ -442,6 +447,7 @@ class MySQL_Monitor {
static bool update_dns_cache_from_mysql_conn(const MYSQL* mysql);
static void trigger_dns_cache_update();

void process_discovered_topology(const std::string& originating_server_hostname, vector<MYSQL_ROW> discovered_servers, int reader_hostgroup);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if MYSQL_ROW are just pointers, it's a good practice to make the second argument const vector<MYSQL_ROW>&.

long val = strtol(s, &temp, 0);

if (temp == s || *temp != '\0' || ((val == LONG_MIN || val == LONG_MAX) && errno == ERANGE)) {
throw std::runtime_error("Could not parse long.");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to throw an exception, error should be handled immediately, since this function is only used when parsing data received by the backend server being monitored. See the comment where this function is called.

Copy link
Contributor Author

@anphucbui anphucbui Mar 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this method entirely and changed the caller to throw a proxy_error message and return early if an exception is detected during string conversion to int

string current_discovered_port = s[3];

if (find(saved_hostnames.begin(), saved_hostnames.end(), current_discovered_hostname) == saved_hostnames.end()) {
tuple<string, int, int> new_server(current_discovered_hostname, parseLong(current_discovered_port.c_str()), reader_hostgroup);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of parseLong here together with the exception model doesn't improve the handling of a malformed response. There are other places in which the response from the backend is assumed not to be malformed, for example, in the number of rows returned above. If handling a invalid value for that response is desired, is much better to handle the case here, parsing the value into a local, and in case of failure, place a proxy_error with the details of the invalid row received. For safety, would be best to either clear the new_servers or return.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the exception handling here as we can rely on the other places where backend is assumed not to be malformed.


// Process the discovered servers and add them to 'runtime_mysql_servers'
if (!discovered_servers.empty()) {
try {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the above comments about exceptions generated inside the function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the outer exception handling

…putation after adding a new server, and other small changes based on feedback
@JavierJF JavierJF changed the base branch from v2.x to 2.x-AWS_RDS_autodiscovery April 1, 2024 13:53
Copy link
Collaborator

@JavierJF JavierJF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @anphucbui,

thanks for the PR and for addressing the previous feedback. There are a few extra changes required, but since they are minor in term of code changes, I will be merging the PR into a dedicated branch and addressing them. I'm leaving the feedback here only for documenting purposes.

Thanks, Javier.

* @param new_servers A vector of tuples where each tuple contains the values needed to add each new server.
*/
void MySQL_HostGroups_Manager::add_discovered_servers_to_mysql_servers_and_replication_hostgroups(vector<tuple<string, int, int>>& new_servers) {
int added_new_server;
Copy link
Collaborator

@JavierJF JavierJF Apr 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uninitialized variable, fine right now if the function isn't called with an empty list, could misbehave otherwise, better to initialize with -1.

@@ -1092,6 +1092,8 @@ class MySQL_HostGroups_Manager {
void set_server_current_latency_us(char *hostname, int port, unsigned int _current_latency_us);
unsigned long long Get_Memory_Stats();

void add_discovered_servers_to_mysql_servers_and_replication_hostgroups(vector<tuple<string, int, int>>& new_servers);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous comment not addressed, should be const.

*/
void MySQL_HostGroups_Manager::add_discovered_servers_to_mysql_servers_and_replication_hostgroups(vector<tuple<string, int, int>>& new_servers) {
int added_new_server;
wrlock();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing lock on Admin GloAdmin->mysql_servers_wrlock(), since get_admin_runtime_mysql_servers is used, Admin should be locked, in other places were both locks needs to be taken, we take them together for simplicity. See MySQL_HostGroups_Manager::update_aws_aurora_set_reader.

try {
current_discovered_port_int = stoi(s[3]);
} catch (...) {
proxy_error("Unable to parse the port value during topology discovery: [%s]. Terminating discovery early.", current_discovered_port_string);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing newline character \n in error message. Also, C string not being passed (char*) instead, a std::string is feed, this is undefined behavior at best, potential crash at worst.

@JavierJF JavierJF merged commit 5d7e5d9 into sysown:2.x-AWS_RDS_autodiscovery Apr 1, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants