Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: master coordinator with aiokafka #880

Merged
merged 2 commits into from
Jun 3, 2024

Conversation

jjaakola-aiven
Copy link
Contributor

@jjaakola-aiven jjaakola-aiven commented May 17, 2024

About this change - What it does

Change the coordinator to use aiokafka.

Why this way

In cases when Kafka brokers are replaced with new brokers on new servers the kafka-python group coordinator can get stuck in a perpetual DNS failure loop. The exact thread getting stuck is the heartbeat thread of Kafka client. This condition requires Karapace restart. See also wbarnha/kafka-python-ng#36.

Work to remove the kafka-python as a dependency is also progressing and this is one more step and provides async capable functionality. The rdkafka has group coordinator but it is not exposed from confluent-kafka Python binding and cannot be used for Karapace primary selection coordinator at this time. The work required to have the group coordination exposed from rdkafka is a future item to investigate.

The primary coordinator is adapted from aiokafka group coordinator for Karapace. Required changes include removing of subscription handling and partition assignors, adding Karapace specific metadata to group and selecting the primary instance. Note that Karapace can join as follower to the group but selected as primary instance.

The implementation removes the primary coordinator thread and primary coordinator is run in the application event loop.

@jjaakola-aiven jjaakola-aiven force-pushed the jjaakola-aiven-master-coordinator-with-aiokafka branch 8 times, most recently from 63af623 to cbe5020 Compare May 17, 2024 18:49
@jjaakola-aiven jjaakola-aiven changed the title Jjaakola aiven master coordinator with aiokafka feature: master coordinator with aiokafka May 18, 2024
@jjaakola-aiven jjaakola-aiven force-pushed the jjaakola-aiven-master-coordinator-with-aiokafka branch 6 times, most recently from d492cc8 to 7b3e299 Compare May 20, 2024 12:40
@jjaakola-aiven jjaakola-aiven marked this pull request as ready for review May 21, 2024 05:14
@jjaakola-aiven jjaakola-aiven requested review from a team as code owners May 21, 2024 05:14
Copy link
Contributor

@eliax1996 eliax1996 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big and complex pr :) need to take a look again

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: copyright?

group_generation_id=generation if generation is not None else -1,
)

def get_master_info(self) -> tuple[bool | None, str | None]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit/optional: maybe a class like this?

@dataclasses.dataclass(frozen=True, kwonly=True)
class MasterInfo:
   this_node_is_master: bool
   master_url: str

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note also the @default_dataclass helper.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It gives you frozen + kw_only, and slots on Python 3.10+.

master_identity: MemberIdentity


@dataclass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

frozen?

member_data: bytes


@dataclass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

frozen?

}


@dataclass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

frozen?

karapace/coordinator/schema_coordinator.py Show resolved Hide resolved
get_member_url(member_identity["scheme"], member_identity["host"], member_identity["port"])
] = (member.member_id, member.member_data)
if len(urls) > 0:
chosen_url = sorted(urls, reverse=self.election_strategy.lower() == "highest")[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we validate the election_strategy to an enum? And have the default to the "lowest"?
If we select the highest this means that each kafka node up and running (with the highest broker id) could become running?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in a next refactor

chosen_url = sorted(urls, reverse=self.election_strategy.lower() == "highest")[0]
schema_master_id, member_data = urls[chosen_url]
else:
# Protocol guarantees there is at least one member thus if urls is empty, fallback_urls cannot be
Copy link
Contributor

@eliax1996 eliax1996 May 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be the error of an assert, otherwise if for any weird reason this assumption its break we have no clue of why something weird its happening:
assert len(urls)+len(fallback_urls)>0, "Protocol guarantees there is at least one member thus if urls is empty, fallback_urls cannot be"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jjaakola-aiven what about this?

karapace/coordinator/schema_coordinator.py Show resolved Hide resolved
karapace/coordinator/schema_coordinator.py Show resolved Hide resolved
@jjaakola-aiven jjaakola-aiven dismissed eliax1996’s stale review May 23, 2024 10:54

Addressed comments.

@jjaakola-aiven jjaakola-aiven force-pushed the jjaakola-aiven-master-coordinator-with-aiokafka branch 2 times, most recently from 94bc884 to c0c07fa Compare June 3, 2024 05:50
@jjaakola-aiven jjaakola-aiven force-pushed the jjaakola-aiven-master-coordinator-with-aiokafka branch from c0c07fa to f7bd79c Compare June 3, 2024 08:39
Copy link
Contributor

@eliax1996 eliax1996 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, lets merge it

@eliax1996 eliax1996 merged commit fc011ba into main Jun 3, 2024
8 checks passed
@eliax1996 eliax1996 deleted the jjaakola-aiven-master-coordinator-with-aiokafka branch June 3, 2024 10:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants