In several places, MySQL represents binlog offsets within a file
as a 32-bit unsigned integer. Most notably for our purposes are
the binlog event header `log_pos` field, and the offset argument
to the `COM_BINLOG_DUMP` command.
The upshot of this is that we can't necessarily trust the offset
to be correct when a file grows past 4GB, and even if we tracked
the "full offset" ourselves we wouldn't be able to resume from
there after a connector restart.
This normally isn't an issue because binlog files are never
supposed to grow that large. The system setting `max_binlog_size`
which governs the point after which the file is rotated has a
maximum possible value of just 1GB. Problem is, that's a soft
limit and it's possible to force MySQL to stuff arbitrarily
large amounts of data into a single file. So we need to handle
that situation as gracefully as possible.
This commit implements that handling. It detects binlog offset
overflow whenever an event's `log_pos` header value is smaller
than the prior cursor position (this is reliable because there's
also a 1GB cap on the size of any single event, and unlike the
binlog size setting this one's actually a hard maximum), and
once that occurs an "offset overflow" state flag is set which
prevents us from emitting any further checkpoints until after
the next binlog rotation.
However there is one other place where we use binlog offsets,
and that's as part of the `/_meta/source/cursor` field. This
field is used as the fallback collection key for keyless tables,
so it's actually kind of important that it be basically correct,
though it's actually sufficient for it to be properly ordered and
unique. We handle this by maintaining a u64 "estimated offset"
which is advanced based on event sizes instead of `log_pos`
values after offset overflow occurs within the current file.
It's not exactly feasible to reproduce the edge case this fixes
on demand within the confines of a CI build, so there is no new
test case accompanying these changes. We'll have to content
ourselves with CI tests showing this doesn't break anything
when overflow doesn't occur, and the real test will come when
this happens again in production. Which we can tell because
there will be a warning message logged when it happens.