Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk unique insertion, uniqueness with subset of args #590

Merged
merged 6 commits into from
Sep 22, 2024

Conversation

bgentry
Copy link
Contributor

@bgentry bgentry commented Sep 15, 2024

Background

Our first two attempts at unique insertions have been limited:

  1. The initial version relied on advisory locks, which were slow enough and had enough potential for conflicts that we restricted them to the single job insertion path. They were fairly flexible in terms of allowing a full customization of which states should be considered in the uniqueness check.

  2. The second version of unique jobs brought us significantly improved performance using a database unique index on the hash of a computed unique key. However, it came with less flexibility: that code path would only be used when the user did not customize the list of unique states beyond our default set of states (everything but cancelled and discarded).

    River queries were hardcoded to clear out the unique_key value if the job transitioned to a designated state (ex. 1 2). If a user wanted to ensure that a job is never duplicated even if there's already an identical job in either of those unsupported states, River would fall back to the slower/older advisory lock uniqueness path.

    This feature also was only available when inserting one job at a time, and because its unique index was prefixed with kind it couldn't be used across job kinds.

An improved approach

In this PR, I've proofed out an improved design for unique jobs that I believe addresses all of the shortcomings of the previous designs, maintaining full flexibility while also maintaining the performance improvements of the database-level unique constraint design. The new design avoids design 2's need for a hardcoded list of allowed states by moving the list of unique states into the database on a per-row basis.

I initially looked at doing this with a river_job_state[] column, but that was both verbose and not particularly space-efficient. Instead I landed on a bit(8) bitmask column, which is included as part of the compound partial index for uniqueness. A custom SQL function allows that unique index to only apply to jobs when their state is within the allowed states as reflected in the bitmask, so when they are moved to another state they are also no longer considered in uniqueness guarantees. (Side note: GPT-o1 was actually quite helpful in ironing out some aspects of this design and helping me to reason about whether I had considered everything. It's super impressive ✨)

Importantly, this design does not rely on advisory locks in any way, and in this PR the new path has been expanded to also work on the bulk insertion paths. Additionally, if for some reason users wish to ensure uniqueness across different job kinds, they can do that here using the ExcludeKind field on UniqueOpts. kind is no longer part of the new unique index to allow for this flexibility.

Another nicety: this gets us very close to having a unified code path for all inserts without any caveats depending on how the job was inserted. 🔜 We should be able to get there as soon as we remove the old deprecated unique code paths.

Restrictions on state

This approach can be made to work for a totally custom list of states. However, there's some complexity and undefined edge cases with doing that, particularly for allowing people to remove pending, scheduled, available, or running from the list. To reason this through, I prepared a mermaid state transition diagram (included in PR) which I believe illustrates all possible job state transitions. On a printed version of this I was able to categorize all the state changes to note which queries are responsible for them.

The analysis of this is verbose, but if you're intrigued feel free to read it through.

Detailed analysis of states list and reasons for customizing

Default unique states

Since we first introduced unique jobs, this has been the default list of states that are checked for uniqueness:

  • Pending
  • Available
  • Scheduled
  • Running
  • Retryable
  • Completed

These states ensure that uniqueness is enforced unless a job has been explicitly cancelled or has exhausted its errors.

The omission of discarded and cancelled makes sense as a default (those states indicate something has gone wrong and maybe you want to be able to retry with a new job). However they are not the only scenario.

Reasons for customizing this list

For the various states, there are reasons someone might want to add or remove a state from the list. I’ll go through each state here and talk about why someone might want to add/remove it from the list, or what complications that would introduce.

Pending

Included by default. I can't think of a use case that would benefit from removing this from the list.

If you remove it, we could have trouble moving jobs out of pending into available or scheduled (if they’re enabled). Would need to have those workflow/sequence queries be able to skip rows that have conflicts, but they would just keep conflicting and being skipped on every run.

Scheduled

Included by default. I can't think of a use case that would benefit from removing this from the list.

Complicated to allow removal. If you allow multiple dupes to be in scheduled state, but block them from proceeding to available, then our scheduler maintenance process (or workflow scheduler) wouldn't be able to advance those jobs when it's time for them to run.

Available

Included by default. The main use case I can think of to remove this and earlier states from the unique list is to facilitate a poor man's version of global concurrency limiting (limited to 1).

Dangerous to remove, because we can get multiple jobs in the available state but then only one of them can move to running without a conflict. Allowing this that will put a lot of stress on the job fetch query, which is the absolute last place we want to deal with repeated conflicts and row rewrites.

Running

Included by default. I can't think of a use case that would benefit from removing this from the list.

It's dangerous to remove, because then you can insert and start working duplicate jobs, only to be unable to complete/retry/discad them (unless those other states are removed also).

Retryable

Included by default. A user may wish to allow a job to try once and may want to enforce uniqueness up until the point at which it has errored once, at which case they could want to allow dupes.

If removed, the scheduler may have trouble moving back to available. That could be dealt with by having the scheduler query mark the row as discarded if it can’t be moved back to available due to a unique conflict.

Completed

Included by default. Similar to cancelled and discarded it may be desirable to remove this if you want to ensure there is only one “live” job at a time in some non-finalized state.

No major downsides to removing it.

Cancelled (opt in)

Use case: I want to ensure that there is only one job of a given kind for a timeframe or set of args, even if it gets cancelled.

If you allow earlier states to be removed from the list, the cancellation and completer queries would need to handle this situation somehow and I'm not sure there's a "right" answer for what to do in that case.

Discarded (opt in)

Use case: I want to ensure that there is only one job of a given kind for a timeframe or set of args, even if it runs out of errors. This makes sure that job is never ever duplicated.

If you allow earlier states to be removed from the list, the cancellation and completer queries would need to handle this situation somehow and I'm not sure there's a "right" answer for what to do in that case.

A solution?

Most problems are resolved if with one rule: don’t allow removing the pending, available, scheduled, and running states from the list. Here’s an analysis of the other states given this rule:

If one removes retryable, there can be dupes in the retryable state that can’t all be moved back to another state. Manual JobRetry can just error in this case, whereas automated scheduling of retries can move the job to discarded if a conflict occurs.

If one adds cancelled to the default list, that does not have any adverse impact, because there can be no dupes in any of the default states plus cancelled. You will always still be able to retry that cancelled job, because none of the states it will go into can be occupied by a duplicate job.

Likewise, adding discarded is not an issue for the same reasons.

Finally, there’s no issue with removing completed, other than somebody being blocked from manually retrying an already-completed job. In that case we can just bubble up the error to the user.

In short, I think it would make sense to restrict users from removing these 4 initial states from their UniqueOpts.ByState. I'm not sure of the best UX for that. We could simply error when you try to insert a job that breaks that rule (at least when using the multi insert / non-advisory code path). Or we could use a different way of customizing states besides a list.

Bonus: opt in only specific args to uniqueness

Since it was first introduced, we've heard from users who want to consider only a subset of their job args when determining uniqueness. For example, I might want to make sure there is exactly one job of a particular type for a given customer_id, even though there may also be other args on the job. Once I had proofed out the new design, I scratched an itch to spike out this feature.

As implemented here, when a job has been opted into uniqueness via UniqueOpts with ByArgs, it can optionally add a river:"unique" struct tag to specific fields that the user wants to be considered in uniqueness. For example, consider this struct:

type EmailJobArgs struct {
	EmailID string `json:"email_id" river:"unique"`
	Recipient string `json:"recipient" river:"unique"`
	Subject   string `json:"subject"`
	Body      string `json:"body"`
}

For unique jobs of this type, the uniqueness constraint will be applied on the following json:

{"email_id": 12345,"recipient":"[email protected]"}

JSON field names are respected as well if customized in the struct tags, and fields that are omitted from the encoded JSON will also be omitted from the uniqueness JSON. These arg keys are sorted alphabetically before being built into the uniqueness JSON to ensure a consistent result.

Transition Plan

While proofing out this design, I decided to avoid getting slowed down with legacy compatibility and stripped out the old uniqueness systems entirely. However I don't think we can do just merge it without considering how existing apps will transition between the systems, ideally without any downtime.

Also, the unique key calculation is necessarily different here than in the v2 design, because kind is now a part of the unique key instead of part of the index. So we can't easily migrate in-flight jobs from the v2 design to the v3 design. And v1 unique jobs don't even have a unique_key populated at all, so they can't be transitioned cleanly either.

I have some ideas:

  • Maintain both the v1 and v2 code paths temporarily (say, for one final v0.x.y release series).
  • Retain the bits of SQL queries which clear the unique_key value when cancelling or discarding jobs, but only do so for rows with unique_states IS NULL. This will allow v2 unique jobs to continue being processed in the same way.
  • For new jobs which specify a list of unique states that don't include all the required ones shown above (pending, scheduled, available, running), continue to fall back to the v1 design but log a deprecation warning. Make sure this legacy mode only works for the single-job insert paths and is explicitly blocked/errored when using InsertMany.
  • New jobs with a compatible set of states will use the v3 path and have both unique_key and unique_states populated. No new jobs will be inserted with v2 mode because v3 is strictly a superset of it.

Remaining issues

  • Decide how to implement the restricted list of unique states and refine its UX

  • Query changes to fix scheduler conflicts if retryable is removed from unique states (job won't be able to retry and must be discarded). Need tests for this too.

  • Implementation changes to account for migration plan (restore deleted code)

  • Rename unique:"true" struct tag to river:"unique" for cleaner, more forward-compatible API design

  • Tests to ensure that jobs using the old unique designs continue to function, for now.

  • Go documentation

  • Website documentation

  • Consistent alphabetical sort order for all encoded args (or else reordering struct fields causes uniqueness changes)

  • Migration framework and tests don't play nicely with this latest migration. The reason is that migration 4 adds a pending state to the river_job_state enum, and this latest migration 6 attempts to use that value within an immutable SQL function. This results in the following error:

    ERROR: unsafe use of new value "pending" of enum type river_job_state (SQLSTATE 55P04)
    

    The cause of this is that our current migration system and its tests will attempt to run all available migrations in a single transaction without committing between each one. I believe this might be part of why other migration systems (like Rails/ActiveRecord) use a separate transaction for each individual migration rather than running them all in one.

    We may be able to solve this for compatible databases by opening a DDL sub transaction for each individual migration (as I've done in this PR to get tests passing and testdbman to work) but we should resolve that in a standalone PR before this one gets merged. In particular the one test still failing in this PR is one that tries to roll back and then re-run all migrations within a single test transaction. Maybe you want to take a crack at a standalone PR for migration framework changes @brandur? Fixed in Migrations: use a txn + commit for each migration, deprecate MigrateTx #600.

rivermigrate/river_migrate.go Outdated Show resolved Hide resolved
Comment on lines +20 to +27
// JobArgs is an interface that should be implemented by the arguments to a job.
// This definition duplicates the JobArgs interface in the river package so that
// it can be used in other packages without creating a circular dependency.
type JobArgs interface {
// Kind returns a unique string that identifies the type of job. It's used to
// determine which worker should work the job.
Kind() string
}
Copy link
Contributor Author

@bgentry bgentry Sep 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needed to use this from dbunique so I copied the interface into here. But then I ended up adding an unexported jobArgs within that package anyway—maybe that's better than exposing a duplicate in a user-facing package like this?

Edit: Actually, I did need this from other places as well. It has to be available as part of riverdriver.JobInsertFastParams as well as internal/dbunique.

Can keep it here in rivertype, or move to rivershared. Lmk what seems better.

internal/dbunique/unique_fields.go Outdated Show resolved Hide resolved
Comment on lines +18 to +19
github.com/tidwall/gjson v1.17.3
github.com/tidwall/sjson v1.2.5
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gjson and sjson were super handy for being able to efficiently extract a subset of keys from a JSON payload and to reconstruct them into a new sorted subset payload. We could figure out a way to do this without them, but it would likely be slower (if using encoding/json) or a lot more verbose (if building up our own JSON bytes).

They're also super small dependencies with minimal indirects, so I feel like it's not a big deal to pull them in. Plus they're also used in riverpro.

// Advisory locks are currently only used for the fallback/slow path of
// unique job insertion where finalized states are included in a ByState
// configuration.
AdvisoryLockPrefix int32
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we complete the deprecation and eventual removal of the advisory lock based unique jobs, would you want to just nuke this config altogether? It's possible we could want to use advisory locks again at some point but it'd be easy enough to re-add.

Comment on lines +118 to +119
JobInsertFast(ctx context.Context, params *JobInsertFastParams) (*JobInsertFastResult, error)
JobInsertFastMany(ctx context.Context, params []*JobInsertFastParams) ([]*JobInsertFastResult, error)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This relates to the API changes from #584 so we may want to align on the right change to make here.

Comment on lines +251 to +254
// Args contains the raw underlying job arguments struct. It has already been
// encoded into EncodedArgs, but the original is kept here for to leverage its
// struct tags and interfaces, such as for use in unique key generation.
Args rivertype.JobArgs
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to discussion from #584, this was added to facilitate tag reflection on the JobArgs struct, even though it's otherwise unused (EncodedArgs are already encoded at this point).

@@ -5,6 +5,7 @@ go 1.21
toolchain go1.23.0

require (
github.com/jackc/pgx/v5 v5.7.1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I guess this is what I get for using pgtype.bits for the bitmask column. I'd be shocked if pq properly supports that, thoughts on what to do?

Comment on lines +230 to +238
ON CONFLICT (unique_key)
WHERE unique_key IS NOT NULL
AND unique_states IS NOT NULL
AND river_job_state_in_bitmask(unique_states, state)
Copy link
Contributor Author

@bgentry bgentry Sep 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish there were a better way but AFAICT this is the only way to indicate the conflict given the specific partial index in use here. It can't be a named CONSTRAINT because it's a partial index.

Comment on lines +8 to +15
WHEN 'available' THEN get_bit(bitmask, 7)
WHEN 'cancelled' THEN get_bit(bitmask, 6)
WHEN 'completed' THEN get_bit(bitmask, 5)
WHEN 'discarded' THEN get_bit(bitmask, 4)
WHEN 'pending' THEN get_bit(bitmask, 3)
WHEN 'retryable' THEN get_bit(bitmask, 2)
WHEN 'running' THEN get_bit(bitmask, 1)
WHEN 'scheduled' THEN get_bit(bitmask, 0)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per the note on the Go bitmask code, Postgres bit positions are numbered left-to-right instead of a more conventional right-to-left.

bgentry added a commit that referenced this pull request Sep 18, 2024
As detailed in #600, there are certain combinations of schema changes
which are not allowed to be run within the same transaction. The example
we encountered with #590 is adding a new enum value, then using it in an
immutable function during a subsequent migration. In Postgres, these
must be separated by a commit.

There are other examples of things which cannot be run in a transaction,
such as `CREATE INDEX CONCURRENTLY`. While that specific one isn't
solved here, moving away from a migrator that bundles migrations into a
single transaction will also allow us to update our migration system to
exclude certain migrations from transactions and i.e. add indexes
concurrently.
bgentry added a commit that referenced this pull request Sep 18, 2024
As detailed in #600, there are certain combinations of schema changes
which are not allowed to be run within the same transaction. The example
we encountered with #590 is adding a new enum value, then using it in an
immutable function during a subsequent migration. In Postgres, these
must be separated by a commit.

There are other examples of things which cannot be run in a transaction,
such as `CREATE INDEX CONCURRENTLY`. While that specific one isn't
solved here, moving away from a migrator that bundles migrations into a
single transaction will also allow us to update our migration system to
exclude certain migrations from transactions and i.e. add indexes
concurrently.
bgentry added a commit that referenced this pull request Sep 18, 2024
As detailed in #600, there are certain combinations of schema changes
which are not allowed to be run within the same transaction. The example
we encountered with #590 is adding a new enum value, then using it in an
immutable function during a subsequent migration. In Postgres, these
must be separated by a commit.

There are other examples of things which cannot be run in a transaction,
such as `CREATE INDEX CONCURRENTLY`. While that specific one isn't
solved here, moving away from a migrator that bundles migrations into a
single transaction will also allow us to update our migration system to
exclude certain migrations from transactions and i.e. add indexes
concurrently.
bgentry added a commit that referenced this pull request Sep 18, 2024
As detailed in #600, there are certain combinations of schema changes
which are not allowed to be run within the same transaction. The example
we encountered with #590 is adding a new enum value, then using it in an
immutable function during a subsequent migration. In Postgres, these
must be separated by a commit.

There are other examples of things which cannot be run in a transaction,
such as `CREATE INDEX CONCURRENTLY`. While that specific one isn't
solved here, moving away from a migrator that bundles migrations into a
single transaction will also allow us to update our migration system to
exclude certain migrations from transactions and i.e. add indexes
concurrently.
@bgentry bgentry changed the base branch from master to bg-migration-framework-ddl-tx-per-migration September 19, 2024 15:31
bgentry added a commit that referenced this pull request Sep 20, 2024
As detailed in #600, there are certain combinations of schema changes
which are not allowed to be run within the same transaction. The example
we encountered with #590 is adding a new enum value, then using it in an
immutable function during a subsequent migration. In Postgres, these
must be separated by a commit.

There are other examples of things which cannot be run in a transaction,
such as `CREATE INDEX CONCURRENTLY`. While that specific one isn't
solved here, moving away from a migrator that bundles migrations into a
single transaction will also allow us to update our migration system to
exclude certain migrations from transactions and i.e. add indexes
concurrently.
bgentry added a commit that referenced this pull request Sep 20, 2024
As detailed in #600, there are certain combinations of schema changes
which are not allowed to be run within the same transaction. The example
we encountered with #590 is adding a new enum value, then using it in an
immutable function during a subsequent migration. In Postgres, these
must be separated by a commit.

There are other examples of things which cannot be run in a transaction,
such as `CREATE INDEX CONCURRENTLY`. While that specific one isn't
solved here, moving away from a migrator that bundles migrations into a
single transaction will also allow us to update our migration system to
exclude certain migrations from transactions and i.e. add indexes
concurrently.
bgentry added a commit that referenced this pull request Sep 20, 2024
As detailed in #600, there are certain combinations of schema changes
which are not allowed to be run within the same transaction. The example
we encountered with #590 is adding a new enum value, then using it in an
immutable function during a subsequent migration. In Postgres, these
must be separated by a commit.

There are other examples of things which cannot be run in a transaction,
such as `CREATE INDEX CONCURRENTLY`. While that specific one isn't
solved here, moving away from a migrator that bundles migrations into a
single transaction will also allow us to update our migration system to
exclude certain migrations from transactions and i.e. add indexes
concurrently.
@bgentry bgentry force-pushed the bg-migration-framework-ddl-tx-per-migration branch from 3c932f6 to c0de0ae Compare September 20, 2024 18:27
@bgentry bgentry force-pushed the bg-bulk-unique-inserts branch 2 times, most recently from 825d9ea to c034b7c Compare September 20, 2024 20:03
client_test.go Outdated
Comment on lines 5301 to 5442
Included bool `json:"included" unique:"true"`
Excluded bool `json:"excluded"`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had another design idea on this. Instead of unique:"true", what if this was something like river:"unique"? There would be room to apply additional comma-separated options in that list, such as for batching or sequence keys.

I think I like that more than having a bunch of separate tags with vague names like unique, batch, etc. Those could be prefixed with river_ of course but that'd be long and redundant, plus not a particularly efficient way to declare boolean attrs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to river:"unique" so better & more cleanly support future options. Seems less vague/generic too which is nice.

@bgentry bgentry force-pushed the bg-bulk-unique-inserts branch 6 times, most recently from 1d857e1 to 0b2268d Compare September 21, 2024 18:49
@bgentry bgentry marked this pull request as ready for review September 21, 2024 18:55
@@ -139,6 +148,50 @@ func TestJobScheduler(t *testing.T) {
requireJobStateUnchanged(t, bundle.exec, retryableJob3) // still retryable
})

t.Run("MovesUniqueKeyConflictingJobsToDiscarded", func(t *testing.T) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized I don't have corresponding test changes for the underlying query in the riverdrivertest, lmk if you think these details should be duplicated over there.

Comment on lines 423 to 454
conflicting_jobs AS (
SELECT DISTINCT unique_key
FROM river_job
WHERE unique_key IN (
SELECT unique_key
FROM jobs_to_schedule
WHERE unique_key IS NOT NULL
AND unique_states IS NOT NULL
)
AND unique_states IS NOT NULL
AND river_job_state_in_bitmask(unique_states, state)
),
updated_jobs AS (
UPDATE river_job
SET state = 'available'
FROM jobs_to_schedule
WHERE river_job.id = jobs_to_schedule.id
RETURNING river_job.id
SET
state = CASE WHEN cj.unique_key IS NULL THEN 'available'::river_job_state
ELSE 'discarded'::river_job_state END,
finalized_at = CASE WHEN cj.unique_key IS NOT NULL THEN @now::timestamptz
ELSE finalized_at END
FROM jobs_to_schedule jts
LEFT JOIN conflicting_jobs cj ON jts.unique_key = cj.unique_key
WHERE river_job.id = jts.id
RETURNING river_job.id, state = 'discarded'::river_job_state AS conflict_discarded
)
SELECT *
SELECT
sqlc.embed(river_job),
updated_jobs.conflict_discarded
FROM river_job
WHERE id IN (SELECT id FROM river_job_scheduled);
JOIN updated_jobs ON river_job.id = updated_jobs.id;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so this is one way of handling the scenario where somebody removes retryable from their ByStates list, and then inserts a new job while there's still a would-be dupe in the retryable state. Essentially I tried to make this one query able to detect and handle that case, instead moving the conflicting jobs to be discarded instead of retryable.

Two questions:

  1. Instead of doing this within a larger query like this, we could attempt to detect this conflict in application code and make follow up queries to (a) clear out/update/discard the conflicting jobs separately, and (b) try the JobSchedule query again.
  2. Should this set a metadata value on the discarded job so the transition is detectable?

Also open to other ways to handle this. If we wanted to allow other states to be removed from that list, it would involve similar handling and needing to define what the behavior would be when such a conflict arises.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a metadata value of "unique_key_confict": "scheduler_discarded" so we can at least see when this case has been triggered.

bgentry added a commit that referenced this pull request Sep 22, 2024
#600)

This is partly extracted from #590. As detailed in that PR, there are certain
combinations of schema changes which are not allowed to be run within the same
transaction. The example we encountered with #590 is adding a new enum value,
then using it in an immutable function during a subsequent migration. In
Postgres, these must be separated by a commit.

The migration in that PR exposes an issue with our migration framework, which is
that there are certain schema changes which must be committed before they can be
referenced by subsequent schema changes. The example of this I found is that if
you add a new value to an enum, you cannot later create a function that relies
on that new enum value unless the new value has first been committed. Committing
it within a DDL transaction does _not_ count—it must be a full commit.

IMO this might mean that the entire idea of a `MigrateTx` API is not workable
with certain schema change combinations. At worst, it can result in
unpredictable failures depending on the exact sequence of changes and how many
migrations are being run at once.

As such, in this PR I've deprecated `MigrateTx` and adjusted the migrator so
that it opens a new transaction for each individual migration with a commit
between them. Migrator tests were changed to move away from `MigrateTx` and to a
setup where they get a new clean schema for each test that's disposed at the
end. This makes it possible to test the full sequence of database migrations
with a clean slate each time.

I believe this is the right long-term direction because it's the approach that
other migration libraries use (Rails/ActiveRecord, [Sequel][1], Goose, etc). It
also enables the potential for us to add the ability for individual migrations
to opt _out_ of a having a wrapping transaction, which is essential if we ever
want our default to be `CREATE INDEX CONCURRENTLY` rather than synchronous
indexing (as it really should be for any at-scale system).

[1]: (https://sequel.jeremyevans.net/rdoc/files/doc/migration_rdoc.html#label-Transactions)
Base automatically changed from bg-migration-framework-ddl-tx-per-migration to master September 22, 2024 19:33
@bgentry
Copy link
Contributor Author

bgentry commented Sep 22, 2024

I believe this is all set. I've updated the changelog with notes on how this PR impacts users and what to watch out for during migration. Going to merge it and aim for a v0.12.0-rc.1 prerelease to shake out any issues.

@bgentry bgentry merged commit d977e10 into master Sep 22, 2024
14 checks passed
@bgentry bgentry deleted the bg-bulk-unique-inserts branch September 22, 2024 20:11
bgentry added a commit that referenced this pull request Sep 26, 2024
PR #590 introduced a more flexible unique job mechanism that allowed for the
unique states to be customized, all while still benefiting from the performance
improvements of using a unique index rather than an advisory lock. The retryable
state is included in the default list of states, but can be removed if the user
doesn't want to prevent an erroring job from blocking duplicate inserts.

However this created an issue: when trying to schedule a retryable job (move it
to available) it could potentially have a conflict with a duplicate unique job.
To handle this, special logic was added to try to deal with this scenario for
unique jobs, moving the conflicting row to discarded rather than available.
Unfortunately this logic had issues and was insufficiently tested. There were a
couple specific scenarios that caused issues:

A unique job that was being scheduled because it was either inserted as
scheduled or had errored and become retryable would actually be considered a
conflict with itself because the query didn't properly exclude the row being
scheduled.  Attempting to schedule two duplicate retryable unique jobs at the
same time would lead to a unique conflict because there was no mechanism to
prevent this.  The query changes in this PR address both of the above cases
along with test coverage.

The increased complexity is unfortunate, and we're probably nearing the limit of
what should be dealt with in a single SQL query. If this still isn't complete
I'm more inclined to fix the issues by catching these conflicts at the
application level, explicitly moving the conflicting row(s) to discarded, and
trying again. This can be looped with a backoff or recursed to ensure that
progress keeps being made as individual conflicts get resolved. But hopefully
that won't be necessary.

Fixes #618.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant