Skip to content
This repository has been archived by the owner on Sep 30, 2024. It is now read-only.

There may be a problem with how we do DB migrations concurrently? #1491

Closed
nicksnyder opened this issue Dec 18, 2018 · 15 comments
Closed

There may be a problem with how we do DB migrations concurrently? #1491

nicksnyder opened this issue Dec 18, 2018 · 15 comments
Assignees
Labels
bug An error, flaw or fault that produces an incorrect or unexpected result, or behavior. release-blocker Prevents us from releasing: https://about.sourcegraph.com/handbook/engineering/releases
Milestone

Comments

@nicksnyder
Copy link
Contributor

nicksnyder commented Dec 18, 2018

Note from Stephen: The issue here is caused by the fact that we now have two services in local dev trying to migrate the DB at the same time. This is in fact already happening in prod/dogfood, because multiple frontend instances are running. This could indicate something wrong with the way we do DB migrations concurrently but I don't know for sure.

Management console migration issue.

This happens first (when running ./dev/launch.sh)

09:34:13     management-console | 2018/12/18 09:34:13 pq: duplicate key value violates unique constraint "pg_type_typname_nsp_index" in line 0: CREATE TABLE "schema_migrations" (version bigint not null primary key, dirty boolean not null)

Then next time I run ./dev/launch.sh I see

11:28:16     management-console | 2018/12/18 11:28:16 Fatal error connecting to Postgres DB: Failed to migrate the DB. Please contact [email protected] for further assistance: Dirty database version 1503574972. Fix and force version.

Dropping the database fixes the issue. This has happened to me twice this week but I don't know what steps reproduce it.

Full log

$ ./dev/launch.sh
Concurrently installing Yarn and Go dependencies...

Running go install from /Users/nick/dev/sourcegraph
Go race detector disabled. You can enable it for specific commands by setting GORACED (e.g. GORACED=frontend,searcher or GORACED=all for all commands)
yarn install v1.12.3
warning ../../package.json: No license field
[1/5] 🔍 Validating package.json...
[2/5] 🔍 Resolving packages...
⠂ sourcegraph@link:packages/sourcegraph-extension-apigithub.com/sourcegraph/sourcegraph/cmd/frontend/graphqlbackend
[3/5] 🚚 Fetching packages...
github.com/sourcegraph/sourcegraph/cmd/frontend/internal/app/jscontext
github.com/sourcegraph/sourcegraph/cmd/frontend/docsite
github.com/sourcegraph/sourcegraph/cmd/frontend/registry
github.com/sourcegraph/sourcegraph/cmd/frontend/internal/httpapi
warning [email protected]: The engine "code" appears to be invalid.
[4/5] 🔗 Linking dependencies...
warning " > [email protected]" has unmet peer dependency "[email protected] - 3".
warning " > [email protected]" has unmet peer dependency "popper.js@^1.14.3".
warning " > [email protected]" has incorrect peer dependency "graphql@^0.6.0 || ^0.7.0 || ^0.8.0-b || ^0.9.0 || ^0.10.0 || ^0.11.0 || ^0.12.0".
warning " > [email protected]" has unmet peer dependency "prop-types@>=15.5.0".
warning "graphiql > [email protected]" has incorrect peer dependency "graphql@^0.10.0 || ^0.11.0".
warning "graphiql > codemirror-graphql > [email protected]" has incorrect peer dependency "graphql@^0.10.0 || ^0.11.0 || ^0.12.0 || ^0.13.0".
warning "graphiql > codemirror-graphql > graphql-language-service-interface > [email protected]" has incorrect peer dependency "graphql@^0.11.0 || ^0.12.0 || ^0.13.0".
warning "graphiql > codemirror-graphql > graphql-language-service-interface > [email protected]" has incorrect peer dependency "graphql@^0.10.0 || ^0.11.0 || ^0.12.0 || ^0.13.0".
warning "graphiql > codemirror-graphql > graphql-language-service-interface > [email protected]" has incorrect peer dependency "graphql@^0.10.0 || ^0.11.0 || ^0.12.0 || ^0.13.0".
warning "graphiql > codemirror-graphql > graphql-language-service-interface > [email protected]" has incorrect peer dependency "graphql@^0.10.0 || ^0.11.0 || ^0.12.0 || ^0.13.0".
warning "graphiql > codemirror-graphql > graphql-language-service-parser > [email protected]" has incorrect peer dependency "graphql@^0.10.0 || ^0.11.0".
warning "graphiql > codemirror-graphql > graphql-language-service-interface > graphql-config > [email protected]" has incorrect peer dependency "graphql@^0.11.0 || ^0.12.0 || ^0.13.0".
warning " > [email protected]" has unmet peer dependency "prop-types@*".
warning " > [email protected]" has incorrect peer dependency "react@^16.6.3".
warning " > [email protected]" has incorrect peer dependency "typescript@>= 2.3.1 < 3".
[5/5] 📃 Building fresh packages...
$ gulp generate && yarn --cwd shared -s run tslint:build-rules
[09:33:54] Requiring external module ts-node/register
Using mode development
[09:33:56] Using gulpfile ~/dev/sourcegraph/gulpfile.ts
[09:33:56] Starting 'generate'...
[09:33:56] Starting 'schema'...
[09:33:56] Starting 'graphQLTypes'...
[09:33:56] Finished 'schema' after 556 ms
[09:33:57] Finished 'graphQLTypes' after 1.16 s
[09:33:57] Finished 'generate' after 1.16 s
warning ../../../package.json: No license field
✨ Done in 13.27s.
~/dev/sourcegraph/cmd/management-console/web ~/dev/sourcegraph

[email protected] postinstall /Users/nick/dev/sourcegraph/cmd/management-console/web/node_modules/parcel-bundler
node -e "console.log('\u001b[35m\u001b[1mLove Parcel? You can now donate to our open collective:\u001b[22m\u001b[39m\n > \u001b[34mhttps://opencollective.com/parcel/donate\u001b[0m')"

Love Parcel? You can now donate to our open collective:

https://opencollective.com/parcel/donate
npm WARN management-console@ No description
npm WARN management-console@ No repository field.
npm WARN management-console@ No license field.

added 20 packages from 43 contributors, removed 4 packages, updated 134 packages and audited 9799 packages in 11.444s
found 0 vulnerabilities

~/dev/sourcegraph

Starting all binaries...

09:34:12 repo-updater | 2018/12/18 09:34:12 received error during background config update, err: unable to fetch new configuration: Post http://localhost:3090/.internal/configuration: dial tcp [::1]:3090: connect: connection refused
09:34:12 searcher | 2018/12/18 09:34:12 received error during background config update, err: unable to fetch new configuration: Post http://localhost:3090/.internal/configuration: dial tcp [::1]:3090: connect: connection refused
09:34:12 symbols | 2018/12/18 09:34:12 received error during background config update, err: unable to fetch new configuration: Post http://localhost:3090/.internal/configuration: dial tcp [::1]:3090: connect: connection refused
09:34:12 query-runner | 2018/12/18 09:34:12 received error during background config update, err: unable to fetch new configuration: Post http://localhost:3090/.internal/configuration: dial tcp [::1]:3090: connect: connection refused
09:34:12 github-proxy | 2018/12/18 09:34:12 received error during background config update, err: unable to fetch new configuration: Post http://localhost:3090/.internal/configuration: dial tcp [::1]:3090: connect: connection refused
09:34:12 gitserver | 2018/12/18 09:34:12 received error during background config update, err: unable to fetch new configuration: Post http://localhost:3090/.internal/configuration: dial tcp [::1]:3090: connect: connection refused
09:34:12 watch | Using chokidar.
09:34:12 web | [09:34:12] Requiring external module ts-node/register
09:34:13 management-console | 2018/12/18 09:34:13 pq: duplicate key value violates unique constraint "pg_type_typname_nsp_index" in line 0: CREATE TABLE "schema_migrations" (version bigint not null primary key, dirty boolean not null)
09:34:13 management-console | Terminating management-console
09:34:13 web | Terminating web
09:34:13 zoekt-indexserver | Terminating zoekt-indexserver
09:34:13 management-console-web | Terminating management-console-web
09:34:13 searcher | Terminating searcher
09:34:13 github-proxy | Terminating github-proxy
09:34:13 frontend | Terminating frontend
09:34:13 watch | Terminating watch
09:34:13 zoekt-webserver | Terminating zoekt-webserver
09:34:13 symbols | Terminating symbols
09:34:13 gitserver | Terminating gitserver
09:34:13 query-runner | Terminating query-runner
09:34:13 repo-updater | Terminating repo-updater
09:34:13 syntect_server | Terminating syntect_server
goreman: exit status 1
nick@Nicks-MacBook-Pro:sourcegraph $ ./dev/launch.sh
Concurrently installing Yarn and Go dependencies...

Running go install from /Users/nick/dev/sourcegraph
Go race detector disabled. You can enable it for specific commands by setting GORACED (e.g. GORACED=frontend,searcher or GORACED=all for all commands)
yarn install v1.12.3
warning ../../package.json: No license field
[1/5] 🔍 Validating package.json...
[2/5] 🔍 Resolving packages...
success Already up-to-date.
$ gulp generate && yarn --cwd shared -s run tslint:build-rules
[11:28:05] Requiring external module ts-node/register
Using mode development
[11:28:07] Using gulpfile ~/dev/sourcegraph/gulpfile.ts
[11:28:07] Starting 'generate'...
[11:28:07] Starting 'schema'...
[11:28:07] Starting 'graphQLTypes'...
[11:28:08] Finished 'schema' after 584 ms
[11:28:08] Finished 'graphQLTypes' after 1.24 s
[11:28:08] Finished 'generate' after 1.24 s
warning ../../../package.json: No license field
✨ Done in 7.19s.
~/dev/sourcegraph/cmd/management-console/web ~/dev/sourcegraph
npm WARN management-console@ No description
npm WARN management-console@ No repository field.
npm WARN management-console@ No license field.

audited 9799 packages in 4.127s
found 0 vulnerabilities

╭───────────────────────────────────────────────────────────────╮
│ │
│ New minor version of npm available! 6.4.1 → 6.5.0 │
│ Changelog: https://github.com/npm/cli/releases/tag/v6.5.0
│ Run npm install -g npm to update! │
│ │
╰───────────────────────────────────────────────────────────────╯

~/dev/sourcegraph

Starting all binaries...

11:28:16 repo-updater | 2018/12/18 11:28:16 received error during background config update, err: unable to fetch new configuration: Post http://localhost:3090/.internal/configuration: dial tcp [::1]:3090: connect: connection refused
11:28:16 symbols | 2018/12/18 11:28:16 received error during background config update, err: unable to fetch new configuration: Post http://localhost:3090/.internal/configuration: dial tcp [::1]:3090: connect: connection refused
11:28:16 query-runner | 2018/12/18 11:28:16 received error during background config update, err: unable to fetch new configuration: Post http://localhost:3090/.internal/configuration: dial tcp [::1]:3090: connect: connection refused
11:28:16 github-proxy | 2018/12/18 11:28:16 received error during background config update, err: unable to fetch new configuration: Post http://localhost:3090/.internal/configuration: dial tcp [::1]:3090: connect: connection refused
11:28:16 gitserver | 2018/12/18 11:28:16 received error during background config update, err: unable to fetch new configuration: Post http://localhost:3090/.internal/configuration: dial tcp [::1]:3090: connect: connection refused
11:28:16 searcher | 2018/12/18 11:28:16 received error during background config update, err: unable to fetch new configuration: Post http://localhost:3090/.internal/configuration: dial tcp [::1]:3090: connect: connection refused
11:28:16 watch | Using chokidar.
11:28:16 management-console | 2018/12/18 11:28:16 Fatal error connecting to Postgres DB: Failed to migrate the DB. Please contact [email protected] for further assistance: Dirty database version 1503574972. Fix and force version.
11:28:16 management-console | Terminating management-console
11:28:16 management-console-web | Terminating management-console-web
11:28:16 query-runner | Terminating query-runner
11:28:16 searcher | Terminating searcher
11:28:16 frontend | Terminating frontend
11:28:16 syntect_server | Terminating syntect_server
11:28:16 zoekt-indexserver | Terminating zoekt-indexserver
11:28:16 zoekt-webserver | Terminating zoekt-webserver
11:28:16 repo-updater | Terminating repo-updater
11:28:16 github-proxy | Terminating github-proxy
11:28:16 web | Terminating web
11:28:16 gitserver | Terminating gitserver
11:28:16 symbols | Terminating symbols
11:28:16 watch | Terminating watch
goreman: exit status 1

@nicksnyder nicksnyder added the bug An error, flaw or fault that produces an incorrect or unexpected result, or behavior. label Dec 18, 2018
@nicksnyder nicksnyder added this to the 3.0 milestone Dec 18, 2018
@slimsag
Copy link
Member

slimsag commented Jan 7, 2019

I don't know what causes this, but I have seen this once and also could not reproduce it.

However, I suspect this is a more general issue: there is something wrong with the way we handle migrations running concurrently. The management console doesn't do anything fancy or special here, it just acts as another frontend instance, effectively. The only difference is that we used to run one frontend instance in dev mode, but we're now running "two" (i.e., the frontend + the management console, each of which try to migrate the DB).

This could be a larger issue / something we just haven't caught in prod / dogfood, or it could be something specific to local dev. I'm not sure, and also unsure how I would reproduce.

I am rewording title + removing my direct assignment because anyone can look into this (I don't have any special knowledge here).

@slimsag slimsag changed the title management console causes db error in local dev There may be a problem with how we do DB migrations concurrently? Jan 7, 2019
@slimsag slimsag removed their assignment Jan 7, 2019
@slimsag slimsag added the release-blocker Prevents us from releasing: https://about.sourcegraph.com/handbook/engineering/releases label Jan 7, 2019
@slimsag
Copy link
Member

slimsag commented Jan 7, 2019

I am tagging as release blocker because of my own uncertainty about what is causing this. If we can identify the cause or likely cause, we can remove release-blocker.

@nicksnyder
Copy link
Contributor Author

I was just able to reproduce a few times (unreliably) so I do think it is a race condition somewhere. All the times I reproduced it were when I was starting with a clean DB (as expected since I don't have any migrations running locally otherwise). I do wonder if this is specifically a race condition with the first migration, which might explain why we are seeing this on dev now but have never seen it on prod before. It seems like the migrate tool should correctly handle race conditions in a steady state migration, but maybe it doesn't handle the first one well.

cc @beyang @keegancsmith as potential owners to investigate this

@nicksnyder
Copy link
Contributor Author

@keegancsmith since we dropped Postgres 11 upgrade, I am assigning this to you.

@keegancsmith
Copy link
Member

Ok, I'll take a look. FYI I just commented on the postgres issue with motivation of why we should do it.

@nicksnyder
Copy link
Contributor Author

I suspect that this is just a race condition with the initial creation of the schema migrations table on a fresh DB. The schema migrations table is exactly what prevents race conditions in the steady state, so maybe the two services are racing to create it? I am kind of surprised that this doesn't appear to be an issue when testing docker images. If this turns out to only impact dev and not customers then it is not release blocking.

@nicksnyder
Copy link
Contributor Author

I am going to remove the release-blocker tag since nobody has seen this issue while testing our docker images. I can't explain why it would be an issue in dev and not in the docker image.

@nicksnyder nicksnyder removed the release-blocker Prevents us from releasing: https://about.sourcegraph.com/handbook/engineering/releases label Jan 10, 2019
@keegancsmith
Copy link
Member

It happened to me yesterday when testing the docker image. Maybe people aren't clearing state. Its a pretty bad issue.

@keegancsmith keegancsmith added the release-blocker Prevents us from releasing: https://about.sourcegraph.com/handbook/engineering/releases label Jan 10, 2019
@ryan-blunden
Copy link
Contributor

Happened to me just now after running ./dev/run-server-image.sh, even though I answered Y to Do you want to delete /tmp/sourcegraph and start clean? [Y/n].

@nicksnyder
Copy link
Contributor Author

Thanks for confirming. This is a bad release blocker then.

@ryan-blunden
Copy link
Contributor

ryan-blunden commented Jan 10, 2019

I had a ~/.sourcegraph/data directory from a previous install. I removed that, and ./dev/run-server-image.sh worked, although I don't know if that means anything because script uses the /tmp/sourcegraph by default.

I then ran DATA=/Users/rb/.sourcegraph ./dev/run-server-image.sh but this did not reproduce the issue either.

@tsenart
Copy link
Contributor

tsenart commented Jan 11, 2019

@keegancsmith: Can I take this off your plate?

@keegancsmith keegancsmith assigned tsenart and unassigned keegancsmith Jan 11, 2019
@tsenart
Copy link
Contributor

tsenart commented Jan 11, 2019

Related issue with Concourse who used the same migration library we do: concourse/concourse#2074

@tsenart
Copy link
Contributor

tsenart commented Jan 11, 2019

This issue seems like an exact match of what we're seeing: golang-migrate/migrate#55

@tsenart
Copy link
Contributor

tsenart commented Jan 11, 2019

OK, I opened a PR with a test case upstream. It works with a high number of concurrent processes attempting to create the schema_migrations table.

While we wait for the upstream contribution to be finalised, I'll change sourcegraph to use my branch.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug An error, flaw or fault that produces an incorrect or unexpected result, or behavior. release-blocker Prevents us from releasing: https://about.sourcegraph.com/handbook/engineering/releases
Projects
None yet
5 participants