Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CREATE CLUSTER REPLICA takes several seconds to return #13854

Closed
umanwizard opened this issue Jul 25, 2022 · 12 comments
Closed

CREATE CLUSTER REPLICA takes several seconds to return #13854

umanwizard opened this issue Jul 25, 2022 · 12 comments
Labels
A-ADAPTER Topics related to the ADAPTER layer A-storage Area: storage C-bug Category: something is broken P2 medium priority; in the case of a bug, should be fixed in the next release cycle

Comments

@umanwizard
Copy link
Contributor

What version of Materialize are you using?

unstable-d5cddec91fdd0088783aa2231bb53cd5c78b2efa

How did you install Materialize?

Materialize Cloud

What is the issue?

CREATE CLUSTER REPLICA takes several seconds to return on latest main. This didn't use to be the case (in fact, I think it works fine on alpha.3)

It seems to get slower the more replicas there are.

Relevant log output

No response

@umanwizard umanwizard added C-bug Category: something is broken C-triage A-compute Area: compute labels Jul 25, 2022
@umanwizard
Copy link
Contributor Author

This is reproducible locally -- each CREATE CLUSTER REPLICA command is faster than in cloud, but the behavior that take longer the more replicas have been created is still true.

After creating a bunch of replicas, even after destroying them all, Materialize still seems to be in "slow mode" (new replicas take several seconds), so maybe there is some kind of resource leak ...

I am also seeing significant CPU usage in the Postgres binary locally (~20% of one core). Not sure if that's related.

@umanwizard
Copy link
Contributor Author

@lluki , @teskje , @antiguru , since this reproes locally I went ahead and bisected it. It bisects to Persisting Introspection Dataflows Part 1 (#13340)

@antiguru
Copy link
Member

I can reproduce the problem locally. Start Materialize: bin/environmentd --release --reset and then run the following script:

for i in {0..100}; do
    psql -p 6875 -h localhost -U materialize -q -c "create cluster replica default.r$i size '1';"
done

Timing output:

Time: 607.946 ms
Time: 1012.080 ms (00:01.012)
Time: 905.382 ms
Time: 997.988 ms
Time: 1180.775 ms (00:01.181)
Time: 1286.084 ms (00:01.286)
Time: 1479.588 ms (00:01.480)
Time: 1522.134 ms (00:01.522)
Time: 2637.546 ms (00:02.638)
Time: 1689.508 ms (00:01.690)
Time: 2150.487 ms (00:02.150)
Time: 2658.386 ms (00:02.658)
Time: 2210.388 ms (00:02.210)
Time: 3284.387 ms (00:03.284)
Time: 4917.503 ms (00:04.918)
Time: 2517.355 ms (00:02.517)
Time: 4448.932 ms (00:04.449)
Time: 2916.827 ms (00:02.917)
Time: 3245.870 ms (00:03.246)
Time: 4319.137 ms (00:04.319)
Time: 5641.637 ms (00:05.642)
Time: 4552.645 ms (00:04.553)
Time: 5293.479 ms (00:05.293)
Time: 5106.743 ms (00:05.107)
Time: 5110.252 ms (00:05.110)
Time: 10454.629 ms (00:10.455)
Time: 4944.696 ms (00:04.945)
Time: 5749.217 ms (00:05.749)
Time: 6089.325 ms (00:06.089)
Time: 7287.410 ms (00:07.287)
Time: 8786.597 ms (00:08.787)
Time: 6731.052 ms (00:06.731)
Time: 6677.547 ms (00:06.678)
Time: 7149.822 ms (00:07.150)
Time: 6954.551 ms (00:06.955)
Time: 9477.785 ms (00:09.478)
Time: 10040.200 ms (00:10.040)
Time: 23227.249 ms (00:23.227)
Time: 8027.271 ms (00:08.027)
Time: 11878.500 ms (00:11.878)
Time: 11588.999 ms (00:11.589)
Time: 9635.383 ms (00:09.635)
Time: 28717.953 ms (00:28.718)
Time: 9934.817 ms (00:09.935)
Time: 10681.579 ms (00:10.682)
Time: 29893.904 ms (00:29.894)
Time: 12169.085 ms (00:12.169)
Time: 13500.272 ms (00:13.500)
Time: 34620.150 ms (00:34.620)
Time: 13096.751 ms (00:13.097)
Time: 15218.211 ms (00:15.218)

The compute processes hover at around 1-2% CPU utilization, postgres and environmentd both at ~20%, but my SSD is 100% busy with writes.

@antiguru
Copy link
Member

This seems to come from create_collections:
image

@aljoscha aljoscha added A-storage Area: storage and removed A-compute Area: compute labels Jul 27, 2022
@aljoscha aljoscha added P1 High priority; in the case of a bug, should be fixed within the current release cycle and removed C-triage labels Jul 27, 2022
@antiguru
Copy link
Member

antiguru commented Jul 27, 2022

It looks much better after 550670b, and I'm now able to create 100 replicas. My terminal now gets spammed with:

compute-cluster-1-replica-89: 2022-07-27T15:43:00.152978Z  INFO mz_persist_client::r#impl::machine: external operation consensus::open failed, retrying in 16s: indeterminate: db error: FATAL: remaining connection slots are reserved for non-replication superuser connections: FATAL: remaining connection slots are reserved for non-replication superuser connections
compute-cluster-1-replica-101: 2022-07-27T15:43:00.704956Z  INFO mz_persist_client::r#impl::machine: external operation consensus::open failed, retrying in 16s: indeterminate: db error: FATAL: remaining connection slots are reserved for non-replication superuser connections: FATAL: remaining connection slots are reserved for non-replication superuser connections
compute-cluster-1-replica-95: 2022-07-27T15:43:00.824713Z  INFO mz_persist_client::r#impl::machine: external operation consensus::open failed, retrying in 16s: indeterminate: db error: FATAL: remaining connection slots are reserved for non-replication superuser connections: FATAL: remaining connection slots are reserved for non-replication superuser connections
compute-cluster-1-replica-82: 2022-07-27T15:43:00.948746Z  INFO mz_persist_client::r#impl::machine: external operation consensus::open failed, retrying in 16s: indeterminate: db error: FATAL: remaining connection slots are reserved for non-replication superuser connections: FATAL: remaining connection slots are reserved for non-replication superuser connections
compute-cluster-1-replica-67: 2022-07-27T15:43:00.960950Z  INFO mz_persist_client::r#impl::machine: external operation consensus::open failed, retrying in 16s: indeterminate: db error: FATAL: remaining connection slots are reserved for non-replication superuser connections: FATAL: remaining connection slots are reserved for non-replication superuser connections
compute-cluster-1-replica-83: 2022-07-27T15:43:01.109067Z  INFO mz_persist_client::r#impl::machine: external operation consensus::open failed, retrying in 16s: indeterminate: db error: FATAL: remaining connection slots are reserved for non-replication superuser connections: FATAL: remaining connection slots are reserved for non-replication superuser connections
compute-cluster-1-replica-49: 2022-07-27T15:43:01.844835Z  INFO mz_persist_client::r#impl::machine: external operation consensus::open failed, retrying in 16s: indeterminate: db error: FATAL: remaining connection slots are reserved for non-replication superuser connections: FATAL: remaining connection slots are reserved for non-replication superuser connections
compute-cluster-1-replica-27: 2022-07-27T15:43:01.996869Z  INFO mz_persist_client::r#impl::machine: external operation consensus::open failed, retrying in 16s: indeterminate: db error: FATAL: remaining connection slots are reserved for non-replication superuser connections: FATAL: remaining connection slots are reserved for non-replication superuser connections
compute-cluster-1-replica-60: 2022-07-27T15:43:02.405145Z  INFO mz_persist_client::r#impl::machine: external operation consensus::open failed, retrying in 16s: indeterminate: db error: FATAL: remaining connection slots are reserved for non-replication superuser connections: FATAL: remaining connection slots are reserved for non-replication superuser connections
compute-cluster-1-replica-94: 2022-07-27T15:43:02.445192Z  INFO mz_persist_client::r#impl::machine: external operation consensus::open failed, retrying in 16s: indeterminate: db error: FATAL: remaining connection slots are reserved for non-replication superuser connections: FATAL: remaining connection slots are reserved for non-replication superuser connections
compute-cluster-1-replica-38: 2022-07-27T15:43:02.588122Z  INFO mz_persist_client::r#impl::machine: external operation consensus::open failed, retrying in 16s: indeterminate: db error: FATAL: remaining connection slots are reserved for non-replication superuser connections: FATAL: remaining connection slots are reserved for non-replication superuser connections

EDIT: The error message is because by default postgres only allows 100 concurrent connections. Increasing the limit (max_connections in postgresql.conf) mitigates this error.

@umanwizard
Copy link
Contributor Author

@antiguru what's the timing like when creating the 100 replicas? Is it low? (0-500ms per replica, no increase over time)

I think investigating whether we should be worried about this fatal postgres error is a separate issue.

@antiguru
Copy link
Member

It still slows down over time:

Time: 500.888 ms
Time: 542.429 ms
Time: 562.142 ms
Time: 537.065 ms
Time: 665.806 ms
Time: 705.150 ms
Time: 771.750 ms
Time: 699.427 ms
Time: 695.268 ms
Time: 740.623 ms
Time: 729.938 ms
Time: 720.785 ms
Time: 852.664 ms
Time: 825.562 ms
Time: 828.229 ms
Time: 785.232 ms
Time: 885.296 ms
Time: 924.643 ms
Time: 1085.448 ms (00:01.085)
Time: 980.323 ms
Time: 923.585 ms
Time: 950.148 ms
Time: 1070.332 ms (00:01.070)
Time: 1134.411 ms (00:01.134)
Time: 991.338 ms
Time: 1052.620 ms (00:01.053)
Time: 1216.839 ms (00:01.217)
Time: 1071.293 ms (00:01.071)
Time: 1228.627 ms (00:01.229)
Time: 1163.459 ms (00:01.163)
Time: 1209.672 ms (00:01.210)
Time: 1320.592 ms (00:01.321)
Time: 1164.606 ms (00:01.165)
Time: 1363.558 ms (00:01.364)
Time: 1171.036 ms (00:01.171)
Time: 1497.137 ms (00:01.497)
Time: 1325.466 ms (00:01.325)
Time: 1599.157 ms (00:01.599)
Time: 1388.790 ms (00:01.389)
Time: 1823.367 ms (00:01.823)
Time: 1383.450 ms (00:01.383)
Time: 1584.217 ms (00:01.584)
Time: 1461.446 ms (00:01.461)
Time: 1460.055 ms (00:01.460)
Time: 1947.955 ms (00:01.948)
Time: 1355.090 ms (00:01.355)
Time: 1688.830 ms (00:01.689)
Time: 1485.054 ms (00:01.485)
Time: 1550.849 ms (00:01.551)
Time: 1822.625 ms (00:01.823)
Time: 1447.812 ms (00:01.448)
Time: 1765.060 ms (00:01.765)
Time: 1500.069 ms (00:01.500)
Time: 1641.999 ms (00:01.642)
Time: 1654.623 ms (00:01.655)
Time: 1911.064 ms (00:01.911)
Time: 1769.129 ms (00:01.769)
Time: 1835.594 ms (00:01.836)
Time: 1939.223 ms (00:01.939)
Time: 1811.547 ms (00:01.812)
Time: 2027.295 ms (00:02.027)
Time: 2105.927 ms (00:02.106)
Time: 1878.129 ms (00:01.878)
Time: 2243.429 ms (00:02.243)
Time: 2142.153 ms (00:02.142)
Time: 2233.226 ms (00:02.233)
Time: 1929.929 ms (00:01.930)
Time: 2231.187 ms (00:02.231)
Time: 2303.271 ms (00:02.303)
Time: 2266.665 ms (00:02.267)
Time: 2372.720 ms (00:02.373)
Time: 3217.584 ms (00:03.218)
Time: 2362.328 ms (00:02.362)
Time: 2470.648 ms (00:02.471)
Time: 4011.303 ms (00:04.011)
Time: 2816.202 ms (00:02.816)
Time: 3365.484 ms (00:03.365)
Time: 3530.801 ms (00:03.531)
Time: 5595.412 ms (00:05.595)
Time: 4762.874 ms (00:04.763)
Time: 4810.499 ms (00:04.810)
Time: 4091.567 ms (00:04.092)
Time: 5873.711 ms (00:05.874)
Time: 4443.925 ms (00:04.444)
Time: 4658.274 ms (00:04.658)
Time: 6820.321 ms (00:06.820)
Time: 3846.666 ms (00:03.847)
Time: 5782.172 ms (00:05.782)
Time: 4477.517 ms (00:04.478)
Time: 4489.902 ms (00:04.490)
Time: 5023.622 ms (00:05.024)
Time: 5164.970 ms (00:05.165)
Time: 5786.001 ms (00:05.786)
Time: 5106.370 ms (00:05.106)

@antiguru
Copy link
Member

My SSD is back at 100% utilization around the 100 replica mark.

@umanwizard
Copy link
Contributor Author

I would describe this as still being the bad behavior (I would have typed git bisect bad on that result)

@lluki lluki added the A-ADAPTER Topics related to the ADAPTER layer label Aug 10, 2022
@JLDLaughlin JLDLaughlin removed the P1 High priority; in the case of a bug, should be fixed within the current release cycle label Aug 18, 2022
@nmeagan11 nmeagan11 added the P2 medium priority; in the case of a bug, should be fixed in the next release cycle label Aug 31, 2022
@ggnall
Copy link
Contributor

ggnall commented Nov 15, 2022

@chaas @mjibson still relevant?

@maddyblue
Copy link
Contributor

We can probably close this and reopen if observed again.

@chaas
Copy link
Contributor

chaas commented Nov 15, 2022

Closing this as it's outdated. We can open a new issue if acute slowness is observed again.

@chaas chaas closed this as completed Nov 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-ADAPTER Topics related to the ADAPTER layer A-storage Area: storage C-bug Category: something is broken P2 medium priority; in the case of a bug, should be fixed in the next release cycle
Projects
None yet
Development

No branches or pull requests

9 participants