`CREATE CLUSTER REPLICA` takes several seconds to return #13854

umanwizard · 2022-07-25T16:53:12Z

What version of Materialize are you using?

unstable-d5cddec91fdd0088783aa2231bb53cd5c78b2efa

How did you install Materialize?

Materialize Cloud

What is the issue?

CREATE CLUSTER REPLICA takes several seconds to return on latest main. This didn't use to be the case (in fact, I think it works fine on alpha.3)

It seems to get slower the more replicas there are.

Relevant log output

No response

The text was updated successfully, but these errors were encountered:

umanwizard · 2022-07-26T19:10:36Z

This is reproducible locally -- each CREATE CLUSTER REPLICA command is faster than in cloud, but the behavior that take longer the more replicas have been created is still true.

After creating a bunch of replicas, even after destroying them all, Materialize still seems to be in "slow mode" (new replicas take several seconds), so maybe there is some kind of resource leak ...

I am also seeing significant CPU usage in the Postgres binary locally (~20% of one core). Not sure if that's related.

umanwizard · 2022-07-26T20:23:26Z

@lluki , @teskje , @antiguru , since this reproes locally I went ahead and bisected it. It bisects to Persisting Introspection Dataflows Part 1 (#13340)

antiguru · 2022-07-27T08:44:42Z

I can reproduce the problem locally. Start Materialize: bin/environmentd --release --reset and then run the following script:

for i in {0..100}; do
    psql -p 6875 -h localhost -U materialize -q -c "create cluster replica default.r$i size '1';"
done

Timing output:

Time: 607.946 ms
Time: 1012.080 ms (00:01.012)
Time: 905.382 ms
Time: 997.988 ms
Time: 1180.775 ms (00:01.181)
Time: 1286.084 ms (00:01.286)
Time: 1479.588 ms (00:01.480)
Time: 1522.134 ms (00:01.522)
Time: 2637.546 ms (00:02.638)
Time: 1689.508 ms (00:01.690)
Time: 2150.487 ms (00:02.150)
Time: 2658.386 ms (00:02.658)
Time: 2210.388 ms (00:02.210)
Time: 3284.387 ms (00:03.284)
Time: 4917.503 ms (00:04.918)
Time: 2517.355 ms (00:02.517)
Time: 4448.932 ms (00:04.449)
Time: 2916.827 ms (00:02.917)
Time: 3245.870 ms (00:03.246)
Time: 4319.137 ms (00:04.319)
Time: 5641.637 ms (00:05.642)
Time: 4552.645 ms (00:04.553)
Time: 5293.479 ms (00:05.293)
Time: 5106.743 ms (00:05.107)
Time: 5110.252 ms (00:05.110)
Time: 10454.629 ms (00:10.455)
Time: 4944.696 ms (00:04.945)
Time: 5749.217 ms (00:05.749)
Time: 6089.325 ms (00:06.089)
Time: 7287.410 ms (00:07.287)
Time: 8786.597 ms (00:08.787)
Time: 6731.052 ms (00:06.731)
Time: 6677.547 ms (00:06.678)
Time: 7149.822 ms (00:07.150)
Time: 6954.551 ms (00:06.955)
Time: 9477.785 ms (00:09.478)
Time: 10040.200 ms (00:10.040)
Time: 23227.249 ms (00:23.227)
Time: 8027.271 ms (00:08.027)
Time: 11878.500 ms (00:11.878)
Time: 11588.999 ms (00:11.589)
Time: 9635.383 ms (00:09.635)
Time: 28717.953 ms (00:28.718)
Time: 9934.817 ms (00:09.935)
Time: 10681.579 ms (00:10.682)
Time: 29893.904 ms (00:29.894)
Time: 12169.085 ms (00:12.169)
Time: 13500.272 ms (00:13.500)
Time: 34620.150 ms (00:34.620)
Time: 13096.751 ms (00:13.097)
Time: 15218.211 ms (00:15.218)

The compute processes hover at around 1-2% CPU utilization, postgres and environmentd both at ~20%, but my SSD is 100% busy with writes.

antiguru · 2022-07-27T12:40:28Z

This seems to come from create_collections:

antiguru · 2022-07-27T15:43:30Z

It looks much better after 550670b, and I'm now able to create 100 replicas. My terminal now gets spammed with:

compute-cluster-1-replica-89: 2022-07-27T15:43:00.152978Z  INFO mz_persist_client::r#impl::machine: external operation consensus::open failed, retrying in 16s: indeterminate: db error: FATAL: remaining connection slots are reserved for non-replication superuser connections: FATAL: remaining connection slots are reserved for non-replication superuser connections
compute-cluster-1-replica-101: 2022-07-27T15:43:00.704956Z  INFO mz_persist_client::r#impl::machine: external operation consensus::open failed, retrying in 16s: indeterminate: db error: FATAL: remaining connection slots are reserved for non-replication superuser connections: FATAL: remaining connection slots are reserved for non-replication superuser connections
compute-cluster-1-replica-95: 2022-07-27T15:43:00.824713Z  INFO mz_persist_client::r#impl::machine: external operation consensus::open failed, retrying in 16s: indeterminate: db error: FATAL: remaining connection slots are reserved for non-replication superuser connections: FATAL: remaining connection slots are reserved for non-replication superuser connections
compute-cluster-1-replica-82: 2022-07-27T15:43:00.948746Z  INFO mz_persist_client::r#impl::machine: external operation consensus::open failed, retrying in 16s: indeterminate: db error: FATAL: remaining connection slots are reserved for non-replication superuser connections: FATAL: remaining connection slots are reserved for non-replication superuser connections
compute-cluster-1-replica-67: 2022-07-27T15:43:00.960950Z  INFO mz_persist_client::r#impl::machine: external operation consensus::open failed, retrying in 16s: indeterminate: db error: FATAL: remaining connection slots are reserved for non-replication superuser connections: FATAL: remaining connection slots are reserved for non-replication superuser connections
compute-cluster-1-replica-83: 2022-07-27T15:43:01.109067Z  INFO mz_persist_client::r#impl::machine: external operation consensus::open failed, retrying in 16s: indeterminate: db error: FATAL: remaining connection slots are reserved for non-replication superuser connections: FATAL: remaining connection slots are reserved for non-replication superuser connections
compute-cluster-1-replica-49: 2022-07-27T15:43:01.844835Z  INFO mz_persist_client::r#impl::machine: external operation consensus::open failed, retrying in 16s: indeterminate: db error: FATAL: remaining connection slots are reserved for non-replication superuser connections: FATAL: remaining connection slots are reserved for non-replication superuser connections
compute-cluster-1-replica-27: 2022-07-27T15:43:01.996869Z  INFO mz_persist_client::r#impl::machine: external operation consensus::open failed, retrying in 16s: indeterminate: db error: FATAL: remaining connection slots are reserved for non-replication superuser connections: FATAL: remaining connection slots are reserved for non-replication superuser connections
compute-cluster-1-replica-60: 2022-07-27T15:43:02.405145Z  INFO mz_persist_client::r#impl::machine: external operation consensus::open failed, retrying in 16s: indeterminate: db error: FATAL: remaining connection slots are reserved for non-replication superuser connections: FATAL: remaining connection slots are reserved for non-replication superuser connections
compute-cluster-1-replica-94: 2022-07-27T15:43:02.445192Z  INFO mz_persist_client::r#impl::machine: external operation consensus::open failed, retrying in 16s: indeterminate: db error: FATAL: remaining connection slots are reserved for non-replication superuser connections: FATAL: remaining connection slots are reserved for non-replication superuser connections
compute-cluster-1-replica-38: 2022-07-27T15:43:02.588122Z  INFO mz_persist_client::r#impl::machine: external operation consensus::open failed, retrying in 16s: indeterminate: db error: FATAL: remaining connection slots are reserved for non-replication superuser connections: FATAL: remaining connection slots are reserved for non-replication superuser connections

EDIT: The error message is because by default postgres only allows 100 concurrent connections. Increasing the limit (max_connections in postgresql.conf) mitigates this error.

umanwizard · 2022-07-27T15:47:27Z

@antiguru what's the timing like when creating the 100 replicas? Is it low? (0-500ms per replica, no increase over time)

I think investigating whether we should be worried about this fatal postgres error is a separate issue.

antiguru · 2022-07-27T15:51:13Z

It still slows down over time:

Time: 500.888 ms
Time: 542.429 ms
Time: 562.142 ms
Time: 537.065 ms
Time: 665.806 ms
Time: 705.150 ms
Time: 771.750 ms
Time: 699.427 ms
Time: 695.268 ms
Time: 740.623 ms
Time: 729.938 ms
Time: 720.785 ms
Time: 852.664 ms
Time: 825.562 ms
Time: 828.229 ms
Time: 785.232 ms
Time: 885.296 ms
Time: 924.643 ms
Time: 1085.448 ms (00:01.085)
Time: 980.323 ms
Time: 923.585 ms
Time: 950.148 ms
Time: 1070.332 ms (00:01.070)
Time: 1134.411 ms (00:01.134)
Time: 991.338 ms
Time: 1052.620 ms (00:01.053)
Time: 1216.839 ms (00:01.217)
Time: 1071.293 ms (00:01.071)
Time: 1228.627 ms (00:01.229)
Time: 1163.459 ms (00:01.163)
Time: 1209.672 ms (00:01.210)
Time: 1320.592 ms (00:01.321)
Time: 1164.606 ms (00:01.165)
Time: 1363.558 ms (00:01.364)
Time: 1171.036 ms (00:01.171)
Time: 1497.137 ms (00:01.497)
Time: 1325.466 ms (00:01.325)
Time: 1599.157 ms (00:01.599)
Time: 1388.790 ms (00:01.389)
Time: 1823.367 ms (00:01.823)
Time: 1383.450 ms (00:01.383)
Time: 1584.217 ms (00:01.584)
Time: 1461.446 ms (00:01.461)
Time: 1460.055 ms (00:01.460)
Time: 1947.955 ms (00:01.948)
Time: 1355.090 ms (00:01.355)
Time: 1688.830 ms (00:01.689)
Time: 1485.054 ms (00:01.485)
Time: 1550.849 ms (00:01.551)
Time: 1822.625 ms (00:01.823)
Time: 1447.812 ms (00:01.448)
Time: 1765.060 ms (00:01.765)
Time: 1500.069 ms (00:01.500)
Time: 1641.999 ms (00:01.642)
Time: 1654.623 ms (00:01.655)
Time: 1911.064 ms (00:01.911)
Time: 1769.129 ms (00:01.769)
Time: 1835.594 ms (00:01.836)
Time: 1939.223 ms (00:01.939)
Time: 1811.547 ms (00:01.812)
Time: 2027.295 ms (00:02.027)
Time: 2105.927 ms (00:02.106)
Time: 1878.129 ms (00:01.878)
Time: 2243.429 ms (00:02.243)
Time: 2142.153 ms (00:02.142)
Time: 2233.226 ms (00:02.233)
Time: 1929.929 ms (00:01.930)
Time: 2231.187 ms (00:02.231)
Time: 2303.271 ms (00:02.303)
Time: 2266.665 ms (00:02.267)
Time: 2372.720 ms (00:02.373)
Time: 3217.584 ms (00:03.218)
Time: 2362.328 ms (00:02.362)
Time: 2470.648 ms (00:02.471)
Time: 4011.303 ms (00:04.011)
Time: 2816.202 ms (00:02.816)
Time: 3365.484 ms (00:03.365)
Time: 3530.801 ms (00:03.531)
Time: 5595.412 ms (00:05.595)
Time: 4762.874 ms (00:04.763)
Time: 4810.499 ms (00:04.810)
Time: 4091.567 ms (00:04.092)
Time: 5873.711 ms (00:05.874)
Time: 4443.925 ms (00:04.444)
Time: 4658.274 ms (00:04.658)
Time: 6820.321 ms (00:06.820)
Time: 3846.666 ms (00:03.847)
Time: 5782.172 ms (00:05.782)
Time: 4477.517 ms (00:04.478)
Time: 4489.902 ms (00:04.490)
Time: 5023.622 ms (00:05.024)
Time: 5164.970 ms (00:05.165)
Time: 5786.001 ms (00:05.786)
Time: 5106.370 ms (00:05.106)

antiguru · 2022-07-27T15:52:50Z

My SSD is back at 100% utilization around the 100 replica mark.

umanwizard · 2022-07-27T15:59:32Z

I would describe this as still being the bad behavior (I would have typed git bisect bad on that result)

ggnall · 2022-11-15T22:54:09Z

@chaas @mjibson still relevant?

maddyblue · 2022-11-15T22:58:18Z

We can probably close this and reopen if observed again.

chaas · 2022-11-15T23:02:00Z

Closing this as it's outdated. We can open a new issue if acute slowness is observed again.

umanwizard added C-bug Category: something is broken C-triage A-compute Area: compute labels Jul 25, 2022

lluki mentioned this issue Jul 26, 2022

[Epic] Introspection sources/views across replicas #11782

Closed

16 tasks

aljoscha mentioned this issue Jul 27, 2022

[Epic] Improve the Performance of Source Ingestion #12863

Closed

5 tasks

aljoscha added A-storage Area: storage and removed A-compute Area: compute labels Jul 27, 2022

aljoscha added P1 High priority; in the case of a bug, should be fixed within the current release cycle and removed C-triage labels Jul 27, 2022

lluki added the A-ADAPTER Topics related to the ADAPTER layer label Aug 10, 2022

JLDLaughlin removed the P1 High priority; in the case of a bug, should be fixed within the current release cycle label Aug 18, 2022

nmeagan11 added the P2 medium priority; in the case of a bug, should be fixed in the next release cycle label Aug 31, 2022

chaas closed this as completed Nov 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`CREATE CLUSTER REPLICA` takes several seconds to return #13854

`CREATE CLUSTER REPLICA` takes several seconds to return #13854

umanwizard commented Jul 25, 2022

umanwizard commented Jul 26, 2022

umanwizard commented Jul 26, 2022

antiguru commented Jul 27, 2022

antiguru commented Jul 27, 2022

antiguru commented Jul 27, 2022 •

edited

Loading

umanwizard commented Jul 27, 2022

antiguru commented Jul 27, 2022

antiguru commented Jul 27, 2022

umanwizard commented Jul 27, 2022

ggnall commented Nov 15, 2022

maddyblue commented Nov 15, 2022

chaas commented Nov 15, 2022

CREATE CLUSTER REPLICA takes several seconds to return #13854

CREATE CLUSTER REPLICA takes several seconds to return #13854

Comments

umanwizard commented Jul 25, 2022

What version of Materialize are you using?

How did you install Materialize?

What is the issue?

Relevant log output

umanwizard commented Jul 26, 2022

umanwizard commented Jul 26, 2022

antiguru commented Jul 27, 2022

antiguru commented Jul 27, 2022

antiguru commented Jul 27, 2022 • edited Loading

umanwizard commented Jul 27, 2022

antiguru commented Jul 27, 2022

antiguru commented Jul 27, 2022

umanwizard commented Jul 27, 2022

ggnall commented Nov 15, 2022

maddyblue commented Nov 15, 2022

chaas commented Nov 15, 2022

`CREATE CLUSTER REPLICA` takes several seconds to return #13854

`CREATE CLUSTER REPLICA` takes several seconds to return #13854

antiguru commented Jul 27, 2022 •

edited

Loading