Skip to content
This repository has been archived by the owner on Dec 2, 2020. It is now read-only.

Frequent downtime on pulls.web-platform-tests.org #47

Open
foolip opened this issue Dec 18, 2017 · 7 comments
Open

Frequent downtime on pulls.web-platform-tests.org #47

foolip opened this issue Dec 18, 2017 · 7 comments

Comments

@foolip
Copy link
Member

foolip commented Dec 18, 2017

https://pulls.web-platform-tests.org/ is now 504 Gateway Time-out.

@mdittmer set up https://bit.ly/ecosystem-infra-status a while ago and from that it's clear that downtime is pretty frequent, some downtime almost every day. This matches what I've experienced, which is that every so often that I take a look, it's slow or down. Recent reports of the same kind: #39 #42 #46

I'm calling this a roadmap issue, because apparently there's something not quite right about the setup causing it to frequently go down. Let's call this resolved when we've seen a week with no downtime.

@mdittmer, can you increase the checking rate to 5 minutes for these checks?

@foolip
Copy link
Member Author

foolip commented Dec 19, 2017

@lukebjerring FYI

@boazsender
Copy link
Collaborator

boazsender commented Jan 5, 2018

This appears to be caused by long SELECT times in postgres.

This usually causes the web server to hang, which makes the application appear down, but results still get aggregated.

In the case of #56, this may have caused the results to never be populated.

Two possible solutions:

  1. Increase CPU resources on the server (postgres selects appear to be CPU bound according to htop)
  2. Separate web server from database server, consider using managed db product, like amazon's RDS in production. If/when this second option is taken, we should also consider how the pullresults services will share resources, data models, and programs with the [w3c/wptdashboard] and http://wpt.fyi constellation of services.

@foolip
Copy link
Member Author

foolip commented Jan 8, 2018

It's surprising that there are selects that take anything more than milliseconds given the small amount of data in the system still. What are those queries?

@boazsender
Copy link
Collaborator

I'm not sure, I'll have to chase this a bit more through the flask ORM. I'll likely do so when we're closer to implementing a solution, though a more well tuned computer is probably what is actually in order.

For what it's worth, I observed multi-second postgres selects in htop for each load of the home page. When I did several, the server became non-responsive.

@foolip
Copy link
Member Author

foolip commented Feb 1, 2018

This continues to be a serious problem. I am getting 504 Gateway Time-out on https://pulls.web-platform-tests.org/job/23710.13 and other URLs today, and https://bit.ly/ecosystem-infra-status shows very frequent downtime.

@jgraham
Copy link
Collaborator

jgraham commented Feb 1, 2018

I reccomend hooking it up to New Relic to understand which queries are slow and what the downtime is like.

@foolip
Copy link
Member Author

foolip commented Feb 25, 2018

A problem today as well, need to look at https://pulls.web-platform-tests.org/job/24794.11 to understand what's wrong with web-platform-tests/wpt#9641 but it's 504 Gateway Time-out.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants