Skip to content

Disaster Recovery & Immediate Response

Mek edited this page Sep 17, 2024 · 16 revisions

The Emergency Response Guide for OpenLibrary.org first-responders.

Responding to a Outage

  • 1. Report outage on #openlibrary and #ops on Slack, follow the escalation guide
  • 2. ❗ Search previous post mortem reports for insights and solutions to common issues
  • 3. Check the public monitoring dashboards and internal:
  • 4. If the bare-metal machine is hanging, contact #ops on slack or manually restart baremetal
  • 5. If there's a fiber outage and openlibrary.org's servers don't resolve (even to Sorry service), ask in the internal slack channels #openlibrary or #ops for openlibrary.org to be temporarily pointed to an active "Sorry Server"
  • 6. Create a new postmortem issue and proceed to this guide:

Diagnostic's Guide

Before continuing, you may want to check our Port-mortems to see if this is a known / already solved problem.

  1. Is CPU load high on web nodes and/or is there a spike in # of transactions?
  2. Are ol-mem* slow to ssh into? We may want to /etc/init.d/memcached restart or even manually restart bare-metal if ssh hangs for more than 3 minutes
  3. Does homepage cache look weird?

Spam

  1. There is an admin dashboard for blocking certain terms from appearing on Open Library: https://openlibrary.org/admin/spamword
  2. You can also block & revert changes per specific accounts via https://openlibrary.org/admin/people
  • If the edit to a page contains any of the spam words or email of the user is from the blacklisted domains, the edit won’t be accepted. New registrations with emails from those domains are also not accepted.

Power Outages at Data Center

Once services return, please make sure all services are running and that VMs are ssh'able (this can probably be a script).

If a machine is up but not reachable, manually restart baremetal. If a machine is up and reachable but services are not running, check docker ps on the host.

Handling Abuse & DDOS (Denial of Service Attack)

We have a few graphs on our main dashboard to monitor the traffic of the top requesting IPs to observe changes in pattern/behaviour:

image

These graphs show the ratio between the last ~20k requests across the top IPs hitting our website. E.g. the green section in the first graph, what ratio of requests came from the top IP. The yellow is from the second top-most IP. And so on.

In the graph above, you can see several anomalies where the top IP has been making significantly more requests. This is an indicator that there might be abuse happening from a certain IP.

Treatment: Investigate and block the IP as necessary.

  1. Investigate the traffic to verify. On the server in question (eg ssh -A ol-www0):
# Observe the recent requests hitting the server. Note 150000 is largely arbitrary.
$ sudo tail -n 150000 /1/var/log/nginx/access.log | grep -oE '^[0-9.()]+' | sort | uniq -c | sort -rn | head -n 25
# 154598 0.32.37.207
# 125985 0.3.111.38
# 124110 0.45.210.121
# 123793 0.249.158.151
# 122969 0.79.152.249
# 122872 0.244.113.216
# 122526 0.30.143.17
# 121269 0.145.106.249
# 120520 0.85.80.58
# 117442 0.141.6.36
# 109663 0.204.1.42
#  90027 0.109.144.144
#  81801 0.218.22.254
#  ...

You can see the top IP (note the IPs are anonymoized) is causing a considerable amount of traffic. Let's investigate it.

$ sudo tail -f /1/var/log/nginx/access.log | grep -F '0.32.37.207'

This will let you see the traffic from that IP and determine if it should be blocked. Use your discretion to check any given IP to see whether the pattern looks abusive / spammy -- e.g. Internet Archive makes many requests to /api/books.json and still we don't want to ban it, for example. If you determine it should be blocked, then we need to get the denanoymized IP and add it to our deny.conf.

On ol-www0 in /openlibrary you can run decode_ip.sh with the offending anonymized IP 0.32.37.207 as follows:

cd /opt/openlibrary/scripts
sudo -E SEED_PATH=http://seedserver.us.archive.org/key/seed.txt ./decode_ip.sh 0.32.37.207

Note: if you run decode.sh and get a file not found error, run it again. This is a work around until a fix is merged for a race condition around the creation of the IP map.

Solr Search Issues

You can restart solr via docker as:

ssh -A ol-solr1
docker restart solr_builder_solr_1 solr_builder_haproxy_1

Out of Space

Cleanup Deploys

There are few servers which we expect to fill up. ol-db1/2 and ol-covers0/1 are candidates because their job is to store temporary or long term data. ol-home0 is another service which generates data dumps, aggregates partner data, and generates sitemaps. These three servers likely need a manual investigation when nagios reports their space is low.

The following will prune unattached images which were created more than 1 week ago (168h):

# to prune the build cache
docker builder prune

# prune unused images created more than 1 week ago
docker image prune -a --filter "until=168h"

Caution

When docker prune is being run, unfortunately the rest of docker typically becomes unresponsive; see this issue. When this happens, the wrong intervention is to try and restart the server with ganeti. When you do, not only will docker still be unresponsive until the prune finishes, but additionally all docker containers that were running will stop and be unreachable.

Docker images

Even with this being the case, a very common cause of disk fill are out docker images which have not been pruned during our deploy process. These can be many GB over time. Run docker image ls for a listing of images registered in docker to see if any of them can be pruned or deleted.

Docker Logs

Docker logs can take up a ton of space. @cdrini mentions one solution is: (Truncating docker logs for container with ID d12b...)

sudo df -h - See the sizes of a bunch of things on the VM
truncate -s 0 $(docker inspect --format='{{.LogPath}}' d12b518475e1)

ol-dev1 out of storage

Symptom: sudo df -h shows a bunch of 100% or 99%. Testing deploys might fail on occasion.

Containers and images can stick around on our dev server causing it to fill up. To free up space:

  1. Confirm with folks on slack, #team-abc, that there are not stopped containers that people care about. There shouldn't be. There is some risk of data loss if someone has made modification to the file system inside a now stopped container. That is why we confirm!
  2. Run docker container prune
  3. Run docker images prune . This will remove any images; all images should have Dockerfiles somewhere, so there's little risk of data loss. But it might be annoying because someone will have to rebuild a docker image they might care about and have to find the Dockerfile!

upstart.log

There is a possibility supervisor can get confused (perhaps related to permissions/chown), and instead of rotating logs, will start writing to /var/log/openlibrary/upstart.log until /dev/vda1 (or wherever root / is mounted) runs out of space. The solution is to restart "supervisor" (not openlibrary via supervistorctl but supervisor itself) on the aflicted node (e.g. ol-web4 in this example):

sudo service supervisor restart

If successful, you should see a new openlibrary.log with an update time more recent than upstart.log. One you've confirmed this, you can truncate the erroneously inflated upstart.log to free up disk space:

sudo truncate upstart.log  --size 0

After truncating, you'll want to restart openlibrary, e.g.

ssh ol-web4 sudo supervisorctl restart openlibrary

Homepage Errors

Sometimes an error occurs while compiling the homepage and an empty body is cached: https://github.com/internetarchive/openlibrary/issues/6646

Solution: You can use this the url to hit to clear the homepage memcache entry: https://openlibrary.org/admin/inspect/memcache?keys=home.homepage.en.pd-&action=delete . Note the .pd . Remove that if you want to clear the cache for non printdisabled users.

Notes

  • If solr-updater or import-bot or deploy issue, or infobase (API), check ol-home
  • If lending information e.g. books appear as available on OL when they are waitlisted on IA, this is a freak incident w/ memcached and we'll need to ssh into each memcached (ol-mem*) and sudo service memcached restart
  • If there's an issue with ssl termination, static assets, connecting to the website, check ol-www1 (which is where all traffic enters and goes into haproxy -- which also lives on this machine). Another case is abuse, which is documented in the troubleshooting guide (usually haproxy limits or banning via nginx /opt/openlibrary/olsystem/etc/nginx/deny.conf
  • If there's a database problem, sorry (ol-db0 primary, ol-db1 replication, ol-backup1)
  • If we're seeing ol-web1 and ol-web2 offline, it may be network, upstream, DNS, or a breaking dependency, CHECK NAGIOS + alert #ops + #openlibrary. Check the logs in /var/log/openlibrary/ (esp. upstart.log)
  • If you notice a disk filling up rapidly or almost out of space... CREATE A BASILISK FILE (an empty 2GB placeholder dd'd file that we can delete and have the ability to ls, etc)
Clone this wiki locally