Skip to content

tapiriik.com infrastructure

Collin edited this page Jun 28, 2014 · 2 revisions

In case you ever wondered what's going on behind the scenes to bring you https://tapiriik.com

General Layout

Under the hood, tapiriik.com runs on servers that with one of three roles: web, sync, and database. Once upon a time, these were all the same VPS (with 512MB of RAM!), now: not so much.

Thankfully, this means that components of the system can go down without everything grinding to a halt - in theory. In practice, the extra complexity has led to a whole new world of mistakes to make and problems to encounter.

Web

The web servers deal purely with the tapiriik web front-end. Activities never pass through them, but they do make direct calls to remote sites (for authentication) and receive web-hook callbacks (for new activity notifications). There's no direct interaction between the web and sync servers, everything is affected via the shared database.

No real heavy lifting happens on them, except for the diagnostics dashboard (which performs too many queries so it can include too much information in a page that's too long and takes too long to load).

They're behind a load-balancer to make sure things stay up and running as much as possible, even if synchronization is broken or backed up.

Boring tech trivia: The load balancing and HTTPS termination and static file handling is done by nginx, while uwsgi runs the Django webapp. Supervisor keeps everyone else up and running.

Currently, the web servers run local redis instances to store stateful data that I don't want to put into the cookie-based sessions (e.g. OAuth request credentials). Eventually I'll move to a central instance to make the load balancing more flexible.

Sync

The most boring part of the system to look at. A fleet of servers each run a bunch of synchronization processes (workers), each continuously draining the queue of users waiting for synchronization. Once a user is available, a single worker locks the record, then handles the entirety of the synchronization task. On completion, all the remaining data (errors, new activity records) is written back to the database, and the user is unlocked. It's at this point that the automatic synchronization is implemented - users are immediately re-scheduled an hour into the future.

Occasionally, there's some mishap in my code that causes the worker to crash, so a watchdog process runs frequently to check for stranded users and unlock them automatically. Users are processed in original queue order, so those hapless users will be retried immediately.

In order to diagnose issues like these (and really, any other problem with a specific account), extensive logs are kept on the individual synchronization servers. When I want to retrieve a user's synchronization history, I use a fabric command (see below) to scan all the sync servers, organize the files in the appropriate order, and pull them down incrementally for further inspection. I briefly considered automatically pushing logs to a central location, but that'd be a whole lot of bandwidth for log files that, generally speaking, don't need to be looked at.

Sync servers also run the sync-trigger polling workers. These allow a major optimization in checking for new activities on certain sites (Garmin Connect) - instead of checking each individual account, I can instead check the "friend activity feed" of an account which has "friended" all those individual accounts. Then, each user with a new activity is marked as such. Then, at the user's next synchronization, their individual service account is checked and the new activity is synchronized. If the service account isn't marked as having a new activity, it is skipped entirely.

Like the web servers, all these processes are kept alive by supervisor.

Database

This is actually the most boring part. Primarily, the database servers run the MongoDB replica set containing all user, connection, and activity data. There are three databases, tapiriik (with all the data I really care about), tapiriik_cache (which can get dropped and rebuilt whenever I accidentally let the primary run out of disk space), and tapiriik_tz (a world timezone map).

Additionally, they run the message queue (RabbitMQ) that drives the sync-trigger polling workers. A cron script regularly enqueues a bunch of tasks which are picked up by the workers on each synchronization server.

And, finally, a single server runs the statistics cron job to calculate synchronization rate, load factor, queue times, and most importantly, the list of most-frequent synchronization exceptions.

Management

Again, once upon a time this was all on one server, and all my management took place inside an SSH session with nano. Nowadays I have an ever-growing fabfile that handles nearly every task: deploying new servers, rolling out upgrades, reconfiguring workers, updating DNS records, retrieving user logs, bulk user actions, etc. Unfortunately, it's also the file that contains all the credentials to everything, so it won't be seeing the light of day any time soon.

Also, MMS rocks for MongoDB monitoring. Occasionally I wish that someone would write a tapiriik monitoring service like it for me to use, but then I realize the person would need to be me, and the idea dies quickly thereafter.

Hosting

Everything runs on top of Linode and DigitalOcean VPSes of various dimension. If I was made out of cash (or, rather, if I charged a lot more than $2/year) I might use Heroku. If I had enough time on my hands to deal with their complexity, I'd consider places like AWS or Azure or OpenShift.

Lessons

These are all things I learned while growing tapiriik from a service used by me alone, running on a single tiny server, to what you see above. I wouldn't follow all of these tips for every service I write, but they're good to have in mind.

  • Don't assume your web front-end will always run on a single server inside a single process.
  • Don't assume anything will remain in the same memory space, file system, machine, LAN, or datacentre as anything else. It'll be a pain to fix these assumptions after the fact.
  • Don't use lesser-known hosting providers. Whatever money you save will not be worth waking up to an email stating that all your services have been suspended because they were "using too resources [sic]."
  • Do use a configuration/deployment tool like fabric, so you can quickly get back on your feet after the above situation.
  • If you start making money, don't be afraid to spend it on things like monitoring and backup services. Yeah, you could write those yourself, but your time is better spent working on your own project instead of reinventing the wheel to save a few dollars (only to later discover that the wheel you came up with was, in fact, only suitable for travelling forward, and would set your car on fire at the first attempt to reverse).
  • Set up monitoring emails for an email that does not send alerts to your phone. Maybe it's just me, but there are a lot more important - and rewarding - things to do in this world, than worry about how "Queues: Total has gone above 30 (avg/sec)."