Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove symlinks and serve static files with Python #4550

Closed
ericholscher opened this issue Aug 21, 2018 · 4 comments
Closed

Remove symlinks and serve static files with Python #4550

ericholscher opened this issue Aug 21, 2018 · 4 comments
Labels
Feature New feature Needed: design decision A core team decision is required Operations Operations or server issue
Milestone

Comments

@ericholscher
Copy link
Member

ericholscher commented Aug 21, 2018

Currently all of our documentation pages are served by Nginx. We maintain this with a web of symlinks that is pretty error prone and hard to maintain. We also throw a large number of errors in these code paths, trying to manipulate the filesystem in wonky ways.

We currently already have a pattern that solves this problem: Sendfile. It allows us to handle processing of the request in Python, but still have Nginx serve the file.

This logic would be a combination of our current redirect logic, which doesn't hit the DB, and existing Sendfile support. These nginx docs cover the usage: https://www.nginx.com/resources/wiki/start/topics/examples/x-accel/ -- we already do this in a couple places:

The primary difference is that we need to be able to do it without hitting the database.

Benefits

  • We remove the symlink code which is quite complex and not valuable
  • We are able to do "real" redirects, not just on 404 pages
  • We move logic from nginx into Python, and give ourselves a lot more flexibly.

Considerations

  • Should we just be moving all our static file serving to cloud files/S3, instead of managing them on disk?
  • Is it worth all the work if we don't get additional user benefits beyond redirects?

Requirements

  • All static files must continue being served without hitting the database

Implementation

  • Write more data into the metadata.json for each project, allowing us to make more decisions without hitting the database (existing code: https://github.com/rtfd/readthedocs.org/blob/master/readthedocs/projects/tasks.py#L1125)
  • Write a small Python proxy that reads metadata.json and then served the correct file off disk. We could only keep the user_builds directory around, and the Python app would be in charge of translating the URL to the filepath to serve, accounting for subprojects, translations, etc.
@agjohnson
Copy link
Contributor

So I'm:

  • +0 on dropping symlinks. I'm 👍 on the idea, but weighing the work required to remove symlinks and develop features for our redirect application, I feel like this will be another distraction from building product features.
  • +0 on moving files to s3/azure. We can sendfile to an external azure storage URL, but perhaps we first explore serving docs directly from blob storage first. Serving from storage blobs is a difficult problem, and it might not even be possible with the amount of additional logic we need (application redirects, etc). We don't get all the benefits if we sendfile to blob storage. But if this doesn't work, sendfile to Azure blob could be a great option to reduce storage duplication.

Instead of reimplementing serve_docs, could we serve docs through Django, but add operations pieces like caching or CDN in front? This would only be acceptable if we can ensure cache serving is seamless when database goes down or latency increases. The benefit here is we don't have additional work on our application.

@humitos
Copy link
Member

humitos commented Jan 18, 2019

My position here is making this in movement in two phases:

  1. Remove all symlinks and serve files from our disks using the NGINX header: this is still a good amount of work probably but we will clean the code a lot removing hacky decisions and start allowing other features as better redirects.

  2. Serve files from blob storage: once the phase 1 is completed, we could work in all the infrastructure needed to upload the files to a blob storage and start exploring that path (without breaking our existing serving) and when we have something testable, we can just switch where the NGINX header points to.

@ericholscher
Copy link
Member Author

we can just switch where the NGINX header points to.

This only works on internal files. We could proxy to an external file host, but that would add a decent bit of latency. Probably <30ms, but worth thinking about. This is what packages.python.org is doing currently w/ S3, so might be worth asking them how it's working.

@humitos
Copy link
Member

humitos commented Apr 28, 2019

We could proxy to an external file host, but that would add a decent bit of latency. Probably <30ms, but worth thinking about

I think we decided to go in this direction all together with "El proxito" (an app that will receive all the requests and translate an URL into the path of that file in blob storage and this file will be proxy).

I'm closing this issue here. We can revisit it if we need it when implementing El Proxito.

@humitos humitos closed this as completed Apr 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature New feature Needed: design decision A core team decision is required Operations Operations or server issue
Projects
None yet
Development

No branches or pull requests

3 participants