Partition the `version_downloads` table #2203

sgrif · 2020-02-19T22:03:56Z

⚠️ This pull request contains destructive commits, which need to be deployed separately, and should be deployed by themselves. In addition, there is a binary which needs to be run to backfill data in between two commits. Please do not merge this unless you intend on deploying it yourself, which I expect will take between 30 minutes to an hour total. Each commit contains deployment instructions. Feel free to review without merging and I can deploy myself. ⚠️

Please review the individual commits of this pull request, as this includes a binary which will be run once on production, but was then deleted in a later commit.

This attempts to address some performance issues related to the version_downloads table. This table has some interesting characteristics to it:

Our highest traffic endpoint does nothing but write to this table. Keeping that endpoint as fast as possible is priority 1
Almost all slow queries against this table occur in a background job where we don't care about the runtime of its queries.
Because this table is written far more frequently than it's read, and we generally don't care about the speed of irregular reads, we generally avoid indexes on this table, as they create a massive amount of bloat, and slow down writes

While we don't care about the runtime of the queries in update_downloads, those queries are falling back to a seq scan of the entire table, and I believe they are evicting the entire table cache when they do that. This is something that's really hard to measure directly without production write traffic, so all we have to go on is runtime of the problem queries. Stats on that are at the bottom of this PR description. I do believe this will improve our table cache hit rate by reducing cache churn, as we will be loading much less data at once.

Partitioning a table is the act of splitting it into multiple smaller tables on some known axis (in this case date). Generally we can just pretend the table is a single table (with some exceptions), but under the hood the query planner is able to prune any partitions which couldn't contain results for the query. When partitioning is applicable, it can be cheaper than traversing an index, since it affects the physical layout of the data as opposed to having a btree that can be used to quickly prune data. In particular, indexes stop being helpful above a certain data size, while partitioning does not.

I had hoped that a BRiN index would have similar improvements, since our write patterns should be perfect for it. However this just hasn't been true in practice, and the index is barely pruning any data. I believe a full vacuum would correct this, but that would require unacceptable levels of downtime.

Typically when partitioning you'd target a partition size much larger than what we're currently targeting. This is because I've optimized our partition range based on what our slowest query (refreshing recent downloads) will access, rather than targeting a row count. For older data I grouped things into larger partitions because the row count was just so low.

As of PG 12, partitions do need to be created manually. I've done this by adding a new background job which will check if a partition for 1 year in the future exists, and creates it if not. To ensure we don't start losing download data if somehow this goes a full year without running, I've also created a default partition. If any data ever ends up in this partition, it will page.

Since this changes the physical layout of the data, we can't just do this to an existing table. We do the same process we would do with any destructive action that would lock the table for an unreasonable amount of time:

Create the new table, add a trigger to the old so any writes are applied to the new table
Separately backfill old data (being careful not to overwhelm the server)
Swap the tables

We can't partition on a boolean, which means we need to add a date range to any queries filtering on the processed column to ensure we get a sufficiently small data set. The only reason this column seemed to exist was so update-downloads could go an infinite amount of time without running and then update however far back it needs to. This just isn't realistic in practice anymore, so I've set it to only look at 1 week back. This made the processed column pointless so I just removed it from the new table.

All of these partitions do break some workflows, however. For one, they're included in schema.rs by default. I've manually excluded the ones included in this migration, but this puts us in a situation where the production schema won't be fully reflected by schema.rs (and we probably don't want it to). I'm not sure what we want to do here, since some folks might run the job to create new partitions to test it. I could change the migration to only create the default partition in any env other than production, and have the job also exit early, and folks who want to test those locally are responsible for cleaning up after themselves; We only need the default partition in dev. I could also add functionality to Diesel to either exclude a regex or exclude all partitions (though the latter would require switching to raw SQL in the one place we directly access a partition).

Additionally, the db dump is currently broken. It wants permissions for all the partitions individually (which we don't want to dump), and the method it's using for copy doesn't work with partitioned tables. I haven't looked into either of these yet, they felt tangential to what is already a beefy PR. Since we did say these are experimental, and we were willing to disable them if they broke, technically we can move forward with this before fixing it, but I don't think this is urgent enough to do that.

⚠️ this commit is part of a PR containing destructive commits, which must be deployed separately with steps in between. Do not deploy this commit with other commits, and make sure you read the instructions below.⚠️ This adds the partitioned version of the table, a trigger adds new data to the new table, and a binary is responsible for backfilling the old data. The `processed` column is notably missing. This is for two reasons. The first is that we can't partition on a boolean column, so any queries that were focused on `processed` will need some date filter to be performant on the new table. The second is that while the query planner prunes unused partitions for selects, it does not appear to do so for updates. This caused the query which set `processed = 't'` to perform unreasonably slow against the partitioned table. This might be fixed in PG 12, which I didn't test against. We could also update the most recent partition directly, which would fix the issue. But ultimately the `processed` column only really existed so `update-downloads` could go an unlimited amount of time without running and still recover. Realistically that's just not going to happen, and with a sufficiently large margin (say 1 week), if we really go that long without realizing that this is broken, we have much bigger failures to worry about. Either way we need some date filter on these queries to be performant. So while I think that's a good move even in a vacuum, `processed` just stops having a purpose. The trigger is pretty standard. I added the `IF NEW IS DISTINCT FROM OLD`, since I think we might keep the non-partitioned table around for a bit after the transition, which means we'll want a trigger updating that table. The reason we'd ever need both is a bit dense for this commit (I'm happy to go into detail about caveats of atomic swaps in PG if anyone has strong concerns here), but the short version is that for at least a short instant the name of the table will not necessarily be relevant, so when we swap there will be a very short instant where writes happen to both tables. More likely we'll just move forward with the transition, and accept that we'll have to manually reconcile the old table or lose some download data if we need to revert. We don't want the backfilling query to overwhelm the database to the point where production traffic is affected. If we were using something purpose built for this, like https://github.com/soundcloud/lhm, we would have a query that operates in rows/time. However, writing that query for a table without an incrementing ID is painful, and going by date will do at most 200,000 rows at once. While LHM's default of 40,000/0.1s is a very good default in my personal experience, spiking to 5x that should be perfectly fine here. This commit does not include any changes that operate on the partitioned table (and in fact no commit will be doing that until the final switch, since the intention is for this new table to essentially operate everywhere the old did, with the only changes being adding some date filters). Deployment instructions ======================= After this commit is deployed, run `backfill_version_downloads`, and ensure it exits without errors. A zero exit status will indicate that the `version_downloads_part` table contains the same data as `version_downloads`. The trigger on `version_downloads` will cause that to continue to be the case.

Since the partitioned `version_downloads` table doesn't contain this column, we need to remove any references to it before actually doing a migration which would remove it. Deployment instructions ======================= This commit does not do anything destructive, and can technically be deployed with the previous commit. It must be deployed *before* the next commit, however. This commit is safe to roll back.

Note that as of this commit, the DB dump is broken, as it can't handle partitioned tables. We're keeping the old table around in case we need to revert this, though for the time being I'm just assuming we'll either manually reconcile data or accept some loss of download counts if we need to revert. For some reason Diesel CLI isn't picking up the foreign key on the partitioned table, so I regenerated the patch file Deployment Instructions ======================= This commit contains a destructive migration, and should be deployed on its own

This adds a new background job which is responsible for creating new partitions for `version_downloads` 1 year in advance. The intent here is to give an extremely large window to notice if for some reason this job stops running. In the extremely unlikely event that a full year passes, data will start going in the default partition and we will get paged. Recovering from this will be a massive pain, but I've tried to ensure it'll never happen.

rust-highfive · 2020-02-19T22:04:00Z

r? @jtgeibel

(rust_highfive has picked a reviewer for you, use r? to override)

sgrif · 2020-02-19T22:05:13Z

Query Performance

These tests were performed on a fork of the production database, so the
dataset reflects reality. However, since the fork does not receive
production traffic, the state of the cache or query planner statistics
do not mirror production, so take these with a grain of salt. I did run
the queries multiple times to at least get a lukewarm cache.

Refresh Recent Crate Downloads

You cannot get a meaningful EXPLAIN ANALYZE for refreshing a
materialized view, so we're only measuring the SELECT query used in
the view here. There's an additional 3-5s that occurs when actually
updating the view, which isn't affected by this change.

Before

Execution Time: 24107.687 ms

Query Plan

                                                                                    QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize GroupAggregate  (cost=260222.67..262425.35 rows=19844 width=12) (actual time=24054.157..24083.842 rows=36080 loops=1)
   Group Key: versions.crate_id
   ->  Gather Merge  (cost=260222.67..262345.97 rows=19844 width=12) (actual time=24054.139..24093.944 rows=72082 loops=1)
         Workers Planned: 1
         Workers Launched: 1
         ->  Sort  (cost=259222.66..259232.59 rows=19844 width=12) (actual time=24038.073..24043.749 rows=36041 loops=2)
               Sort Key: versions.crate_id
               Sort Method: quicksort  Memory: 3226kB
               Worker 0:  Sort Method: quicksort  Memory: 3226kB
               ->  Partial HashAggregate  (cost=258879.83..258939.36 rows=19844 width=12) (actual time=24019.372..24027.653 rows=36041 loops=2)
                     Group Key: versions.crate_id
                     ->  Parallel Hash Join  (cost=5747.40..256699.28 rows=2180547 width=8) (actual time=592.725..23441.185 rows=1817862 loops=2)
                           Hash Cond: (version_downloads.version_id = versions.id)
                           ->  Parallel Bitmap Heap Scan on version_downloads  (cost=201.75..250008.81 rows=2180547 width=8) (actual time=545.116..22563.885 rows=1817862 loops=2)
                                 Recheck Cond: ((date > (CURRENT_DATE - '90 days'::interval)) AND (date <= CURRENT_DATE))
                                 Rows Removed by Index Recheck: 7448850
                                 Heap Blocks: lossy=74303
                                 ->  Bitmap Index Scan on index_version_downloads_by_date  (cost=0.00..16.40 rows=3769074 width=0) (actual time=7.901..7.901 rows=1480960 loops=1)
                                       Index Cond: ((date > (CURRENT_DATE - '90 days'::interval)) AND (date <= CURRENT_DATE))
                           ->  Parallel Hash  (cost=5115.07..5115.07 rows=123024 width=8) (actual time=46.829..46.829 rows=104572 loops=2)
                                 Buckets: 262144  Batches: 1  Memory Usage: 10240kB
                                 ->  Parallel Seq Scan on versions  (cost=0.00..5115.07 rows=123024 width=8) (actual time=0.014..22.338 rows=104572 loops=2)
 Planning Time: 13.081 ms
 Execution Time: 24107.687 ms

After

Execution Time: 2859.205 ms (~10x improvement)

Query Plan

                                                                                   QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize GroupAggregate  (cost=290423.69..292626.38 rows=19844 width=12) (actual time=2821.338..2848.636 rows=36080 loops=1)
   Group Key: versions.crate_id
   ->  Gather Merge  (cost=290423.69..292547.00 rows=19844 width=12) (actual time=2821.327..2845.286 rows=69848 loops=1)
         Workers Planned: 1
         Workers Launched: 1
         ->  Sort  (cost=289423.69..289433.61 rows=19844 width=12) (actual time=2813.357..2817.727 rows=34924 loops=2)
               Sort Key: versions.crate_id
               Sort Method: quicksort  Memory: 3228kB
               Worker 0:  Sort Method: quicksort  Memory: 3119kB
               ->  Partial HashAggregate  (cost=289080.86..289140.39 rows=19844 width=12) (actual time=2795.301..2803.307 rows=34924 loops=2)
                     Group Key: versions.crate_id
                     ->  Parallel Hash Join  (cost=5551.39..286954.39 rows=2126468 width=8) (actual time=118.070..2323.046 rows=1817862 loops=2)
                           Hash Cond: (version_downloads_2019_q4.version_id = versions.id)
                           ->  Parallel Append  (cost=5.73..280292.30 rows=2126471 width=8) (actual time=0.033..1483.926 rows=1817862 loops=2)
                                 Subplans Removed: 14
                                 ->  Parallel Seq Scan on version_downloads_2019_q4  (cost=0.00..38084.79 rows=1164031 width=8) (actual time=0.029..983.035 rows=999817 loops=2)
                                       Filter: ((date <= CURRENT_DATE) AND (date > (CURRENT_DATE - '90 days'::interval)))
                                       Rows Removed by Filter: 1203968
                                 ->  Parallel Seq Scan on version_downloads_2020_q1  (cost=0.00..14137.23 rows=962406 width=8) (actual time=0.023..768.265 rows=1636090 loops=1)
                                       Filter: ((date <= CURRENT_DATE) AND (date > (CURRENT_DATE - '90 days'::interval)))
                           ->  Parallel Hash  (cost=5115.07..5115.07 rows=123024 width=8) (actual time=117.250..117.250 rows=104572 loops=2)
                                 Buckets: 262144  Batches: 1  Memory Usage: 10272kB
                                 ->  Parallel Seq Scan on versions  (cost=0.00..5115.07 rows=123024 width=8) (actual time=0.013..90.574 rows=104572 loops=2)
 Planning Time: 13.999 ms
 Execution Time: 2859.205 ms

Load Uncounted Versions

We cannot partition on a boolean, so we need to add a date restriction
for this query to work well on the partitioned database. Since the old
query performed better with the date restriction as well (sometimes,
depending on whether the BRiN index was cooperating at that exact
moment), this comparison includes it on both.

Before

Execution Time: 5188.726 ms

Query Plan

                                                                     QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on version_downloads  (cost=15.14..223283.53 rows=81 width=17) (actual time=4309.323..5188.516 rows=3148 loops=1)
   Recheck Cond: ((date >= (CURRENT_DATE - '7 days'::interval)) AND (date <= CURRENT_DATE))
   Rows Removed by Index Recheck: 14677651
   Filter: ((NOT processed) AND (downloads <> counted))
   Rows Removed by Filter: 350539
   Heap Blocks: lossy=113837
   ->  Bitmap Index Scan on index_version_downloads_by_date  (cost=0.00..15.14 rows=368142 width=0) (actual time=7.165..7.165 rows=1139200 loops=1)
         Index Cond: ((date >= (CURRENT_DATE - '7 days'::interval)) AND (date <= CURRENT_DATE))
 Planning Time: 0.126 ms
 Execution Time: 5188.726 ms

After

Execution Time: 141.453 ms (~40x improvement)

Query Plan

                                                               QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
 Append  (cost=0.09..298606.03 rows=349025 width=16) (actual time=13.171..141.314 rows=3148 loops=1)
   Subplans Removed: 15
   ->  Seq Scan on version_downloads_2020_q1  (cost=0.00..18660.54 rows=348970 width=16) (actual time=13.170..141.103 rows=3148 loops=1)
         Filter: ((downloads <> counted) AND (date <= CURRENT_DATE) AND (date >= (CURRENT_DATE - '7 days'::interval)))
         Rows Removed by Filter: 1632942
 Planning Time: 0.663 ms
 Execution Time: 141.453 ms

Set processed = 't' on old rows

Before

~6s on average in production, based on logs (This cannot be reliably
meaningfully measured repeatedly, since it requires rows to be affected
and write traffic at production levels to show speeds comparable to what
our logs show)

After

No longer executed (∞x improvement)

Insert or update (e.g. the `/download` endpoint)

Before

Execution Time: 0.092 ms

Query Plan

                                                QUERY PLAN
----------------------------------------------------------------------------------------------------------
 Insert on version_downloads  (cost=0.00..0.00 rows=1 width=21) (actual time=0.063..0.063 rows=0 loops=1)
   Conflict Resolution: UPDATE
   Conflict Arbiter Indexes: version_downloads_pkey
   Tuples Inserted: 0
   Conflicting Tuples: 1
   ->  Result  (cost=0.00..0.00 rows=1 width=21) (actual time=0.007..0.007 rows=1 loops=1)
 Planning Time: 0.045 ms
 Execution Time: 0.092 ms

After

Execution Time: 0.157 ms (~2x slower)

This one will grow linearly with the number of partitions. The
underlying performance regression is fixed in PG 12. In the short term,
this query is still way sub-millisecond so I'm not worried.

Query Plan

                                                  QUERY PLAN
---------------------------------------------------------------------------------------------------------------
 Insert on version_downloads_part  (cost=0.00..0.00 rows=1 width=16) (actual time=0.077..0.077 rows=0 loops=1)
   Conflict Resolution: UPDATE
   Conflict Arbiter Indexes: version_downloads_part_pkey
   Tuples Inserted: 0
   Conflicting Tuples: 1
   ->  Result  (cost=0.00..0.00 rows=1 width=16) (actual time=0.004..0.004 rows=1 loops=1)
 Planning Time: 0.031 ms
 Execution Time: 0.157 ms

sgrif · 2020-02-19T22:05:23Z

@smarnach I could use your help getting the db dump fixed

smarnach · 2020-02-22T11:10:41Z

I'm looking into fixing the database dump now.

smarnach · 2020-02-22T23:07:51Z

For the time being, the easiest way to fix the database dump is to completely exclude the version_downloads table from the dump. I think the dumps still provide value even without that data, and they will only be one third of the size. (See also #2078 for discussions about the future of the dumps.)

I plan to implement an ignore_tables setting in the configuration file of the dumps. Maybe I manage to do this on a long-distance flight tomorrow.

sgrif · 2020-02-23T00:12:30Z

I don't think explicitly listing tables to ignore is viable here since they get created programmatically. We need to ignore any partitions

…

On Sat, Feb 22, 2020, 3:07 PM Sven Marnach ***@***.***> wrote: For the time being, the easiest way to fix the database dump is to completely exclude the version_downloads table from the dump. I think the dumps still provide value even without that data, and they will only be one third of the size. (See also #2078 <#2078> for discussions about the future of the dumps.) I plan to implement an ignore_tables setting in the configuration file of the dumps. Maybe I manage to do this on a long-distance flight tomorrow. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#2203?email_source=notifications&email_token=AALVMKY7YM7HUAX4JOOLJOTREGV4PA5CNFSM4KYCR47KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMVMW5Q#issuecomment-590007158>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AALVMK5LR7RSAQRXTOJXE6DREGV4PANCNFSM4KYCR47A> .

smarnach · 2020-02-24T13:37:55Z

I filed a proposed fix as pull request against this branch.

jtgeibel · 2020-02-25T01:17:25Z

Pinging @kornelski on this PR with respect to changes (temporary and long term) to how we dump the version_downloads column. In the short term, it would be nice if we were at least exporting recent downloads (maybe the last 7 days).

kornelski · 2020-02-25T12:58:19Z

I think it's a good idea to remove the full version_downloads from the data dump — it's a huge csv, and redownloading the full history every day makes bandwidth usage exponential.

For me an ideal solution would be to export downloads in separate archives, one per day, containing only downloads from that day:

https://static.crates.io/downloads-2020-02-22.tar.gz
https://static.crates.io/downloads-2020-02-23.tar.gz
https://static.crates.io/downloads-2020-02-24.tar.gz

This way I could get downloads incrementally and catch up if necessary.

smarnach · 2020-02-26T13:36:17Z

@kornelski That approach would be relatively straight-forward to implement, so I may do that. Give it a few weeks, though. :-/

For the time being, would you prefer to have version_downloads included in the single tarball, or would you prefer to have it not at all?

kornelski · 2020-02-26T13:59:30Z

For now I'd prefer to have it included. It's OK if it's truncated.

smarnach · 2020-02-27T04:34:14Z

@kornelski Since you are the only user of the database dumps I'm in contact with, I updated sgrif#1 to include version_downloads again. I hope to implent a better solution in the next few weeks.

jtgeibel · 2020-03-03T02:08:13Z

@sgrif I've gone through the commits and the approach (and results) look great to me! I'd just like to take another look at the SQL function to make sure I understand it.

Other than that, there's just the database export. From what I understand @smarnach has changes in sgrif#1 that should continue to export all download history for now. If you're fine with that PR then I think we can coordinate a time window to merge this and sequence through the deploy.

sgrif · 2020-03-10T22:58:53Z

I didn't get a chance to look today but I will Thursday

bluetech · 2020-04-20T08:20:41Z

I had hoped that a BRiN index would have similar improvements, since our write patterns should be perfect for it. However this just hasn't been true in practice, and the index is barely pruning any data. I believe a full vacuum would correct this, but that would require unacceptable levels of downtime.

Have you considered pg_repack?

bors · 2020-07-05T21:59:42Z

☔ The latest upstream changes (presumably #2539) made this pull request unmergeable. Please resolve the merge conflicts.

Turbo87 · 2020-11-26T15:40:54Z

CI is failing, we seem to have some conflicts and there generally hasn't been much movement here, so I'll close this PR for now. feel free to reopen if you want to pick this back up :)

sgrif added 4 commits February 19, 2020 14:05

rust-highfive assigned jtgeibel Feb 19, 2020

rust-highfive added the S-waiting-on-review label Feb 19, 2020

smarnach mentioned this pull request Feb 27, 2020

Make database dumps work with the partitioned version_downloads table. sgrif/crates.io#1

Open

Turbo87 closed this Nov 26, 2020

kornelski mentioned this pull request May 11, 2021

Include only the last 90 days of downloads in our database dumps #3611

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partition the `version_downloads` table #2203

Partition the `version_downloads` table #2203

sgrif commented Feb 19, 2020

rust-highfive commented Feb 19, 2020

sgrif commented Feb 19, 2020

sgrif commented Feb 19, 2020

smarnach commented Feb 22, 2020

smarnach commented Feb 22, 2020

sgrif commented Feb 23, 2020 via email

smarnach commented Feb 24, 2020

jtgeibel commented Feb 25, 2020

kornelski commented Feb 25, 2020 •

edited

Loading

smarnach commented Feb 26, 2020

kornelski commented Feb 26, 2020

smarnach commented Feb 27, 2020

jtgeibel commented Mar 3, 2020

sgrif commented Mar 10, 2020

bluetech commented Apr 20, 2020

bors commented Jul 5, 2020

Turbo87 commented Nov 26, 2020

Partition the version_downloads table #2203

Partition the version_downloads table #2203

Conversation

sgrif commented Feb 19, 2020

rust-highfive commented Feb 19, 2020

sgrif commented Feb 19, 2020

Query Performance

Refresh Recent Crate Downloads

Before

After

Load Uncounted Versions

Before

After

Set processed = 't' on old rows

Before

After

Insert or update (e.g. the /download endpoint)

Before

After

sgrif commented Feb 19, 2020

smarnach commented Feb 22, 2020

smarnach commented Feb 22, 2020

sgrif commented Feb 23, 2020 via email

smarnach commented Feb 24, 2020

jtgeibel commented Feb 25, 2020

kornelski commented Feb 25, 2020 • edited Loading

smarnach commented Feb 26, 2020

kornelski commented Feb 26, 2020

smarnach commented Feb 27, 2020

jtgeibel commented Mar 3, 2020

sgrif commented Mar 10, 2020

bluetech commented Apr 20, 2020

bors commented Jul 5, 2020

Turbo87 commented Nov 26, 2020

Partition the `version_downloads` table #2203

Partition the `version_downloads` table #2203

Insert or update (e.g. the `/download` endpoint)

kornelski commented Feb 25, 2020 •

edited

Loading