Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partition the version_downloads table #2203

Closed
wants to merge 4 commits into from

Conversation

sgrif
Copy link
Contributor

@sgrif sgrif commented Feb 19, 2020

⚠️ This pull request contains destructive commits, which need to be deployed separately, and should be deployed by themselves. In addition, there is a binary which needs to be run to backfill data in between two commits. Please do not merge this unless you intend on deploying it yourself, which I expect will take between 30 minutes to an hour total. Each commit contains deployment instructions. Feel free to review without merging and I can deploy myself. ⚠️

Please review the individual commits of this pull request, as this includes a binary which will be run once on production, but was then deleted in a later commit.

This attempts to address some performance issues related to the version_downloads table. This table has some interesting characteristics to it:

  • Our highest traffic endpoint does nothing but write to this table. Keeping that endpoint as fast as possible is priority 1
  • Almost all slow queries against this table occur in a background job where we don't care about the runtime of its queries.
  • Because this table is written far more frequently than it's read, and we generally don't care about the speed of irregular reads, we generally avoid indexes on this table, as they create a massive amount of bloat, and slow down writes

While we don't care about the runtime of the queries in update_downloads, those queries are falling back to a seq scan of the entire table, and I believe they are evicting the entire table cache when they do that. This is something that's really hard to measure directly without production write traffic, so all we have to go on is runtime of the problem queries. Stats on that are at the bottom of this PR description. I do believe this will improve our table cache hit rate by reducing cache churn, as we will be loading much less data at once.

Partitioning a table is the act of splitting it into multiple smaller tables on some known axis (in this case date). Generally we can just pretend the table is a single table (with some exceptions), but under the hood the query planner is able to prune any partitions which couldn't contain results for the query. When partitioning is applicable, it can be cheaper than traversing an index, since it affects the physical layout of the data as opposed to having a btree that can be used to quickly prune data. In particular, indexes stop being helpful above a certain data size, while partitioning does not.

I had hoped that a BRiN index would have similar improvements, since our write patterns should be perfect for it. However this just hasn't been true in practice, and the index is barely pruning any data. I believe a full vacuum would correct this, but that would require unacceptable levels of downtime.

Typically when partitioning you'd target a partition size much larger than what we're currently targeting. This is because I've optimized our partition range based on what our slowest query (refreshing recent downloads) will access, rather than targeting a row count. For older data I grouped things into larger partitions because the row count was just so low.

As of PG 12, partitions do need to be created manually. I've done this by adding a new background job which will check if a partition for 1 year in the future exists, and creates it if not. To ensure we don't start losing download data if somehow this goes a full year without running, I've also created a default partition. If any data ever ends up in this partition, it will page.

Since this changes the physical layout of the data, we can't just do this to an existing table. We do the same process we would do with any destructive action that would lock the table for an unreasonable amount of time:

  • Create the new table, add a trigger to the old so any writes are applied to the new table
  • Separately backfill old data (being careful not to overwhelm the server)
  • Swap the tables

We can't partition on a boolean, which means we need to add a date range to any queries filtering on the processed column to ensure we get a sufficiently small data set. The only reason this column seemed to exist was so update-downloads could go an infinite amount of time without running and then update however far back it needs to. This just isn't realistic in practice anymore, so I've set it to only look at 1 week back. This made the processed column pointless so I just removed it from the new table.

All of these partitions do break some workflows, however. For one, they're included in schema.rs by default. I've manually excluded the ones included in this migration, but this puts us in a situation where the production schema won't be fully reflected by schema.rs (and we probably don't want it to). I'm not sure what we want to do here, since some folks might run the job to create new partitions to test it. I could change the migration to only create the default partition in any env other than production, and have the job also exit early, and folks who want to test those locally are responsible for cleaning up after themselves; We only need the default partition in dev. I could also add functionality to Diesel to either exclude a regex or exclude all partitions (though the latter would require switching to raw SQL in the one place we directly access a partition).

Additionally, the db dump is currently broken. It wants permissions for all the partitions individually (which we don't want to dump), and the method it's using for copy doesn't work with partitioned tables. I haven't looked into either of these yet, they felt tangential to what is already a beefy PR. Since we did say these are experimental, and we were willing to disable them if they broke, technically we can move forward with this before fixing it, but I don't think this is urgent enough to do that.

⚠️  this commit is part of a PR containing destructive commits, which
must be deployed separately with steps in between. Do not deploy this
commit with other commits, and make sure you read the instructions
below.⚠️

This adds the partitioned version of the table, a trigger adds new data
to the new table, and a binary is responsible for backfilling the old
data.

The `processed` column is notably missing. This is for two reasons. The
first is that we can't partition on a boolean column, so any queries
that were focused on `processed` will need some date filter to be
performant on the new table.

The second is that while the query planner prunes unused partitions for
selects, it does not appear to do so for updates. This caused the query
which set `processed = 't'` to perform unreasonably slow against the
partitioned table. This might be fixed in PG 12, which I didn't test
against. We could also update the most recent partition directly, which
would fix the issue. But ultimately the `processed` column only really
existed so `update-downloads` could go an unlimited amount of time
without running and still recover. Realistically that's just not going
to happen, and with a sufficiently large margin (say 1 week), if we
really go that long without realizing that this is broken, we have much
bigger failures to worry about. Either way we need some date filter on
these queries to be performant. So while I think that's a good move even
in a vacuum, `processed` just stops having a purpose.

The trigger is pretty standard. I added the `IF NEW IS DISTINCT FROM
OLD`, since I think we might keep the non-partitioned table around for a
bit after the transition, which means we'll want a trigger updating that
table. The reason we'd ever need both is a bit dense for this commit
(I'm happy to go into detail about caveats of atomic swaps in PG if
anyone has strong concerns here), but the short version is that for at
least a short instant the name of the table will not necessarily be
relevant, so when we swap there will be a very short instant where
writes happen to both tables.

More likely we'll just move forward with the transition, and accept that
we'll have to manually reconcile the old table or lose some download
data if we need to revert.

We don't want the backfilling query to overwhelm the database to the
point where production traffic is affected. If we were using something
purpose built for this, like https://github.com/soundcloud/lhm, we would
have a query that operates in rows/time. However, writing that query for
a table without an incrementing ID is painful, and going by date will do
at most 200,000 rows at once. While LHM's default of 40,000/0.1s is a
very good default in my personal experience, spiking to 5x that should
be perfectly fine here.

This commit does not include any changes that operate on the partitioned
table (and in fact no commit will be doing that until the final switch,
since the intention is for this new table to essentially operate
everywhere the old did, with the only changes being adding some date
filters).

Deployment instructions
=======================

After this commit is deployed, run `backfill_version_downloads`, and
ensure it exits without errors. A zero exit status will indicate that
the `version_downloads_part` table contains the same data as
`version_downloads`. The trigger on `version_downloads` will cause that
to continue to be the case.
Since the partitioned `version_downloads` table doesn't contain this
column, we need to remove any references to it before actually doing a
migration which would remove it.

Deployment instructions
=======================

This commit does not do anything destructive, and can technically be
deployed with the previous commit. It must be deployed *before* the next
commit, however. This commit is safe to roll back.
Note that as of this commit, the DB dump is broken, as it can't handle
partitioned tables. We're keeping the old table around in case we need
to revert this, though for the time being I'm just assuming we'll either
manually reconcile data or accept some loss of download counts if we
need to revert.

For some reason Diesel CLI isn't picking up the foreign key on the
partitioned table, so I regenerated the patch file

Deployment Instructions
=======================

This commit contains a destructive migration, and should be deployed on
its own
This adds a new background job which is responsible for creating new
partitions for `version_downloads` 1 year in advance. The intent here
is to give an extremely large window to notice if for some reason this
job stops running. In the extremely unlikely event that a full year
passes, data will start going in the default partition and we will get
paged. Recovering from this will be a massive pain, but I've tried to
ensure it'll never happen.
@rust-highfive
Copy link

r? @jtgeibel

(rust_highfive has picked a reviewer for you, use r? to override)

@sgrif
Copy link
Contributor Author

sgrif commented Feb 19, 2020

Query Performance

These tests were performed on a fork of the production database, so the
dataset reflects reality. However, since the fork does not receive
production traffic, the state of the cache or query planner statistics
do not mirror production, so take these with a grain of salt. I did run
the queries multiple times to at least get a lukewarm cache.

Refresh Recent Crate Downloads

You cannot get a meaningful EXPLAIN ANALYZE for refreshing a
materialized view, so we're only measuring the SELECT query used in
the view here. There's an additional 3-5s that occurs when actually
updating the view, which isn't affected by this change.

Before

Execution Time: 24107.687 ms

Query Plan
                                                                                    QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize GroupAggregate  (cost=260222.67..262425.35 rows=19844 width=12) (actual time=24054.157..24083.842 rows=36080 loops=1)
   Group Key: versions.crate_id
   ->  Gather Merge  (cost=260222.67..262345.97 rows=19844 width=12) (actual time=24054.139..24093.944 rows=72082 loops=1)
         Workers Planned: 1
         Workers Launched: 1
         ->  Sort  (cost=259222.66..259232.59 rows=19844 width=12) (actual time=24038.073..24043.749 rows=36041 loops=2)
               Sort Key: versions.crate_id
               Sort Method: quicksort  Memory: 3226kB
               Worker 0:  Sort Method: quicksort  Memory: 3226kB
               ->  Partial HashAggregate  (cost=258879.83..258939.36 rows=19844 width=12) (actual time=24019.372..24027.653 rows=36041 loops=2)
                     Group Key: versions.crate_id
                     ->  Parallel Hash Join  (cost=5747.40..256699.28 rows=2180547 width=8) (actual time=592.725..23441.185 rows=1817862 loops=2)
                           Hash Cond: (version_downloads.version_id = versions.id)
                           ->  Parallel Bitmap Heap Scan on version_downloads  (cost=201.75..250008.81 rows=2180547 width=8) (actual time=545.116..22563.885 rows=1817862 loops=2)
                                 Recheck Cond: ((date > (CURRENT_DATE - '90 days'::interval)) AND (date <= CURRENT_DATE))
                                 Rows Removed by Index Recheck: 7448850
                                 Heap Blocks: lossy=74303
                                 ->  Bitmap Index Scan on index_version_downloads_by_date  (cost=0.00..16.40 rows=3769074 width=0) (actual time=7.901..7.901 rows=1480960 loops=1)
                                       Index Cond: ((date > (CURRENT_DATE - '90 days'::interval)) AND (date <= CURRENT_DATE))
                           ->  Parallel Hash  (cost=5115.07..5115.07 rows=123024 width=8) (actual time=46.829..46.829 rows=104572 loops=2)
                                 Buckets: 262144  Batches: 1  Memory Usage: 10240kB
                                 ->  Parallel Seq Scan on versions  (cost=0.00..5115.07 rows=123024 width=8) (actual time=0.014..22.338 rows=104572 loops=2)
 Planning Time: 13.081 ms
 Execution Time: 24107.687 ms

After

Execution Time: 2859.205 ms (~10x improvement)

Query Plan
                                                                                   QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize GroupAggregate  (cost=290423.69..292626.38 rows=19844 width=12) (actual time=2821.338..2848.636 rows=36080 loops=1)
   Group Key: versions.crate_id
   ->  Gather Merge  (cost=290423.69..292547.00 rows=19844 width=12) (actual time=2821.327..2845.286 rows=69848 loops=1)
         Workers Planned: 1
         Workers Launched: 1
         ->  Sort  (cost=289423.69..289433.61 rows=19844 width=12) (actual time=2813.357..2817.727 rows=34924 loops=2)
               Sort Key: versions.crate_id
               Sort Method: quicksort  Memory: 3228kB
               Worker 0:  Sort Method: quicksort  Memory: 3119kB
               ->  Partial HashAggregate  (cost=289080.86..289140.39 rows=19844 width=12) (actual time=2795.301..2803.307 rows=34924 loops=2)
                     Group Key: versions.crate_id
                     ->  Parallel Hash Join  (cost=5551.39..286954.39 rows=2126468 width=8) (actual time=118.070..2323.046 rows=1817862 loops=2)
                           Hash Cond: (version_downloads_2019_q4.version_id = versions.id)
                           ->  Parallel Append  (cost=5.73..280292.30 rows=2126471 width=8) (actual time=0.033..1483.926 rows=1817862 loops=2)
                                 Subplans Removed: 14
                                 ->  Parallel Seq Scan on version_downloads_2019_q4  (cost=0.00..38084.79 rows=1164031 width=8) (actual time=0.029..983.035 rows=999817 loops=2)
                                       Filter: ((date <= CURRENT_DATE) AND (date > (CURRENT_DATE - '90 days'::interval)))
                                       Rows Removed by Filter: 1203968
                                 ->  Parallel Seq Scan on version_downloads_2020_q1  (cost=0.00..14137.23 rows=962406 width=8) (actual time=0.023..768.265 rows=1636090 loops=1)
                                       Filter: ((date <= CURRENT_DATE) AND (date > (CURRENT_DATE - '90 days'::interval)))
                           ->  Parallel Hash  (cost=5115.07..5115.07 rows=123024 width=8) (actual time=117.250..117.250 rows=104572 loops=2)
                                 Buckets: 262144  Batches: 1  Memory Usage: 10272kB
                                 ->  Parallel Seq Scan on versions  (cost=0.00..5115.07 rows=123024 width=8) (actual time=0.013..90.574 rows=104572 loops=2)
 Planning Time: 13.999 ms
 Execution Time: 2859.205 ms

Load Uncounted Versions

We cannot partition on a boolean, so we need to add a date restriction
for this query to work well on the partitioned database. Since the old
query performed better with the date restriction as well (sometimes,
depending on whether the BRiN index was cooperating at that exact
moment), this comparison includes it on both.

Before

Execution Time: 5188.726 ms

Query Plan
                                                                     QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on version_downloads  (cost=15.14..223283.53 rows=81 width=17) (actual time=4309.323..5188.516 rows=3148 loops=1)
   Recheck Cond: ((date >= (CURRENT_DATE - '7 days'::interval)) AND (date <= CURRENT_DATE))
   Rows Removed by Index Recheck: 14677651
   Filter: ((NOT processed) AND (downloads <> counted))
   Rows Removed by Filter: 350539
   Heap Blocks: lossy=113837
   ->  Bitmap Index Scan on index_version_downloads_by_date  (cost=0.00..15.14 rows=368142 width=0) (actual time=7.165..7.165 rows=1139200 loops=1)
         Index Cond: ((date >= (CURRENT_DATE - '7 days'::interval)) AND (date <= CURRENT_DATE))
 Planning Time: 0.126 ms
 Execution Time: 5188.726 ms

After

Execution Time: 141.453 ms (~40x improvement)

Query Plan
                                                               QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
 Append  (cost=0.09..298606.03 rows=349025 width=16) (actual time=13.171..141.314 rows=3148 loops=1)
   Subplans Removed: 15
   ->  Seq Scan on version_downloads_2020_q1  (cost=0.00..18660.54 rows=348970 width=16) (actual time=13.170..141.103 rows=3148 loops=1)
         Filter: ((downloads <> counted) AND (date <= CURRENT_DATE) AND (date >= (CURRENT_DATE - '7 days'::interval)))
         Rows Removed by Filter: 1632942
 Planning Time: 0.663 ms
 Execution Time: 141.453 ms

Set processed = 't' on old rows

Before

~6s on average in production, based on logs (This cannot be reliably
meaningfully measured repeatedly, since it requires rows to be affected
and write traffic at production levels to show speeds comparable to what
our logs show)

After

No longer executed (∞x improvement)

Insert or update (e.g. the /download endpoint)

Before

Execution Time: 0.092 ms

Query Plan
                                                QUERY PLAN
----------------------------------------------------------------------------------------------------------
 Insert on version_downloads  (cost=0.00..0.00 rows=1 width=21) (actual time=0.063..0.063 rows=0 loops=1)
   Conflict Resolution: UPDATE
   Conflict Arbiter Indexes: version_downloads_pkey
   Tuples Inserted: 0
   Conflicting Tuples: 1
   ->  Result  (cost=0.00..0.00 rows=1 width=21) (actual time=0.007..0.007 rows=1 loops=1)
 Planning Time: 0.045 ms
 Execution Time: 0.092 ms

After

Execution Time: 0.157 ms (~2x slower)

This one will grow linearly with the number of partitions. The
underlying performance regression is fixed in PG 12. In the short term,
this query is still way sub-millisecond so I'm not worried.

Query Plan
                                                  QUERY PLAN
---------------------------------------------------------------------------------------------------------------
 Insert on version_downloads_part  (cost=0.00..0.00 rows=1 width=16) (actual time=0.077..0.077 rows=0 loops=1)
   Conflict Resolution: UPDATE
   Conflict Arbiter Indexes: version_downloads_part_pkey
   Tuples Inserted: 0
   Conflicting Tuples: 1
   ->  Result  (cost=0.00..0.00 rows=1 width=16) (actual time=0.004..0.004 rows=1 loops=1)
 Planning Time: 0.031 ms
 Execution Time: 0.157 ms

@sgrif
Copy link
Contributor Author

sgrif commented Feb 19, 2020

@smarnach I could use your help getting the db dump fixed

@smarnach
Copy link
Contributor

I'm looking into fixing the database dump now.

@smarnach
Copy link
Contributor

For the time being, the easiest way to fix the database dump is to completely exclude the version_downloads table from the dump. I think the dumps still provide value even without that data, and they will only be one third of the size. (See also #2078 for discussions about the future of the dumps.)

I plan to implement an ignore_tables setting in the configuration file of the dumps. Maybe I manage to do this on a long-distance flight tomorrow.

@sgrif
Copy link
Contributor Author

sgrif commented Feb 23, 2020 via email

@smarnach
Copy link
Contributor

I filed a proposed fix as pull request against this branch.

@jtgeibel
Copy link
Member

Pinging @kornelski on this PR with respect to changes (temporary and long term) to how we dump the version_downloads column. In the short term, it would be nice if we were at least exporting recent downloads (maybe the last 7 days).

@kornelski
Copy link
Contributor

kornelski commented Feb 25, 2020

I think it's a good idea to remove the full version_downloads from the data dump — it's a huge csv, and redownloading the full history every day makes bandwidth usage exponential.

For me an ideal solution would be to export downloads in separate archives, one per day, containing only downloads from that day:

  • https://static.crates.io/downloads-2020-02-22.tar.gz
  • https://static.crates.io/downloads-2020-02-23.tar.gz
  • https://static.crates.io/downloads-2020-02-24.tar.gz

This way I could get downloads incrementally and catch up if necessary.

@smarnach
Copy link
Contributor

@kornelski That approach would be relatively straight-forward to implement, so I may do that. Give it a few weeks, though. :-/

For the time being, would you prefer to have version_downloads included in the single tarball, or would you prefer to have it not at all?

@kornelski
Copy link
Contributor

For now I'd prefer to have it included. It's OK if it's truncated.

@smarnach
Copy link
Contributor

@kornelski Since you are the only user of the database dumps I'm in contact with, I updated sgrif#1 to include version_downloads again. I hope to implent a better solution in the next few weeks.

@jtgeibel
Copy link
Member

jtgeibel commented Mar 3, 2020

@sgrif I've gone through the commits and the approach (and results) look great to me! I'd just like to take another look at the SQL function to make sure I understand it.

Other than that, there's just the database export. From what I understand @smarnach has changes in sgrif#1 that should continue to export all download history for now. If you're fine with that PR then I think we can coordinate a time window to merge this and sequence through the deploy.

@sgrif
Copy link
Contributor Author

sgrif commented Mar 10, 2020

I didn't get a chance to look today but I will Thursday

@bluetech
Copy link

I had hoped that a BRiN index would have similar improvements, since our write patterns should be perfect for it. However this just hasn't been true in practice, and the index is barely pruning any data. I believe a full vacuum would correct this, but that would require unacceptable levels of downtime.

Have you considered pg_repack?

@bors
Copy link
Contributor

bors commented Jul 5, 2020

☔ The latest upstream changes (presumably #2539) made this pull request unmergeable. Please resolve the merge conflicts.

@Turbo87
Copy link
Member

Turbo87 commented Nov 26, 2020

CI is failing, we seem to have some conflicts and there generally hasn't been much movement here, so I'll close this PR for now. feel free to reopen if you want to pick this back up :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants