feat: Adding the ability to prioritize tasks #519

simeq · 2024-07-27T18:52:14Z

This change introduces the possibility of using a task prioritization mechanism. This mechanism is disabled by default, and can be enabled using the enablePrioritization() method.

Reminders

Added/ran automated tests
Update README and/or examples
Ran mvn spotless:apply

cc @kagkarlsson open to any suggestions and comments regarding tests that would be worth adding

kagkarlsson · 2024-08-02T08:10:58Z

Thanks for updating the PR 👍. Will try and find some time to have a look at it soon!

kagkarlsson · 2024-08-07T20:03:28Z

Did a quick look through. I can see most lines come from the opt-in toggle 😅. Good job realigning the PR with master. I need to go through it more thoroughly, but a couple of reactions:

Ideally the compatibility test should test both variants of the priority-toggle
I like the instance-builder. Probably prefer dropping the set-prefix for the method-names
Might need to drop "not null" from the schema definitions, considering people upgrading will have existing data?

Another thing. It would be great to see the numbers for how polling performs, specifically for postgres, how many buffers are read to satisfy the query, with and without priority order.

A test like:

10M executions not due, random priority
10M executions due, random priority

Run the due-query in postgres with (analyze on, buffers on). Possibly with index-variations also (priority,execution_time) or (execution_time,priority).

Not sure if this is something you or someone else are up to, otherwise I need to run it through myself.

simeq · 2024-08-08T07:39:49Z

Thanks for looking into it!

In near days I will update the PR according to your suggestions, but got a question regarding the not null part.
Nowhere in the schema, priority is set to not null and I added:

Upgrading to 15.x

Add column priority and priority_execution_time_idx index to the database schema. See table definitions for postgresql, oracle or mysql. Note that when enablePrioritization() is used, the null value in order of prioritization is handled differently depending on the database used.

So the schema allows null values, but maybe make it just clearer that in this upgrade note we are talking about existing values?

About that testing of changes, didn't do int earlier, but I'm happy to try. If any problems occur for me, I will let you know.

Overall, it's a good exercise diving into this code, helped recently with understanding exactly how to improve performance 😄

kagkarlsson · 2024-08-10T18:19:32Z

In near days I will update the PR according to your suggestions, but got a question regarding the not null part.
Nowhere in the schema, priority is set to not null and I added:

Ah, I didn't check all schemas, just assumed after I think I saw one. Checked now, and looks like mssql has not null, but as you say, none other has it.

...heduler/src/test/java/com/github/kagkarlsson/scheduler/functional/PriorityExecutionTest.java

db-scheduler/src/test/resources/mariadb_tables.sql

GeorgEchterling · 2024-08-13T09:48:47Z

I think I read a discussion about the index usage with prioritization somewhere on this repo, but I can't find it. In case it's still relevant:

Have you considered splitting the "due task detection" from the picking step? I.e. something like this:

UPDATE scheduled_tasks
SET due = TRUE
WHERE NOT due
AND NOT picked
AND execution_time < NOW();

SELECT * FROM schedules_tasks
WHERE due
AND NOT picked
ORDER BY priority, execution_time;

Both queries could be optimized (even for arbitrary priority cardinality) using indices over (due, picked, execution_time) and (due, picked, priority, execution_time).

Also, this PR uses descending priorities. Older versions of MySQL/MariaDB don't support direction on index columns, which would prevent them from using the index when sorting by priority DESC, execution_time ASC. I'm not sure if that affects any other DBs.

simeq · 2024-08-16T17:54:59Z

I made some tests @kagkarlsson for the 10M executions due, wasn't certain what did you mean with "not due" executions, so happy to add them later 😄

TL;DR priority desc, execution_time asc is the correct index, enabling prioritization causes a reduction in performance of about 15 percent.

I conducted tests on:

GCP PostgreSQL 12 (4 vCPUs, 25 GB memory, SSD storage)
4x GCP VMs (4 vCPUs, 15 GB memory, SSD storage)

Scheduler was configured with lock-and-fetch:

lowerLimitFractionOfThreads: 0.5
upperLimitFractionOfThreads: 4.0
threads: 50

I have scheduled 10M executions that were due, with random priority

And tested two types of indexes:

priority desc, execution_time asc
execution_time asc, priority desc

Results

Results for scheduler without prioritization

	count	mean	1m rate	5m rate	15m rate
vm1	2521123	2194.22	2341.86	2267.74	2153.32
vm2	2483663	2161.62	2289.32	2228.68	2101.48
vm3	2500572	2176.34	2316.60	2246.62	2121.06
vm2	2494642	2171.17	2307.60	2242.47	2128.78
total	10000000	8703.35	9255.38	8985.51	8504.64

Results for prioritization with index priority desc, execution_time asc

	count	mean	1m rate	5m rate	15m rate
vm1	2516239	1890.51	1972.18	1950.99	1923.13
vm2	2481189	1864.18	1951.05	1924.37	1898.35
vm3	2495400	1874.86	1951.23	1934.60	1896.35
vm2	2507172	1883.70	1968.52	1946.73	1921.82
total	10000000	7513.25	7842.98	7756.69	7639.65

Results for prioritization with index execution_time asc, priority desc
I just gave up, it was too slow...

	count	mean	1m rate	5m rate	15m rate
total	1600	19.76	13.23	4.62	1.68

Query plans - explain (analyze, buffers)

EXPLAIN (ANALYZE, BUFFERS)
SELECT task_name, task_instance
FROM scheduled_tasks WHERE picked = false and execution_time <= now()
ORDER BY priority desc, execution_time ASC FOR UPDATE SKIP LOCKED
LIMIT 100

Query plan without index on priority

Limit  (cost=645821.29..645822.54 rows=100 width=39) (actual time=11979.371..11979.485 rows=100 loops=1)
  Buffers: shared hit=113737, temp read=138357 written=206674
  ->  LockRows  (cost=645821.29..770819.29 rows=9999840 width=39) (actual time=11979.370..11979.476 rows=100 loops=1)
        Buffers: shared hit=113737, temp read=138357 written=206674
        ->  Sort  (cost=645821.29..670820.89 rows=9999840 width=39) (actual time=11979.345..11979.372 rows=100 loops=1)
              Sort Key: priority DESC, execution_time
              Sort Method: external merge  Disk: 547248kB
              Buffers: shared hit=113637, temp read=138357 written=206674
              ->  Seq Scan on scheduled_tasks  (cost=0.00..263634.60 rows=9999840 width=39) (actual time=0.014..2431.470 rows=10000000 loops=1)
                    Filter: ((NOT picked) AND (execution_time <= now()))
                    Buffers: shared hit=113637
Planning Time: 0.107 ms
Execution Time: 12091.458 ms

Query plan with index priority desc, execution_time asc

Limit  (cost=0.56..9.29 rows=100 width=39) (actual time=0.028..0.144 rows=100 loops=1)
  Buffers: shared hit=117 dirtied=9
  ->  LockRows  (cost=0.56..872617.26 rows=9999840 width=39) (actual time=0.027..0.135 rows=100 loops=1)
        Buffers: shared hit=117 dirtied=9
        ->  Index Scan using priority_execution_time_idx on scheduled_tasks  (cost=0.56..772618.86 rows=9999840 width=39) (actual time=0.021..0.084 rows=100 loops=1)
              Index Cond: (execution_time <= now())
              Filter: (NOT picked)
              Buffers: shared hit=17 dirtied=9
Planning Time: 0.283 ms
Execution Time: 0.220 ms

Query plan with index execution_time asc, priority desc

Limit  (cost=469587.40..469588.65 rows=100 width=40) (actual time=14596.903..14597.121 rows=100 loops=1)
  Buffers: shared hit=113737 dirtied=103971, temp read=138357 written=206674
  ->  LockRows  (cost=469587.40..553112.54 rows=6682011 width=40) (actual time=14596.902..14597.111 rows=100 loops=1)
        Buffers: shared hit=113737 dirtied=103971, temp read=138357 written=206674
        ->  Sort  (cost=469587.40..486292.43 rows=6682011 width=40) (actual time=14596.873..14596.898 rows=100 loops=1)
              Sort Key: priority DESC, execution_time
              Sort Method: external merge  Disk: 547248kB
              Buffers: shared hit=113637 dirtied=103971, temp read=138357 written=206674
              ->  Seq Scan on scheduled_tasks  (cost=0.00..214205.74 rows=6682011 width=40) (actual time=0.026..5012.498 rows=10000000 loops=1)
                    Filter: ((NOT picked) AND (execution_time <= now()))
                    Buffers: shared hit=113637 dirtied=103971
Planning Time: 0.178 ms
Execution Time: 14794.232 ms

Query plan when prioritization is disabled (ORDER BY execution_time ASC)

Limit  (cost=0.44..5.85 rows=100 width=35) (actual time=0.024..0.101 rows=100 loops=1)
  Buffers: shared hit=105 dirtied=2
  ->  LockRows  (cost=0.44..541357.98 rows=10000056 width=35) (actual time=0.024..0.092 rows=100 loops=1)
        Buffers: shared hit=105 dirtied=2
        ->  Index Scan using execution_time_idx on scheduled_tasks  (cost=0.44..441357.42 rows=10000056 width=35) (actual time=0.015..0.038 rows=100 loops=1)
              Index Cond: (execution_time <= now())
              Filter: (NOT picked)
              Buffers: shared hit=5
Planning Time: 0.258 ms
Execution Time: 0.125 ms

kagkarlsson · 2024-08-23T06:22:30Z

Sorry I haven't followed up earlier. Good job on the testing! Excellent with a full test using concurrent schedulers and detailed statistics 👏.

wasn't certain what did you mean with "not due" executions, so happy to add them later

To make the testing more realistic we have to assume that there are a large number of executions which are not due (also high priority executions that are not due yet).

Index (priority,execution_time) will be better when most executions are due, i.e. execution-time have passed.
Index (execution_time,priority) will be better when most executions are not due, i.e. execution-time have not passed yet (future executions).

My assumption would be that the realistic scenario would be that there are more future executions than due. If there are throughput problems, there will however eventually be a significant amount of due executions, which is the scenario where priority is useful.

So for the testing, I think we should add at least the same amount of future executions to the table as there are due (maybe even a factor higher).
Due: 1M, future: 10M might be a better distribution? 🤔

kagkarlsson · 2024-08-23T06:26:48Z

Have you considered splitting the "due task detection" from the picking step?

@GeorgEchterling not really. That would require an additional update and roundtrip to the database 🤔

(on the other hand, the performance will likely be more predictable)

simeq · 2024-08-23T10:17:24Z

Thanks for your explanation @kagkarlsson.

I ran the tests again, started the same instances and filled the scheduler with 1M due tasks (random -60 minutes) and 10M in the future (random +60 minutes), with random priorities from 1 to 10.

priority desc, execution_time asc is the correct index because PostgreSQL is still doing seq scan for execution_time asc, priority desc regardless of more tasks that are not due. But there is a significant drop in events rate when prioritization is enabled.

Results

Results for scheduler without prioritization

	count	mean	1m rate	5m rate	15m rate
vm1	252366	2274.00	2033.30	2186.72	2174.81
vm2	247850	2233.27	1989.09	2122.85	2104.44
vm3	251391	2265.19	2026.89	2173.77	2160.37
vm2	248393	2238.11	2004.79	2170.64	2164.12
total	1000000	9010.57	8054.07	8653.98	8603.74

Results for prioritization with index priority desc, execution_time asc

	count	mean	1m rate	5m rate	15m rate
vm1	250200	297.86	275.29	287.74	293.46
vm2	249400	296.91	274.74	286.37	288.67
vm3	250000	297.63	277.64	287.47	288.88
vm2	250400	298.10	278.52	288.00	289.24
total	1000000	1190.5	1106.19	1149.58	1160.25

Results for prioritization with index execution_time asc, priority desc

	count	mean	1m rate	5m rate	15m rate
vm1	252324	58.16	55.20	44.27	41.51
vm2	247834	58.18	55.19	44.27	41.51
vm3	252391	56.19	55.42	44.28	41.51
vm2	247451	58.18	55.20	44.27	41.51
total	1000000	230.71	221.01	177.09	166.04

Query plans

Query plan with index priority desc, execution_time asc was:

Limit  (cost=0.56..15.45 rows=100 width=38) (actual time=148.539..148.643 rows=100 loops=1)
  Buffers: shared hit=31277
  ->  LockRows  (cost=0.56..813440.53 rows=5464003 width=38) (actual time=148.538..148.633 rows=100 loops=1)
        Buffers: shared hit=31277
        ->  Index Scan using priority_execution_time_idx on scheduled_tasks  (cost=0.56..758800.50 rows=5464003 width=38) (actual time=148.519..148.562 rows=100 loops=1)
              Index Cond: (execution_time <= now())
              Filter: (NOT picked)
              Buffers: shared hit=31177
Planning Time: 0.113 ms
Execution Time: 148.676 ms

Query plan with index execution_time asc, priority desc was:

Limit  (cost=336782.19..336783.44 rows=100 width=38) (actual time=7200.015..7200.163 rows=100 loops=1)
  Buffers: shared hit=125100 dirtied=110046, temp read=3466 written=9883
  ->  LockRows  (cost=336782.19..386525.02 rows=3979426 width=38) (actual time=7200.014..7200.153 rows=100 loops=1)
        Buffers: shared hit=125100 dirtied=110046, temp read=3466 written=9883
        ->  Sort  (cost=336782.19..346730.76 rows=3979426 width=38) (actual time=7199.988..7200.013 rows=100 loops=1)
              Sort Key: priority DESC, execution_time
              Sort Method: external merge  Disk: 54112kB
              Buffers: shared hit=125000 dirtied=110023, temp read=3466 written=9883
              ->  Seq Scan on scheduled_tasks  (cost=0.00..184691.39 rows=3979426 width=38) (actual time=0.015..6648.807 rows=1000000 loops=1)
                    Filter: ((NOT picked) AND (execution_time <= now()))
                    Rows Removed by Filter: 10000001
                    Buffers: shared hit=125000 dirtied=110023
Planning Time: 0.369 ms
Execution Time: 7212.177 ms

simeq · 2024-08-23T10:30:01Z

Basically, I would say that the usage of prioritization depends on the usage scenario of the scheduler.

I got an instance of scheduler where there are millions of recurring tasks with a persistent schedule and a few millions of one-time tasks added with execution time now() once a day. So for this, I'm guessing that separate schedulers would be better for prioritization.

But in instances where we operate only on one-time tasks that are always added with execution time now() - this type of prioritization would be a suitable solution.

simeq · 2024-09-17T08:41:34Z

@kagkarlsson What would we do next with this PR? :)

kagkarlsson · 2024-09-20T11:43:08Z

I did some testing on my own and I think we probably need to add both indexes (or at least supply them). (i.e both (pririty,execution_time) and (execution_time,priority)

With some luck I can review your changes (and possibly contribute some) next week 🤞

kagkarlsson · 2024-09-20T11:45:26Z

How do you feel about enablePrioritization() vs enablePriority()? Isn't "priority" more common to use?

simeq · 2024-09-21T07:18:33Z

Thanks for looking into this :)

I'm good with changing the name to enablePriority()

…is currently the same for all databases

…curring, and MEDIUM for the rest)

…now have a TaskInstance.Builder.

kagkarlsson · 2024-09-27T13:30:41Z

I pushed some changes, addressing prioritization -> priority among other things. Would be great if you would have a look @simeq

One big question I have is:

Should a high int-value for priority mean higher priority or the reverse? 🤯
A lot of schedulers seem to use a low value for higher priority (and for some reason it was my initial inclination as well)

simeq · 2024-10-11T08:26:32Z

The changes are good for me @kagkarlsson, anything else I could help with?

kagkarlsson · 2024-10-11T15:04:38Z

I think it is very close. I started a refactoring that I feel is a bit unfinished still. Slightly unrelated 😬

See 26e0492

e.g. (.instanceWithId is a new builder as well)

    scheduler.schedule(
        MY_TASK
            .instanceWithId("1045")
            .data(new MyTaskData(1001L))
            .scheduledTo(Instant.now().plusSeconds(5)));

i.e. stop using task.instance(..) in examples, and rather a static TaskDescriptor reference.

And also, if it makes sense, deprecated/reduce use of TaskWithDataDescriptor and TaskWithoutDataDescriptor (use plain TaskDescriptor instead, and the instance builder) (if it makes sense)

Update: I think I have completed this refactoring now.

kagkarlsson · 2024-10-11T15:06:32Z

How do we feel about these "defaults"? Users are free to use what suits them, as longs a it fits in the column..

public class Priority {
  public static final int HIGH = 90;
  public static final int MEDIUM = 50;
  public static final int LOW = 10;
}

…eprecating TaskWithDataDescriptor and TaskWithoutDataDescriptor. Updated examples.

simeq · 2024-10-14T08:47:36Z

Changes for TaskDescriptor are clear for me and looks good.

Those predefined priorities are great for default usage. Also, there are good description how value of field translates into priority.

For my part, it's important to have an ability to dynamically set value of priority.

kagkarlsson · 2024-10-21T07:29:18Z

Hopefully I will get some time on Friday to go through this PR one last time. If everything checks out, I will try to release it.

You are happy with the current version of the PR @simeq ?

simeq · 2024-10-22T07:03:14Z

I confirm my happiness with the current version @kagkarlsson :)

…o upgrade without modifying the schema)

kagkarlsson · 2024-10-25T09:31:25Z

I am quickly going to check how hard it is to avoid touching priority column if priority is disabled. Any thoughts on that @simeq ? It is to avoid forcing existing users to update the schema. (I have another feature planned that also might require schema-changes)

kagkarlsson · 2024-10-25T13:31:35Z

I am adding some missing tests for result ordering for the different cases

…ot in expected order.

simeq · 2024-10-28T16:57:44Z

On one hand, I think it's a good idea to be able to upgrade db-scheduler without database changes. On another, there would need to be explicit instructions on how to ALTER tables to work with priority, and should we have postgresql_tables.sql separate for with and without priority?

kagkarlsson · 2024-10-29T08:38:29Z

We will still update all the schema files with column priority, but moderate the upgrading instructions to say that it is recommended to add the column, but not necessary until priority is enabled.

kagkarlsson · 2024-10-29T10:53:35Z

(I am doing this as a service to long-time users that do not need priority. This way they can simply bump major and keep going.)

…eduledExecutions)

kagkarlsson

I think this is good now 👍

simeq · 2024-10-29T12:19:37Z

Thanks for the changes and comment it's clear for me now, and looks good @kagkarlsson 🙂

kagkarlsson · 2024-10-29T12:26:14Z

Will release soon!

Adding the ability to prioritize tasks

d389e53

simeq changed the title ~~Adding the ability to prioritize tasks~~ feat: Adding the ability to prioritize tasks Aug 8, 2024

kagkarlsson reviewed Aug 10, 2024

View reviewed changes

...heduler/src/test/java/com/github/kagkarlsson/scheduler/functional/PriorityExecutionTest.java Outdated Show resolved Hide resolved

kagkarlsson reviewed Aug 10, 2024

View reviewed changes

...heduler/src/test/java/com/github/kagkarlsson/scheduler/functional/PriorityExecutionTest.java Outdated Show resolved Hide resolved

kagkarlsson reviewed Aug 10, 2024

View reviewed changes

db-scheduler/src/test/resources/mariadb_tables.sql Show resolved Hide resolved

Fixes after first review for prioritization

0298341

simeq requested a review from kagkarlsson August 17, 2024 17:11

kagkarlsson added 7 commits September 27, 2024 13:13

prioritization -> priority

2e879ab

Make getQueryOrderPart not part of JdbcCustomization interface as it …

3ad327d

…is currently the same for all databases

Make priority into SMALLINT

fa0d808

Create JdbcTaskRepositoryTest for priority

7f53146

Compatibility test for priority

8aed4fe

Allow Task-types to specify a default priority (typically HIGH for re…

fce9c49

…curring, and MEDIUM for the rest)

Refactor TaskDescription a bit. SchedulableInstance.Builder since we …

26e0492

…now have a TaskInstance.Builder.

spotless

bee9217

kagkarlsson added 2 commits September 27, 2024 20:24

Merge branch 'master' into prioritization

e31e90a

Enable Sql server tests again

241fbba

kagkarlsson added 2 commits October 11, 2024 23:20

Completing refactor introducing instance-builder on TaskDescriptor. D…

9d9c30f

…eprecating TaskWithDataDescriptor and TaskWithoutDataDescriptor. Updated examples.

spotless

b1b4a79

kagkarlsson added 2 commits October 25, 2024 11:28

README and index-hints

397d70a

Try removing dependency on priority-column in table (to allow users t…

cc6f76f

…o upgrade without modifying the schema)

kagkarlsson added 2 commits October 25, 2024 12:22

Try fix Oracle error

e08284b

Try fix Oracle error

1025734

kagkarlsson added 4 commits October 25, 2024 21:34

Adding compatibility-tests for execution ordering (including priority).

62e95b8

Fixing bug in postgres lock-and-fetch where returned executions was n…

ae0c125

…ot in expected order.

spotless

20c8ebf

Fix oracle-schema errors

005ccf0

kagkarlsson added 2 commits October 29, 2024 10:52

Updating README considering priority column now optional

e0c0973

Merge branch 'master' into prioritization

038ec6c

Allow enablePriority() for SchedulerClient (to order result of getSch…

e2704a3

…eduledExecutions)

kagkarlsson approved these changes Oct 29, 2024

View reviewed changes

spotless

fe125cd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Adding the ability to prioritize tasks #519

feat: Adding the ability to prioritize tasks #519

simeq commented Jul 27, 2024

kagkarlsson commented Aug 2, 2024

kagkarlsson commented Aug 7, 2024

simeq commented Aug 8, 2024

kagkarlsson commented Aug 10, 2024

GeorgEchterling commented Aug 13, 2024

simeq commented Aug 16, 2024

kagkarlsson commented Aug 23, 2024

kagkarlsson commented Aug 23, 2024 •

edited

Loading

simeq commented Aug 23, 2024

simeq commented Aug 23, 2024

simeq commented Sep 17, 2024

kagkarlsson commented Sep 20, 2024

kagkarlsson commented Sep 20, 2024

simeq commented Sep 21, 2024

kagkarlsson commented Sep 27, 2024

simeq commented Oct 11, 2024

kagkarlsson commented Oct 11, 2024 •

edited

Loading

kagkarlsson commented Oct 11, 2024

simeq commented Oct 14, 2024

kagkarlsson commented Oct 21, 2024

simeq commented Oct 22, 2024

kagkarlsson commented Oct 25, 2024

kagkarlsson commented Oct 25, 2024

simeq commented Oct 28, 2024

kagkarlsson commented Oct 29, 2024

kagkarlsson commented Oct 29, 2024

kagkarlsson left a comment

simeq commented Oct 29, 2024

kagkarlsson commented Oct 29, 2024

feat: Adding the ability to prioritize tasks #519

Are you sure you want to change the base?

feat: Adding the ability to prioritize tasks #519

Conversation

simeq commented Jul 27, 2024

Reminders

kagkarlsson commented Aug 2, 2024

kagkarlsson commented Aug 7, 2024

simeq commented Aug 8, 2024

kagkarlsson commented Aug 10, 2024

GeorgEchterling commented Aug 13, 2024

simeq commented Aug 16, 2024

Results

Query plans - explain (analyze, buffers)

kagkarlsson commented Aug 23, 2024

kagkarlsson commented Aug 23, 2024 • edited Loading

simeq commented Aug 23, 2024

Results

Query plans

simeq commented Aug 23, 2024

simeq commented Sep 17, 2024

kagkarlsson commented Sep 20, 2024

kagkarlsson commented Sep 20, 2024

simeq commented Sep 21, 2024

kagkarlsson commented Sep 27, 2024

simeq commented Oct 11, 2024

kagkarlsson commented Oct 11, 2024 • edited Loading

kagkarlsson commented Oct 11, 2024

simeq commented Oct 14, 2024

kagkarlsson commented Oct 21, 2024

simeq commented Oct 22, 2024

kagkarlsson commented Oct 25, 2024

kagkarlsson commented Oct 25, 2024

simeq commented Oct 28, 2024

kagkarlsson commented Oct 29, 2024

kagkarlsson commented Oct 29, 2024

kagkarlsson left a comment

Choose a reason for hiding this comment

simeq commented Oct 29, 2024

kagkarlsson commented Oct 29, 2024

kagkarlsson commented Aug 23, 2024 •

edited

Loading

kagkarlsson commented Oct 11, 2024 •

edited

Loading