feat: Celery worker concurrency setting #1010

arbrandes · 2024-02-29T20:30:43Z

This allows the user to configure how many Celery workers are spawned independently of how many CPUs there are in the system. The default is to spawn as many workers as there are CPUs, which in some cases can consume too many resources.

The setting should be particularly useful to people running Tutor for development on Linux machines, where reducing the concurrency to "1" can reduce RAM usage significantly.

Testing

Before running this branch, launch a Tutor environment and count how many celery process there are. With something like:
```
pgrep celery | wc -l
```
You should get twice the number of CPUs on the system - one set for each of LMS and CMS - plus two parent processes. (On my machine, which has 12 real cores + 12 virtual ones, the number comes out to 26.)

Stop the environment, install this branch, and set:

tutor config save --set OPENEDX_LMS_CELERY_WORKERS=1 --set OPENEDX_CMS_CELERY_WORKERS=1

Relaunch the environment. The worker containers should be recreated.
Check the number of celery processes. There should now be just 4.

It's worth checking RAM usage, too. Before, my dev environment would take up 8 or more gigabytes of RAM. After, it takes less than 3.

arbrandes · 2024-03-05T12:32:19Z

@regisb, mind taking a look?

dkaliberda · 2024-03-07T17:11:03Z

tutor/templates/k8s/deployments.yml

@@ -141,7 +141,7 @@ spec:
      containers:
        - name: cms-worker
          image: {{ DOCKER_IMAGE_OPENEDX }}
-          args: ["celery", "--app=cms.celery", "worker", "--loglevel=info", "--hostname=edx.cms.core.default.%%h", "--max-tasks-per-child", "100", "--exclude-queues=edx.lms.core.default"]
+          args: ["celery", "--app=cms.celery", "worker", "--loglevel=info", "--hostname=edx.cms.core.default.%%h", "--concurrency={{ OPENEDX_CMS_CELERY_WORKERS }}", "--max-tasks-per-child", "100", "--exclude-queues=edx.lms.core.default"]


The --concurrency argument specifies the number of worker processes. This is not an ideal practice in Kubernetes environments because:

Kubernetes prefers to manage scalability and replication at the container orchestration level, using the replicas field in a Deployment to manage the number of pod instances.

Setting --concurrency inside a container limits the scalability to the process level inside the pod, rather than allowing Kubernetes to manage multiple pods across nodes for better fault tolerance and load distribution.

It violate "one process per container" principle. This is important because with multiple processes in the same container, it is harder to troubleshoot the container because logs from different processes will be mixed together, and it is harder to manage the processes lifecycle, etc.

So, it's better to just make a hardcode--concurrency=1

That all makes sense, but it might be counter-intuitive to have a configuration item that works for one deployment scenario but not another.

I mean, we have OPENEDX_CMS_UWSGI_WORKERS, and that's also configurable for Kubernetes. 🤷🏼

I think, it would be appropriate to add to the documentation a mention that setting --concurrency=1 for K8s is recommended not in the context of saving resources, but in the context of proper resource management. What do you think about this? It will be useful for DevOps to pay attention to this.

My opinion is that we should allow the operator to decide what is the size of its services/pods. For the LMS/Studio by defining the OPENEDX_CMS_UWSGI_WORKERS and on celery workers the same principle is applied by adding the OPENEDX_CMS_CELERY_WORKERS variable. Bigger pods could allow some installations to optimize for their case.
Nevertheless, my Kubernetes deployment uses --concurrency=1, with an Horizontal Pod Autoscaling configuration.

dkaliberda · 2024-03-07T17:11:49Z

tutor/templates/k8s/deployments.yml

@@ -250,7 +250,7 @@ spec:
      containers:
        - name: lms-worker
          image: {{ DOCKER_IMAGE_OPENEDX }}
-          args: ["celery", "--app=lms.celery", "worker", "--loglevel=info", "--hostname=edx.lms.core.default.%%h", "--max-tasks-per-child=100", "--exclude-queues=edx.cms.core.default"]
+          args: ["celery", "--app=lms.celery", "worker", "--loglevel=info", "--hostname=edx.lms.core.default.%%h", "--concurrency={{ OPENEDX_LMS_CELERY_WORKERS }}", "--max-tasks-per-child=100", "--exclude-queues=edx.cms.core.default"]


The same problem

dkaliberda · 2024-03-07T17:19:26Z

tutor/templates/local/docker-compose.yml

@@ -158,7 +158,7 @@ services:
    environment:
      SERVICE_VARIANT: lms
      DJANGO_SETTINGS_MODULE: lms.envs.tutor.production
-    command: celery --app=lms.celery worker --loglevel=info --hostname=edx.lms.core.default.%%h --max-tasks-per-child=100 --exclude-queues=edx.cms.core.default
+    command: celery --app=lms.celery worker --loglevel=info --hostname=edx.lms.core.default.%%h --concurrency={{ OPENEDX_LMS_CELERY_WORKERS }} --max-tasks-per-child=100 --exclude-queues=edx.cms.core.default


The docker-compose also provides mechanisms for managing replicas. Therefore, it is also better to make --concurrency=1

In theory, yes. I'd be glad to review a PR that does that instead. I just need a way to reduce RAM usage for development. ;)

dkaliberda · 2024-03-07T17:19:28Z

tutor/templates/local/docker-compose.yml

@@ -177,7 +177,7 @@ services:
    environment:
      SERVICE_VARIANT: cms
      DJANGO_SETTINGS_MODULE: cms.envs.tutor.production
-    command: celery --app=cms.celery worker --loglevel=info --hostname=edx.cms.core.default.%%h --max-tasks-per-child 100 --exclude-queues=edx.lms.core.default
+    command: celery --app=cms.celery worker --loglevel=info --hostname=edx.cms.core.default.%%h --concurrency={{ OPENEDX_CMS_CELERY_WORKERS }} --max-tasks-per-child 100 --exclude-queues=edx.lms.core.default


The docker-compose also provides mechanisms for managing replicas. Therefore, it is also better to make --concurrency=1

DawoudSheraz · 2024-03-13T07:18:45Z

Not sure if it is Mac thing but I can't see anything against pgrep celery in LMS/CMS worker, using bash (both dev and local). However, I can see the max_concurrency is set to number of allocated CPUs for both LMS and CMS workers using celery --app=lms.celery inspect stats.

regisb · 2024-03-21T08:29:14Z

docs/dev.rst

+    --set OPENEDX_CMS_CELERY_WORKERS=1 \
+    --set OPENEDX_LMS_CELERY_WORKERS=1 \
+    --set OPENEDX_CMS_UWSGI_WORKERS=1 \
+    --set OPENEDX_LMS_UWSGI_WORKERS=1 \


I'd rather avoid asking users to manually set these values. Instead, we should automatically default to workers=1 in development. Can we do that? For instance by overriding the celery config in development?

regisb · 2024-03-21T08:29:38Z

docs/dev.rst

+    --set OPENEDX_LMS_CELERY_WORKERS=1 \
+    --set OPENEDX_CMS_UWSGI_WORKERS=1 \
+    --set OPENEDX_LMS_UWSGI_WORKERS=1 \
+    --set ELASTICSEARCH_HEAP_SIZE=100m


Same here: can we automatically set this value in development?

regisb · 2024-03-21T08:53:53Z

docs/configuration.rst

@@ -149,6 +149,11 @@ This defines the version that will be pulled from just the Open edX platform git

 By default, there are 2 `uwsgi worker processes <https://uwsgi-docs.readthedocs.io/en/latest/Options.html#processes>`__ to serve requests for the LMS and the CMS. However, each worker requires upwards of 500 Mb of RAM. You should reduce this value to 1 if your computer/server does not have enough memory.

+- ``OPENEDX_LMS_CELERY_WORKERS`` (default: ``"0"``)
+- ``OPENEDX_CMS_CELERY_WORKERS`` (default: ``"0"``)


Adding new configuration settings to Tutor core is a personal trigger of mine 🧨 Do we really want to make changes to the default production values? If yes, can we:

propose better defaults?

make these custom changes possible via a patch instead of two new configuration settings?

Yeah, I understand the reluctance to add new config items. It's just that in this case, it was the most straightforward way to achieve what I was after. There's precedent, too: OPENEDX_LMS_UWSGI_WORKERS is there for very similar reasons.

Regarding the defaults, I'm not actually changing them: I'm just making them explicit, where before they were implicit. (The implicit default is to scale the workers to however many CPUs you have, and that's what "0" means.)

As for using patches, I wouldn't mind except for the fact that, as mentioned above, this is just doing what OPENEDX_LMS_UWSGI_WORKERS does, except for Celery workers. If we have that configuration, I don't see why we shouldn't have this one.

All of this said, I really like the idea of changing certain things automatically for development environments, whether they have corresponding config items or not. For instance, after I issued this PR it came to my attention that Tutor's importing * from devstack.py for the development settings, and that means that we aren't using Celery workers at all! (See https://github.com/openedx/edx-platform/blob/master/lms/envs/devstack.py#L35.) So why is tutor dev even firing up workers?

In any case, the latter sounds like it warrants a separate PR. My question regarding this PR, though, is whether we do or do not want OPENEDX_LMS_CELERY_WORKERS. It might not make sense to change this in a Kubernetes setting, but I'm willing to defend that it does on any Docker deployment where you have more CPUs than you have RAM (so to speak).

How about this for not starting workers at all in dev mode? #1041

The question remains whether we still want to let people configure the number of Celery workers manually. (I say we let them.)

I love the fact that we can disable workers in dev. I commented on #1041.

Let's now focus on the possibility to customize the number of celery runners. I agree that this would be a useful feature. If we really have to, we'll introduce new configuration values, but I'd like to see if we can avoid it. For instance, could we avoid that by creating a celery config file? This file would include a {{ patch("edx-platform-celery-config") }} statement. That way, we wouldn't have to create new configuration settings for every celery parameter.

@regisb
Agree! There should be an option to disable workers in dev, it could be disabled by default on dev mode. Personally, I like of having a config file with a patch. By default the config file should have the minimum config to start.
Nevertheless, I feel that won't resolve every additional configuration:

Missing config options for --without-mingle and --without-gossip celery/celery#2566

https://docs.celeryq.dev/en/stable/reference/cli.html

https://docs.celeryq.dev/en/stable/userguide/configuration.html this won't have a way to disable gossip, mingle, heartbeat and configure (vertical) autoscale.

Please see my answer here: #1126 (comment)
I propose that all remaining comments are made on issue #1126

DawoudSheraz · 2024-04-03T14:17:43Z

@arbrandes Hi, there are a few to-be-addressed comments added by Régis. Please take a look when you get a chance. Thanks.

This allows the user to configure how many Celery workers are spawned independently of how many CPUs there are in the system. The default is to spawn as many workers as there are CPUs, which in some cases can consume too many resources. (The setting should be particularly useful to people running Tutor for development on Linux machines, where reducing the concurrency to "1" can reduce RAM usage significantly.)

DawoudSheraz · 2024-07-01T11:38:47Z

@arbrandes Hi, what's the plan for this PR? Thanks

arbrandes · 2024-07-01T12:17:34Z

I can look into adding a Celery conf file patch, but since I'm not using this PR (as opposed to the one that disables workers in dev mode), it'll probably take a while to get to.

arbrandes · 2024-10-08T15:18:54Z

I'm closing this because it seems we're gonna go with a patch/config file solution. The conversation should continue on #1126.

arbrandes requested a review from regisb February 29, 2024 20:30

arbrandes force-pushed the celery-concurrency branch from b44aee8 to d02b5ed Compare February 29, 2024 20:43

arbrandes changed the base branch from nightly to master February 29, 2024 20:43

arbrandes force-pushed the celery-concurrency branch 2 times, most recently from 1dd4b5a to 0382877 Compare March 1, 2024 14:34

dkaliberda reviewed Mar 7, 2024

View reviewed changes

DawoudSheraz self-requested a review March 12, 2024 09:17

DawoudSheraz approved these changes Mar 13, 2024

View reviewed changes

regisb reviewed Mar 21, 2024

View reviewed changes

DawoudSheraz requested a review from regisb April 9, 2024 06:50

arbrandes force-pushed the celery-concurrency branch from 0382877 to a77c999 Compare April 11, 2024 20:18

DawoudSheraz mentioned this pull request May 7, 2024

feat!: don't run Celery workers in dev mode #1041

Closed

This was referenced Sep 30, 2024

Celery lms/cms-worker consumes too much RAM #1126

Open

Add support for running multiple Celery queues #1130

Open

arbrandes closed this Oct 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Celery worker concurrency setting #1010

feat: Celery worker concurrency setting #1010

arbrandes commented Feb 29, 2024 •

edited

Loading

arbrandes commented Mar 5, 2024

dkaliberda Mar 7, 2024 •

edited

Loading

arbrandes Mar 7, 2024

dkaliberda Mar 11, 2024

igobranco Sep 30, 2024

dkaliberda Mar 7, 2024

dkaliberda Mar 7, 2024 •

edited

Loading

arbrandes Mar 7, 2024

dkaliberda Mar 7, 2024 •

edited

Loading

DawoudSheraz commented Mar 13, 2024

regisb Mar 21, 2024

regisb Mar 21, 2024

regisb Mar 21, 2024

arbrandes Apr 3, 2024 •

edited

Loading

arbrandes Apr 16, 2024

regisb Apr 17, 2024

igobranco Oct 1, 2024

regisb Oct 1, 2024

DawoudSheraz commented Apr 3, 2024

DawoudSheraz commented Jul 1, 2024

arbrandes commented Jul 1, 2024

arbrandes commented Oct 8, 2024

feat: Celery worker concurrency setting #1010

feat: Celery worker concurrency setting #1010

Conversation

arbrandes commented Feb 29, 2024 • edited Loading

Testing

arbrandes commented Mar 5, 2024

dkaliberda Mar 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dkaliberda Mar 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dkaliberda Mar 7, 2024 • edited Loading

Choose a reason for hiding this comment

DawoudSheraz commented Mar 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arbrandes Apr 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DawoudSheraz commented Apr 3, 2024

DawoudSheraz commented Jul 1, 2024

arbrandes commented Jul 1, 2024

arbrandes commented Oct 8, 2024

arbrandes commented Feb 29, 2024 •

edited

Loading

dkaliberda Mar 7, 2024 •

edited

Loading

dkaliberda Mar 7, 2024 •

edited

Loading

dkaliberda Mar 7, 2024 •

edited

Loading

arbrandes Apr 3, 2024 •

edited

Loading