-
Notifications
You must be signed in to change notification settings - Fork 685
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Manage rq services with systemd instead of supervisor #4855
Manage rq services with systemd instead of supervisor #4855
Conversation
27602c7
to
32b4137
Compare
Codecov Report
@@ Coverage Diff @@
## develop #4855 +/- ##
========================================
Coverage 81.74% 81.74%
========================================
Files 49 49
Lines 3418 3418
Branches 392 392
========================================
Hits 2794 2794
Misses 533 533
Partials 91 91 Continue to review full report at Codecov.
|
Blocked pending reply from supervisor upstream in https://bugs.launchpad.net/ubuntu/+source/supervisor/+bug/1844802 (we'll make additional efforts to contact upstream in the week of 9/30 if warranted), as our preference is to not replace the dependency if we don't have to. |
At standup today we discussed that, if there is no progress getting a Python 3 version included in Xenial, it's too late for this PR to make it into 1.1.0. We will need to then address this in a point release (alongside the grsec update - #4843). Moving back to near-term backlog for now. |
3799431
to
5e00c90
Compare
Rebased and marked ready for review. The log question was answered; @conorsch correctly pointed out that the services' messages end up in |
|
dac568f
to
7bb2d0f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is looking good! I've ran through the test plan and all works as advertised. So one thing we should also do is ensure that we preserve the behavior added in #4713 regarding automatic requeuing of interrupted deletions - if with the systemd change these interrupted jobs are added to the failed job registry instead of the started job registry, we could modify rqrequeue to requeue jobs from FailedJobRegistry
in addition to StartedJobRegistry
- what do you think?
this is an important change so we should:
- have at least another person review (cc @kushaldas or @emkll maybe?)
- prior to final merge run the upgrade scenario
@@ -9,7 +9,6 @@ cd "${REPOROOT}/securedrop" | |||
source "${BASH_SOURCE%/*}/dev-deps" | |||
|
|||
run_redis & | |||
run_supervisor & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we'll also need to use this systemd service in the dev env else we'll reopen #4733
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I'll take a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think running systemd in a Docker container is going to be more trouble than it's worth just to run rqworker
and rqrequeue
in the dev container.
I don't know about trying to restart jobs in the failed registry. We'd need to somehow distinguish interruptions from real failures, and ensure that we're not rerunning jobs that aren't idempotent -- what we do now (hashing and deletion) is, but that could change, and this service would have to be made less generic at that point. Hashing will already be done as needed when the files are downloaded, and failed deletions will pop up in daily OSSEC alerts. It's not ideal to be bothering admins, but it is less complicated and error-prone. Since interruptions should be happening much less frequently now that deletion is so much faster, I'd be inclined to surface what should be very few failures caused by interruption. |
hmm yeah good point - actually do we know why those jobs are not in the started job registry and end up in the failed job registry? I don’t get why this systemd change produces that behavior |
I think what's changed is that systemd is stopping the workers properly, so the rq worker has time to stop the jobs and move them to the failed registry, where under supervisor it didn't, so those processes were just killed, and the record of them stayed in the started registry. Whether that was due to a problem in supervisor, or just a timing problem from having another control layer in the stack, I don't know. |
ah gotcha, that makes sense. I'm a bit concerned about us pushing this cleanup operation onto admins for two reasons:
|
Instead of submitting jobs to rq that invoke rm.secure_delete, move files to be deleted to a designated directory in the store, and use a systemd service, securedrop_shredder, to monitor that directory and clear its contents securely. Moving files is atomic and quick, so should never be interrupted. If the secure deletion is interrupted, the securedrop_shredder service will just find the file on its next pass. We had discussed adding columns to the database tables involved, and having the service use those to discover items to delete, but: - It would have been a bigger, more complicated change. - It would have introduced the possibility of source information being present in the database for quite a while after the admin "deleted" it, which could be problematic in the event of raid/seizure. This way only the encrypted files might linger; the database records will still be deleted immediately. - It would have required that we always remember to filter using those columns to avoid working with deleted content. (SQLAlchemy currently has only partial support for setting up this filtering globally: https://github.com/sqlalchemy/sqlalchemy/wiki/FilteredQuery)
96e171f
to
1a09c92
Compare
This pull request fixes 1 alert when merging 1a09c92 into ed014be - view on LGTM.com fixed alerts:
|
Deletion is now being handled by a systemd service. I've updated the PR description with implementation details and new testing instructions. |
This pull request fixes 1 alert when merging fbbef3d into ed014be - view on LGTM.com fixed alerts:
|
testinfra_hosts = ["app-staging"] | ||
|
||
|
||
def test_securedrop_rqrequeue_service(host): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still testing at the moment but I'm realizing that rqrequeue will also no longer re-enqueue hashing jobs that aren't completed, i.e. it's not going to work as advertised as the StartedJobRegistry
assumption it's based upon is no longer valid (we know those jobs will be in the failed job registry but we can't tell if they failed due to interruption during restart or some other reason).
My thinking is that we'll want to address that by just removing the rqrequeue
service entirely - this one is easier to handle from the application's perspective as we can just on app start resubmit jobs for any files that don't have a hash stored in the database. Anyway since this is an even larger change and we want to get this change in for 1.2.0, let's do in a followup. Let me know if you think that approach doesn't make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's affected the same way by the switch to systemd, yes. But it's less of a problem, as we're just using the checksums for ETags, and if they're missing they're calculated as needed.
My vote is to move hashing to another simple systemd service and remove rq entirely. It's too late to do it here, but we'd eliminate a lot of code and config, at least rq as a dependency we need to keep updated and reviewed, and possibly also Redis?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh yeah they're calculated on the fly, great point. Since we need to serve those in the headers as responses from the API, we should ensure they are hashed async prior to the user requesting them as at least for large files it takes too long (file.out
is 500 MB here):
www-data@app-staging:/var/www/securedrop$ time shasum -a 256 file.out
38f7c0648553d81ad9402ebdd1b275a0029644c5b7eef7c963dfa7db9ef0ba23 file.out
real 0m3.278s
user 0m2.557s
sys 0m0.452s
We could continue to use rq or replace rqrequeue with another systemd service just for hashing (open to either tbh), to discuss further and then incorporate as part of 1.3.0 🚀
.../ansible-base/roles/build-securedrop-app-code-deb-pkg/templates/securedrop_rqrequeue.service
Outdated
Show resolved
Hide resolved
Changed store.move_to_shredder to accept a path, pushing the responsibility for obtaining said path to the caller, since it's sometimes necessary to delete in a context where you don't have filesystem_id.
Remove hard-coded directories from systemd service templates where we have Ansible variables for them. In the securedrop-app-code rules, add shredder to the dh-systemd calls to ensure the service is handled properly during upgrades.
This pull request fixes 1 alert when merging b7e4131 into ed014be - view on LGTM.com fixed alerts:
|
This pull request fixes 1 alert when merging e932d3a into ed014be - view on LGTM.com fixed alerts:
|
hmm I'm seeing permissions issues when deleting files, trying to figure out why this is happening.. (permissions issue is due to
|
OK my bad this was testing error (I had some changes locally when I saw CI failing earlier so that must have been it) - I reprovisioned everything in staging and retested following the test plan and all is working as advertised now |
We've tested that the creation in store.py works, but it also can't hurt to do it right off the bat in postinst.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pull request fixes 1 alert when merging fdcd989 into ed014be - view on LGTM.com fixed alerts:
|
Status
Ready for review
Description of Changes
This change makes securedrop-app-code Conflict/Replace supervisor, to ensure that supervisor is purged from the system. The rq worker and requeuing processes are now managed as systemd services, with configuration files in /lib/systemd/system. We're only using rq for submission hashing now. Deletion is handled by a new systemd service, discussed in detail below.
All the services now log to the systemd journal, which in the current SecureDrop server configuration means that their log messages also wind up in /var/log/syslog.
Submission hashing works as before. Requeuing a job interrupted by a reboot is less likely to work, as the worker seems to be shut down such that the running jobs are terminated and moved by rq to the failed queue, but hashing will be done automatically when submissions are downloaded.
Instead of submitting jobs to rq to delete submissions or replies, we'll move files to be deleted to a designated directory in the store, and use a systemd service, securedrop_shredder, to monitor that directory and clear its contents securely.
Moving files is atomic and quick, so should never be interrupted. If the secure deletion is interrupted, the securedrop_shredder service will just find the file on its next pass.
We had discussed adding columns to the database tables involved, and having the service use those to discover items to delete, but:
Fixes #4783.
Testing
Use the staging environment to confirm the asynchronous services still work:
make build-debs
make staging
securedrop_rqworker
service log by runningsudo journalctl -u securedrop_rqworker --no-pager
-- the output should look like:You can watch the securedrop_shredder service with
journalctl -f -u securedrop_shredder --no-pager
. I find it helpful tosudo apt install tree
on the app server and use that to watch/var/lib/securedrop
. If you don't want to wait 60 seconds between shredder runs, you can edit/etc/systemd/system/multi-user.target.wants/securedrop_shredder.service
to change the interval, or just stop the service withsystemctl stop securedrop_shredder
, and then invoke/var/www/securedrop/scripts/shredder
manually as you delete submissions and replies in the UI. (And yes, you can test the basic functionality in the dev container that way too, if you want to check it out before going to the trouble of building the staging environment.)Deployment
With this change, SecureDrop's asynchronous worker processes (the rq worker for hashing submissions, the monitor that requeues those jobs if they're interrupted, and the new service that securely deletes submissions and replies) will be managed by systemd instead of supervisor, and the supervisor package will be purged.
Checklist
If you made changes to the server application code:
make lint
) and tests (make -C securedrop test
) pass in the development containerIf you made changes to the system configuration:
If you made non-trivial code changes: