Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manage rq services with systemd instead of supervisor #4855

Merged
merged 9 commits into from
Nov 20, 2019

Conversation

rmol
Copy link
Contributor

@rmol rmol commented Sep 20, 2019

Status

Ready for review

Description of Changes

This change makes securedrop-app-code Conflict/Replace supervisor, to ensure that supervisor is purged from the system. The rq worker and requeuing processes are now managed as systemd services, with configuration files in /lib/systemd/system. We're only using rq for submission hashing now. Deletion is handled by a new systemd service, discussed in detail below.

All the services now log to the systemd journal, which in the current SecureDrop server configuration means that their log messages also wind up in /var/log/syslog.

Submission hashing works as before. Requeuing a job interrupted by a reboot is less likely to work, as the worker seems to be shut down such that the running jobs are terminated and moved by rq to the failed queue, but hashing will be done automatically when submissions are downloaded.

Instead of submitting jobs to rq to delete submissions or replies, we'll move files to be deleted to a designated directory in the store, and use a systemd service, securedrop_shredder, to monitor that directory and clear its contents securely.

Moving files is atomic and quick, so should never be interrupted. If the secure deletion is interrupted, the securedrop_shredder service will just find the file on its next pass.

We had discussed adding columns to the database tables involved, and having the service use those to discover items to delete, but:

  • It would have been a bigger, more complicated change.
  • It would have introduced the possibility of source information being present in the database for quite a while after the admin "deleted" it, which could be problematic in the event of raid/seizure. This way only the encrypted files might linger; the database records will still be deleted immediately.
  • It would have required that we always remember to filter using those columns to avoid working with deleted content. (SQLAlchemy currently has only partial support for setting up this filtering globally.

Fixes #4783.

Testing

Use the staging environment to confirm the asynchronous services still work:

  • make build-debs
  • make staging
  • Visit your staging source interface and submit a message.
  • Confirm that it was hashed by checking the submissions table in the database.
  • Verify the hashing operation appears in the securedrop_rqworker service log by running sudo journalctl -u securedrop_rqworker --no-pager -- the output should look like:
-- Logs begin at Fri 2019-09-20 15:20:20 UTC, end at Fri 2019-09-20 18:38:13 UTC. --
Sep 20 18:25:35 app-staging rqworker[1513]: 18:25:35 default: store.queued_add_checksum_for_file(<class 'models.Submission'>, 1, '/var/lib/securedrop/store/LB4IY3SOHQLOZ6FNU4UPALZXRLUOEJKNNLZCNAJXQW2JQZTEQV2TVQLUMBND3Y4VX3QC4K6KDWFCNH2R2YSRZSP7XRPCYO2MSEWOUBY=/1-culinary_fracture-msg.gpg', 'sqlite:////var/lib/securedrop/db.sqlite') (c221907f-5a8b-4676-a835-888a9d321c0b)
Sep 20 18:25:35 app-staging rqworker[1513]: 18:25:35 default: Job OK (c221907f-5a8b-4676-a835-888a9d321c0b)
Sep 20 18:25:35 app-staging rqworker[1513]: 18:25:35 Result is kept for 500 seconds
Sep 20 18:25:35 app-staging rqworker[1513]: 18:25:35 Cleaning registries for queue: default
  • Add an admin user.
  • Visit the journalist interface as that user and:
    • add some replies to the source.
    • delete one.
    • delete several.
    • delete an entire source collection.

You can watch the securedrop_shredder service with journalctl -f -u securedrop_shredder --no-pager. I find it helpful to sudo apt install tree on the app server and use that to watch /var/lib/securedrop. If you don't want to wait 60 seconds between shredder runs, you can edit /etc/systemd/system/multi-user.target.wants/securedrop_shredder.service to change the interval, or just stop the service with systemctl stop securedrop_shredder, and then invoke /var/www/securedrop/scripts/shredder manually as you delete submissions and replies in the UI. (And yes, you can test the basic functionality in the dev container that way too, if you want to check it out before going to the trouble of building the staging environment.)

Deployment

With this change, SecureDrop's asynchronous worker processes (the rq worker for hashing submissions, the monitor that requeues those jobs if they're interrupted, and the new service that securely deletes submissions and replies) will be managed by systemd instead of supervisor, and the supervisor package will be purged.

Checklist

If you made changes to the server application code:

  • Linting (make lint) and tests (make -C securedrop test) pass in the development container

If you made changes to the system configuration:

If you made non-trivial code changes:

  • I have written a test plan and validated it for this PR

@codecov-io
Copy link

codecov-io commented Sep 23, 2019

Codecov Report

Merging #4855 into develop will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff            @@
##           develop    #4855   +/-   ##
========================================
  Coverage    81.74%   81.74%           
========================================
  Files           49       49           
  Lines         3418     3418           
  Branches       392      392           
========================================
  Hits          2794     2794           
  Misses         533      533           
  Partials        91       91

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e4c9784...96e171f. Read the comment docs.

@eloquence
Copy link
Member

Blocked pending reply from supervisor upstream in https://bugs.launchpad.net/ubuntu/+source/supervisor/+bug/1844802 (we'll make additional efforts to contact upstream in the week of 9/30 if warranted), as our preference is to not replace the dependency if we don't have to.

@eloquence
Copy link
Member

At standup today we discussed that, if there is no progress getting a Python 3 version included in Xenial, it's too late for this PR to make it into 1.1.0. We will need to then address this in a point release (alongside the grsec update - #4843). Moving back to near-term backlog for now.

@rmol
Copy link
Contributor Author

rmol commented Oct 23, 2019

Rebased and marked ready for review. The log question was answered; @conorsch correctly pointed out that the services' messages end up in /var/log/syslog.

@eloquence
Copy link
Member

staging-test-with-rebase failure looks related (cleaned up for readability):

fatal: [xenial-sd-app]: FAILED! => 
{
   "changed":true,
   "cmd":[
      "dpkg-buildpackage",
      "-us",
      "-uc"
   ],
   "delta":"0:00:01.430765",
   "end":"2019-10-23 22:16:45.896067",
   "msg":"non-zero return code",
   "rc":3,
   "start":"2019-10-23 22:16:44.465302",
   "stderr":" dpkg-source --before-build securedrop-app-code_1.2.0~rc1+xenial_amd64\ndpkg-checkbuilddeps: error: Unmet build dependencies: dh-systemd\ndpkg-buildpackage: warning: build dependencies/conflicts unsatisfied; aborting\ndpkg-buildpackage: warning: (Use -d flag to override.)",
   "stderr_lines":[
      " dpkg-source --before-build securedrop-app-code_1.2.0~rc1+xenial_amd64",
      "dpkg-checkbuilddeps: error: Unmet build dependencies: dh-systemd",
      "dpkg-buildpackage: warning: build dependencies/conflicts unsatisfied; aborting",
      "dpkg-buildpackage: warning: (Use -d flag to override.)"
   ],
   "stdout":"dpkg-buildpackage: source package securedrop-app-code\ndpkg-buildpackage: source version 1.2.0~rc1+xenial\ndpkg-buildpackage: source distribution xenial\ndpkg-buildpackage: source changed by SecureDrop Team <[email protected]>\ndpkg-buildpackage: host architecture amd64",
   "stdout_lines":[
      "dpkg-buildpackage: source package securedrop-app-code",
      "dpkg-buildpackage: source version 1.2.0~rc1+xenial",
      "dpkg-buildpackage: source distribution xenial",
      "dpkg-buildpackage: source changed by SecureDrop Team <[email protected]>",
      "dpkg-buildpackage: host architecture amd64"
   ]
}

@rmol rmol force-pushed the 4783-supervisor-to-systemd branch from dac568f to 7bb2d0f Compare October 25, 2019 13:28
Copy link
Contributor

@redshiftzero redshiftzero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is looking good! I've ran through the test plan and all works as advertised. So one thing we should also do is ensure that we preserve the behavior added in #4713 regarding automatic requeuing of interrupted deletions - if with the systemd change these interrupted jobs are added to the failed job registry instead of the started job registry, we could modify rqrequeue to requeue jobs from FailedJobRegistry in addition to StartedJobRegistry - what do you think?

this is an important change so we should:

  • have at least another person review (cc @kushaldas or @emkll maybe?)
  • prior to final merge run the upgrade scenario

@@ -9,7 +9,6 @@ cd "${REPOROOT}/securedrop"
source "${BASH_SOURCE%/*}/dev-deps"

run_redis &
run_supervisor &
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'll also need to use this systemd service in the dev env else we'll reopen #4733

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I'll take a look.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think running systemd in a Docker container is going to be more trouble than it's worth just to run rqworker and rqrequeue in the dev container.

@rmol
Copy link
Contributor Author

rmol commented Oct 25, 2019

I don't know about trying to restart jobs in the failed registry. We'd need to somehow distinguish interruptions from real failures, and ensure that we're not rerunning jobs that aren't idempotent -- what we do now (hashing and deletion) is, but that could change, and this service would have to be made less generic at that point.

Hashing will already be done as needed when the files are downloaded, and failed deletions will pop up in daily OSSEC alerts. It's not ideal to be bothering admins, but it is less complicated and error-prone. Since interruptions should be happening much less frequently now that deletion is so much faster, I'd be inclined to surface what should be very few failures caused by interruption.

@redshiftzero
Copy link
Contributor

hmm yeah good point - actually do we know why those jobs are not in the started job registry and end up in the failed job registry? I don’t get why this systemd change produces that behavior

@rmol
Copy link
Contributor Author

rmol commented Oct 25, 2019

I think what's changed is that systemd is stopping the workers properly, so the rq worker has time to stop the jobs and move them to the failed registry, where under supervisor it didn't, so those processes were just killed, and the record of them stayed in the started registry. Whether that was due to a problem in supervisor, or just a timing problem from having another control layer in the stack, I don't know.

@redshiftzero
Copy link
Contributor

ah gotcha, that makes sense. I'm a bit concerned about us pushing this cleanup operation onto admins for two reasons:

  • we're notifying admins via OSSEC alerts, which they may ignore or not check, meaning that "deleted" files (albeit encrypted) will be sitting around on the hard disk.
  • our application should be able to reliably delete files (since the server is rebooting every night), if we're not able to do this via rq's job registries after this change then we might need to go back to considering something like a to_delete flag in the database, only deleting the row in the database after the file itself is deleted, and then on application start resubmitting all the to_delete rows to the job queue (in a similar way to how we clean up expired tokens from the expired token table on application start).

Instead of submitting jobs to rq that invoke rm.secure_delete, move
files to be deleted to a designated directory in the store, and use a
systemd service, securedrop_shredder, to monitor that directory and
clear its contents securely.

Moving files is atomic and quick, so should never be interrupted. If
the secure deletion is interrupted, the securedrop_shredder service
will just find the file on its next pass.

We had discussed adding columns to the database tables involved, and
having the service use those to discover items to delete, but:

- It would have been a bigger, more complicated change.

- It would have introduced the possibility of source information being
  present in the database for quite a while after the admin "deleted"
  it, which could be problematic in the event of raid/seizure. This
  way only the encrypted files might linger; the database records will
  still be deleted immediately.

- It would have required that we always remember to filter using those
  columns to avoid working with deleted content. (SQLAlchemy currently
  has only partial support for setting up this filtering globally:
  https://github.com/sqlalchemy/sqlalchemy/wiki/FilteredQuery)
@lgtm-com
Copy link

lgtm-com bot commented Nov 19, 2019

This pull request fixes 1 alert when merging 1a09c92 into ed014be - view on LGTM.com

fixed alerts:

  • 1 for Unused local variable

@rmol
Copy link
Contributor Author

rmol commented Nov 19, 2019

Deletion is now being handled by a systemd service. I've updated the PR description with implementation details and new testing instructions.

@lgtm-com
Copy link

lgtm-com bot commented Nov 19, 2019

This pull request fixes 1 alert when merging fbbef3d into ed014be - view on LGTM.com

fixed alerts:

  • 1 for Unused local variable

testinfra_hosts = ["app-staging"]


def test_securedrop_rqrequeue_service(host):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still testing at the moment but I'm realizing that rqrequeue will also no longer re-enqueue hashing jobs that aren't completed, i.e. it's not going to work as advertised as the StartedJobRegistry assumption it's based upon is no longer valid (we know those jobs will be in the failed job registry but we can't tell if they failed due to interruption during restart or some other reason).

My thinking is that we'll want to address that by just removing the rqrequeue service entirely - this one is easier to handle from the application's perspective as we can just on app start resubmit jobs for any files that don't have a hash stored in the database. Anyway since this is an even larger change and we want to get this change in for 1.2.0, let's do in a followup. Let me know if you think that approach doesn't make sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's affected the same way by the switch to systemd, yes. But it's less of a problem, as we're just using the checksums for ETags, and if they're missing they're calculated as needed.

My vote is to move hashing to another simple systemd service and remove rq entirely. It's too late to do it here, but we'd eliminate a lot of code and config, at least rq as a dependency we need to keep updated and reviewed, and possibly also Redis?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yeah they're calculated on the fly, great point. Since we need to serve those in the headers as responses from the API, we should ensure they are hashed async prior to the user requesting them as at least for large files it takes too long (file.out is 500 MB here):

www-data@app-staging:/var/www/securedrop$ time shasum -a 256 file.out
38f7c0648553d81ad9402ebdd1b275a0029644c5b7eef7c963dfa7db9ef0ba23  file.out

real    0m3.278s
user    0m2.557s
sys     0m0.452s

We could continue to use rq or replace rqrequeue with another systemd service just for hashing (open to either tbh), to discuss further and then incorporate as part of 1.3.0 🚀

Changed store.move_to_shredder to accept a path, pushing the
responsibility for obtaining said path to the caller, since it's
sometimes necessary to delete in a context where you don't have
filesystem_id.
Remove hard-coded directories from systemd service templates where we
have Ansible variables for them.

In the securedrop-app-code rules, add shredder to the dh-systemd calls
to ensure the service is handled properly during upgrades.
@lgtm-com
Copy link

lgtm-com bot commented Nov 19, 2019

This pull request fixes 1 alert when merging b7e4131 into ed014be - view on LGTM.com

fixed alerts:

  • 1 for Unused local variable

@lgtm-com
Copy link

lgtm-com bot commented Nov 19, 2019

This pull request fixes 1 alert when merging e932d3a into ed014be - view on LGTM.com

fixed alerts:

  • 1 for Unused local variable

@redshiftzero
Copy link
Contributor

hmm I'm seeing permissions issues when deleting files, trying to figure out why this is happening.. (permissions issue is due to /var/lib/securedrop/shredder/ owned by root):

[Tue Nov 19 23:23:31.701665 2019] [wsgi:error] [pid 1484:tid 121987389728512] [remote 127.0.0.1:48354]   File "/opt/venvs/securedrop-app-code/lib/python3.5/site-packages/flask/app.py", line 1799, in dispatch_request
[Tue Nov 19 23:23:31.701671 2019] [wsgi:error] [pid 1484:tid 121987389728512] [remote 127.0.0.1:48354]     return self.view_functions[rule.endpoint](**req.view_args)
[Tue Nov 19 23:23:31.701679 2019] [wsgi:error] [pid 1484:tid 121987389728512] [remote 127.0.0.1:48354]   File "/var/www/securedrop/journalist_app/col.py", line 67, in process
[Tue Nov 19 23:23:31.701684 2019] [wsgi:error] [pid 1484:tid 121987389728512] [remote 127.0.0.1:48354]     return method(cols_selected)
[Tue Nov 19 23:23:31.701691 2019] [wsgi:error] [pid 1484:tid 121987389728512] [remote 127.0.0.1:48354]   File "/var/www/securedrop/journalist_app/utils.py", line 237, in col_delete
[Tue Nov 19 23:23:31.701697 2019] [wsgi:error] [pid 1484:tid 121987389728512] [remote 127.0.0.1:48354]     delete_collection(filesystem_id)
[Tue Nov 19 23:23:31.701704 2019] [wsgi:error] [pid 1484:tid 121987389728512] [remote 127.0.0.1:48354]   File "/var/www/securedrop/journalist_app/utils.py", line 262, in delete_collection
[Tue Nov 19 23:23:31.701709 2019] [wsgi:error] [pid 1484:tid 121987389728512] [remote 127.0.0.1:48354]     current_app.storage.move_to_shredder(path)
[Tue Nov 19 23:23:31.701717 2019] [wsgi:error] [pid 1484:tid 121987389728512] [remote 127.0.0.1:48354]   File "/var/www/securedrop/store.py", line 231, in move_to_shredder
[Tue Nov 19 23:23:31.701722 2019] [wsgi:error] [pid 1484:tid 121987389728512] [remote 127.0.0.1:48354]     dest = os.path.join(tempfile.mkdtemp(dir=self.__shredder_path), relpath)
[Tue Nov 19 23:23:31.701741 2019] [wsgi:error] [pid 1484:tid 121987389728512] [remote 127.0.0.1:48354]   File "/opt/venvs/securedrop-app-code/lib/python3.5/tempfile.py", line 507, in mkdtemp
[Tue Nov 19 23:23:31.701747 2019] [wsgi:error] [pid 1484:tid 121987389728512] [remote 127.0.0.1:48354]     _os.mkdir(file, 0o700)
[Tue Nov 19 23:23:31.701766 2019] [wsgi:error] [pid 1484:tid 121987389728512] [remote 127.0.0.1:48354] PermissionError: [Errno 13] Permission denied: '/var/lib/securedrop/shredder/tmps5ocd0xq'
[Tue Nov 19 23:23:31.701777 2019] [wsgi:error] [pid 1484:tid 121987389728512] [remote 127.0.0.1:48354]
[Tue Nov 19 23:23:31.701782 2019] [wsgi:error] [pid 1484:tid 121987389728512] [remote 127.0.0.1:48354] During handling of the above exception, another exception occurred:
[Tue Nov 19 23:23:31.701786 2019] [wsgi:error] [pid 1484:tid 121987389728512] [remote 127.0.0.1:48354]
[Tue Nov 19 23:23:31.701794 2019] [wsgi:error] [pid 1484:tid 121987389728512] [remote 127.0.0.1:48354] Traceback (most recent call last):
[Tue Nov 19 23:23:31.701815 2019] [wsgi:error] [pid 1484:tid 121987389728512] [remote 127.0.0.1:48354]   File "/opt/venvs/securedrop-app-code/lib/python3.5/site-packages/flask/app.py", line 2309, in __call__
[Tue Nov 19 23:23:31.701821 2019] [wsgi:error] [pid 1484:tid 121987389728512] [remote 127.0.0.1:48354]     return self.wsgi_app(environ, start_response)
[Tue Nov 19 23:23:31.701828 2019] [wsgi:error] [pid 1484:tid 121987389728512] [remote 127.0.0.1:48354]   File "/opt/venvs/securedrop-app-code/lib/python3.5/site-packages/flask/app.py", line 2295, in wsgi_app
[Tue Nov 19 23:23:31.701834 2019] [wsgi:error] [pid 1484:tid 121987389728512] [remote 127.0.0.1:48354]     response = self.handle_exception(e)
[Tue Nov 19 23:23:31.701841 2019] [wsgi:error] [pid 1484:tid 121987389728512] [remote 127.0.0.1:48354]   File "/opt/venvs/securedrop-app-code/lib/python3.5/site-packages/flask/app.py", line 1748, in handle_exception
[Tue Nov 19 23:23:31.701846 2019] [wsgi:error] [pid 1484:tid 121987389728512] [remote 127.0.0.1:48354]     return self.finalize_request(handler(e), from_error_handler=True)
[Tue Nov 19 23:23:31.701854 2019] [wsgi:error] [pid 1484:tid 121987389728512] [remote 127.0.0.1:48354]   File "/var/www/securedrop/journalist_app/__init__.py", line 97, in _handle_http_exception
[Tue Nov 19 23:23:31.701859 2019] [wsgi:error] [pid 1484:tid 121987389728512] [remote 127.0.0.1:48354]     handler = list(app.error_handler_spec['api'][error.code].values())[0]
[Tue Nov 19 23:23:31.701871 2019] [wsgi:error] [pid 1484:tid 121987389728512] [remote 127.0.0.1:48354] AttributeError: 'PermissionError' object has no attribute 'code'

@redshiftzero
Copy link
Contributor

OK my bad this was testing error (I had some changes locally when I saw CI failing earlier so that must have been it) - I reprovisioned everything in staging and retested following the test plan and all is working as advertised now

We've tested that the creation in store.py works, but it also can't
hurt to do it right off the bat in postinst.
Copy link
Contributor

@redshiftzero redshiftzero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all my comments are addressed, and this is working as advertised following the test plan. I've also made a followup for rqrequeue in #5006.

thanks @rmol, nice work!

@lgtm-com
Copy link

lgtm-com bot commented Nov 20, 2019

This pull request fixes 1 alert when merging fdcd989 into ed014be - view on LGTM.com

fixed alerts:

  • 1 for Unused local variable

@redshiftzero redshiftzero merged commit 4524ac6 into freedomofpress:develop Nov 20, 2019
@zenmonkeykstop zenmonkeykstop mentioned this pull request Nov 22, 2019
24 tasks
@rocodes rocodes mentioned this pull request Nov 27, 2019
35 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

supervisor is using python 2
5 participants