Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SOLVED] Centos - Nginx - Gunicorn - Django - Kaleido on Digital Ocean Droplet not working because of selinux policy [EDITED] #37

Closed
irwanOyong opened this issue Aug 26, 2020 · 15 comments

Comments

@irwanOyong
Copy link

irwanOyong commented Aug 26, 2020

Hi! Thank you for the hardwork,

I have a question regarding Kaleido implementation of Centos - Nginx - Gunicorn - Django - Kaleido on a Digital Ocean Droplet using Cloudflare SSL.

It works seamlessly on my local development env, Ubuntu/Windows, but not in the mentioned environment (staging).

As seen from the gunicorn status below, the worker(s) are exiting and rebooting when I try exporting plots using fig.write_image function.

● gunicorn.service - Gunicorn Django Daemon
   Loaded: loaded (/etc/systemd/system/gunicorn.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2020-08-26 16:47:48 UTC; 56s ago
 Main PID: 288482 (gunicorn)
    Tasks: 37 (limit: 11328)
   Memory: 276.9M
   CGroup: /system.slice/gunicorn.service
           ├─288482 /home/lqophe/pheEnv/bin/python /home/lqophe/pheEnv/bin/gunicorn --access-logfile - --workers 3 --bind unix:/home/lqophe/pheLQO/pheLQO.sock pheLQO.wsgi:application
           ├─288485 /home/lqophe/pheEnv/bin/python /home/lqophe/pheEnv/bin/gunicorn --access-logfile - --workers 3 --bind unix:/home/lqophe/pheLQO/pheLQO.sock pheLQO.wsgi:application
           ├─288488 /home/lqophe/pheEnv/bin/python /home/lqophe/pheEnv/bin/gunicorn --access-logfile - --workers 3 --bind unix:/home/lqophe/pheLQO/pheLQO.sock pheLQO.wsgi:application
           ├─288506 /bin/bash /home/lqophe/pheEnv/lib/python3.6/site-packages/kaleido/executable/kaleido plotly --disable-gpu --plotlyjs='/home/lqophe/pheEnv/lib/python3.6/site-packages/plotly/package_data/plotly.min.js' --mathjax='https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/MathJax.js'
           ├─288511 ./bin/kaleido --no-sandbox --allow-file-access-from-files --disable-breakpad plotly --disable-gpu --plotlyjs='/home/lqophe/pheEnv/lib/python3.6/site-packages/plotly/package_data/plotly.min.js' --mathjax='https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/MathJax.js'
           ├─288513 /home/lqophe/pheEnv/lib/python3.6/site-packages/kaleido/executable/bin/kaleido --type=zygote --no-zygote-sandbox --no-sandbox --headless --headless
           ├─288514 /home/lqophe/pheEnv/lib/python3.6/site-packages/kaleido/executable/bin/kaleido --type=zygote --no-sandbox --headless --headless
           ├─288527 /home/lqophe/pheEnv/lib/python3.6/site-packages/kaleido/executable/bin/kaleido --type=gpu-process --field-trial-handle=7383061888025097530,4985302504621741626,131072 --no-sandbox --disable-breakpad --headless --ozone-platform=headless --headless --gpu-preferences=OAAAAAAAAAAgAAAgAAAAAAAAAAAAAAAAAABgAAAAAAAYAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAIAAAAAAAAAA== --use-gl=swiftshader-webgl --override-use-software-gl-for-tests --shared-files
           ├─288529 /home/lqophe/pheEnv/lib/python3.6/site-packages/kaleido/executable/bin/kaleido --type=utility --field-trial-handle=7383061888025097530,4985302504621741626,131072 --lang=en-US --service-sandbox-type=network --no-sandbox --use-gl=swiftshader-webgl --headless --shared-files
           └─288562 /home/lqophe/pheEnv/bin/python /home/lqophe/pheEnv/bin/gunicorn --access-logfile - --workers 3 --bind unix:/home/lqophe/pheLQO/pheLQO.sock pheLQO.wsgi:application

Agu 26 16:47:48 staging-centos-s-2vcpu-2gb-sgp1-01 systemd[1]: Started Gunicorn Django Daemon.
Agu 26 16:47:48 staging-centos-s-2vcpu-2gb-sgp1-01 gunicorn[288482]: [2020-08-26 16:47:48 +0000] [288482] [INFO] Starting gunicorn 20.0.4
Agu 26 16:47:48 staging-centos-s-2vcpu-2gb-sgp1-01 gunicorn[288482]: [2020-08-26 16:47:48 +0000] [288482] [INFO] Listening at: unix:/home/lqophe/phe>
Agu 26 16:47:48 staging-centos-s-2vcpu-2gb-sgp1-01 gunicorn[288482]: [2020-08-26 16:47:48 +0000] [288482] [INFO] Using worker: sync
Agu 26 16:47:48 staging-centos-s-2vcpu-2gb-sgp1-01 gunicorn[288482]: [2020-08-26 16:47:48 +0000] [288485] [INFO] Booting worker with pid: 288485
Agu 26 16:47:49 staging-centos-s-2vcpu-2gb-sgp1-01 gunicorn[288482]: [2020-08-26 16:47:48 +0000] [288487] [INFO] Booting worker with pid: 288487
Agu 26 16:47:49 staging-centos-s-2vcpu-2gb-sgp1-01 gunicorn[288482]: [2020-08-26 16:47:49 +0000] [288488] [INFO] Booting worker with pid: 288488
Agu 26 16:48:33 staging-centos-s-2vcpu-2gb-sgp1-01 gunicorn[288482]: [2020-08-26 16:48:33 +0000] [288482] [CRITICAL] WORKER TIMEOUT (pid:288487)
Agu 26 16:48:33 staging-centos-s-2vcpu-2gb-sgp1-01 gunicorn[288482]: [2020-08-26 23:48:33 +0700] [288487] [INFO] Worker exiting (pid: 288487)
Agu 26 16:48:33 staging-centos-s-2vcpu-2gb-sgp1-01 gunicorn[288482]: [2020-08-26 16:48:33 +0000] [288562] [INFO] Booting worker with pid: 288562
@jonmmease
Copy link
Collaborator

Hi @irwanOyong,

Can you SSH into the droplet and try exporting an image from a python/ipython session? This will help narrow in on where things are going wrong.

Also, FWIW, these lines look odd

           ├─288511 ./bin/kaleido --no-sandbox --allow-file-access-from-files --disable-breakpad plotly --disable-gpu --plotlyjs='/home/lqophe/pheEnv/lib/python3.6/site-packages/plotly/package_data/plotly.min.js' --mathjax='https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/MathJax.js'
           ├─288513 /home/lqophe/pheEnv/lib/python3.6/site-packages/kaleido/executable/bin/kaleido --type=zygote --no-zygote-sandbox --no-sandbox --headless --headless
           ├─288514 /home/lqophe/pheEnv/lib/python3.6/site-packages/kaleido/executable/bin/kaleido --type=zygote --no-sandbox --headless --headless
           ├─288527 /home/lqophe/pheEnv/lib/python3.6/site-packages/kaleido/executable/bin/kaleido --type=gpu-process --field-trial-handle=7383061888025097530,4985302504621741626,131072 --no-sandbox --disable-breakpad --headless --ozone-platform=headless --headless --gpu-preferences=OAAAAAAAAAAgAAAgAAAAAAAAAAAAAAAAAABgAAAAAAAYAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAIAAAAAAAAAA== --use-gl=swiftshader-webgl --override-use-software-gl-for-tests --shared-files
           ├─288529 /home/lqophe/pheEnv/lib/python3.6/site-packages/kaleido/executable/bin/kaleido --type=utility --field-trial-handle=7383061888025097530,4985302504621741626,131072 --lang=en-US --service-sandbox-type=network --no-sandbox --use-gl=swiftshader-webgl --headless --shared-files

The top line lists the chromium command line arguments that kaleido uses. But the following lines list a bunch of command line arguments that aren't specified by kaleido (e.g. --type=zygote, --headless, etc.). Do you know where these arguments are coming from?

@irwanOyong
Copy link
Author

irwanOyong commented Aug 27, 2020

Thank you for the response @jonmmease

And yes, I am able to SSH the droplet and export an image from the plotly go.Figure in a CLI python session without any warning and error.

I think those lines also come from the Chromium, no?

@jonmmease
Copy link
Collaborator

I think those lines also come from the Chromium, no?

They are flags that are accepted by Chromium, but I'm not clear on how they are being set. Here are the flags that we set when calling the kaleido executable, which passes them on to Chromium: https://github.com/plotly/Kaleido/blob/master/repos/linux_scripts/launch_script. I guess these may be subprocesses that chromium launches internally.

Are you using the --preload gunicorn flag? If so, could you try calling gunicorn without it? And if not, try calling gunicorn with it?

@irwanOyong
Copy link
Author

irwanOyong commented Aug 27, 2020

I did not use --preload, I tried it just now but the problem still persists.

And ah, I noticed that the response I get from the request is 504: Gateway time-out, any possibilites the problem comes from (unsuitable) Nginx configuration?

Here is what written in the Nginx error log I saw just now.
2020/08/27 16:21:27 [error] 92163#0: *621626 upstream timed out (110: Connection timed out) while reading response header from upstream, client: my_ip_address, server: my_staging_ip_address, request: "GET /api/bubble-map/?reservoir_id=1&base=reservoir-well HTTP/1.1", upstream: "http://unix:/home/lqophe/pheLQO/pheLQO.sock/api/bubble-map/?reservoir_id=1&base=reservoir-well", host: "my_staging_ip_address"

My client_body_timeout and client_header_timeout are set to 3 minutes btw, which supposed to be more than enough to finish the process.

@jonmmease
Copy link
Collaborator

Just to clarify, everything works when you comment out the write_image call?

@irwanOyong
Copy link
Author

Yes, I can assure that it happens right on this line of code fig.write_image("repository/bubble-pie/well_{}.png".format(well), engine="kaleido").

I tried to add these on my location /api block of nginx.conf but nothing changes.
proxy_connect_timeout 600s; proxy_send_timeout 600s; proxy_read_timeout 600s; fastcgi_send_timeout 600s; fastcgi_read_timeout 600s;

Also to clarify, do we need GPU to finish the write_image with kaleido? I see --disable-gpu argument here but just to make sure.

@jonmmease
Copy link
Collaborator

Yes, I can assure that it happens right on this line of code fig.write_image("repository/bubble-pie/well_{}.png".format(well), engine="kaleido").

Ok, thanks for confirming.

Also to clarify, do we need GPU to finish the write_image with kaleido? I see --disable-gpu argument here but just to make sure.

No, a GPU is not required.

I don't have any other ideas off hand. Do you have any other logging from your app itself? Can you tell if the write_image call is hanging indefinitely vs. raising an exception? Have you tried with a single gunicorn worker?

@irwanOyong
Copy link
Author

I tried using single gunicorn worker, nothing changed.

But I found something, using the same code I tried to run my Django server in development mode python3 manage.py runserver 0.0.0.0:8000 and send a request via Postman to my_staging_ip_address:8000/corresponding_api and it works as expected successfully exported images.

The latest response after I updated some _timeout config in my server is 524: A timeout occurred which comes from Cloudflare after 100 seconds of not getting a response, which is unlikely because the process using development mode mentioned above finished in less than 2 seconds.

Btw thank you for the continuous response even knowing the problem doesn't come directly from Kaleido, I really appreciate that.

@jonmmease
Copy link
Collaborator

Ok, so it sounds like it specifically a problem under gunicorn. And again, to double check, running with gunicorn on your local development machine work properly?

If it were a problem with gunicorn everywhere, then I'd suspect it has something to do with gunicorn's process forking model, but if it's only a problem on specific os configuration, that doesn't make as much sense.

@irwanOyong
Copy link
Author

irwanOyong commented Aug 29, 2020

Unfortunately I don't run the project with gunicorn on my local development machine, but I will try to set it up that way to narrow down the problem.

Btw, I found this error message when restarting gunicorn while the process is hanging before the 100 seconds mark.

ValueError at /api/bubble-map/
Failed to start Kaleido subprocess. Error stream:

[0829/174013.805387:WARNING:resource_bundle.cc(435)] locale_file_path.empty() for locale 
[0829/174013.814742:WARNING:resource_bundle.cc(435)] locale_file_path.empty() for locale 
[0829/174013.814747:WARNING:resource_bundle.cc(435)] locale_file_path.empty() for locale 
[0829/174013.836980:WARNING:resource_bundle.cc(435)] locale_file_path.empty() for locale 

Fatal error in , line 0
Check failed: reservation_.SetPermissions(protect_start, protect_size, permission).
FailureMessage Object: 0x7ffde5961200#0 0x558d6f2c38a9 base::debug::CollectStackTrace()
1 0x558d6f232203 base::debug::StackTrace::StackTrace()
2 0x558d707738cd gin::(anonymous namespace)::PrintStackTrace()
3 0x558d70670d25 V8_Fatal()
4 0x558d6e59f3f5 v8::internal::MemoryChunk::DecrementWriteUnprotectCounterAndMaybeSetPermissions()
5 0x558d6e5a25bc v8::internal::PagedSpace::SetReadAndExecutable()
6 0x558d6e4d2ed0 v8::internal::Isolate::Init()
7 0x558d6e4d3449 v8::internal::Isolate::InitWithSnapshot()
8 0x558d6e871ddf v8::internal::Snapshot::Initialize()
9 0x558d6e3dce4b v8::Isolate::Initialize()
10 0x558d7076ff74 gin::IsolateHolder::IsolateHolder()
11 0x558d706a817e blink::V8PerIsolateData::V8PerIsolateData()
12 0x558d706a8a29 blink::V8PerIsolateData::Initialize()
13 0x558d70793334 blink::V8Initializer::InitializeMainThread()
14 0x558d71cec394 blink::(anonymous namespace)::InitializeCommon()
15 0x558d71ce4048 content::RenderThreadImpl::InitializeWebKit()
16 0x558d71ce29a0 content::RenderThreadImpl::Init()
17 0x558d71ce3e27 content::RenderThreadImpl::RenderThreadImpl()
18 0x558d72333d65 content::RendererMain()
19 0x558d6efc5d18 content::RunZygote()
20 0x558d6efc6c85 content::ContentMainRunnerImpl::Run()
21 0x558d704b5281 service_manager::Main()
22 0x558d6efc0081 content::ContentMain()
23 0x558d6f01b51d headless::(anonymous namespace)::RunContentMain()
24 0x558d6f01b36b headless::RunChildProcessIfNeeded()
25 0x558d6d2098a9 main
26 0x7f58b1c5f6a3 __libc_start_main
27 0x558d6d2034ea _start
Received signal 4 ILL_ILLOPN 558d706738d2
0 0x558d6f2c38a9 base::debug::CollectStackTrace()
1 0x558d6f232203 base::debug::StackTrace::StackTrace()
2 0x558d6f2c3445 base::debug::(anonymous namespace)::StackDumpSignalHandler()
3 0x7f58b318fdd0 (/usr/lib64/libpthread-2.28.so+0x12dcf)
4 0x558d706738d2 v8::base::OS::Abort()
5 0x558d70670d32 V8_Fatal()
6 0x558d6e59f3f5 v8::internal::MemoryChunk::DecrementWriteUnprotectCounterAndMaybeSetPermissions()
7 0x558d6e5a25bc v8::internal::PagedSpace::SetReadAndExecutable()
8 0x558d6e4d2ed0 v8::internal::Isolate::Init()
9 0x558d6e4d3449 v8::internal::Isolate::InitWithSnapshot()
10 0x558d6e871ddf v8::internal::Snapshot::Initialize()
11 0x558d6e3dce4b v8::Isolate::Initialize()
12 0x558d7076ff74 gin::IsolateHolder::IsolateHolder()
13 0x558d706a817e blink::V8PerIsolateData::V8PerIsolateData()
14 0x558d706a8a29 blink::V8PerIsolateData::Initialize()
15 0x558d70793334 blink::V8Initializer::InitializeMainThread()
16 0x558d71cec394 blink::(anonymous namespace)::InitializeCommon()
17 0x558d71ce4048 content::RenderThreadImpl::InitializeWebKit()
18 0x558d71ce29a0 content::RenderThreadImpl::Init()
19 0x558d71ce3e27 content::RenderThreadImpl::RenderThreadImpl()
20 0x558d72333d65 content::RendererMain()
21 0x558d6efc5d18 content::RunZygote()
22 0x558d6efc6c85 content::ContentMainRunnerImpl::Run()
23 0x558d704b5281 service_manager::Main()
24 0x558d6efc0081 content::ContentMain()
25 0x558d6f01b51d headless::(anonymous namespace)::RunContentMain()
26 0x558d6f01b36b headless::RunChildProcessIfNeeded()
27 0x558d6d2098a9 main
28 0x7f58b1c5f6a3 __libc_start_main
29 0x558d6d2034ea _start
  r8: 00007f58b1ffa810  r9: 00007f58b37b7800 r10: 0000000008ef1c8a r11: 0000000000000000
 r12: 0000003000000008 r13: 00007ffde5961510 r14: 0000558d6b340c5a r15: 00007ffde59614c0
  di: 00007f58b1ff95e0  si: 00007f58b1ffa810  bp: 00007ffde59611f0  bx: 00007f58b1ff97a0
  dx: 0000000000000000  ax: 0000000000000000  cx: 0000000000000b40  sp: 00007ffde59611f0
  ip: 0000558d706738d2 efl: 0000000000010202 cgf: 002b000000000033 erf: 0000000000000000
 trp: 0000000000000006 msk: 0000000000000000 cr2: 0000000000000000
[end of stack trace]
Calling _exit(1). Core file will not be generated.

@jonmmease
Copy link
Collaborator

Thanks @irwanOyong, this error message is helpful.

After a little searching, I've seen a few references to this kind of error being causes by selinux security policies. e.g. https://forums.gentoo.org/viewtopic-t-1114806-start-0.html.

I'm not very familiar with selinux, but it would be worth checking if this is installed on the droplet and to try disabling it if so: https://www.tecmint.com/disable-selinux-in-centos-rhel-fedora/.

@irwanOyong
Copy link
Author

Oh my, you are right. I tried to disable the Selinux (not entirely) by setting it to permissive mode, and it works like a charm.

Last question, is there any way I may still use Kaleido without modifying the default Selinux mode? It was set that way to give more security (said them who made so), many people wrote that it is not recommended to disable or set Selinux to permissive mode. But only if you and the team already know something in hand, no need to dig too much because you helped me a lot already. Thank you.

@jonmmease
Copy link
Collaborator

Hi @irwanOyong, great! Yeah, I wasn't recommending disabling selinux permanently, just to test things out. I haven't worked with selinux much, but there must be some way to allow the execution of individual executables.

The native executable will be located in a directory under

/path/to/site-packages/kaleido/executable/

So the trick will be working out how to tell selinux to allow execution from this directory. If you work out a solution for allowing this, please add it to the issue here! Also, please let us know if you come across information on anything that we could do on our end to avoid getting flagged.

@irwanOyong
Copy link
Author

irwanOyong commented Aug 31, 2020

Good news @jonmmease !

In short, after ensuring that the problem comes from selinux policy. Here is how I resolved the issue.

First, we may run grep denied /var/log/audit/audit.log to see the denied service/process from the audit log. Which more or less gives something like this.

type=AVC msg=audit(1598895759.536:296665): avc:  denied  { execmem } for  pid=362414 comm="kaleido" scontext=system_u:system_r:init_t:s0 tcontext=system_u:system_r:init_t:s0 tclass=process permissive=0

And even more human-friendly, we may run audit2allow -w -a to see what and why are they denied.

type=AVC msg=audit(1598896225.774:296754): avc:  denied  { execmem } for  pid=363485 comm="kaleido" 
scontext=system_u:system_r:init_t:s0 tcontext=system_u:system_r:init_t:s0 tclass=process permissive=1
Was caused by:
Missing type enforcement (TE) allow rule.
You can use audit2allow to generate a loadable module to allow this access.

By running audit2allow -a, they will tell us the options how to allow them.

#============= httpd_t ==============
#!!!! This avc can be allowed using the boolean 'httpd_read_user_content'
allow httpd_t user_home_t:file { open read };
allow httpd_t user_home_t:sock_file write;

#============= init_t ==============
allow init_t home_cert_t:file { getattr lock open read write };
allow init_t self:process execmem;
allow init_t user_home_t:file append;

Specifically we can then run audit2allow -a -M kaleido to tell selinux to allow kaleido.

******************** IMPORTANT ***********************
To make this policy package active, execute:

semodule -i kaleido.pp

And the last part is installing the module semodule -i kaleido.pp. Which would then gives:

#============= httpd_t ==============

#!!!! This avc is allowed in the current policy
allow httpd_t user_home_t:file { open read };

#!!!! This avc is allowed in the current policy
allow httpd_t user_home_t:sock_file write;

#============= init_t ==============

#!!!! This avc is allowed in the current policy
allow init_t home_cert_t:file { getattr lock open read write };

#!!!! This avc is allowed in the current policy
allow init_t self:process execmem;

#!!!! This avc is allowed in the current policy
allow init_t user_home_t:file append;

I don't know if this may help, but from what I read, running your service under init_t is not recommended, but I am also no expert in sysadmin stuff so I am not really sure :(

Thank you so much for helping these few days!

@irwanOyong irwanOyong changed the title Centos - Nginx - Gunicorn - Django - Kaleido on Digital Ocean Droplet not working as expected [SOLVED] Centos - Nginx - Gunicorn - Django - Kaleido on Digital Ocean Droplet not working as expected because of selinux policy [EDITED] Aug 31, 2020
@irwanOyong irwanOyong changed the title [SOLVED] Centos - Nginx - Gunicorn - Django - Kaleido on Digital Ocean Droplet not working as expected because of selinux policy [EDITED] [SOLVED] Centos - Nginx - Gunicorn - Django - Kaleido on Digital Ocean Droplet not working because of selinux policy [EDITED] Aug 31, 2020
@jonmmease
Copy link
Collaborator

Thanks for sharing your solution @irwanOyong!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants