Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Troubleshoot connection issue between parsl and kubernetes cluster #36

Open
julietcohen opened this issue Mar 27, 2024 · 3 comments
Open

Comments

@julietcohen
Copy link
Collaborator

julietcohen commented Mar 27, 2024

Progress

The parsl and kubernetes viz workflow has been progressing nicely in the following ways:

  • adjusted filepaths for the viz workflow output, based on the location of the mounted persistent volume (PV) within the container, and the WORKDIR specified in the Dockerfile
  • assigned the filepaths and directory names as those that are commonly used, such as app/ for the WORKDIR and /mnt/data for the PV
  • rearranged the Dockerfile lines to make each new build faster (examples: moving parsl_config.py to a lower line because that script is often updated with a new version number for the published image, and moving the pip install line to right after copying over the requirements.txt)
  • runinfo directory is created each run (in the dir of the python script on Datateam, not in the container or PV), which is a sign parsl is working behind the scenes to some degree
  • when running the python script, output is written to a log file by adjusting the command to: python parsl_workflow.py > k8s_parsl.log 2>&1
  • ensuring that the default range ports that parsl uses are open

Problems

  • viz output files are not writing to the specified persistent directory
  • the pods are not shutting down by themselves
  • when you check the pods with kubectl get pods, their status is CrashLoopBackOff
image
  • checking the log of a specific pod with kubectl logs {podname} returns the print statement that we included in the workflow ("Worker started...") plus a vague syntax error:
image

A good sign

Print statements inserted into the script at all stages are being printed to the log output we specify when we run the python script (in the example command given above, that is k8s_parsl.log), including the final statement "script complete".

When running the parsl and kubernetes workflow with a parsl app that does not ingest data files nor output files, and instead executes a simple iterative mathematical operation with print statements and sleep periods inserted, the output seems to imply the script worked as expected. However, the pods are still not shutting down after, and the CrashLoopBackOff status is still the case.

Useful Commands

kubectl run -n pdgrun -i --tty --rm busybox --image=busybox -- sh initates a pod and inserts you into the pod so you can poke around. This pod is named busybox by default. An example of one way to "poke around" is to ping the IP address you specify in the parsl config to see if it is open. Example: telnet 128.111.85.174 54001

Suggested next steps

  • look at parsl documentation to figure out how it is working behind the scenes, determine why we see output in the log files that indicates that the mathematical operation and viz workflow script are indeed working but the pods are still not “talking back” or shutting down properly
  • make sure the ports are open (in addition to the IP address, which was already determined to be open). As mentioned in the progress section, Matt did open these ports during a debugging session, but perhaps they are not consistently open
  • look more into the source of the syntax error reported by the pod log
  • uncomment out the line:
worker_init='echo "Worker started..."; lf=`find . -name \'manager.log\'` tail -n+1 -f ${lf}',

in the parsl config and comment out the line we have been using thus far: worker_init = 'echo "Worker started..."', to get more info in the pod logs

  • in parsl_config.py, play around with using address = address_by_route(), vs address='128.111.85.174',

Thank you to Matthew Brook and Matt Jones for all your help troubleshooting thus far!

@julietcohen
Copy link
Collaborator Author

This issue is a sub-task towards the ultimate goal of #1

@julietcohen
Copy link
Collaborator Author

I was able to resolve the CrashLoopBackOff issue with the published image version 0.1.4. The script parsl_workflow.py in this version was running a simple mathematical function in parallel instead of the viz workflow. Running kubectl get pods during the process showed that they were running.

k8s

The job lasted a while, as it should because there are sleep periods inserted into the function, and it was running in a tmux session. Throughout the process, the script successfully output the print messages we expect from the function. The print statements in the script itself, outside of the function, also were present, such as Script complete. An example snippet is the following saved to the log:
image

When I checked back on the process this morning, at the end of the log, I see something new:
image

I wonder why there is a KeyboardInterrupt message at the end. I also do not know how to interpret the error. I also checked the pods status again, and they are still Running.

Overall, the workflow has certainly made a big step forward. I think it is safe to say that the job is running successfully now, and there seems to be something funky going on with how the pods are shutting down after the script completes.

@julietcohen
Copy link
Collaborator Author

A note for the last comment: The KeyboardInterrupt message at the end of the log was leftover from a previous run, when I cancelled the command to run the script due to issues. Should have realized the log is not replaced by the new log, but rather is appended to with each run, as is the case with the normal viz run as well.

Since the other tickets regarding setting up a new user and an env in the container have been resolved, there may not be a "connection issue" between parsl and k8s anymore, but rather just some adjustments to be made to write the log and the output viz tilesets. The workflow is not running smoothly yet, but I will be able to pinpoint the smaller issue better now that we have the new user and env set up in the container.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant