Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test job cancel when scheduled on Kubernetes #8

Open
wvengen opened this issue Jan 25, 2024 · 4 comments
Open

Test job cancel when scheduled on Kubernetes #8

wvengen opened this issue Jan 25, 2024 · 4 comments
Labels
k8s Kuberenetes

Comments

@wvengen
Copy link
Member

wvengen commented Jan 25, 2024

Stopping jobs mostly works, but it has a number of cases to test.

  1. Just created, but not running yet -> remove job/container without stopping it (not tested)
  2. Running -> send signal (tested in PR Integration tests #21)
  3. Finished -> do nothing (tested in PR Integration tests #21)

Can you think of more corner-cases?
Especially in the first case, there may be various stages (e.g. on Kubernetes, waiting for resources, pulling the image).

See also documentation on scrapyd's cancel endpoint.

Note that in PR #21 tests have been added, including some for job cancellation. The main thing now is testing that a job is removed when it is cancelled before it has started. This issue is now about implementing that, including finding a way to test it reliably.

@wvengen wvengen changed the title Make stopping jobs more robust Make job cancel on Kubernetes more robust Jan 25, 2024
@wvengen wvengen added the k8s Kuberenetes label Jan 31, 2024
@wvengen
Copy link
Member Author

wvengen commented Feb 15, 2024

As part of #7 it appears that sending a signal to a running container doesn't work. PR #21 contains a fix, but still killing the spider doesn't seem to do anything.

An important reason is probably that the spider is run as the init process, as PID 1, and cannot be killed.
update enabled shareProcessNamespace to the pod spec, which adds an init process; as we only have one container in the pod this has no real other side-effects.

@wvengen
Copy link
Member Author

wvengen commented Feb 16, 2024

Fixed behaviour during run in PR #21.
Testing for pending and finished jobs is missing still, and there could perhaps be race conditions (e.g. is the container always running then scrapyd-k8s thinks the job is running?).

@wvengen
Copy link
Member Author

wvengen commented Feb 16, 2024

At this moment, we look at job.status.ready and if it is, then we assume that we can exec into it. There could be a race condition where the job and pod are running, but the container is not. A solution is described here:

            # kill pod (retry is disabled, so there should be only one pod)
            pod = self._get_pod(project, job_id)
            if pod: # if a pod has just ended, we're good already, don't kill
                # make sure container is running - https://stackoverflow.com/a/74833787
                if all([c.state.running for c in pod.status.container_statuses]):
                  self._k8s_kill(pod.metadata.name, Signals['SIG' + signal].value)
                else:
                   # refactor code to fall through to delete the job instead

Not encountered yet, so not including this check for now.

@wvengen wvengen changed the title Make job cancel on Kubernetes more robust Test job cancel when scheduled on Kubernetes Feb 27, 2024
@wvengen
Copy link
Member Author

wvengen commented Mar 13, 2024

One approach to make this, is to add an option to create a suspended job, e.g. when the schedule endpoint is called a special query parameter (one that wouldn't be used as a setting), or perhaps a special header. Then the test can use it for testing this scenario

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
k8s Kuberenetes
Projects
No open projects
Status: Todo
Development

No branches or pull requests

1 participant