Skip to content

Commit

Permalink
Fix SIGHUP on termination (Solaris patch 260-22964338)
Browse files Browse the repository at this point in the history
This fixes the following bug filed with Solaris: "22964338 ksh93
appears to send SIGHUP to unrelated processes on occasion". It is
fixed by applying this patch by Lijo George from the Solaris repo:
https://github.com/oracle/solaris-userland/blob/master/components/ksh93/patches/260-22964338.patch

The ksh2020 upstream rejected this, but if it's in production use
in Solaris, Solaris, it's probably good enough for 93u+m. If any
breakage is left, it can be fixed later.
att#1

src/cmd/ksh93/include/jobs.h,
src/cmd/ksh93/sh/fault.c,
src/cmd/ksh93/sh/jobs.c:
- Use a new job_hup() function instead of job_kill() to send SIGHUP
  to job processes on termination. The new function checks if a job
  is in fact still live before issuing SIGHUP to it.
  • Loading branch information
McDutchie committed Jan 8, 2021
1 parent ab98ec6 commit 62cf88d
Show file tree
Hide file tree
Showing 3 changed files with 50 additions and 1 deletion.
1 change: 1 addition & 0 deletions src/cmd/ksh93/include/jobs.h
Original file line number Diff line number Diff line change
Expand Up @@ -179,6 +179,7 @@ extern void job_chldtrap(Shell_t*, const char*,int);
extern int job_close(Shell_t*);
extern int job_list(struct process*,int);
extern int job_terminate(struct process*,int);
extern int job_hup(struct process *, int);
extern int job_switch(struct process*,int);
extern void job_fork(pid_t);
extern int job_reap(int);
Expand Down
2 changes: 1 addition & 1 deletion src/cmd/ksh93/sh/fault.c
Original file line number Diff line number Diff line change
Expand Up @@ -634,7 +634,7 @@ void sh_done(void *ptr, register int sig)
#endif
#ifdef JOBS
if((sh_isoption(SH_INTERACTIVE) && shp->login_sh) || (!sh_isoption(SH_INTERACTIVE) && (sig==SIGHUP)))
job_walk(sfstderr,job_terminate,SIGHUP,NIL(char**));
job_walk(sfstderr, job_hup, SIGHUP, NIL(char**));
#endif /* JOBS */
job_close(shp);
if(nv_search("VMTRACE", shp->var_tree,0))
Expand Down
48 changes: 48 additions & 0 deletions src/cmd/ksh93/sh/jobs.c
Original file line number Diff line number Diff line change
Expand Up @@ -1072,6 +1072,54 @@ int job_kill(register struct process *pw,register int sig)
return(r);
}

/*
* Similar to job_kill, but dedicated to SIGHUP handling when session is
* being disconnected.
*/
int job_hup(struct process *pw, int sig)
{
struct process *px;
pid_t pid;
int r;
NOT_USED(sig);
if(pw->p_pgrp == 0 || (pw->p_flag & P_DISOWN))
return(0);
job_lock();
if(pw->p_pgrp != 0)
{
int palive = 0;
for(px = pw; px != NULL; px = px->p_nxtproc)
{
if((px->p_flag & P_DONE) == 0)
{
palive = 1;
break;
}
}
/*
* If all the processes have died, there is no guarantee that
* p_pgrp is still the valid process group that we made, i.e.,
* the PID may have been recycled and the same p_pgrp may have
* been assigned to unrelated processes.
*/
if(palive)
{
if(killpg(pw->p_pgrp, SIGHUP) >= 0)
job_unstop(pw);
}
}
for(; pw != NULL && pw->p_pgrp == 0; pw = pw->p_nxtproc)
{
if(pw->p_flag & P_DONE)
continue;
if(kill(pw->p_pid, SIGHUP) >= 0)
(void)kill(pw->p_pid, SIGCONT);
pw = pw->p_nxtproc;
}
job_unlock();
return(0);
}

/*
* Get process structure from first letters of jobname
*
Expand Down

8 comments on commit 62cf88d

@JohnoKing
Copy link

@JohnoKing JohnoKing commented on 62cf88d Feb 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the email on the old mailing list associated with this patch: https://www.mail-archive.com/[email protected]/msg01887.html
I tried using the C program provided in that email but could not replicate the bug on either Linux or Solaris. Is this patch actually necessary?

@McDutchie
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. Thanks for the find. I will have to try to test this myself as well. I'm not comfortable reverting the patch until we have identified and understood the commit that made it unnecessary.

@JohnoKing
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not comfortable reverting the patch until we have identified and understood the commit that made it unnecessary.

I have tested the reproducer for this patch again and couldn't reproduce the SIGHUP issue on any version of ksh at all, including ksh93u+. Additionally, the issues with the patch pointed out in att#1 are still valid. If we keep this patch then the now unused job_terminate function and the always true if(pw->p_pgrp != 0) check should both be removed.

IMO this patch should be reverted unless the SIGHUP bug can be successfully reproduced.

@McDutchie
Copy link
Author

@McDutchie McDutchie commented on 62cf88d Nov 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just reproduced the bug on a Solaris 11.3 VM with a recent ksh 93u+m with this patch reverted. The steps in the mailing message aren't all that clearly explained. Here's what I did exactly. Note I did all this on the Solaris console, as I disabled the GUI to reduce the load on my host system. I don't know if that makes a difference.

  1. On a Solaris VM, get current master branch. Revert this patch. Compile with CC=gcc, CCFLAGS=-m64 -Os -std=c99 -D_AST_ksh_release -D__EXTENSIONS__
  2. Install this as the system sh and ksh, ensuring it is the default login shell. As root: cat arch/sol.i386/bin/ksh > /bin/ksh93 (this will also catch /bin/sh and others via symlinks and hardlinks; the output redirection method ensures none of those links are broken)
  3. Reboot. Observe Solaris works just fine with our ksh. :)
  4. Compile cpid: gcc -m64 -o cpid cpid.c
  5. We need an ssh session that we can terminate, so let's ssh localhost and enter your password.
  6. Within that ssh session, run a brief background job: sleep 1 & and let it finish. Note down the reported PID. That is the one we will reuse. Let's say 26650.
  7. Switch to a second console terminal (same way as Linux, except it makes you enter your password at each switch – how tedious). In that terminal, run: ./cpid 26650 (the PID from the previous step). Now wait until it says "pid 26650 is ready"; it has now succeeded at re-using that PID, and will just sit there. This process will never voluntarily terminate. If we have the bug, the termination of this process will be the symptom.
  8. Switch back to the first terminal. Forcibly terminate the ssh session by typing, on a new prompt, ~..
  9. Switch over to the second terminal again. Note that cpid has been terminated, reporting: waitpid return 26650, status 0x0001. This is the bug reproduced. (Note that status 0x0001 refers to being killed by signal 1 which is SIGHUP.)

Then I undid the reversion of this patch, redid the whole procedure, and the bug was gone; cpid keeps sitting there in step 9.

I have not tried this on other systems, but it's probably reproducible on others as well.

@McDutchie
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of Siteshwar's review comments on att#1 made sense, though. There is quite a bit of no-op code in that function, presumably because it started out as a copy-paste of another function. I'm fixing that now, which is making sense of why this is in fact needed.

One comment is wrong, though: "We can remove the sig argument if it's dedicated to SIGHUP handling". No, we cannot; this function is called by job_walk() via a pointer so it must accept that argument even though it doesn't use it.

Also, yes, you're right, job_terminate() is now unused and should be removed.

@JohnoKing
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the better reproducer instructions. I finally got the bug to occur in my Solaris VM.

@McDutchie
Copy link
Author

@McDutchie McDutchie commented on 62cf88d Nov 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I've now reproduced it on the stock /bin/ksh 93u+ 2012-08-01 on macOS as well. It should be a system-agnostic bug; this POSIXy signal handling stuff works the same everywhere. (And yes, the patch fixes it on that system as well.)

@McDutchie
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed and documented in 6d3796c

Please sign in to comment.