Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: write log shipper in python [MLG-993] #7974

Merged
merged 1 commit into from
Oct 10, 2023
Merged

Conversation

rb-determined-ai
Copy link
Member

@rb-determined-ai rb-determined-ai commented Sep 22, 2023

Instead of orchestrating multiple processes with bash and
confusing/complex process substitution, with all the trapping of signals
and printf x > ... nonsense, just write a wrapper process in a real
programming language. This reduces the complexity of our logging
solution, and prevents the out-of-order bugs inherent in having two
separate log shippers, one for stdout and one for stderr.

The new ship_logs.py has the following features:

  • it launches and monitors a child command
  • it shares a log buffer for shipping stdout and stderr
  • it has the same log parsing regexes as enrich_logging.py
  • it converts carriage returns to newlines
  • it forwards signals to its child process
  • it exits after a maximum of DET_LOG_WAIT_TIME or 30 seconds
  • it depends on the python interpreter, but only the interpreter (all
    imports are from the standard library)
  • in the special case that the child process can't be started, it ships
    an explanation of what happened to the master and exits with standard
    bash exit codes

Automated Test Plan

  • make sure shipping to a TLS master with an untrusted cert works
  • make sure shipping to a TLS master with a cert not matching its url works
  • make sure signal forwarding works
  • make sure normal exits in the child process work
  • make sure signal exits in the child process work
  • make sure that if the shipper hangs, we stop waiting after DET_LOG_WAIT_TIME seconds
  • make sure ordering is preserved, or at least almost preserved, between stdout and stderr. They can't be perfectly preserved any time that you're collecting them as two separate streams.
  • make sure that failures to start the child process are shipped to the master, even though the log shipping machinery can't be stood up until after the process starts successfully.
  • make sure that there are no dependencies outside of the python standard library
  • make sure proxy cases are handled
  • make sure the bindmount-based debug stream works.

Manual Test plan

  • make sure shipping to a TLS master with a trusted cert works (hard to fake this in CI)

@netlify
Copy link

netlify bot commented Sep 22, 2023

Deploy Preview for determined-ui canceled.

Name Link
🔨 Latest commit 1ab3c4b
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/6525d997454e530007f42651

@rb-determined-ai rb-determined-ai marked this pull request as draft September 22, 2023 22:14
@rb-determined-ai rb-determined-ai changed the title draft: write log shipper in python fix: write log shipper in python [DET-9867] Sep 27, 2023
@rb-determined-ai rb-determined-ai marked this pull request as ready for review September 27, 2023 22:46
@rb-determined-ai rb-determined-ai requested a review from a team as a code owner October 2, 2023 23:10

metadata["source"] = "task"

# XXX: remove this after ensuring it has been cleaned up master-side
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a couple XXX's left in here that I need to remember

Copy link
Contributor

@stoksc stoksc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reviewed everything except ship_logs and its test

set -e

if [ -z "$DET_PYTHON_EXECUTABLE" ]; then
export DET_PYTHON_EXECUTABLE="python3"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe type -P "$DET_PYTHON_EXECUTABLE" and bail with a special exit code? (or whatever whence -p python is in bash).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate? I'm not really sure what you mean.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, can we check if python exists here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, but bash already emits a special exit code for command-not-found.

I could make this emit a different exit code in that case, is that what you mean?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is what I mean. I know it does, this just felt like a very special case of "command not found". What do you think? Not worth it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

personally, I feel "not worth it", but it's not a bad idea. If you want it I will do it.

@rb-determined-ai rb-determined-ai changed the title fix: write log shipper in python [DET-9867] fix: write log shipper in python [MLG-993] Oct 3, 2023
Copy link
Contributor

@stoksc stoksc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to backend

master/static/srv/ship_logs.py Show resolved Hide resolved
master/static/srv/ship_logs.py Show resolved Hide resolved
master/static/srv/ship_logs.py Show resolved Hide resolved
master/static/srv/ship_logs.py Show resolved Hide resolved
master/static/srv/ship_logs.py Show resolved Hide resolved
master/static/srv/ship_logs.py Show resolved Hide resolved
master/static/srv/ship_logs.py Outdated Show resolved Hide resolved
master/static/srv/ship_logs.py Outdated Show resolved Hide resolved
e2e_tests/tests/cluster/test_ship_logs.py Outdated Show resolved Hide resolved
master/static/srv/ship_logs.py Outdated Show resolved Hide resolved
master/static/srv/ship_logs.py Show resolved Hide resolved
master/static/srv/ship_logs.py Show resolved Hide resolved
master/static/srv/ship-logs.sh Show resolved Hide resolved
master_url = os.environ["DET_MASTER"]
cert_name = os.environ.get("DET_MASTER_CERT_NAME", "")
cert_file = os.environ.get("DET_MASTER_CERT_FILE", "")
# TODO(rb): fix DET_USER_TOKEN to support tokens with lifetimes tied to an allocation, and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this TODO worth a ticket?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I don't think so, because I don't see anybody doing it but me... for better or worse.

If you want a ticket I can make one, but you'll have to tell me instead of ask me if I want to lol.

master/static/srv/ship_logs.py Show resolved Hide resolved
master/static/srv/ship_logs.py Outdated Show resolved Hide resolved
master/static/srv/ship_logs.py Show resolved Hide resolved
master/static/srv/ship_logs.py Show resolved Hide resolved
master/static/srv/ship-logs.sh Show resolved Hide resolved
try:
stdout = Collector(p.stdout, "stdout", emit_stdout_logs, metadata, logq, doneq)
stderr = Collector(p.stderr, "stderr", emit_stdout_logs, metadata, logq, doneq)
waiter = threading.Thread(target=lambda: doneq.put(("waiter", p.wait(), None)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's some cuteness going on here, but I haven't worked out, yet, whether it's the good kind of cuteness or the bad kind. Is there any significance to the second, "blocking" argument in the doneq.put, or is it there solely to make the waiter thread wait until the cmd subprocess is done before it publishes to doneq?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or is it there solely to make the waiter thread wait until the cmd subprocess is done before it publishes to doneq

Yeah that's the whole point. You have to get notified as soon as the process is dead, so you can start a watchdog timer, so you don't wait forever logging to a crashed master or something.

For that to work, the wait has to happen on a separate thread.

I don't see it as cute, more like "the reason that this needs to be a thread in the first place".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "cuteness" for me revolves around this:

Is there any significance to the second, "blocking" argument in the doneq.put, or is it there solely to make the waiter thread wait until the cmd subprocess is done before it publishes to doneq?

I believe it's there to make the waiter write to the doneq only when cmd is complete, in which case this feels like an abuse of the signature of doneq.put. If that's the case, I think intention is more obvious if the lambda is like p.wait(); doneq.put(...). But this difference isn't something I care much about if instead you leave a comment.

Or if actually you do want p's exit code to be the value for block in doneq.put, then I haven't actually understood how this part of the code works, yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh... I'm just now seeing the extra parens and that it's passing a 3-tuple to doneq as its item, rather than passing all three of the arguments that Queue.put takes. Ok. Obvious during the get maybe easy to miss when looking at the put.

It's slightly weird that Shipper and main both separately know how to marshal who, what, error. Slight preference here for a dict or a namedtupleish item. It'd be nice to tell the reader what those puts represent.

Names are maybe also a little opaque. What makes error special while exit_code is just a general "what"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I used a NamedTuple.

Also, I decided that the inline-lambda was missing the consistent error handling that the other thread classes had, and it wasn't on-par with the "consider every possible exception case, even remote possibilities" attitude of the rest of the code. So I expanded it into a (short) subclass of threading.Thread (thank you for challenging me on that, you were right).

Side-note, and maybe you already knew this since you didn't ask for it, but the annotation I'd like to have (doneq: typing.Queue[DoneMsg]) isn't possible until python 3.9.

Lastly, re error vs exit_code: the answer is that in the context we're writing, nonzero exit code is not an error. Is a return value from a user-owned script which we must preserve. User code failing is not considered exceptional by the platform, but the Shipper failing to ship or the Collector failing to collect or the Waiter failing to wait? Those are all exceptional circumstances and will result in a RuntimeError.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I like the DoneMsg. I think it makes the payload significantly clearer.

Also, I buy your reasoning for why error is special and not just another message. Thanks.

master/static/srv/ship_logs.py Show resolved Hide resolved
master/static/srv/ship_logs.py Show resolved Hide resolved
master/static/srv/ship_logs.py Show resolved Hide resolved
master/static/srv/ship_logs.py Outdated Show resolved Hide resolved
master/static/srv/ship_logs.py Show resolved Hide resolved
try:
stdout = Collector(p.stdout, "stdout", emit_stdout_logs, metadata, logq, doneq)
stderr = Collector(p.stderr, "stderr", emit_stdout_logs, metadata, logq, doneq)
waiter = threading.Thread(target=lambda: doneq.put(("waiter", p.wait(), None)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "cuteness" for me revolves around this:

Is there any significance to the second, "blocking" argument in the doneq.put, or is it there solely to make the waiter thread wait until the cmd subprocess is done before it publishes to doneq?

I believe it's there to make the waiter write to the doneq only when cmd is complete, in which case this feels like an abuse of the signature of doneq.put. If that's the case, I think intention is more obvious if the lambda is like p.wait(); doneq.put(...). But this difference isn't something I care much about if instead you leave a comment.

Or if actually you do want p's exit code to be the value for block in doneq.put, then I haven't actually understood how this part of the code works, yet.

try:
stdout = Collector(p.stdout, "stdout", emit_stdout_logs, metadata, logq, doneq)
stderr = Collector(p.stderr, "stderr", emit_stdout_logs, metadata, logq, doneq)
waiter = threading.Thread(target=lambda: doneq.put(("waiter", p.wait(), None)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh... I'm just now seeing the extra parens and that it's passing a 3-tuple to doneq as its item, rather than passing all three of the arguments that Queue.put takes. Ok. Obvious during the get maybe easy to miss when looking at the put.

It's slightly weird that Shipper and main both separately know how to marshal who, what, error. Slight preference here for a dict or a namedtupleish item. It'd be nice to tell the reader what those puts represent.

Names are maybe also a little opaque. What makes error special while exit_code is just a general "what"?

# Wait for an event, possibly with a deadline (if the child process already exited).
try:
timeout = None if deadline is None else deadline - time.time()
if timeout is not None and timeout <= 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't quite work out when this could happen.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the deadline passed in-between calls to doneq.get(), since the deadline is a fixed point in time that could be passed by any given line of code.

Unfortunately, q.get(timeout=-1) isn't interpreted as "oh your deadline already passed" it's a ValueError. And q.get(timeout=0) isn't interpreted as "oh your deadline is right now" it means "block forever".

That decision makes sense from the standard library because it's fairly common to see timeout=-1 or timeout=0 as ways to specify "block forever" kind of behavior, especially in C where you might use int or unsigned int for your timeout (no None value available).

But this decision is also annoying because this limitation comes up every time you want this "pop until deadline" kind of behavior.

master/static/srv/ship_logs.py Show resolved Hide resolved
master/static/srv/ship_logs.py Outdated Show resolved Hide resolved
data = json.dumps(logs).encode("utf8")

# Try to ship for about ten minutes.
backoffs = [0, 1, 5, 10, 15, 15, 15, 15, 15, 15, 15, 60, 60, 60, 60, 60, 60, 60, 60, 60]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not weird to me that ship takes backoffs as a parameter (though if you wanted to make it unconfigurable, I think that would be fine, too). But I think the result where ship and ship_special have different backoffs is a surprising result.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not going to pretend this is a Great Architecture Decision, rather this was more of a "where do I refactor to maximize DRY" type of decision.

In particular, I thought they should have quite different backoffs, because ship_special happens right after the task was launched, and it indicates nothing actually ran. In my mind, that is the sort of case that is most likely going to be hit when somebody is doing heavy interactive development on a new image, new workflow, etc, and having a 10minute timeout if somebody e.g. configured an image wrong would be really unfortunate. The normal log shipping code might start failing mid-task if, say, a master upgrade took a few minutes to run a migration. I wanted that timeout to be pretty generous.

And so, starting with that position, having the backoffs get passed into ship made the most sense.

Copy link
Contributor

@wes-turner wes-turner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Instead of orchestrating multiple processes with bash and
confusing/complex process substitution, with all the trapping of signals
and `printf x > ...` nonsense, just write a wrapper process in a real
programming language.  This reduces the complexity of our logging
solution, and prevents the out-of-order bugs inherent in having two
separate log shippers, one for stdout and one for stderr.

The new ship_logs.py has the following features:
- it launches and monitors a child command
- it shares a log buffer for shipping stdout and stderr
- it has the same log parsing regexes as enrich_logging.py
- it converts carriage returns to newlines
- it forwards signals to its child process
- it exits after a maximum of DET_LOG_WAIT_TIME or 30 seconds
- it depends on the python interpreter, but only the interpreter (all
  imports are from the standard library)
- in the special case that the child process can't be started, it ships
  an explanation of what happened to the master and exits with standard
  bash exit codes
@rb-determined-ai rb-determined-ai enabled auto-merge (squash) October 10, 2023 23:09
@rb-determined-ai rb-determined-ai merged commit 3b42b91 into main Oct 10, 2023
68 of 83 checks passed
@rb-determined-ai rb-determined-ai deleted the rb/pylogs branch October 10, 2023 23:37
@jerryharrow
Copy link
Contributor

This has issued on the -ee side. I have a fix for the compilation error https://github.com/determined-ai/determined-ee/pull/1085 but no logs are shipped at all and no apparent error messages in the slrum error or output logs.

@rb-determined-ai
Copy link
Member Author

there's a hotfix pr

@dannysauer dannysauer added this to the 0.26.2 milestone Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants