fix: write log shipper in python [MLG-993] #7974

rb-determined-ai · 2023-09-22T22:13:55Z

Instead of orchestrating multiple processes with bash and
confusing/complex process substitution, with all the trapping of signals
and printf x > ... nonsense, just write a wrapper process in a real
programming language. This reduces the complexity of our logging
solution, and prevents the out-of-order bugs inherent in having two
separate log shippers, one for stdout and one for stderr.

The new ship_logs.py has the following features:

it launches and monitors a child command
it shares a log buffer for shipping stdout and stderr
it has the same log parsing regexes as enrich_logging.py
it converts carriage returns to newlines
it forwards signals to its child process
it exits after a maximum of DET_LOG_WAIT_TIME or 30 seconds
it depends on the python interpreter, but only the interpreter (all
imports are from the standard library)
in the special case that the child process can't be started, it ships
an explanation of what happened to the master and exits with standard
bash exit codes

Automated Test Plan

Manual Test plan

make sure shipping to a TLS master with a trusted cert works (hard to fake this in CI)

netlify · 2023-09-22T22:14:00Z

✅ Deploy Preview for determined-ui canceled.

Name	Link
🔨 Latest commit	`1ab3c4b`
🔍 Latest deploy log	https://app.netlify.com/sites/determined-ui/deploys/6525d997454e530007f42651

master/static/srv/ship_logs.py

master/static/srv/command-entrypoint.sh

rb-determined-ai · 2023-10-02T23:19:23Z

master/static/srv/ship_logs.py

+
+        metadata["source"] = "task"
+
+        # XXX: remove this after ensuring it has been cleaned up master-side


there's a couple XXX's left in here that I need to remember

stoksc

reviewed everything except ship_logs and its test

stoksc · 2023-10-03T02:51:35Z

master/static/srv/ship_logs.sh

+set -e
+
+if [ -z "$DET_PYTHON_EXECUTABLE" ]; then
+    export DET_PYTHON_EXECUTABLE="python3"


maybe type -P "$DET_PYTHON_EXECUTABLE" and bail with a special exit code? (or whatever whence -p python is in bash).

Can you elaborate? I'm not really sure what you mean.

sorry, can we check if python exists here?

We could, but bash already emits a special exit code for command-not-found.

I could make this emit a different exit code in that case, is that what you mean?

That is what I mean. I know it does, this just felt like a very special case of "command not found". What do you think? Not worth it?

personally, I feel "not worth it", but it's not a bad idea. If you want it I will do it.

stoksc

looks good to backend

master/static/srv/ship_logs.py

e2e_tests/tests/cluster/test_ship_logs.py

master/static/srv/ship_logs.py

master/static/srv/ship-logs.sh

wes-turner · 2023-10-04T18:44:04Z

master/static/srv/ship_logs.py

+        master_url = os.environ["DET_MASTER"]
+        cert_name = os.environ.get("DET_MASTER_CERT_NAME", "")
+        cert_file = os.environ.get("DET_MASTER_CERT_FILE", "")
+        # TODO(rb): fix DET_USER_TOKEN to support tokens with lifetimes tied to an allocation, and


Is this TODO worth a ticket?

Personally I don't think so, because I don't see anybody doing it but me... for better or worse.

If you want a ticket I can make one, but you'll have to tell me instead of ask me if I want to lol.

master/static/srv/ship_logs.py

master/static/srv/ship-logs.sh

wes-turner · 2023-10-04T23:21:48Z

master/static/srv/ship_logs.py

+    try:
+        stdout = Collector(p.stdout, "stdout", emit_stdout_logs, metadata, logq, doneq)
+        stderr = Collector(p.stderr, "stderr", emit_stdout_logs, metadata, logq, doneq)
+        waiter = threading.Thread(target=lambda: doneq.put(("waiter", p.wait(), None)))


I think there's some cuteness going on here, but I haven't worked out, yet, whether it's the good kind of cuteness or the bad kind. Is there any significance to the second, "blocking" argument in the doneq.put, or is it there solely to make the waiter thread wait until the cmd subprocess is done before it publishes to doneq?

or is it there solely to make the waiter thread wait until the cmd subprocess is done before it publishes to doneq

Yeah that's the whole point. You have to get notified as soon as the process is dead, so you can start a watchdog timer, so you don't wait forever logging to a crashed master or something.

For that to work, the wait has to happen on a separate thread.

I don't see it as cute, more like "the reason that this needs to be a thread in the first place".

The "cuteness" for me revolves around this:

Is there any significance to the second, "blocking" argument in the doneq.put, or is it there solely to make the waiter thread wait until the cmd subprocess is done before it publishes to doneq?

I believe it's there to make the waiter write to the doneq only when cmd is complete, in which case this feels like an abuse of the signature of doneq.put. If that's the case, I think intention is more obvious if the lambda is like p.wait(); doneq.put(...). But this difference isn't something I care much about if instead you leave a comment.

Or if actually you do want p's exit code to be the value for block in doneq.put, then I haven't actually understood how this part of the code works, yet.

Oh... I'm just now seeing the extra parens and that it's passing a 3-tuple to doneq as its item, rather than passing all three of the arguments that Queue.put takes. Ok. Obvious during the get maybe easy to miss when looking at the put.

It's slightly weird that Shipper and main both separately know how to marshal who, what, error. Slight preference here for a dict or a namedtupleish item. It'd be nice to tell the reader what those puts represent.

Names are maybe also a little opaque. What makes error special while exit_code is just a general "what"?

Sure, I used a NamedTuple.

Also, I decided that the inline-lambda was missing the consistent error handling that the other thread classes had, and it wasn't on-par with the "consider every possible exception case, even remote possibilities" attitude of the rest of the code. So I expanded it into a (short) subclass of threading.Thread (thank you for challenging me on that, you were right).

Side-note, and maybe you already knew this since you didn't ask for it, but the annotation I'd like to have (doneq: typing.Queue[DoneMsg]) isn't possible until python 3.9.

Lastly, re error vs exit_code: the answer is that in the context we're writing, nonzero exit code is not an error. Is a return value from a user-owned script which we must preserve. User code failing is not considered exceptional by the platform, but the Shipper failing to ship or the Collector failing to collect or the Waiter failing to wait? Those are all exceptional circumstances and will result in a RuntimeError.

Thanks. I like the DoneMsg. I think it makes the payload significantly clearer.

Also, I buy your reasoning for why error is special and not just another message. Thanks.

master/static/srv/ship_logs.py

wes-turner · 2023-10-06T22:56:58Z

master/static/srv/ship_logs.py

+    try:
+        stdout = Collector(p.stdout, "stdout", emit_stdout_logs, metadata, logq, doneq)
+        stderr = Collector(p.stderr, "stderr", emit_stdout_logs, metadata, logq, doneq)
+        waiter = threading.Thread(target=lambda: doneq.put(("waiter", p.wait(), None)))


The "cuteness" for me revolves around this:

Is there any significance to the second, "blocking" argument in the doneq.put, or is it there solely to make the waiter thread wait until the cmd subprocess is done before it publishes to doneq?

I believe it's there to make the waiter write to the doneq only when cmd is complete, in which case this feels like an abuse of the signature of doneq.put. If that's the case, I think intention is more obvious if the lambda is like p.wait(); doneq.put(...). But this difference isn't something I care much about if instead you leave a comment.

Or if actually you do want p's exit code to be the value for block in doneq.put, then I haven't actually understood how this part of the code works, yet.

wes-turner · 2023-10-07T00:14:20Z

master/static/srv/ship_logs.py

+    try:
+        stdout = Collector(p.stdout, "stdout", emit_stdout_logs, metadata, logq, doneq)
+        stderr = Collector(p.stderr, "stderr", emit_stdout_logs, metadata, logq, doneq)
+        waiter = threading.Thread(target=lambda: doneq.put(("waiter", p.wait(), None)))


Oh... I'm just now seeing the extra parens and that it's passing a 3-tuple to doneq as its item, rather than passing all three of the arguments that Queue.put takes. Ok. Obvious during the get maybe easy to miss when looking at the put.

It's slightly weird that Shipper and main both separately know how to marshal who, what, error. Slight preference here for a dict or a namedtupleish item. It'd be nice to tell the reader what those puts represent.

Names are maybe also a little opaque. What makes error special while exit_code is just a general "what"?

wes-turner · 2023-10-07T00:16:59Z

master/static/srv/ship_logs.py

+            # Wait for an event, possibly with a deadline (if the child process already exited).
+            try:
+                timeout = None if deadline is None else deadline - time.time()
+                if timeout is not None and timeout <= 0:


I can't quite work out when this could happen.

If the deadline passed in-between calls to doneq.get(), since the deadline is a fixed point in time that could be passed by any given line of code.

Unfortunately, q.get(timeout=-1) isn't interpreted as "oh your deadline already passed" it's a ValueError. And q.get(timeout=0) isn't interpreted as "oh your deadline is right now" it means "block forever".

That decision makes sense from the standard library because it's fairly common to see timeout=-1 or timeout=0 as ways to specify "block forever" kind of behavior, especially in C where you might use int or unsigned int for your timeout (no None value available).

But this decision is also annoying because this limitation comes up every time you want this "pop until deadline" kind of behavior.

master/static/srv/ship_logs.py

e2e_tests/tests/cluster/test_ship_logs.py

wes-turner · 2023-10-09T22:25:34Z

master/static/srv/ship_logs.py

+            data = json.dumps(logs).encode("utf8")
+
+            # Try to ship for about ten minutes.
+            backoffs = [0, 1, 5, 10, 15, 15, 15, 15, 15, 15, 15, 60, 60, 60, 60, 60, 60, 60, 60, 60]


It's not weird to me that ship takes backoffs as a parameter (though if you wanted to make it unconfigurable, I think that would be fine, too). But I think the result where ship and ship_special have different backoffs is a surprising result.

I am not going to pretend this is a Great Architecture Decision, rather this was more of a "where do I refactor to maximize DRY" type of decision.

In particular, I thought they should have quite different backoffs, because ship_special happens right after the task was launched, and it indicates nothing actually ran. In my mind, that is the sort of case that is most likely going to be hit when somebody is doing heavy interactive development on a new image, new workflow, etc, and having a 10minute timeout if somebody e.g. configured an image wrong would be really unfortunate. The normal log shipping code might start failing mid-task if, say, a master upgrade took a few minutes to run a migration. I wanted that timeout to be pretty generous.

And so, starting with that position, having the backoffs get passed into ship made the most sense.

wes-turner

LGTM

Instead of orchestrating multiple processes with bash and confusing/complex process substitution, with all the trapping of signals and `printf x > ...` nonsense, just write a wrapper process in a real programming language. This reduces the complexity of our logging solution, and prevents the out-of-order bugs inherent in having two separate log shippers, one for stdout and one for stderr. The new ship_logs.py has the following features: - it launches and monitors a child command - it shares a log buffer for shipping stdout and stderr - it has the same log parsing regexes as enrich_logging.py - it converts carriage returns to newlines - it forwards signals to its child process - it exits after a maximum of DET_LOG_WAIT_TIME or 30 seconds - it depends on the python interpreter, but only the interpreter (all imports are from the standard library) - in the special case that the child process can't be started, it ships an explanation of what happened to the master and exits with standard bash exit codes

jerryharrow · 2023-10-11T13:36:04Z

This has issued on the -ee side. I have a fix for the compilation error https://github.com/determined-ai/determined-ee/pull/1085 but no logs are shipped at all and no apparent error messages in the slrum error or output logs.

rb-determined-ai · 2023-10-11T17:22:43Z

there's a hotfix pr

rb-determined-ai requested a review from a team as a code owner September 22, 2023 22:13

rb-determined-ai requested a review from corban-beaird September 22, 2023 22:13

cla-bot bot added the cla-signed label Sep 22, 2023

rb-determined-ai self-assigned this Sep 22, 2023

rb-determined-ai marked this pull request as draft September 22, 2023 22:14

stoksc reviewed Sep 22, 2023

View reviewed changes

master/static/srv/ship_logs.py Outdated Show resolved Hide resolved

rb-determined-ai force-pushed the rb/pylogs branch 5 times, most recently from 258a09d to f143514 Compare September 27, 2023 22:43

rb-determined-ai changed the title ~~draft: write log shipper in python~~ fix: write log shipper in python [DET-9867] Sep 27, 2023

rb-determined-ai marked this pull request as ready for review September 27, 2023 22:46

rb-determined-ai mentioned this pull request Sep 28, 2023

draft: write log shipper in go #7881

Closed

rb-determined-ai force-pushed the rb/pylogs branch from f143514 to 6211fb3 Compare October 2, 2023 23:10

rb-determined-ai requested a review from a team as a code owner October 2, 2023 23:10

rb-determined-ai requested a review from szewaiyuen6 October 2, 2023 23:10

rb-determined-ai commented Oct 2, 2023

View reviewed changes

master/static/srv/command-entrypoint.sh Outdated Show resolved Hide resolved

rb-determined-ai commented Oct 2, 2023

View reviewed changes

stoksc reviewed Oct 3, 2023

View reviewed changes

rb-determined-ai force-pushed the rb/pylogs branch 7 times, most recently from a49ee36 to a6056b3 Compare October 3, 2023 19:59

rb-determined-ai changed the title ~~fix: write log shipper in python [DET-9867]~~ fix: write log shipper in python [MLG-993] Oct 3, 2023

rb-determined-ai force-pushed the rb/pylogs branch from a6056b3 to 19023df Compare October 3, 2023 20:19

rb-determined-ai force-pushed the rb/pylogs branch from 6db9cfd to 31e5081 Compare October 4, 2023 16:29

stoksc approved these changes Oct 4, 2023

View reviewed changes

master/static/srv/ship_logs.py Show resolved Hide resolved

master/static/srv/ship_logs.py Show resolved Hide resolved

wes-turner reviewed Oct 4, 2023

View reviewed changes

rb-determined-ai force-pushed the rb/pylogs branch from 31e5081 to a9ba331 Compare October 4, 2023 20:07

wes-turner reviewed Oct 4, 2023

View reviewed changes

master/static/srv/ship_logs.py Outdated Show resolved Hide resolved

master/static/srv/ship_logs.py Show resolved Hide resolved

master/static/srv/ship_logs.py Show resolved Hide resolved

wes-turner reviewed Oct 4, 2023

View reviewed changes

stoksc removed their assignment Oct 5, 2023

rb-determined-ai force-pushed the rb/pylogs branch from a9ba331 to 3a65517 Compare October 5, 2023 22:32

wes-turner reviewed Oct 9, 2023

View reviewed changes

rb-determined-ai force-pushed the rb/pylogs branch from 3a65517 to 519405d Compare October 9, 2023 19:54

wes-turner reviewed Oct 9, 2023

View reviewed changes

master/static/srv/ship_logs.py Show resolved Hide resolved

wes-turner reviewed Oct 9, 2023

View reviewed changes

e2e_tests/tests/cluster/test_ship_logs.py Show resolved Hide resolved

wes-turner reviewed Oct 9, 2023

View reviewed changes

e2e_tests/tests/cluster/test_ship_logs.py Show resolved Hide resolved

wes-turner reviewed Oct 9, 2023

View reviewed changes

e2e_tests/tests/cluster/test_ship_logs.py Show resolved Hide resolved

wes-turner reviewed Oct 9, 2023

View reviewed changes

e2e_tests/tests/cluster/test_ship_logs.py Outdated Show resolved Hide resolved

rb-determined-ai force-pushed the rb/pylogs branch from 519405d to b8a0ff6 Compare October 9, 2023 21:11

wes-turner reviewed Oct 9, 2023

View reviewed changes

wes-turner approved these changes Oct 9, 2023

View reviewed changes

rb-determined-ai force-pushed the rb/pylogs branch 3 times, most recently from 06e6f55 to b6100d9 Compare October 10, 2023 16:52

rb-determined-ai force-pushed the rb/pylogs branch from b6100d9 to 1ab3c4b Compare October 10, 2023 23:09

rb-determined-ai enabled auto-merge (squash) October 10, 2023 23:09

rb-determined-ai merged commit 3b42b91 into main Oct 10, 2023
68 of 83 checks passed

rb-determined-ai deleted the rb/pylogs branch October 10, 2023 23:37

dannysauer added this to the 0.26.2 milestone Feb 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: write log shipper in python [MLG-993] #7974

fix: write log shipper in python [MLG-993] #7974

rb-determined-ai commented Sep 22, 2023 •

edited

Loading

netlify bot commented Sep 22, 2023 •

edited

Loading

rb-determined-ai Oct 2, 2023

stoksc left a comment

stoksc Oct 3, 2023

rb-determined-ai Oct 3, 2023

stoksc Oct 3, 2023

rb-determined-ai Oct 3, 2023

stoksc Oct 3, 2023

rb-determined-ai Oct 4, 2023

stoksc left a comment

wes-turner Oct 4, 2023

rb-determined-ai Oct 5, 2023

wes-turner Oct 4, 2023

rb-determined-ai Oct 5, 2023

wes-turner Oct 6, 2023

wes-turner Oct 7, 2023

rb-determined-ai Oct 9, 2023

wes-turner Oct 9, 2023

wes-turner Oct 6, 2023

wes-turner Oct 7, 2023

wes-turner Oct 7, 2023

rb-determined-ai Oct 9, 2023

wes-turner Oct 9, 2023

rb-determined-ai Oct 9, 2023

wes-turner left a comment

jerryharrow commented Oct 11, 2023

rb-determined-ai commented Oct 11, 2023


		metadata["source"] = "task"

		# XXX: remove this after ensuring it has been cleaned up master-side

fix: write log shipper in python [MLG-993] #7974

fix: write log shipper in python [MLG-993] #7974

Conversation

rb-determined-ai commented Sep 22, 2023 • edited Loading

Automated Test Plan

Manual Test plan

netlify bot commented Sep 22, 2023 • edited Loading

✅ Deploy Preview for determined-ui canceled.

Choose a reason for hiding this comment

stoksc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stoksc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wes-turner left a comment

Choose a reason for hiding this comment

jerryharrow commented Oct 11, 2023

rb-determined-ai commented Oct 11, 2023

rb-determined-ai commented Sep 22, 2023 •

edited

Loading

netlify bot commented Sep 22, 2023 •

edited

Loading