-
Notifications
You must be signed in to change notification settings - Fork 356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: write log shipper in python [MLG-993] #7974
Conversation
✅ Deploy Preview for determined-ui canceled.
|
258a09d
to
f143514
Compare
f143514
to
6211fb3
Compare
master/static/srv/ship_logs.py
Outdated
|
||
metadata["source"] = "task" | ||
|
||
# XXX: remove this after ensuring it has been cleaned up master-side |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's a couple XXX's left in here that I need to remember
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reviewed everything except ship_logs and its test
master/static/srv/ship_logs.sh
Outdated
set -e | ||
|
||
if [ -z "$DET_PYTHON_EXECUTABLE" ]; then | ||
export DET_PYTHON_EXECUTABLE="python3" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe type -P "$DET_PYTHON_EXECUTABLE"
and bail with a special exit code? (or whatever whence -p python
is in bash).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate? I'm not really sure what you mean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, can we check if python exists here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could, but bash already emits a special exit code for command-not-found.
I could make this emit a different exit code in that case, is that what you mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is what I mean. I know it does, this just felt like a very special case of "command not found". What do you think? Not worth it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
personally, I feel "not worth it", but it's not a bad idea. If you want it I will do it.
a49ee36
to
a6056b3
Compare
a6056b3
to
19023df
Compare
6db9cfd
to
31e5081
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to backend
31e5081
to
a9ba331
Compare
master_url = os.environ["DET_MASTER"] | ||
cert_name = os.environ.get("DET_MASTER_CERT_NAME", "") | ||
cert_file = os.environ.get("DET_MASTER_CERT_FILE", "") | ||
# TODO(rb): fix DET_USER_TOKEN to support tokens with lifetimes tied to an allocation, and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this TODO worth a ticket?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I don't think so, because I don't see anybody doing it but me... for better or worse.
If you want a ticket I can make one, but you'll have to tell me instead of ask me if I want to lol.
master/static/srv/ship_logs.py
Outdated
try: | ||
stdout = Collector(p.stdout, "stdout", emit_stdout_logs, metadata, logq, doneq) | ||
stderr = Collector(p.stderr, "stderr", emit_stdout_logs, metadata, logq, doneq) | ||
waiter = threading.Thread(target=lambda: doneq.put(("waiter", p.wait(), None))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's some cuteness going on here, but I haven't worked out, yet, whether it's the good kind of cuteness or the bad kind. Is there any significance to the second, "blocking" argument in the doneq.put
, or is it there solely to make the waiter thread wait until the cmd
subprocess is done before it publishes to doneq
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or is it there solely to make the waiter thread wait until the cmd subprocess is done before it publishes to doneq
Yeah that's the whole point. You have to get notified as soon as the process is dead, so you can start a watchdog timer, so you don't wait forever logging to a crashed master or something.
For that to work, the wait has to happen on a separate thread.
I don't see it as cute, more like "the reason that this needs to be a thread in the first place".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "cuteness" for me revolves around this:
Is there any significance to the second, "blocking" argument in the doneq.put, or is it there solely to make the waiter thread wait until the cmd subprocess is done before it publishes to doneq?
I believe it's there to make the waiter
write to the doneq
only when cmd
is complete, in which case this feels like an abuse of the signature of doneq.put
. If that's the case, I think intention is more obvious if the lambda is like p.wait(); doneq.put(...)
. But this difference isn't something I care much about if instead you leave a comment.
Or if actually you do want p
's exit code to be the value for block
in doneq.put
, then I haven't actually understood how this part of the code works, yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh... I'm just now seeing the extra parens and that it's passing a 3-tuple to doneq
as its item, rather than passing all three of the arguments that Queue.put
takes. Ok. Obvious during the get
maybe easy to miss when looking at the put
.
It's slightly weird that Shipper
and main
both separately know how to marshal who, what, error
. Slight preference here for a dict or a namedtuple
ish item. It'd be nice to tell the reader what those puts
represent.
Names are maybe also a little opaque. What makes error
special while exit_code
is just a general "what"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I used a NamedTuple.
Also, I decided that the inline-lambda was missing the consistent error handling that the other thread classes had, and it wasn't on-par with the "consider every possible exception case, even remote possibilities" attitude of the rest of the code. So I expanded it into a (short) subclass of threading.Thread (thank you for challenging me on that, you were right).
Side-note, and maybe you already knew this since you didn't ask for it, but the annotation I'd like to have (doneq: typing.Queue[DoneMsg]
) isn't possible until python 3.9.
Lastly, re error
vs exit_code
: the answer is that in the context we're writing, nonzero exit code is not an error. Is a return value from a user-owned script which we must preserve. User code failing is not considered exceptional by the platform, but the Shipper failing to ship or the Collector failing to collect or the Waiter failing to wait? Those are all exceptional circumstances and will result in a RuntimeError.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I like the DoneMsg
. I think it makes the payload significantly clearer.
Also, I buy your reasoning for why error
is special and not just another message. Thanks.
a9ba331
to
3a65517
Compare
master/static/srv/ship_logs.py
Outdated
try: | ||
stdout = Collector(p.stdout, "stdout", emit_stdout_logs, metadata, logq, doneq) | ||
stderr = Collector(p.stderr, "stderr", emit_stdout_logs, metadata, logq, doneq) | ||
waiter = threading.Thread(target=lambda: doneq.put(("waiter", p.wait(), None))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "cuteness" for me revolves around this:
Is there any significance to the second, "blocking" argument in the doneq.put, or is it there solely to make the waiter thread wait until the cmd subprocess is done before it publishes to doneq?
I believe it's there to make the waiter
write to the doneq
only when cmd
is complete, in which case this feels like an abuse of the signature of doneq.put
. If that's the case, I think intention is more obvious if the lambda is like p.wait(); doneq.put(...)
. But this difference isn't something I care much about if instead you leave a comment.
Or if actually you do want p
's exit code to be the value for block
in doneq.put
, then I haven't actually understood how this part of the code works, yet.
master/static/srv/ship_logs.py
Outdated
try: | ||
stdout = Collector(p.stdout, "stdout", emit_stdout_logs, metadata, logq, doneq) | ||
stderr = Collector(p.stderr, "stderr", emit_stdout_logs, metadata, logq, doneq) | ||
waiter = threading.Thread(target=lambda: doneq.put(("waiter", p.wait(), None))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh... I'm just now seeing the extra parens and that it's passing a 3-tuple to doneq
as its item, rather than passing all three of the arguments that Queue.put
takes. Ok. Obvious during the get
maybe easy to miss when looking at the put
.
It's slightly weird that Shipper
and main
both separately know how to marshal who, what, error
. Slight preference here for a dict or a namedtuple
ish item. It'd be nice to tell the reader what those puts
represent.
Names are maybe also a little opaque. What makes error
special while exit_code
is just a general "what"?
# Wait for an event, possibly with a deadline (if the child process already exited). | ||
try: | ||
timeout = None if deadline is None else deadline - time.time() | ||
if timeout is not None and timeout <= 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't quite work out when this could happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the deadline passed in-between calls to doneq.get()
, since the deadline is a fixed point in time that could be passed by any given line of code.
Unfortunately, q.get(timeout=-1)
isn't interpreted as "oh your deadline already passed" it's a ValueError. And q.get(timeout=0)
isn't interpreted as "oh your deadline is right now" it means "block forever".
That decision makes sense from the standard library because it's fairly common to see timeout=-1
or timeout=0
as ways to specify "block forever" kind of behavior, especially in C where you might use int
or unsigned int
for your timeout (no None
value available).
But this decision is also annoying because this limitation comes up every time you want this "pop until deadline" kind of behavior.
3a65517
to
519405d
Compare
519405d
to
b8a0ff6
Compare
data = json.dumps(logs).encode("utf8") | ||
|
||
# Try to ship for about ten minutes. | ||
backoffs = [0, 1, 5, 10, 15, 15, 15, 15, 15, 15, 15, 60, 60, 60, 60, 60, 60, 60, 60, 60] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not weird to me that ship
takes backoffs as a parameter (though if you wanted to make it unconfigurable, I think that would be fine, too). But I think the result where ship
and ship_special
have different backoffs is a surprising result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not going to pretend this is a Great Architecture Decision, rather this was more of a "where do I refactor to maximize DRY" type of decision.
In particular, I thought they should have quite different backoffs, because ship_special happens right after the task was launched, and it indicates nothing actually ran. In my mind, that is the sort of case that is most likely going to be hit when somebody is doing heavy interactive development on a new image, new workflow, etc, and having a 10minute timeout if somebody e.g. configured an image wrong would be really unfortunate. The normal log shipping code might start failing mid-task if, say, a master upgrade took a few minutes to run a migration. I wanted that timeout to be pretty generous.
And so, starting with that position, having the backoffs get passed into ship
made the most sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
06e6f55
to
b6100d9
Compare
Instead of orchestrating multiple processes with bash and confusing/complex process substitution, with all the trapping of signals and `printf x > ...` nonsense, just write a wrapper process in a real programming language. This reduces the complexity of our logging solution, and prevents the out-of-order bugs inherent in having two separate log shippers, one for stdout and one for stderr. The new ship_logs.py has the following features: - it launches and monitors a child command - it shares a log buffer for shipping stdout and stderr - it has the same log parsing regexes as enrich_logging.py - it converts carriage returns to newlines - it forwards signals to its child process - it exits after a maximum of DET_LOG_WAIT_TIME or 30 seconds - it depends on the python interpreter, but only the interpreter (all imports are from the standard library) - in the special case that the child process can't be started, it ships an explanation of what happened to the master and exits with standard bash exit codes
b6100d9
to
1ab3c4b
Compare
This has issued on the -ee side. I have a fix for the compilation error https://github.com/determined-ai/determined-ee/pull/1085 but no logs are shipped at all and no apparent error messages in the slrum error or output logs. |
there's a hotfix pr |
Instead of orchestrating multiple processes with bash and
confusing/complex process substitution, with all the trapping of signals
and
printf x > ...
nonsense, just write a wrapper process in a realprogramming language. This reduces the complexity of our logging
solution, and prevents the out-of-order bugs inherent in having two
separate log shippers, one for stdout and one for stderr.
The new ship_logs.py has the following features:
imports are from the standard library)
an explanation of what happened to the master and exits with standard
bash exit codes
Automated Test Plan
Manual Test plan