Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core][state] Task log - Improve log tailing from log_client and support tailing from offsets [2/4] #28188

Merged
merged 26 commits into from
May 5, 2023

Conversation

rickyyx
Copy link
Contributor

@rickyyx rickyyx commented Aug 30, 2022

Why are these changes needed?

With verbose logging, the log file size might grow significantly. This PR prevents the grpc buffer overflow when tailing with large number of lines specified:

  • Instead of reading last X lines into memory, it looks for the start of the last X lines, and read afterwards.
  • Always stream log data in chunks

Related issue number

Closes #27009

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests

@rickyyx
Copy link
Contributor Author

rickyyx commented Sep 16, 2022

@rkooo567 🏓

@rickyyx
Copy link
Contributor Author

rickyyx commented Sep 16, 2022

TODO: rerun the release test after review

@rkooo567
Copy link
Contributor

Let me review this asap

Signed-off-by: rickyyx <[email protected]>
Signed-off-by: rickyyx <[email protected]>
@@ -276,7 +276,7 @@ def write_log(self, log_file_size_byte: int):

time_taken = 0
t_start = time.perf_counter()
for s in get_log(actor_id=actor._actor_id.hex(), tail=-1):
for s in get_log(actor_id=actor._actor_id.hex(), tail=1000000000):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this was the original failure case

@rickyyx
Copy link
Contributor Author

rickyyx commented Feb 25, 2023

It was a bit over-engineering for the original tailing usecases so I was planning to maybe simplify it. But with the recent requirement of streaming logs from offsets (begin -> end) for task logs, the added complexity is actually needed so I decided to just revive this PR.

cc @rkooo567

@rickyyx rickyyx changed the title [Core][State Observability] Improve log tailing from log_client [Core][State Observability] Improve log tailing from log_client and support tailing from offsets [2/4] Mar 20, 2023
@rickyyx rickyyx changed the title [Core][State Observability] Improve log tailing from log_client and support tailing from offsets [2/4] [Core][State Observability] Task log - Improve log tailing from log_client and support tailing from offsets [2/4] Mar 20, 2023
@rickyyx rickyyx changed the title [Core][State Observability] Task log - Improve log tailing from log_client and support tailing from offsets [2/4] [core][state] Task log - Improve log tailing from log_client and support tailing from offsets [2/4] Mar 20, 2023
Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I assume the most of logics are changed when reviewing the PR? Looks like there are lots of removals.

@rickyyx
Copy link
Contributor Author

rickyyx commented Mar 21, 2023

Should I assume that most of logic are changed when reviewing the PR? Looks like there are lots of removals.

Yes, the end behavior should be the same.
The change is - how we are finding out the last X lines to tail

  • Before we are reading chunk by chunk, collecting them into a single string in memory, and split by new lines to find out the last X lines, and send them in a single gRPC.
  • Now we are still reading chunk by chunk, but only keeping track of the offsets (similar to how linux's tail is implemented) rather than keeping the content of all chunks in memory. We will then find out the start and end offsets in the file to stream, and stream the file in the streaming grpc API.

I added quite some tests hoping to make sure the offsets bookkeeping is correct.

@rkooo567
Copy link
Contributor

Btw, is this API sufficient to implement the following logic?

  1. Print the last 1000 lines
  2. Print the previous 1000 lines
  3. Repeat until the beg of the file

can you tell me how we can achieve this one?

@rickyyx
Copy link
Contributor Author

rickyyx commented Mar 22, 2023

Btw, is this API sufficient to implement the following logic?

  1. Print the last 1000 lines
  2. Print the previous 1000 lines
  3. Repeat until the beg of the file

can you tell me how we can achieve this one?

So for pagination with lines. We will probably need some code (another PR on top of this) to translate line count to specific offsets. I think the frontend needs to keep track of lines count, an example API I could think of will be:

  1. StreamLog (REST equivalent): start_lines=-0, end_lines=-1000
  2. StreamLog: start_lines=-1000, end_lines=-2000
    ...

This will probably require finding out the corresponding offsets to start_lines each time. Alternatively we could return the offsets together with file content:

  1. PrepStreamLog: start_offset=-0, lines=1000 -> end_offset
  2. StreamLog: start_offset=-0, end_offset=<end_offset>

@rkooo567
Copy link
Contributor

Hmm maybe -1000 approach could have duplicated logs depending on the situation? Maybe we can make the API return the absolute offset and we can do

start = prev_absolute_offset - 1000, end = absolute

But it makes sense we can build on top of this as a follow up!

def tail(f: io.TextIOBase, lines: int):
"""Tails the given file (in 'rb' mode)
# Default stream entire file
start_offset = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be configurable if we want to make it work with task log right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, will add fields when supporting it.

dashboard/modules/log/log_agent.py Show resolved Hide resolved
Return:
Async generator of StreamReply
"""
assert "b" in file.mode, "Only binary file is supported."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it correct? what are other file formats? (is regular log also binary file?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we opened it in binary format for offset. Added comments.

cur_offset = start_offset

# Until gRPC is done
while not context.done():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I forgot the semantics lol... When is the context set to be "done"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess it's just that the streaming context not closed.

dashboard/modules/log/log_agent.py Outdated Show resolved Hide resolved
@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Mar 23, 2023
@rickyyx
Copy link
Contributor Author

rickyyx commented Apr 26, 2023

  • Refactored find_start_offset_last_n_lines_from_offset simpler.
  • Added find_end_offset_next_n_lines_from_offset to support for pagination in the future.
  • Added better testing

@rickyyx rickyyx removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Apr 26, 2023
Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM! I will look into it one more time tomorrow

return end


def find_end_offset_next_n_lines_from_offset(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it test only?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not used for now - but I suppose it will be used to support pagination

old_pos = file.tell() # store old position
file.seek(0, io.SEEK_END) # move file pointer to end of file
end = file.tell() # return end of file offset
file.seek(old_pos, io.SEEK_SET)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider making it a context manager?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to guratantee the API semantic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wdym here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I thought we could make it seek back if there's any exception occurred. But I think it should be fine for this case because if this will raise an exception the whole operation will just fail.

dashboard/modules/log/log_agent.py Outdated Show resolved Hide resolved
# in the block.
# Use `split` here to split away the last
lines = block_data.split(b"\n", num_lines - lines_more)
# len(lines[0]) + 1 for the new line character split
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm not sure if I understood this comment. Also, isn't this supposed to be len(sum(len(line) for line in lines))?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, since it's tailing, so only the last lines in the block need to be accounted for from nbytes_from_end.

line 1 \n line 2 \n line 3 \n line 4\n

If lines_more = 1, only line 4 needs to be included in the tail, so we split them with num_lines - lines_more = 3 splits, producing 4 parts:

  1. line 1
  2. line 2
  3. line 3
  4. line 4

And we will include line 4's len in the tail results.

@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Apr 26, 2023
Signed-off-by: Ricky Xu <[email protected]>
Signed-off-by: Ricky Xu <[email protected]>
@rickyyx rickyyx removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label May 3, 2023
@rkooo567 rkooo567 merged commit 0a15649 into ray-project:master May 5, 2023
architkulkarni pushed a commit to architkulkarni/ray that referenced this pull request May 16, 2023
…ort tailing from offsets [2/4] (ray-project#28188)

With verbose logging, the log file size might grow significantly. This PR prevents the grpc buffer overflow when tailing with large number of lines specified:

Instead of reading last X lines into memory, it looks for the start of the last X lines, and read afterwards.
Always stream log data in chunks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Core][State Observability][Log] gRPC max size limit when ray logs --tail 1000000
2 participants