-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[data] Basic structured logging #47210
[data] Basic structured logging #47210
Conversation
18a921c
to
7c6313d
Compare
@@ -0,0 +1,34 @@ | |||
version: 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this yaml file format? Is this somethign specific to ray data? is there documentation about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The YAML file format is specific to python (more docs here). We just directly load the yaml with something like
with open(config_path) as file:
config = yaml.safe_load(file)
logging.config.dictConfig(config)
Probably could do a better job explaining that this could be used in the brief logging docs here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a comment in this file explaining the origin/format of this YAML file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No blocker for me, but we should probably add some tests to ensure the behavior, and follow up with product and doc changes.
Also, maybe out of scope, but right now everything is configured through env var. Does it make sense to create a logging config for users to do other configurations such as log level or log file locations?
Users could create their own |
Yep, understand there are essentially unlimited degrees of freedom for users to define their own yaml, but then they would have to specify everything on their own right. Just wondering if here are something in between if user just need a small tweak. But totally understand if there are no such usecase for ray data, no need to over engineer this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
I think to keep things simple will leave as is. My understanding is that the vast majority of users will use the default settings (maybe not even JSON file logging) and those who do need something custom will likely have the expertise to set things up in whatever way they desire. Can revisit in the future if this becomes a pain point. |
python/ray/data/_internal/logging.py
Outdated
os.path.join(os.path.dirname(__file__), "logging_json.yaml") | ||
) | ||
|
||
# Environment variable to specify the encoding of the log messages |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also comment what options are available?
python/ray/data/_internal/logging.py
Outdated
environment variable. If the variable isn't set, this function loads the | ||
"logging.yaml" file that is adjacent to this module. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docstring is slightly out-of-date with the code
@@ -0,0 +1,34 @@ | |||
version: 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a comment in this file explaining the origin/format of this YAML file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @omatthew98, based on the logs in the PR description, it seems that the core context is not being added to the logs in ray-data.log. The main use case for structured logging is to enable users to associate Job ID, Actor ID, and Task ID with the logs.
cc @omatthew98 I also met this problem when I was doing the train structured logging. I think the fix is to use |
|
||
filters: | ||
console_filter: | ||
(): ray.data._internal.logging.HiddenRecordFilter | ||
core_context_filter: | ||
class: ray._private.ray_logging.filters.CoreContextFilter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(): ray._private.ray_logging.filters.CoreContextFilter
will enable this core context filter, I think.
@kevin85421 I updated this PR with the suggestion from @hongpeng-guo and it now seems to be including those tags. The example above should be updated if you want to confirm. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
class: ray.data._internal.logging.SessionFileHandler | ||
formatter: ray_json | ||
filename: ray-data.log | ||
filters: [core_context_filter] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to the core context filter for the default (text) log?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. The core context filter doesn't actually filter anything out, it just adds metadata to the logs but for text logs this is not shown.
python/ray/data/_internal/logging.py
Outdated
If "RAY_DATA_LOG_ENCODING" is specified as "JSON" we will enable JSON reading mode | ||
if using the default logging config. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused about what JSON reading mode refers to -- does this mean JSON log format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, it should be JSON logging mode not JSON reading.
python/ray/data/_internal/logging.py
Outdated
# Env. variable to specify the encoding of the file logs when using the default config. | ||
RAY_DATA_LOG_ENCODING = "RAY_DATA_LOG_ENCODING" | ||
|
||
# Env. variable to specify the logging config path use defaults if not set | ||
RAY_DATA_LOGGING_CONFIG_PATH = "RAY_DATA_LOGGING_CONFIG_PATH" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Maybe rename to make it clear these refer to environment variable names and not the values? I'd think RAY_DATA_LOGGING_CONFIG_PATH
refers to the actual path and not the name of an environment variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going to suffix both of these with _ENV_VAR_NAME
to be super explicit.
python/ray/data/_internal/logging.py
Outdated
# After configuring logger, warn if RAY_DATA_LOGGING_CONFIG_PATH is used with | ||
# RAY_DATA_LOG_ENCODING, because they are not both supported together. | ||
if config_path is not None and log_encoding is not None: | ||
logger = logging.getLogger("ray.data") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: What's the motivation for using the parent logger rather than this module's logger? Feel like it's more conventional to do logging.getLogger(__name__)
unless there's a strong reason to do otherwise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think there is a solid reason to use the parent logger, will switch to the module logger.
python/ray/data/_internal/logging.py
Outdated
logger.warning( | ||
"Using `RAY_DATA_LOG_ENCODING` is not supported with " | ||
+ "`RAY_DATA_LOGGING_CONFIG_PATH`" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: If these aren't supported, should we just error? Feel like it might be confusing if allow unsupported combinations (e.g., what's the behavior if you do this?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If they specify RAY_DATA_LOGGING_CONFIG
, then this will be used exactly as the logging configuration. If they try to use RAY_DATA_LOG_ENCODING
alongside this, this env var will essentially be ignored. My thought was if the user is using a custom configuration that should be the expected behavior. This warning is meant to tell them that this env var doesn't magically make JSON logging work, but they should still get logs exactly how they have them configured in the custom config
python/ray/data/_internal/logging.py
Outdated
# Dynamically load env vars | ||
config_path = os.environ.get("RAY_DATA_LOGGING_CONFIG_PATH") | ||
log_encoding = os.environ.get("RAY_DATA_LOG_ENCODING") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my own understanding, what happens if we don't load these dynamically?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main difference would be if you were editing the environment variables and then recalled configure_logging
, this would pick up the changes vs. loading them once on module initialization would not. In our tests we are doing something similar where we modify the variables and then call configure_logging
so this works better with that, but in general this will just ensure the most recent environment variable values are used.
Signed-off-by: Matthew Owen <[email protected]>
Signed-off-by: Matthew Owen <[email protected]>
Co-authored-by: Scott Lee <[email protected]> Signed-off-by: Matthew Owen <[email protected]>
Co-authored-by: Hao Chen <[email protected]> Signed-off-by: Matthew Owen <[email protected]>
Signed-off-by: Matthew Owen <[email protected]>
Signed-off-by: Matthew Owen <[email protected]>
Signed-off-by: Matthew Owen <[email protected]>
Signed-off-by: Matthew Owen <[email protected]>
Signed-off-by: Matthew Owen <[email protected]>
Signed-off-by: Matthew Owen <[email protected]>
4f35375
to
b4e0e6a
Compare
Adds structured logging to Ray Data. This will allow users to configure logging to use any of the following: * A user's custom logging file (existing functionality) * A default TEXT logger (existing functionality) * A default JSON logger (new functionality) --------- Signed-off-by: Matthew Owen <[email protected]> Signed-off-by: Matthew Owen <[email protected]> Co-authored-by: Scott Lee <[email protected]> Co-authored-by: Hao Chen <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Why are these changes needed?
Adds structured logging to Ray Data. This will allow users to configure logging to use any of the following:
Examples:
Code snippet:
JSON logging (new)
Console output:
ray-data.log:
TEXT logging (unchanged)
Console output:
ray-data.log:
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.