Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify logging #906

Closed
Tracked by #1103
jlaneve opened this issue Apr 4, 2024 · 8 comments
Closed
Tracked by #1103

Simplify logging #906

jlaneve opened this issue Apr 4, 2024 · 8 comments
Assignees
Labels
area:logging Related to logging, like log levels, log formats, error logging, etc epic-assigned
Milestone

Comments

@jlaneve
Copy link
Collaborator

jlaneve commented Apr 4, 2024

We add colors and other formatting to the logs that make it very difficult to read. This came up in the #airflow-dbt channel on the Airflow Slack: https://apache-airflow.slack.com/archives/C059CC42E9W/p1711624722823329

We should look at how we do logging today and potentially think about an overhaul. We should be able to remove the colors, duplicate timestamps, etc.

@dosubot dosubot bot added the area:logging Related to logging, like log levels, log formats, error logging, etc label Apr 4, 2024
@dwreeves
Copy link
Collaborator

dwreeves commented Apr 4, 2024

This is small but a simple starting point would be to move to the LoggingMixin instead of a global logger. This would still be backwards compatible I believe, wouldn't change any behaviors, but at least provides opportunities to users to override logging behaviors, and is overall more idiomatic in Airflow world.

@tatiana tatiana added this to the 1.6.0 milestone Apr 30, 2024
@tatiana tatiana modified the milestones: 1.6.0, 1.5.0 May 17, 2024
@pankajkoti
Copy link
Contributor

This issue was reported again by the community:
https://apache-airflow.slack.com/archives/C059CC42E9W/p1715739144301439

This was the Slack message body in case we lose the above thread data

Hi everyone,
Actually i'm using astronomer-cosmos==0.6.2 and the dbt models run correctly in astro airflow, the only problem is that scheduler logs are bigger and bigger, and almost all the logs contains the same info (check the sample below) but as i mentioned before the process executes as expected, the diagram is correctly. Anyone knows how to avoid this or fix it because the log size of scheduler is almost ~5 Gb per day?
[2024-05-14T16:56:57.780+0000] {logging_mixin.py:154} INFO - [2024-05-14T16:56:57.780+0000] {render.py:214} ERROR - Dependency stg_talent_companies not found for model DbtModel(name='stg_goal_processes', type='DbtModelType.DBT_MODEL', path='/usr/local/airflow/dbt/ch_analytics/models/staging/stg_talent/stg_goal_processes.sql', config=DbtModelConfig(config_selectors=set(), upstream_models={'stg_talent_companies', 'talent_innpulsa'}))

Temptively adding this to the 1.5 milestone

@tatiana tatiana added triage-needed Items need to be reviewed / assigned to milestone and removed triage-needed Items need to be reviewed / assigned to milestone labels May 17, 2024
@dwreeves
Copy link
Collaborator

Big post incoming

Issue

Users have been seeing duplicate logs and other logging related issues (see #639, #747. #906) for a while.

In my experience, logging issues have gotten a little out of hand. When running the dev container, I noticed that the logs were being tripled. That was new to me, and I hadn’t recalled that occurring the last time I worked with Cosmos. After running the main branch of Cosmos with a completely blank slate, here are what my logs look like:

[2024-05-27T15:13:26.879-0400] {graph.py:76} INFO - (astronomer-cosmos) - Running command: `/Users/daniel/projects/astronomer-cosmos/.venv/bin/dbt deps --project-dir /var/folders/t1/bd17h0253k38syfz01875yf40000gn/T/tmp2qtqepf2 --profiles-dir /var/folders/t1/bd17h0253k38syfz01875yf40000gn/T/tmp3dwu7_qs --profile default --target dev`
[2024-05-27T15:13:26.879-0400] {graph.py:76} INFO - (astronomer-cosmos) - Running command: `/Users/daniel/projects/astronomer-cosmos/.venv/bin/dbt deps --project-dir /var/folders/t1/bd17h0253k38syfz01875yf40000gn/T/tmp2qtqepf2 --profiles-dir /var/folders/t1/bd17h0253k38syfz01875yf40000gn/T/tmp3dwu7_qs --profile default --target dev`
[2024-05-27T15:13:26.879-0400] {graph.py:76} INFO - (astronomer-cosmos) - Running command: `/Users/daniel/projects/astronomer-cosmos/.venv/bin/dbt deps --project-dir /var/folders/t1/bd17h0253k38syfz01875yf40000gn/T/tmp2qtqepf2 --profiles-dir /var/folders/t1/bd17h0253k38syfz01875yf40000gn/T/tmp3dwu7_qs --profile default --target dev`
[2024-05-27T15:13:31.086-0400] {graph.py:76} INFO - (astronomer-cosmos) - Running command: `/Users/daniel/projects/astronomer-cosmos/.venv/bin/dbt ls --output json --project-dir /var/folders/t1/bd17h0253k38syfz01875yf40000gn/T/tmp2qtqepf2 --profiles-dir /var/folders/t1/bd17h0253k38syfz01875yf40000gn/T/tmp3dwu7_qs --profile default --target dev`
[2024-05-27T15:13:31.086-0400] {graph.py:76} INFO - (astronomer-cosmos) - Running command: `/Users/daniel/projects/astronomer-cosmos/.venv/bin/dbt ls --output json --project-dir /var/folders/t1/bd17h0253k38syfz01875yf40000gn/T/tmp2qtqepf2 --profiles-dir /var/folders/t1/bd17h0253k38syfz01875yf40000gn/T/tmp3dwu7_qs --profile default --target dev`
[2024-05-27T15:13:31.086-0400] {graph.py:76} INFO - (astronomer-cosmos) - Running command: `/Users/daniel/projects/astronomer-cosmos/.venv/bin/dbt ls --output json --project-dir /var/folders/t1/bd17h0253k38syfz01875yf40000gn/T/tmp2qtqepf2 --profiles-dir /var/folders/t1/bd17h0253k38syfz01875yf40000gn/T/tmp3dwu7_qs --profile default --target dev`

Python logging idioms

There are a handful of issues causing this, but I think the simplest way to make some progress toward addressing the root problem would be to note how this stuff is normally done.

Let's say you do this:

# a/b/c.py
logger = logging.getLogger(__name__)
logger.info("Hello, world!")

The logger's namespace is a.b.c, and the logger has no handlers attached, and propagate = True.

This means that the logger.info() call does basically nothing, since there are no handlers.

... Except, what I just said is not entirely true! Because of propagate = True, the logs get passed up the chain of command. The logs in a.b.c get passed to a.b if there is a logger, and if not then a, and if not then any and all loggers without a namespace at all, i.e. None. This assumes all are propagating, which is true by default.

The reason logging.getLogger()'s propagate attribute is True by default is because it is expected that you'll do something like, have a root logger (without a namespace), which has a StreamHandler() or any other handler, and then a bunch of loggers without any Handlers at all will pass their logs up the chain to the root logger you set up. For example, the httpx library uses a bunch of logging.info() calls, but you don't see those until you set up your own root logger.

Full demo:

>>> import logging
>>> import httpx
>>> root = logging.getLogger()
>>> root.info("Hello world")
None

>>> # Nothing happens
>>> res = httpx.get("https://google.com/")
None

>>> root.addHandler(logging.StreamHandler())
>>> root.info("Hello world")
None

>>> root.setLevel(logging.INFO)
>>> root.info("Hello world")
Hello world

>>> # httpx use a logging.Logger w/o a handler, so it relies on our root logger's handler to print
>>> res = httpx.get("https://google.com/")
HTTP Request: GET https://google.com/ "HTTP/1.1 301 Moved Permanently"

>>> # Here we are intentionally creating a separate logger instance
>>> # With 2x stream handlers, this causes doubling of logs
>>> root_v2 = logging.getLogger()
>>> root_v2.addHandler(logging.StreamHandler())
>>> res = httpx.get("https://google.com/")
HTTP Request: GET https://google.com/ "HTTP/1.1 301 Moved Permanently"
HTTP Request: GET https://google.com/ "HTTP/1.1 301 Moved Permanently"

>>> # Note that if we assign a namespace to the logger, logs don't _triple_ though.
>>> namespaced_logger = logging.getLogger("a")
>>> namespaced_logger.addHandler(logging.StreamHandler())
>>> res = httpx.get("https://google.com/")
HTTP Request: GET https://google.com/ "HTTP/1.1 301 Moved Permanently"
HTTP Request: GET https://google.com/ "HTTP/1.1 301 Moved Permanently"

>>> # However in this case it will triple, since it uses the StreamHandler inside the above logger.
>>> namespaced_logger_v2 = logging.getLogger("a.b")
>>> namespaced_logger_v2.info("Hello world")
Hello world
Hello world
Hello world

Issues with Cosmos's logging

For context: the justification for the way Cosmos's logging is currently set up is to add a colored (astronomer-cosmos) in front of the logs inside the formatter object, which requires a StreamHandler object. This helps to make parsing the scheduler logs a little easier. That said, this has caused a lot of issues for users.

Under the hood, I would argue the reason Cosmos's logging is a little strange is because it doesn't seem to adhere to the idioms I've mentioned above in a few ways:

  • It creates multiple StreamHandlers. Every single invocation of get_logger() creates a stream handler, but it would make more sense to either create zero StreamHandlers (this is the typical approach, e.g. how httpx approaches logging) or, at most exactly 1 StreamHandlers, and then rely on propagation to the root logger for all the Cosmos sub-modules.
  • Some of the loggers don't have a namespace. Specifically the logger in cosmos/__init__.py and cosmos/cache.py. This is actually the reason for the tripling of the logs: without a namespace, you have anywhere up to 3 different loggers that will log from any namespace. Each log in Cosmos should have a namespace.
  • propagate_logs config option feels like a hack? The problem is that in some contexts, there are multiple streamhandlers printing somewhere, and in other cases there may not be.
  • It is more typically the responsibility of the formatter/handler to expose metadata such as where a log line is coming from. For example, the following log formatter logging.Formatter("%(name)s.%(module)s - %(asctime)s - %(name)s.%(module)s:%(lineno)d - %(levelname)s - %(message)s") would expose the fact that Cosmos is the entity sending out particular logging messages.

Solutions

I'm sorting these in order from least controversial to most controversial. They're also ordered in terms of "steps".

1. All Loggers must be namespaced + only the root logger gets a StreamHandler.

There are a few ways to do this but the simplest is probably to just replace each one of these:

from cosmos.log import get_logger

logger = get_logger(__name__)

with:

import logging

logger = logging.getLogger(__name__)

Another option would be to do something like if name != "cosmos": ... inside get_logger(), but this is too different from the typical approach of just doing logging.getLogger(__name__).

Thus, the only logger that would get a StreamHandler is the one inside cosmos/__init__.py, or alternatively wherever else you'd like to configure logger = logging.getLogger("cosmos").

This is essentially a refactor, but it makes for a nicer pattern going forward.

2. Use LoggingMixin when appropriate.

Basically, inside task instance executions, the LoggingMixin makes more sense to use than the module scope logger objects.

Because of how namespacing works, just like step 1, this will ultimately just end up being mostly a refactor: each LoggingMixin's log name is self.__class__.__module__ + '.' + self.__class__.__name__, so e.g. instead of cosmos.operators.local, it becomes cosmos.operators.local.DbtLocalBaseOperator. And then the namespace pairs down to cosmos and that will ultimately create the StreamHandler.

3. Decouple scheduler logs from task instance logs + treat these separately

This is where things get a little spicy. The above changes do nothing to fix the core issue of duplicate logs, they just refactor things to be cleaner and a teeny bit more idiomatic. But logs will still duplicate

The issue ultimately is that we want the scheduler logs to look nicer, but we don't want them to duplicate, and we also don't want the task instance logs to be impacted.

You have two options for this as far as I am concerned:

3a. StreamHandler for specific namespaces only

One way we can achieve this is, for any logs that run as part of the scheduler's work, we can:

  • Define some logger getLogger("cosmos.dbt")
  • Add a StreamHandler to the cosmos.dbt logger
  • Set propagate = [the config option] for the cosmos.dbt logger

This ensures that logs do not get duplicated, but they do get written, and the custom logging stuff is isolated to the scheduler only.

As part of this, you'd also remove the StreamHandler from the getLogger("cosmos") logger.

There are a few other places in the code where the "scheduler" stuff occurs, in addition to cosmos.dbt. There's also cosmos.converter, for example. It's a little all over the place.

Each one of these places where logging is used by the scheduler should either get its own specific namespace, or alternatively each one should just be __name__ and get a StreamHandler:

# cosmos/dbt/__init__.py
from cosmos.log import get_scheduler_logger

logger = get_logger_for_scheduler(__name__)

# cosmos/dbt/graph.py
import logging

logger = logging.getLogger(__name__)  # <- note: logs propagate to `cosmos.dbt`, so this is fine.

Lastly, assertions need to be made that logs are not being duplicated, basically.

3b. Just nuke it?

This is a lot of extra complexity for (astronomer-cosmos) to be appended to each log line. It's not unreasonable to just remove the feature entirely and handle logging more "typically". It is not my place to do that, it's mainly my place to provide feedback and then implement the repository owners' and community's vision, but I will throw it out there as a suggestion regardless.

@tatiana tatiana removed their assignment Jun 3, 2024
@tatiana tatiana modified the milestones: Cosmos 1.5.0, Cosmos 1.6.0 Jun 6, 2024
@gbatiz
Copy link

gbatiz commented Jun 13, 2024

Do you guys have a suggested workaround or fix we can implement for the time being?
We managed to suppress node_converter warnings flooding the logs, but still every scheduler log line is repeated 3 times which makes debugging quite inconvenient. Also some of the scheduler logs from cosmos seem to show up in unrelated task execution logs, creating confusion.

@dwreeves
Copy link
Collaborator

@gbatiz #1047 puts a bandaid on the issue. In addition to setting cosmos.propagate_logs = False in the config, you should be all set when the next alpha release of 1.5.0 comes out.


That said, I feel like it is still a bandaid. The more I think about it, the more I am in favor of scrapping the custom logging entirely, and ditching the whole (astronomer-cosmos) thing. But this needs to occur with community input.

@gbatiz
Copy link

gbatiz commented Jun 21, 2024

Thanks!
Personally I also think it would be best to get rid of the whole thing. The small nicety is not worth the complexity.

@ajsquared
Copy link

I'd +1 removing it entirely too.

@dwreeves dwreeves mentioned this issue Jul 21, 2024
2 tasks
tatiana pushed a commit that referenced this issue Aug 15, 2024
This PR addresses #906 and fixes issues in the Cosmos logging once and
for all.\*

> \* Actually, there is another issue with logs being polluting with
warnings about "source" nodes in 1.5.0, but that is a separate matter!

I have a long explanation of how the `logging` module in Python works,
and the sort of idioms it expects of end users of the module, here:
#906 (comment)

The choices I made, explained:

- Although I don't know that I entirely agree with adding
`(astronomer-cosmos)` to all the logs, clearly at least one user, and
possibly many more, want it, and I don't believe we should remove it.
The objective of this PR was therefore to preserve the feature while
future-proofing against future issues.
- Why I can't say I'm a fan of it: It seems that adding
`(astronomer-cosmos)` to logs is a symptom of other problems with the
Cosmos library, specifically how it impacts performance when users do
not set it up effectively. And the prefix was added as a way to assist
people in diagnosing these issues. I think ultimately we want to move
away from this. Other components of the Airflow ecosystem do not feel
compelled to do things like this. Also, the module path is something
that can be handled in the `log_format` if users really want it.
- How I future-proofed: As per the long post I link above, basically the
issue is that there should not be tons of StreamHandlers being created.
The proper and typical use of the logging module, with few exceptions,
is to allow for logs to propagate upwards to a root logger. The reason
the Cosmos logs presented issues for so long was because it deviated a
lot from this.
- I think default behavior being the "least astonishing" means making no
modifications to the base logging behavior whatsoever. This is also less
likely to morph into future issues if any further changes are made to
the custom logging.
- One thing I never mentioned: I found it odd that by default Cosmos did
not "work out of the box" and that, despite using Astronomer's own
Airflow platform (!), I had to set a config option that made Cosmos
logging not be a nightmare (i.e. set `propagate_logs` = false). Previous
logs referenced the Celery Executor as having issues, even though this
is one of 2 of the most popular production ways to run Airflow.
Something like this should just work out of the box for a majority of
users!
- For task execution, Cosmos should make use of the more
Airflow-idiomatic `LoggingMixin` class whenever appropriate. This can
also be used in scheduler / webserver related logging contexts but I
think it is less out-of-place there to use globally scoped loggers.
- These will not use the `get_logger()` implementation. That is
intentional and probably desirable. These logs do not need to be
"enriched" because they are isolated in the task execution logs.

Oh also, I fixed an issue in the `project.entry_points` in the
`pyproject.toml` while I was at it.

## Breaking Change?

- Removes `propagate_logging` conf option, although removing this will
not break users' builds. There is now a `rich_logging` conf option
instead, which by default is disabled.
@tatiana
Copy link
Collaborator

tatiana commented Aug 16, 2024

Closed in #1108

@tatiana tatiana closed this as completed Aug 16, 2024
@tatiana tatiana mentioned this issue Aug 16, 2024
18 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:logging Related to logging, like log levels, log formats, error logging, etc epic-assigned
Projects
None yet
Development

No branches or pull requests

6 participants