BatchLogRecordProcessor worker thread deadlock causing force_flush Timeout when shutting down LoggingHandler #3193

nigel-gott · 2023-02-21T11:35:22Z

Describe the bug

Configure a process with otel logging and BatchLogRecordProcessor and OTLPLogExporter, which sends logs in its own worker thread using requests.
Add a opentelemetry.sdk._logs.LoggingHandler using the batch processor to logging to intercept and send all logs
In the main thread, clear logging's registered handlers. For example this happens when you try to register a new logging config logging.config.dictConfig(NEW_CONFIG)
In the main thread:
- logging will acquire logging._lock to prevent other concurrent changes to the config
- logging will call .shutdown on opentelemetry.sdk._logs.LoggingHandler
- BatchLogRecordProcessor tries to flush it's worker thread
Back in the worker thread, OTLPLogExporter will try to export remaining logs to the endpoint using requests
- requests tries to do logger.debug
- logger.debug blocks, waiting to acquire logging._lock, but it is now held by the main thread.
Finally, the main thread will time out with a Timeout was exceeded in force_flush(). warning.

Steps to reproduce

from opentelemetry._logs import set_logger_provider
from opentelemetry.exporter.otlp.proto.http._log_exporter import OTLPLogExporter
from opentelemetry.sdk._logs import LoggerProvider, LoggingHandler
from opentelemetry.sdk._logs._internal.export import BatchLogRecordProcessor
import logging.config
import logging

# To see the deadlock try using: https://github.com/niccokunzmann/hanging_threads
# pip install hanging_threads 
# 
# from hanging_threads import start_monitoring
#
# monitoring_thread = start_monitoring()


logger_provider = LoggerProvider()
set_logger_provider(logger_provider)
exporter = OTLPLogExporter()
logger_provider.add_log_record_processor(BatchLogRecordProcessor(exporter))
handler = LoggingHandler(
    level=logging.INFO,
    logger_provider=logger_provider,
)

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(message)s",
    handlers=[handler, logging.StreamHandler()],
)

logger = logging.getLogger(__name__)
logger.info("test")

## reconfigure!
logging.config.dictConfig(
    {
        "version": 1,
        "disable_existing_loggers": False,
        "formatters": {
            "console": {
                "format": "%(levelname)s %(asctime)s %(name)s.%(funcName)s:%(lineno)s- %("
                "message)s "
            },
        },
        "handlers": {
            "console": {
                "class": "logging.StreamHandler",
                "formatter": "console",
            },
        },
        "root": {
            "handlers": ["console"],
            "level": "INFO",
        },
    }
)
# We now wait for the BatchLogRecordProcessor for the full timeout on its worker thread finishing the flush.
logger.info("test")

What did you expect to see?
The above script ideally shouldn't just block and hang. But i'm not exactly sure what the fix could be, or even if the fix instead should be raised against logging instead.

What did you see instead?
The above script hangs until the otel worker times out with a warning of Timeout was exceeded in force_flush()

What version did you use?
Version: 0.36b0

What config did you use?

receivers:
  otlp:
    protocols:
      grpc: # port 4317
      http: # port 4318

processors:
  batch:
    send_batch_size: 1000

exporters:
  otlp:
    endpoint: "api.honeycomb.io:443"
    headers:
      "x-honeycomb-team": "${env:HONEYCOMB_API_KEY}"
  otlp/metrics:
    endpoint: "api.honeycomb.io:443"
    headers:
      "x-honeycomb-team": "${env:HONEYCOMB_API_KEY}"
  logging:
    verbosity: detailed

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp,logging]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp/metrics,logging]
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp,logging]

extensions:
  health_check:

Environment
OS: Ubuntu 22.04 LTS

Additional context

All of the above is a constructed example. The real situation I hit this in was with trying to configure Django to work with otel logging. I was accidentally adding the Otel LoggingHandler prior to Django being setup. Django then would configure it's own logging by calling logging.config.dictConfig. All I initially experienced was the entire Django application failing to startup for the length of the timeout.

For future Django users encountering this problem, make sure you configure Django's LOGGING setting itself to work with the Otel handler and don't try to do something custom like I was. Alternatively make sure to setup the otel logging after django.setup has been run (say after the get_xyz_application is called in wsgi.py or asgi.py)

The text was updated successfully, but these errors were encountered:

nigel-gott · 2023-02-21T11:45:38Z

One potential fix would be to do the shutdown flushing not in the worker thread, but in the one where BatchLogRecordProcessor.shutdown is being called, as that thread will be guaranteed to be holding the logging._lock lock and so can safely log/use requests.

Due to open-telemetry/opentelemetry-python#3193, the OTLP exporter does not flush logs that were buffered when another process reconfigured the logger. To mitigate this: - During initialization, flush all DBOS logs immediately - Only export non-DBOS logs to OTLP during `launch` after startup completes.

MikeGoldsmith · 2024-11-12T13:18:21Z

Hello - is there any update on this issue or potential work arounds? The suggested workarounds didn't the app I'm helping out on.

nigel-gott added the bug Something isn't working label Feb 21, 2023

lzchen added the logging label Apr 27, 2023

lzchen mentioned this issue Sep 25, 2023

Stabilize Logs #3361

Open

4 tasks

github-project-automation bot added this to Logging RC Aug 29, 2024

github-project-automation bot moved this to To do in Logging RC Aug 29, 2024

kraftp mentioned this issue Sep 4, 2024

Fix Logging dbos-inc/dbos-transact-py#86

Merged

lzchen added this to Python Logging API/SDK RC Oct 31, 2024

lzchen moved this to Todo in Python Logging API/SDK RC Oct 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BatchLogRecordProcessor worker thread deadlock causing force_flush Timeout when shutting down LoggingHandler #3193

BatchLogRecordProcessor worker thread deadlock causing force_flush Timeout when shutting down LoggingHandler #3193

nigel-gott commented Feb 21, 2023 •

edited

Loading

nigel-gott commented Feb 21, 2023

MikeGoldsmith commented Nov 12, 2024

BatchLogRecordProcessor worker thread deadlock causing force_flush Timeout when shutting down LoggingHandler #3193

BatchLogRecordProcessor worker thread deadlock causing force_flush Timeout when shutting down LoggingHandler #3193

Comments

nigel-gott commented Feb 21, 2023 • edited Loading

nigel-gott commented Feb 21, 2023

MikeGoldsmith commented Nov 12, 2024

nigel-gott commented Feb 21, 2023 •

edited

Loading