Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: extend component run info #8436

Closed
wants to merge 6 commits into from
Closed

Conversation

tstadel
Copy link
Member

@tstadel tstadel commented Oct 2, 2024

Related Issues

  • fixes logging component run info via logging extra in addition to tracing + adds additional information (e.g. input and output lengths).

Background:
Currently there is little to no information about a component's run invocation in our logs. We get a log Running component writer without any further information. Additional information is only available via tracing. Traces are not always available or suitable in certain situations as for example user-facing logs.

Proposed Changes:

  • emit component run info via logging extra in addition to tracing
  • add input lengths
  • add output types
  • add output lengths

How did you test it?

Notes for the reviewer

Checklist

@tstadel tstadel requested a review from a team as a code owner October 2, 2024 15:57
@tstadel tstadel requested review from julian-risch and removed request for a team October 2, 2024 15:57
@github-actions github-actions bot added topic:core type:documentation Improvements on the docs labels Oct 2, 2024
@silvanocerza
Copy link
Contributor

Are inputs and outputs lengths really necessary? Which use case would stop people from calculating it themselves? 🤔

@silvanocerza silvanocerza requested review from silvanocerza and shadeMe and removed request for julian-risch October 2, 2024 16:14
@tstadel
Copy link
Member Author

tstadel commented Oct 2, 2024

Are inputs and outputs lengths really necessary? Which use case would stop people from calculating it themselves? 🤔

@silvanocerza Not sure what you mean with "calculating it themselves" exactly.
Input and output lengths can vary heavily througout the pipeline and additonally can depend heavily on the actual input.
For example take the following pipeline in which you

  • start with one file (e.g. a pdf) -> 1 path
  • split the pdf into 1 image per page -> n_pages images
  • run OCR using an llm per page -> n_pages documents
  • build strides of 3 pages -> n_pages-2 strides
  • extract some information for each stride using llm -> n_pages-2 information
  • join all information -> n_information
  • write the information to document_store

To actually monitor progress and to decide whether an invocation takes too long or is within bounds you need to know input and output lengths of the individual components. OCR might be quick for 5 pages, but take an hour for 1k pages.

@coveralls
Copy link
Collaborator

coveralls commented Oct 4, 2024

Pull Request Test Coverage Report for Build 11178179903

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 17 unchanged lines in 1 file lost coverage.
  • Overall coverage increased (+0.01%) to 90.312%

Files with Coverage Reduction New Missed Lines %
core/pipeline/pipeline.py 17 81.63%
Totals Coverage Status
Change from base Build 11144701295: 0.01%
Covered Lines: 7467
Relevant Lines: 8268

💛 - Coveralls

Copy link
Collaborator

@shadeMe shadeMe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid this change falls outside the purview of the the Pipeline class. The current tracing hooks expose the inputs of each component run invocation, and that's the extent of the class' involvement - It's up to the consumer of the tracing instrumentation to house the logic that reasons about those inputs.

We can additionally log the parameters that we currently passing to the tracing instrumentation (and add the output types to that), but logic to infer lengths doesn't belong here.

@tstadel
Copy link
Member Author

tstadel commented Oct 7, 2024

I'm afraid this change falls outside the purview of the the Pipeline class. The current tracing hooks expose the inputs of each component run invocation, and that's the extent of the class' involvement - It's up to the consumer of the tracing instrumentation to house the logic that reasons about those inputs.

We can additionally log the parameters that we currently passing to the tracing instrumentation (and add the output types to that), but logic to infer lengths doesn't belong here.

@shadeMe So how would I log the number of input and output items (which I believe is an important number for debugging/monitoring pipelines)?

@shadeMe
Copy link
Collaborator

shadeMe commented Oct 8, 2024

@shadeMe So how would I log the number of input and output items (which I believe is an important number for debugging/monitoring pipelines)?

Not sure what your constraints are, but perhaps by creating a custom tracer that inspects the inputs and output and writes their characteristics to to the logs? The haystack.tracing module should have the required scaffolding.

On second thought, such a tracer would be a useful tool for Haystack users too.

@anakin87 anakin87 mentioned this pull request Oct 9, 2024
@tstadel
Copy link
Member Author

tstadel commented Oct 9, 2024

@shadeMe So how would I log the number of input and output items (which I believe is an important number for debugging/monitoring pipelines)?

Not sure what your constraints are, but perhaps by creating a custom tracer that inspects the inputs and output and writes their characteristics to to the logs? The haystack.tracing module should have the required scaffolding.

On second thought, such a tracer would be a useful tool for Haystack users too.

@shadeMe If the LoggingTracer is the way to go, do you think adding the extra to the existing log lines in this PR makes still sense?

@shadeMe
Copy link
Collaborator

shadeMe commented Oct 16, 2024

@shadeMe If the LoggingTracer is the way to go, do you think adding the extra to the existing log lines in this PR makes still sense?

I don't see an issue with logging the input and output types, if only for completeness' sake.

@silvanocerza
Copy link
Contributor

LoggingTracer introduced with #8447 should solve the problem this PR is trying to solve. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants