Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timing breakdown #588

Merged
merged 26 commits into from
Jul 19, 2019
Merged

Timing breakdown #588

merged 26 commits into from
Jul 19, 2019

Conversation

felixbarny
Copy link
Member

@felixbarny felixbarny commented Apr 16, 2019

closes #556
depends on #587, elastic/apm-server#2086

TODO

  • guard against concurrent reads (report metrics to APM Server) and writes (tracking of metrics)
  • Javadoc
  • Report self_time on transactions/spans?
  • Feature flag
  • Limit timers to 1000
  • Thread-safe publication for Labels#immutableCopy

Open questions

@codecov-io
Copy link

codecov-io commented Apr 17, 2019

Codecov Report

Merging #588 into master will increase coverage by <.01%.
The diff coverage is 79.25%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master     #588      +/-   ##
============================================
+ Coverage     63.82%   63.83%   +<.01%     
  Complexity       68       68              
============================================
  Files           205      207       +2     
  Lines          8254     8496     +242     
  Branches       1056     1062       +6     
============================================
+ Hits           5268     5423     +155     
- Misses         2673     2741      +68     
- Partials        313      332      +19
Impacted Files Coverage Δ Complexity Δ
...lastic/apm/agent/report/ReporterConfiguration.java 100% <ø> (ø) 0 <0> (ø) ⬇️
.../apm/agent/concurrent/ExecutorInstrumentation.java 46.93% <ø> (ø) 0 <0> (ø) ⬇️
...m/agent/impl/async/SpanInScopeRunnableWrapper.java 100% <ø> (ø) 0 <0> (ø) ⬇️
...m/agent/impl/async/SpanInScopeCallableWrapper.java 100% <ø> (ø) 0 <0> (ø) ⬇️
...va/co/elastic/apm/agent/impl/transaction/Span.java 78.02% <100%> (+3.02%) 0 <0> (ø) ⬇️
...ic/apm/agent/metrics/builtin/JvmMemoryMetrics.java 100% <100%> (ø) 0 <0> (ø) ⬇️
...va/co/elastic/apm/agent/impl/ElasticApmTracer.java 74.05% <100%> (ø) 0 <0> (ø) ⬇️
...astic/apm/agent/impl/transaction/AbstractSpan.java 83.63% <100%> (+7.01%) 0 <0> (ø) ⬇️
...lastic/apm/agent/metrics/builtin/JvmGcMetrics.java 92.85% <100%> (ø) 0 <0> (ø) ⬇️
...astic/apm/agent/metrics/builtin/ThreadMetrics.java 100% <100%> (ø) 0 <0> (ø) ⬇️
... and 16 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8aaaf6a...f84852a. Read the comment docs.

@felixbarny
Copy link
Member Author

@bmorelli25 the last commit contains documentation for the metrics. I'd love to get some feedback from you 🙂

docs/metrics.asciidoc Outdated Show resolved Hide resolved
docs/metrics.asciidoc Outdated Show resolved Hide resolved
Copy link
Member

@bmorelli25 bmorelli25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These docs look almost as good as your balcony in the summer ☀️

@felixbarny felixbarny marked this pull request as ready for review June 11, 2019 11:50
Copy link
Contributor

@eyalkoren eyalkoren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is very big and complicated.
So I did my best to follow and provide initial feedback.
We can continue the review together, adding notes here as necessary.

NOTICE Show resolved Hide resolved
timer.update(duration);
if (finished) {
// in case end()->trackMetrics() has been called concurrently
// don't leak timers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not thread safe:

Thread A:
      AbstractSpan#end() -> Transaction#doEnd -> trackMetrics() // all timers are reset, finished == false

Thread B:
      AbstractSpan#end() -> Span#doEnd // creates an put a Timer, doesn't clean because finished == false

Thread A:
      AbstractSpan#end() -> Transaction#doEnd (continue) 
      AbstractSpan#end() -> sets finished = true 

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will set finished = true earlier

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be fixed now

* Returns a mutable {@link List}, roughly equal to the {@link #keySet()}.
* <p>
* Note that in concurrent scenarios, the key list may be a subset of the values of the respective {@link #keySet()}.
* Entries added via the {@code compute*} family of methods are not reflected in the list.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the compute methods?
I think this list in general may not reflect the underlying keySet() as it is not thread-safe with regard to the underlying map, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example java.util.concurrent.ConcurrentHashMap#computeIfAbsent

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Synchronizing on add and remove operations should also make it thread-safe, right? That would make the map more applicable for other use-cases where modifying the list does not happen under contention.

* It can be used to iterate over the map's keys without allocating an {@link java.util.Iterator}
*/
public class KeyListConcurrentHashMap<K, V> extends ConcurrentHashMap<K, V> {
private final List<K> keyList = Collections.synchronizedList(new ArrayList<K>());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure the map can only grow in size- throw unsupported operation exceptions from remove and clear

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@eyalkoren eyalkoren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't go thoroughly through tests, but mostly through everything else.

update(durationUs, 1);
}

public void update(long durationUs, long count) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment that must be used within code that guarantees atomicity, or make atomic, so that can't be queried while updated

final Timer timer = spanTimings.get(spanType);
if (timer.getCount() > 0) {
labels.spanType(spanType.getSpanType()).spanSubType(spanType.getSpanSubType());
metricRegistry.updateTimer("span.self_time", labels, timer.getTotalTimeUs(), timer.getCount());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, Timer update is not atomic, so reading here is not guaranteed to get fully updated Timer state (ie updated halfway by incrementTimer)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should now be fixed

}

@Override
public void resetState() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same- not atomic and may occur concurrently with reads or other writes

if (timer.getCount() > 0) {
labels.spanType(spanType.getSpanType()).spanSubType(spanType.getSpanSubType());
metricRegistry.updateTimer("span.self_time", labels, timer.getTotalTimeUs(), timer.getCount());
timer.resetState();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same- may be executed concurrently with updates

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should now be fixed

serializeValue(key, ".count", timer.getCount(), jw);
jw.writeByte(JsonWriter.COMMA);
serializeValue(key, ".sum", timer.getTotalTimeMs(), jw);
timer.resetState();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these resets seem awkward here, much better fit in metricSet.onAfterReport(). Is this done here in order to prevent additional iteration?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that the idea was to prevent another iteration. But it's probably better to trade off in favor of readability here.

Copy link
Contributor

@eyalkoren eyalkoren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A really epic PR! 👏 👏

I think this change should come with a visible warning that high cardinality of span types and subtypes is highly discouraged as it will make this feature useless. It will also result with increased resource consumption (memory, cpu and bandwidth) but not sure how significant that would be.

@@ -48,6 +60,9 @@
*/
private final TransactionContext context = new TransactionContext();
private final SpanCount spanCount = new SpanCount();
// type: subtype: timer
private final KeyListConcurrentHashMap<String, KeyListConcurrentHashMap<String, Timer>> spanTimings = new KeyListConcurrentHashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment that this map does not get recycled and instead accumulates all span types and subtypes.
Maybe an additional ALL-CAPITALS comment about that in the resetState method, so to not let anyone fall for this one.

protected AtomicInteger references = new AtomicInteger();
protected volatile boolean finished = true;

public int getReferenceCount() {
return references.get();
}

private static class ReentrantTimer implements Recyclable {

private AtomicInteger nestingLevel = new AtomicInteger();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this name confusing. I thought it is about nested children, but it is more about counting concurrent-updaters of only first level (ie direct children) spans.

@felixbarny felixbarny merged commit d383730 into elastic:master Jul 19, 2019
@felixbarny felixbarny deleted the timing-breakdown branch July 19, 2019 13:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for timing breakdowns
4 participants