Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HADOOP-16830. IOStatistics API. #2069

Closed

Conversation

steveloughran
Copy link
Contributor

@steveloughran steveloughran commented Jun 11, 2020

IOStatistics API in hadoop common and s3afs impl. @mehakmeet has been working on an abfs version

  1. this is HADOOP-16830. IOStatistics API. #1982 rebased to trunk
  2. i do plan to submit the s3a side of the stats separately; but having it together helps me work on an API which is useful for apps.
  3. HADOOP-16830. IOStatistics API. #1982 discusses the evolution of the design. Specifically

having written the new extensible design, I've decided I don't like it. It is too complex as I'm trying to support arbitrary arity tuples of any kind of statistic.it makes iterating/parsing this stuff way too complex

here's a better idea: we only support a limited set;

  • counter: long
  • min; long
  • max: long
  • mean (double, long)
  • gauge; long
  1. all but gauge have simple aggregation, for gauge i'll add stuff up too, on the assumption that they will be positive values (e.g 'number of active reads')
  2. and every set will have its own iterator.

what do people think?

Superceded by #2323 and #2324

@steveloughran steveloughran added enhancement fs/s3 changes related to hadoop-aws; submitter must declare test endpoint labels Jun 11, 2020
@apache apache deleted a comment from hadoop-yetus Jun 15, 2020
@steveloughran steveloughran force-pushed the s3/HADOOP-16830-iostatistics branch 2 times, most recently from 8d1f9b4 to b44f338 Compare June 25, 2020 12:14
@apache apache deleted a comment from hadoop-yetus Jun 29, 2020
@apache apache deleted a comment from hadoop-yetus Jun 29, 2020
@apache apache deleted a comment from hadoop-yetus Jun 29, 2020
@apache apache deleted a comment from hadoop-yetus Jun 30, 2020
@apache apache deleted a comment from hadoop-yetus Jul 3, 2020
@apache apache deleted a comment from hadoop-yetus Jul 3, 2020
@apache apache deleted a comment from hadoop-yetus Jul 3, 2020
@apache apache deleted a comment from hadoop-yetus Jul 3, 2020
@steveloughran
Copy link
Contributor Author

latest patch wires up stats collection from the workers on an s3a committer job, marshalls them as json in .pending/.pendingset files and then finally aggregates them into the _SUCCESS job summary file. Here's an example of a test run.

2020-07-03 16:47:08,981 [JUnit-ITestMagicCommitProtocol-testOutputFormatIntegration] INFO  commit.AbstractCommitITest (AbstractCommitITest.java:loadSuccessFile(503)) - Loading committer success file s3a://stevel-ireland/test/ITestMagicCommitProtocol-testOutputFormatIntegration/_SUCCESS. Actual contents=
{
  "name" : "org.apache.hadoop.fs.s3a.commit.files.SuccessData/1",
  "timestamp" : 1593791227415,
  "date" : "Fri Jul 03 16:47:07 BST 2020",
  "hostname" : "stevel-mbp15-13176.local",
  "committer" : "magic",
  "description" : "Task committer attempt_200707120821_0001_m_000000_0",
...
  "diagnostics" : {
    "fs.s3a.authoritative.path" : "",
    "fs.s3a.metadatastore.impl" : "org.apache.hadoop.fs.s3a.s3guard.DynamoDBMetadataStore",
    "fs.s3a.committer.magic.enabled" : "true",
    "fs.s3a.metadatastore.authoritative" : "false"
  },
  "filenames" : [ "/test/ITestMagicCommitProtocol-testOutputFormatIntegration/part-m-00000" ],
  "iostatistics" : {
    "counters" : {
      "committer_bytes_committed" : 4,
      "committer_bytes_uploaded" : 0,
      "committer_commits_aborted" : 0,
      "committer_commits_completed" : 1,
      "committer_commits_created" : 0,
      "committer_commits_failed" : 0,
      "committer_commits_reverted" : 0,
      "committer_jobs_completed" : 1,
      "committer_jobs_failed" : 0,
      "committer_tasks_completed" : 1,
      "committer_tasks_failed" : 0,
      "stream_write_block_uploads" : 1,
      "stream_write_block_uploads_data_pending" : 0,
      "stream_write_bytes" : 4,
      "stream_write_exceptions" : 0,
      "stream_write_exceptions_completing_uploads" : 0,
      "stream_write_queue_duration" : 0,
      "stream_write_total_data" : 4,
      "stream_write_total_time" : 0
    },
    "gauges" : {
      "stream_write_block_uploads_data_pending" : 4,
      "stream_write_block_uploads_pending" : 0,
    },
    "minimums" : { },
    "maximums" : { },
    "meanStatistics" : { }
  }
}

I'm in a good mood here. Time for others to look at.


| Category | Aggregation |
|------------------|-------------|
| `counter` | `min(0, x) + min(0, y)` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Counters should be a simple addition right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, but if something negative has got in I don't want it tainting the results

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't you mean max(0, x)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah -fixed

/**
* Builder of the CounterIOStatistics class.
*/
public interface CounterIOStatisticsBuilder {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Don't you think name is bit confusing here? We are providing support for min, max , mean as well here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it's obsolete. Removed

snapshotIOStatistics(IOStatistics statistics) {

IOStatisticsSnapshot stats = new IOStatisticsSnapshot(statistics);
stats.snapshot(statistics);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to call snapshot() again? I can see the snapshot calculation just happened in the previous call during constructor initialisation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right. cut this.

@apache apache deleted a comment from hadoop-yetus Jul 15, 2020
@apache apache deleted a comment from hadoop-yetus Jul 15, 2020
@steveloughran steveloughran force-pushed the s3/HADOOP-16830-iostatistics branch 2 times, most recently from fad3d9d to 243f186 Compare July 20, 2020 14:28
@apache apache deleted a comment from hadoop-yetus Jul 20, 2020
+ and ability to set gauges, mix, max, means
+ methods to access/update these, and to get the
  raw refs to ease migration, pass them around,
  and better performance on oft-updated values (no map lookup)

s3a instrumentation using this to update gauges on
s3a block output stats; adding a base class for all those
stats impls now they have a lot more commonality than
a set of long counters.

Change-Id: I0456b200509ffdb184c4ae86976b0b2789670c82
really rounding this out and moving s3a instrumentation to it

Change-Id: I749a8ef78b33b4ea85de4d93eeefc7c7d71294a8
Change-Id: Id1d47856ce57ea8515b2a3d1de33cf88cd934389
all streasms which serve up statistics declare themselves as having
the stream capability "iostatistics"; all those which forward
getIOStatistics also do the same for stream capabilities.

This gives a consistent way to probe for streams supporting the statistics
even without invoking it.

BufferedIOStatisticsOutputStream forwards syncable() calls after
flushing itself.

This benefits work in HADOOP-13327 where we need
a BufferedOutputStream subclass to forward Syncable operations.
and Syncable.

Change-Id: Ib015f48396819873859b3fa644d4afd133e576d8
Change-Id: I1dd0b9e46d3e24c19f29f8a9e0bd16e762b74345
WiP there, but making it part of the public API so it can
be used in other classes (e.g pendingset files)

snapsnot implements the aggregate action covered in the specification.

Change-Id: I4dfc04f4fe8f97a9c64f4fa5a636fa1da68bbb43
TODO: full round trip of a statistics instance
It's feeling way oveerengineered here, but at least now I can embed it
in something like my committer stats. Which, if we get off
the output streams on the write isn't completely useless

Change-Id: Ifbeb3f546d1884cb92a853da5a54139cb98c03b2
This is does full end-to-end integration with the stats from
individual file uploads (magic only) and task/job commit counters
being aggregated through .pending and .pendingset writes, before
making it through to _SUCCESS.

Most of the work is in the hadoop-aws module; changes done in
hadoop-common are focused on making that implementation
straightforward, identifying troublespots etc.

Overall, other than the fact we don't have worker thread level of
statistic collection (hence no input stream stats), we do now have
the ability to collect and aggregate statistics.

Change-Id: I2cd82f7cfb5b1937cb69fb85555836a2b7b20f98
fully embracing the new stats in the S3A input stream, including
unbuffer. Some test failures there related to total counts -something
is overcounting.

Lots of tuning of the implementation classes here; main API seems OK

Change-Id: I0a627a54b421594c941458b82a12925383d865e3
Rebase onto trunk and merge changes which went in with HDFS-13934.

That's already had a stats API intended for integration -but this wraps up.

That uploader is going to be the first place where I add som
min/max/mean tracking of upload performance because it is simple code.
This is driving the changes in CounterStatisticsImpl (TODO: rename?)
to make it easier to track and update these.

New unit tests

Change-Id: I66dab375cfe4d0ec386521dbfd83534f0ec32f40
* easy to track duration of operations
* instrumened S3AInputStream to track time of GET requests
* Instrumented S3A List operations to track time/count of LIST requests;
  The remote iterators returned in the S3A code all now serve these results.

To aid with collection in applications, more classes implement
IOStatisticsSource and pass down to their inner streams:

* CompressionInputStream
* CompressionOutputStream
* LineReader

LocatedFileStatusFetcher wlll aggregate stats coming from its filesystem
list operations, so give total count/performance of operations against
a store.

See ITestLocatedFileStatusFetcher for use

Change-Id: Iebe49341ae515c387194040c861e3002e1f366a2
- Improving API for duration tracking; testing it too
- SelectInputStream counting right stats
- both min and max stats have -1 as unset.
Mixed feelings here, but setting min to 2^32 sucks too

Change-Id: I4b1ba62dc619ea65308614f1a930642d4ce9c59b
still failing ITestS3AUnbuffer.testUnbufferStreamStatistics

issue: valid regression or was the test previously broken.
Change-Id: I218b484347737ef38822f9bf78fa18218254f4a7
-fixed failing unbuffer test
-checkstyles
-fighting findbugs
-listing iterators toString methods print stats
-as do others

Change-Id: I06b1b21b2f6686d18ed80de24d4c30835a7521ba
This is actually based on my PoC of using this code through Parquet -
we need to move it to using RemoteIterable through its chained work,
and we want to keep collecting statistics.

This is a movement and extension of the S3A listing code to do this.
Making this available for Parquet &c forced me to make some of the functional
classes in fs.impl public (the FunctionsRaisingIOE and WrappedIOException).

-new package fs.functional for this stuff
-moved the function declarations
-created superclass for WrappedIOE, in case anything was using it.

Really needs some tests of this stuff, as while Listing validates the
correct codepaths, they don't try and break them

Change-Id: I178509cc60e19b6f2fab09ffd15c6b6986c79095
Change-Id: I31a7a25624f6e10711ac6e7fa205be4bd7515a85
+marker tools will show stream stats @ verbose, shows that we
can pick this up in our own tooling

Change-Id: I4b0437b5179a43a65f2c5cc6d604dfe3593653f8
By making the methods to create a DurationTracker their own interface, and by
moving that and the DurationTracker interface into the public package, it
becomes possible to pass references to a factory into other classes, without
having to know whether its been implemented by an IOStatisticsStore, or
a simple stub.

This is used in Listing to call into ListingCallbacks; it would also be
how something like the SemaphoredExecutor could track wait times

Change-Id: If75eb753bc84777aaaeb6e99a2ce1dad07eec8f6
SemaphoreExecutor does take a duration tracker factory, most
of the S3A Statistics are now such factories. So we can track the
duration of block acquisition time, etc.

We're lagging here on how well this stuff is being incorporated back
into the core filesystem statistics.

I'm thinking

* S3A FS implements IOStatsSource
* There's an uber source which tracks everything
* On close() of a stats source, its updated
* And S3A Instrumentation and Statistics are updated to match.
  Really Statistics should just serve up the counters.

? What does this do for Metrics?

Unsure.

? How to propagate stats from long-lived S3 subsidirary classes?

We do that already in places. For duration tracking though, we'd want
something which could either create two DurationTrackers (inefficient)
or chain back into a second one which would let us forward this stuff
from the inner to the outer, so using exactly the same System.currentTime values

Change-Id: If4b404e45129fea4384b1943d3993d42e2474853
Change-Id: Ib54ff0505cf6f98c308da553f6c5b2b9b0e11daf
* If we add a forEach() method, we can log @ debug any IOStatistics, without
the calling code needing to know anything about the interface.
* If this is done for operations which go through remote iterators in our
  own code, we start logging operation costs ourselves. Slick.
* Done that switch in S3AUtils, but not in production code elsewhere.

Change-Id: I31167419c7a2755d6910127400820d6fee4945d6
* Docs
* Javadocs
* Ongoing work on making multihreading resilient to changes in the maps,
  so lining up for support for dynamic map expansion

Change-Id: Id8220e40cdba9c91137bf0229f0daea067387103
-move to hadoop.util.functional
-add unit tests
-use in more code to see how well they work
-tune close() and factory names

I'm happy with them internally, we do need to mark as unstable at first
though.

Change-Id: I6493412c2de53468f8c6f6f2184be07db12c00b0
@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 30s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 3s No case conflicting files found.
+0 🆗 markdownlint 0m 1s markdownlint was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 39 new or modified test files.
_ trunk Compile Tests _
+0 🆗 mvndep 3m 22s Maven dependency ordering for branch
+1 💚 mvninstall 25m 50s trunk passed
+1 💚 compile 19m 21s trunk passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1
+1 💚 compile 16m 42s trunk passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01
+1 💚 checkstyle 2m 53s trunk passed
+1 💚 mvnsite 3m 14s trunk passed
+1 💚 shadedclient 20m 39s branch has no errors when building and testing our client artifacts.
+1 💚 javadoc 1m 49s trunk passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1
+1 💚 javadoc 2m 52s trunk passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01
+0 🆗 spotbugs 1m 13s Used deprecated FindBugs config; considering switching to SpotBugs.
+1 💚 findbugs 4m 45s trunk passed
-0 ⚠️ patch 1m 35s Used diff version of patch file. Binary files and potentially other changes not applied. Please rebase and squash commits if necessary.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 26s Maven dependency ordering for patch
+1 💚 mvninstall 1m 57s the patch passed
+1 💚 compile 18m 46s the patch passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1
+1 💚 javac 18m 46s root-jdkUbuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 generated 0 new + 2061 unchanged - 1 fixed = 2061 total (was 2062)
+1 💚 compile 16m 54s the patch passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01
+1 💚 javac 16m 54s root-jdkPrivateBuild-1.8.0_265-8u265-b01-0ubuntu218.04-b01 with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu218.04-b01 generated 0 new + 1955 unchanged - 1 fixed = 1955 total (was 1956)
-0 ⚠️ checkstyle 2m 45s root: The patch generated 18 new + 266 unchanged - 26 fixed = 284 total (was 292)
+1 💚 mvnsite 3m 15s the patch passed
-1 ❌ whitespace 0m 0s The patch has 14 line(s) that end in whitespace. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply
+1 💚 xml 0m 1s The patch has no ill-formed XML file.
+1 💚 shadedclient 14m 11s patch has no errors when building and testing our client artifacts.
+1 💚 javadoc 1m 48s the patch passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1
-1 ❌ javadoc 1m 37s hadoop-common-project_hadoop-common-jdkPrivateBuild-1.8.0_265-8u265-b01-0ubuntu218.04-b01 with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu218.04-b01 generated 1 new + 1 unchanged - 0 fixed = 2 total (was 1)
+1 💚 javadoc 0m 35s hadoop-mapreduce-client-core in the patch passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01.
+1 💚 javadoc 0m 41s hadoop-tools_hadoop-aws-jdkPrivateBuild-1.8.0_265-8u265-b01-0ubuntu218.04-b01 with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu218.04-b01 generated 0 new + 0 unchanged - 4 fixed = 0 total (was 4)
-1 ❌ findbugs 2m 21s hadoop-common-project/hadoop-common generated 9 new + 0 unchanged - 0 fixed = 9 total (was 0)
_ Other Tests _
+1 💚 unit 9m 36s hadoop-common in the patch passed.
+1 💚 unit 7m 2s hadoop-mapreduce-client-core in the patch passed.
+1 💚 unit 1m 32s hadoop-aws in the patch passed.
+1 💚 asflicense 0m 56s The patch does not generate ASF License warnings.
189m 33s
Reason Tests
FindBugs module:hadoop-common-project/hadoop-common
Inconsistent synchronization of org.apache.hadoop.fs.statistics.IOStatisticsSnapshot.counters; locked 60% of time Unsynchronized access at IOStatisticsSnapshot.java:60% of time Unsynchronized access at IOStatisticsSnapshot.java:[line 179]
Inconsistent synchronization of org.apache.hadoop.fs.statistics.IOStatisticsSnapshot.gauges; locked 60% of time Unsynchronized access at IOStatisticsSnapshot.java:60% of time Unsynchronized access at IOStatisticsSnapshot.java:[line 184]
Inconsistent synchronization of org.apache.hadoop.fs.statistics.IOStatisticsSnapshot.maximums; locked 60% of time Unsynchronized access at IOStatisticsSnapshot.java:60% of time Unsynchronized access at IOStatisticsSnapshot.java:[line 194]
Inconsistent synchronization of org.apache.hadoop.fs.statistics.IOStatisticsSnapshot.meanStatistics; locked 60% of time Unsynchronized access at IOStatisticsSnapshot.java:60% of time Unsynchronized access at IOStatisticsSnapshot.java:[line 199]
Inconsistent synchronization of org.apache.hadoop.fs.statistics.IOStatisticsSnapshot.minimums; locked 60% of time Unsynchronized access at IOStatisticsSnapshot.java:60% of time Unsynchronized access at IOStatisticsSnapshot.java:[line 113]
Inconsistent synchronization of org.apache.hadoop.fs.statistics.MeanStatistic.samples; locked 68% of time Unsynchronized access at MeanStatistic.java:68% of time Unsynchronized access at MeanStatistic.java:[line 122]
Inconsistent synchronization of org.apache.hadoop.fs.statistics.MeanStatistic.sum; locked 69% of time Unsynchronized access at MeanStatistic.java:69% of time Unsynchronized access at MeanStatistic.java:[line 214]
org.apache.hadoop.fs.statistics.MeanStatistic.getSamples() is unsynchronized, org.apache.hadoop.fs.statistics.MeanStatistic.setSamples(long) is synchronized At MeanStatistic.java:synchronized At MeanStatistic.java:[line 122]
org.apache.hadoop.fs.statistics.MeanStatistic.getSum() is unsynchronized, org.apache.hadoop.fs.statistics.MeanStatistic.setSum(long) is synchronized At MeanStatistic.java:synchronized At MeanStatistic.java:[line 114]
Subsystem Report/Notes
Docker ClientAPI=1.40 ServerAPI=1.40 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2069/18/artifact/out/Dockerfile
GITHUB PR #2069
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle markdownlint xml
uname Linux 0e5467e1e178 4.15.0-112-generic #113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / e5fe326
Default Java Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01
checkstyle https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2069/18/artifact/out/diff-checkstyle-root.txt
whitespace https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2069/18/artifact/out/whitespace-eol.txt
javadoc https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2069/18/artifact/out/diff-javadoc-javadoc-hadoop-common-project_hadoop-common-jdkPrivateBuild-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01.txt
findbugs https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2069/18/artifact/out/new-findbugs-hadoop-common-project_hadoop-common.html
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2069/18/testReport/
Max. process+thread count 1510 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-tools/hadoop-aws U: .
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2069/18/console
versions git=2.17.1 maven=3.6.0 findbugs=4.0.6
Powered by Apache Yetus 0.13.0-SNAPSHOT https://yetus.apache.org

This message was automatically generated.

@steveloughran
Copy link
Contributor Author

Closing this; best to rebuild as a new PR atop trunk.

For the next PR I plan to split into

  • hadoop-common
  • hadoop-aws

trickier than I'd like, but, it means we can get the common one reviewed and in early

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement fs/s3 changes related to hadoop-aws; submitter must declare test endpoint
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants