HDDS-10452. Improve Recon Disk Usage to fetch and display Top N records based on size. #6318

ArafatKhan2198 · 2024-03-02T20:08:47Z

What changes were proposed in this pull request?

This pull request introduces enhancements to the Recon disk usage endpoint to significantly improve usability and performance when dealing with large datasets:
Top Entities Focus: The endpoint has been updated to efficiently sort and display only the top entities by size. This targeted approach helps users easily identify the most significant space consumers, addressing the impracticality of visualizing thousands of records in a single view.
Efficient Sorting with Parallel Streams: To manage and sort vast numbers of records effectively, we've implemented parallel stream processing.
Key advantages of using parallel streams include :-
1. Better Utilization of Multi-core Processors: Enables concurrent sorting operations across multiple cores, drastically cutting down processing times for large datasets.
2. Optimized for Large Datasets: The parallelism overhead is more efficiently distributed over a large number of elements, making it particularly suited for our use case.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-10452

How was this patch tested?

Manually Tested Out the API and also using Integration Testing :-

Results from Manual Testing :-

Created 4 files of 100MB, 10MB, 1MB & 10KB under dir-1

{
  "status": "OK",
  "path": "/volumetest/buckettest/dir1",
  "size": 111010000,
  "sizeWithReplica": -1,
  "subPathCount": 4,
  "subPaths": [
    {
      "key": true,
      "path": "/volumetest/buckettest/dir1/key100MB",
      "size": 100000000,
      "sizeWithReplica": -1,
      "isKey": true
    },
    {
      "key": true,
      "path": "/volumetest/buckettest/dir1/key10mb",
      "size": 10000000,
      "sizeWithReplica": -1,
      "isKey": true
    },
    {
      "key": true,
      "path": "/volumetest/buckettest/dir1/key1MB",
      "size": 1000000,
      "sizeWithReplica": -1,
      "isKey": true
    },
    {
      "key": true,
      "path": "/volumetest/buckettest/dir1/key10kb",
      "size": 10000,
      "sizeWithReplica": -1,
      "isKey": true
    }
  ],
  "sizeDirectKey": 111010000
}

…ds based on size.

SaketaChalamchala · 2024-03-04T17:11:59Z

@devmadhuu and @dombizita could you please take a look?

smitajoshi12 · 2024-03-05T06:13:29Z

@ArafatKhan2198
Arfafat Can you set Limit on API

devmadhuu

Thanks @ArafatKhan2198 for working on this patch. Few comments.

...ne/recon/src/test/java/org/apache/hadoop/ozone/recon/api/TestNSSummaryDiskUsageOrdering.java

ArafatKhan2198 · 2024-03-08T19:30:56Z

@devmadhuu @adoroszlai @smitajoshi12

Could you please review the latest changes? Here's a quick summary:

Switched to Parallel Sorting: To improve performance, we're now using parallel sorting. More details are in the description.
Added a Toggle for Sorting: There's a new boolean flag to turn sorting on or off.
Set a Limit of 30 Records: We've added a constant to limit the response to the top 30 records in Disk Usage.

dombizita

Thanks for working on this @ArafatKhan2198, overall it looks good to me, I'd like to make the javadoc and comments more accurate, please see my comments inline.

hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/api/NSSummaryEndpoint.java

...zone/recon/src/main/java/org/apache/hadoop/ozone/recon/api/handlers/BucketEntityHandler.java

...e/recon/src/main/java/org/apache/hadoop/ozone/recon/api/handlers/DirectoryEntityHandler.java

...-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/api/handlers/RootEntityHandler.java

...zone/recon/src/main/java/org/apache/hadoop/ozone/recon/api/handlers/VolumeEntityHandler.java

devmadhuu · 2024-03-19T04:24:47Z

@devmadhuu @adoroszlai @smitajoshi12

Could you please review the latest changes? Here's a quick summary:

* Switched to Parallel Sorting: To improve performance, we're now using parallel sorting. More details are in the description.

* Added a Toggle for Sorting: There's a new boolean flag to turn sorting on or off.

* Set a Limit of 30 Records: We've added a constant to limit the response to the top 30 records in Disk Usage.

Thanks @ArafatKhan2198 for handling some points. However I am not sure if parallelStreaming always improves performance, in fact rather sometimes, it increases more overhead and may do bad than good. I would like you to have a look here.

devmadhuu

Some comments are still open. Pls handle them.

...ne/recon/src/test/java/org/apache/hadoop/ozone/recon/api/TestNSSummaryDiskUsageOrdering.java

ArafatKhan2198 · 2024-03-26T08:03:49Z

@devmadhuu @adoroszlai @smitajoshi12
Could you please review the latest changes? Here's a quick summary:
* Switched to Parallel Sorting: To improve performance, we're now using parallel sorting. More details are in the description.

* Added a Toggle for Sorting: There's a new boolean flag to turn sorting on or off.

* Set a Limit of 30 Records: We've added a constant to limit the response to the top 30 records in Disk Usage.
Thanks @ArafatKhan2198 for handling some points. However I am not sure if parallelStreaming always improves performance, in fact rather sometimes, it increases more overhead and may do bad than good. I would like you to have a look here.

Thanks a lot, @devmadhuu , for the comment and the article! I've read through it carefully and here's my analysis:

Parallel Streaming concern:

Parallel streams introduce overhead for managing multiple threads.
This overhead can outweigh the benefits of parallel processing for small datasets or simple operations.

After going through the article I can summarise the following ➖

Factors affecting performance:
- Data size: Parallel streams benefit from large datasets where the overhead is justified.
  - This sorting algorithm will be applied to response objects at a single level in the file system hierarchy, which could potentially encompass millions of items in the worst-case scenario under ideal conditions.
- Computation intensity: Operations involving complex calculations benefit more from parallelization.
  - Sorting is considered a moderately complex calculation in the context of parallelization.
- Stream source: Easily splittable sources like arrays perform better in parallel streams,
  - We are using Lists as our source.

adoroszlai · 2024-03-26T08:08:49Z

@ArafatKhan2198 @devmadhuu Please omit @mention when quoting the message that asked for review. Including it re-subscribes folks mentioned who may have already unsubscribed from the discussion (sorry, I don't have time to review this).

devmadhuu · 2024-03-26T08:24:46Z

Could you please review the latest changes? Here's a quick summary:
* Switched to Parallel Sorting: To improve performance, we're now using parallel sorting. More details are in the description.

* Added a Toggle for Sorting: There's a new boolean flag to turn sorting on or off.

* Set a Limit of 30 Records: We've added a constant to limit the response to the top 30 records in Disk Usage.
Thanks @ArafatKhan2198 for handling some points. However I am not sure if parallelStreaming always improves performance, in fact rather sometimes, it increases more overhead and may do bad than good. I would like you to have a look here.
Thanks a lot, @devmadhuu , for the comment and the article! I've read through it carefully and here's my analysis:

Parallel Streaming concern:

Parallel streams introduce overhead for managing multiple threads.

This overhead can outweigh the benefits of parallel processing for small datasets or simple operations.

After going through the article I can summarise the following ➖

Factors affecting performance:

Data size: Parallel streams benefit from large datasets where the overhead is justified.

This sorting algorithm will be applied to response objects at a single level in the file system hierarchy, which could potentially encompass millions of items in the worst-case scenario under ideal conditions.

Computation intensity: Operations involving complex calculations benefit more from parallelization.

Sorting is considered a moderately complex calculation in the context of parallelization.

Stream source: Easily splittable sources like arrays perform better in parallel streams,

We are using Lists as our source.

Do we have any performance measure data over 1 million records at least with and without parallel streaming. I am emphasizing it because I have experienced , that even with few 10K of records, parallel streaming do bad more than good. So I would suggest to publish some figures of performance with and without parallel streaming at least with 1 million records.

devmadhuu · 2024-03-26T08:29:42Z

Could you please review the latest changes? Here's a quick summary:

Switched to Parallel Sorting: To improve performance, we're now using parallel sorting. More details are in the description.

Added a Toggle for Sorting: There's a new boolean flag to turn sorting on or off.

Set a Limit of 30 Records: We've added a constant to limit the response to the top 30 records in Disk Usage.

Pls check on UI, what is the max limit in dropdown we are setting and using. I think its changed to 10k+. Pls check and confirm.

ArafatKhan2198 · 2024-04-02T10:31:57Z

Could you please review the latest changes? Here's a quick summary:
* Switched to Parallel Sorting: To improve performance, we're now using parallel sorting. More details are in the description.

* Added a Toggle for Sorting: There's a new boolean flag to turn sorting on or off.

* Set a Limit of 30 Records: We've added a constant to limit the response to the top 30 records in Disk Usage.
Thanks @ArafatKhan2198 for handling some points. However I am not sure if parallelStreaming always improves performance, in fact rather sometimes, it increases more overhead and may do bad than good. I would like you to have a look here.
Thanks a lot, @devmadhuu , for the comment and the article! I've read through it carefully and here's my analysis:
Parallel Streaming concern:

Parallel streams introduce overhead for managing multiple threads.

This overhead can outweigh the benefits of parallel processing for small datasets or simple operations.

After going through the article I can summarise the following ➖

Factors affecting performance:

Data size: Parallel streams benefit from large datasets where the overhead is justified.

This sorting algorithm will be applied to response objects at a single level in the file system hierarchy, which could potentially encompass millions of items in the worst-case scenario under ideal conditions.

Computation intensity: Operations involving complex calculations benefit more from parallelization.

Sorting is considered a moderately complex calculation in the context of parallelization.

Stream source: Easily splittable sources like arrays perform better in parallel streams,

We are using Lists as our source.
Do we have any performance measure data over 1 million records at least with and without parallel streaming. I am emphasizing it because I have experienced , that even with few 10K of records, parallel streaming do bad more than good. So I would suggest to publish some figures of performance with and without parallel streaming at least with 1 million records.

Thanks for the comments @devmadhuu tested this out on a cluster with 10 million keys,
These were the results :-

Sequential sort time: 7657 ms
Parallel sort time: 1279 ms

I believe we could got with parallel sort.

devmadhuu · 2024-04-02T15:32:02Z

Could you please review the latest changes? Here's a quick summary:
* Switched to Parallel Sorting: To improve performance, we're now using parallel sorting. More details are in the description.

* Added a Toggle for Sorting: There's a new boolean flag to turn sorting on or off.

* Set a Limit of 30 Records: We've added a constant to limit the response to the top 30 records in Disk Usage.
Thanks @ArafatKhan2198 for handling some points. However I am not sure if parallelStreaming always improves performance, in fact rather sometimes, it increases more overhead and may do bad than good. I would like you to have a look here.
Thanks a lot, @devmadhuu , for the comment and the article! I've read through it carefully and here's my analysis:
Parallel Streaming concern:

Parallel streams introduce overhead for managing multiple threads.

This overhead can outweigh the benefits of parallel processing for small datasets or simple operations.

After going through the article I can summarise the following ➖

Factors affecting performance:

Data size: Parallel streams benefit from large datasets where the overhead is justified.

This sorting algorithm will be applied to response objects at a single level in the file system hierarchy, which could potentially encompass millions of items in the worst-case scenario under ideal conditions.

Computation intensity: Operations involving complex calculations benefit more from parallelization.

Sorting is considered a moderately complex calculation in the context of parallelization.

Stream source: Easily splittable sources like arrays perform better in parallel streams,

We are using Lists as our source.
Do we have any performance measure data over 1 million records at least with and without parallel streaming. I am emphasizing it because I have experienced , that even with few 10K of records, parallel streaming do bad more than good. So I would suggest to publish some figures of performance with and without parallel streaming at least with 1 million records.
Thanks for the comments @devmadhuu tested this out on a cluster with 10 million keys, These were the results :-
Sequential sort time: 7657 ms
Parallel sort time: 1279 ms
I believe we could got with parallel sort.

Thanks @ArafatKhan2198 for testing out and publish the figures. This looks promising.

devmadhuu

Changes LGTM +1. Pls resolve conflicts.

devmadhuu

A minor comment.

...zone/recon/src/main/java/org/apache/hadoop/ozone/recon/api/handlers/BucketEntityHandler.java

devmadhuu

Thanks @ArafatKhan2198 for working on this patch. Changes LGTM +1

ArafatKhan2198 · 2024-04-13T07:18:20Z

@dombizita Could you please take a final look at it!
I believe we are done and can merge it

...ne/recon/src/test/java/org/apache/hadoop/ozone/recon/api/TestNSSummaryDiskUsageOrdering.java

devmadhuu · 2024-04-16T05:28:25Z

Thanks @ArafatKhan2198 for working on this patch.

…ds based on size. (apache#6318)

…ds based on size. (apache#6318) (cherry picked from commit 93a2489)

ArafatKhan2198 added 3 commits March 3, 2024 01:21

HDDS-10452. Improve Recon Disk Usage to fetch and display Top N recor…

7e8bb75

…ds based on size.

Fixed checkstyle and bugs

5a03fba

Added licence

a11e44f

ArafatKhan2198 marked this pull request as ready for review March 4, 2024 08:29

ArafatKhan2198 marked this pull request as draft March 4, 2024 08:29

ArafatKhan2198 marked this pull request as ready for review March 4, 2024 08:33

adoroszlai added the recon label Mar 4, 2024

kerneltime requested review from devmadhuu and dombizita March 4, 2024 17:57

ArafatKhan2198 added 3 commits March 5, 2024 03:22

Fixed errors and bugs

65b3053

Added licence

32625da

Fixed checkstyle isssues

1c292f1

devmadhuu reviewed Mar 5, 2024

View reviewed changes

ArafatKhan2198 added 3 commits March 8, 2024 21:55

Added a flag to enable/disable sorting

994a9e5

Changed the sorting algorithm to parallel sorting

0876f35

Fixed checkstyle issues

aeeabce

ArafatKhan2198 requested a review from devmadhuu March 8, 2024 19:18

dombizita reviewed Mar 11, 2024

View reviewed changes

Made review comments

07de9c2

ArafatKhan2198 requested a review from dombizita March 14, 2024 05:40

devmadhuu reviewed Mar 19, 2024

View reviewed changes

...ne/recon/src/test/java/org/apache/hadoop/ozone/recon/api/TestNSSummaryDiskUsageOrdering.java Show resolved Hide resolved

devmadhuu reviewed Apr 2, 2024

View reviewed changes

devmadhuu reviewed Apr 3, 2024

View reviewed changes

...zone/recon/src/main/java/org/apache/hadoop/ozone/recon/api/handlers/BucketEntityHandler.java Show resolved Hide resolved

Merge branch 'master' into HDDS-10452

1d5a90f

ArafatKhan2198 force-pushed the HDDS-10452 branch from def81a0 to 1d5a90f Compare April 12, 2024 06:16

devmadhuu approved these changes Apr 12, 2024

View reviewed changes

dombizita reviewed Apr 15, 2024

View reviewed changes

...ne/recon/src/test/java/org/apache/hadoop/ozone/recon/api/TestNSSummaryDiskUsageOrdering.java Show resolved Hide resolved

ArafatKhan2198 requested a review from dombizita April 15, 2024 15:04

dombizita approved these changes Apr 15, 2024

View reviewed changes

devmadhuu merged commit 93a2489 into apache:master Apr 16, 2024
40 of 51 checks passed

smitajoshi12 mentioned this pull request Apr 16, 2024

HDDS-9626. [Recon] Disk Usage page with high number of key/bucket/volume #6535

Merged

Tejaskriya pushed a commit to Tejaskriya/ozone that referenced this pull request Apr 17, 2024

HDDS-10452. Improve Recon Disk Usage to fetch and display Top N recor…

2419158

…ds based on size. (apache#6318)

jojochuang pushed a commit to jojochuang/ozone that referenced this pull request May 29, 2024

HDDS-10452. Improve Recon Disk Usage to fetch and display Top N recor…

691247a

…ds based on size. (apache#6318) (cherry picked from commit 93a2489)

xichen01 pushed a commit to xichen01/ozone that referenced this pull request Jul 17, 2024

HDDS-10452. Improve Recon Disk Usage to fetch and display Top N recor…

2f9ff5b

…ds based on size. (apache#6318) (cherry picked from commit 93a2489)

xichen01 pushed a commit to xichen01/ozone that referenced this pull request Jul 17, 2024

HDDS-10452. Improve Recon Disk Usage to fetch and display Top N recor…

018a784

…ds based on size. (apache#6318) (cherry picked from commit 93a2489)

xichen01 pushed a commit to xichen01/ozone that referenced this pull request Jul 17, 2024

HDDS-10452. Improve Recon Disk Usage to fetch and display Top N recor…

9016c1b

…ds based on size. (apache#6318) (cherry picked from commit 93a2489)

xichen01 pushed a commit to xichen01/ozone that referenced this pull request Jul 18, 2024

HDDS-10452. Improve Recon Disk Usage to fetch and display Top N recor…

a71b56d

…ds based on size. (apache#6318) (cherry picked from commit 93a2489)

xichen01 pushed a commit to xichen01/ozone that referenced this pull request Jul 18, 2024

HDDS-10452. Improve Recon Disk Usage to fetch and display Top N recor…

bb4f066

…ds based on size. (apache#6318) (cherry picked from commit 93a2489)

xichen01 mentioned this pull request Jul 18, 2024

[DO NOT MERGE] Backport some fixes, performance optimizations from master to ozone-1.4 #6929 #6964

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-10452. Improve Recon Disk Usage to fetch and display Top N records based on size. #6318

HDDS-10452. Improve Recon Disk Usage to fetch and display Top N records based on size. #6318

ArafatKhan2198 commented Mar 2, 2024 •

edited

Loading

SaketaChalamchala commented Mar 4, 2024

smitajoshi12 commented Mar 5, 2024

devmadhuu left a comment

ArafatKhan2198 commented Mar 8, 2024

dombizita left a comment

devmadhuu commented Mar 19, 2024

devmadhuu left a comment

ArafatKhan2198 commented Mar 26, 2024 •

edited

Loading

adoroszlai commented Mar 26, 2024

devmadhuu commented Mar 26, 2024 •

edited by adoroszlai

Loading

devmadhuu commented Mar 26, 2024 •

edited by adoroszlai

Loading

ArafatKhan2198 commented Apr 2, 2024

devmadhuu commented Apr 2, 2024

devmadhuu left a comment

devmadhuu left a comment

devmadhuu left a comment •

edited

Loading

ArafatKhan2198 commented Apr 13, 2024

devmadhuu commented Apr 16, 2024

HDDS-10452. Improve Recon Disk Usage to fetch and display Top N records based on size. #6318

HDDS-10452. Improve Recon Disk Usage to fetch and display Top N records based on size. #6318

Conversation

ArafatKhan2198 commented Mar 2, 2024 • edited Loading

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Manually Tested Out the API and also using Integration Testing :-

Results from Manual Testing :-

SaketaChalamchala commented Mar 4, 2024

smitajoshi12 commented Mar 5, 2024

devmadhuu left a comment

Choose a reason for hiding this comment

ArafatKhan2198 commented Mar 8, 2024

dombizita left a comment

Choose a reason for hiding this comment

devmadhuu commented Mar 19, 2024

devmadhuu left a comment

Choose a reason for hiding this comment

ArafatKhan2198 commented Mar 26, 2024 • edited Loading

adoroszlai commented Mar 26, 2024

devmadhuu commented Mar 26, 2024 • edited by adoroszlai Loading

devmadhuu commented Mar 26, 2024 • edited by adoroszlai Loading

ArafatKhan2198 commented Apr 2, 2024

devmadhuu commented Apr 2, 2024

devmadhuu left a comment

Choose a reason for hiding this comment

devmadhuu left a comment

Choose a reason for hiding this comment

devmadhuu left a comment • edited Loading

Choose a reason for hiding this comment

ArafatKhan2198 commented Apr 13, 2024

devmadhuu commented Apr 16, 2024

ArafatKhan2198 commented Mar 2, 2024 •

edited

Loading

ArafatKhan2198 commented Mar 26, 2024 •

edited

Loading

devmadhuu commented Mar 26, 2024 •

edited by adoroszlai

Loading

devmadhuu commented Mar 26, 2024 •

edited by adoroszlai

Loading

devmadhuu left a comment •

edited

Loading