Visualize Dataset statistics in metadata panel #1472

ravi-kumar-pilla · 2023-08-02T00:20:15Z

Description

Resolves #662

Development notes

Screenshots:

All stats -

File size only -

File size not configured due to no file path -

Overflow Dataset stats -

QA notes

Steps to QA :

Checkout feature/viz-size-datasets
Pip install -e package
cd demo-project
Execute kedro run
A file stats.json will be created in the demo-project along with changes in the data folder due to new kedro run
cd .. (go into kedro-viz folder)
Start the backend server executing make run
Start frontend by executing npm start
Click on datasets (covered only instances of pd.Dataframe). You should see the file statistics (rows, columns and file size). If file size is not available, you will see it as NA

Note:

There needs to be further discussion on where to extract file size as it is not available at the hook level (or I do not know how to get file path at hook level). As of now, I am using fsspec inside the flowchart model to get the file size.
Transcoded data name is stored as ingestion.int_typed_shuttles@pandas2

Checklist

Read the contributing guidelines
Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added new entries to the RELEASE.md file
Added tests to cover my changes

amandakys · 2023-08-02T14:04:00Z

For the overflow case, I thought 29000 rows would be displayed as 29K? Or is that just a forced example to show the overflow behaviour?

Would you be able to provide a screenshot of the skeleton loader in action?

ravi-kumar-pilla · 2023-08-02T14:13:28Z

For the overflow case, I thought 29000 rows would be displayed as 29K? Or is that just a forced example to show the overflow behaviour?

I wanted to show you how the overflow looks. If you want it to be in a single row, I will implement the row and column shortforms (like 29K). During our call we thought this will not show exact number (unless we have tooltip or something like that). If the 2 rows are not bad, we can have it this way for now.

Would you be able to provide a screenshot of the skeleton loader in action?

Skeleton loader is not needed for this implementation as the dataset statistics are pre-calculated. However, we might need it as a general improvement for the api call which is not related to this specific ticket. I will create a new ticket to track the skeleton loader implementation.

…ature/viz-size-datasets

merelcht

I've done a first review and left initial comments. It's a lot of changes, so I'll do a second review asap!

In terms of using hooks for getting the file size. You could potentially do something like:

@hook_impl
    def after_catalog_created(self, catalog: DataCatalog):
        datasets = catalog.datasets
        for ds_name, ds in datasets.items():
            try:
                fp = ds._filepath
            except Exception:
                logger.error("Something went wrong trying to get the filepath")

This is super rough and I haven't tried it properly, but it's a way to get the filepath for each dataset in the catalog after the catalog has been created.

package/kedro_viz/api/rest/router.py

package/kedro_viz/data_access/managers.py

package/kedro_viz/integrations/kedro/data_loader.py

package/kedro_viz/integrations/kedro/hooks.py

package/kedro_viz/models/flowchart.py

tynandebold

I'll approve this PR from the JS side and will let Merel, Nok, and others weight in on the Python part.

I've left a couple of smaller comments that I hope you can get to before merging.

Well done! Looking forward to this being released.

src/components/metadata/styles/metadata.scss

src/utils/index.js

src/components/metadata/metadata.js

ravi-kumar-pilla · 2023-08-10T02:27:17Z

Hi @tynandebold , @vladimir-mck
I have modified the frontend code to account for design changes due to overflow. Please review

Thank you

merelcht

Thanks @ravi-kumar-pilla , the python tests & code look good now! 👍 ⭐

tynandebold · 2023-08-10T10:56:03Z

Hi @tynandebold , @vladimir-mck I have modified the frontend code to account for design changes due to overflow. Please review

Thank you

Where can we view that in action? I clicked on several nodes but didn't see anything overflowing.

ravi-kumar-pilla · 2023-08-10T13:10:55Z

Hi @tynandebold , @vladimir-mck I have modified the frontend code to account for design changes due to overflow. Please review
Thank you

Where can we view that in action? I clicked on several nodes but didn't see anything overflowing.

You can check for datasets - int typed companies, Experiment params, Regressor, R2 score, Regressor, Model Input Table, Prm Spine Table

ravi-kumar-pilla · 2023-08-10T14:59:33Z

Hi @stichbury , I need your suggestion in documenting this feature. Once the feature is pushed in the new release -

What happens

Users can see the statistics (rows, columns, file size) for the datasets which are instances of pandas Data Frame. In case the stat is not available or if the dataset is not an instance of pandas Data Frame, the users will see N/A for the stat. All this information will be displayed under Dataset statistics row in the metadata panel

What should users do

Step 1 -
Install latest kedro-viz package - pip install kedro-viz

Step 2 -
Navigate to the kedro project and execute kedro run

Note: A stats file will be created in the kedro project folder (stats.json)

Step3 -
Visualize stats by running kedro-viz from the project directory - kedro viz

The command opens a browser tab to serve the visualisation at http://127.0.0.1:4141/

Click on a dataset node and check the metadata panel. The users should see something like below -

@stichbury is https://github.com/kedro-org/kedro/tree/main/docs/source/visualisation good place to document this ?
@noklam can you please confirm if I am not missing any steps. Thank you

package/kedro_viz/integrations/kedro/hooks.py

noklam · 2023-08-11T13:51:02Z

package/kedro_viz/integrations/kedro/hooks.py

+        try:
+            with open("stats.json", "w", encoding="utf8") as file:
+                sorted_stats_data = {
+                    dataset_name: stats_order(stats)


I found the name stats_order slightly confusing. Maybe format_json or prettify_json?

noklam · 2023-08-11T13:52:43Z

package/kedro_viz/integrations/kedro/hooks.py

+        try:
+            with open("stats.json", "w", encoding="utf8") as file:
+                sorted_stats_data = {
+                    dataset_name: stats_order(stats)


And can it be just a helper method stay within the class of the Hook? I don't see why we want to put it in utils.py since this is very specific to format the JSON produce by the Hook itself.

noklam · 2023-08-11T13:53:52Z

package/kedro_viz/integrations/kedro/hooks.py

+        try:
+            with open("stats.json", "w", encoding="utf8") as file:
+                sorted_stats_data = {
+                    dataset_name: stats_order(stats)


Same apply to the other functions. Can we not have the utils.py?

noklam · 2023-08-11T13:57:08Z

package/kedro_viz/models/utils.py

+def get_file_size(file_path: Union[str, None]) -> Union[int, None]:
+    """Get the dataset file size using fsspec. If the file_path is a directory,
+    get the latest file created (this corresponds to the latest run)
+


Why can't we calculate this from the hook? It should be much simpler to do this if you have the Dataset object. That way you don't need to re-create all the fsspec logic.

I tried doing that using _filepath in after_context_created hook. But since AbstractDataset does not have an api supporting that, you mentioned it to be fragile and so thought it would be a good idea to move it here

ravi-kumar-pilla · 2023-08-12T01:43:08Z

@noklam I have modified the hooks to include file_size. Though I am calculating the file_size only if the dataset instance is pd.Dataframe. Earlier, I calculated it for every datanode and transcoded datanode which contains file_path. Please have a look and let me know if there needs to be any changes. Thank you !

package/kedro_viz/integrations/kedro/hooks.py

noklam

Approved with non-blocking comments

oulianov · 2023-08-30T11:31:45Z

package/kedro_viz/integrations/kedro/hooks.py

+            if isinstance(data, pd.DataFrame):
+                self._stats[stats_dataset_name]["rows"] = int(data.shape[0])
+                self._stats[stats_dataset_name]["columns"] = int(data.shape[1])
+
+                current_dataset = self.datasets.get(dataset_name, None)
+
+                if current_dataset:
+                    self._stats[stats_dataset_name]["file_size"] = self.get_file_size(
+                        current_dataset
+                    )


Good start ! However it only works with dataframes.

There are 2 additional cases that can be easily implemented :

In the case of a PartitionedDataSet that loads pd.DataFrame s

In the case of loading multiple sheets of an Excel file (load_args: sheet_name: None )

I think it's not too difficult to account for those cases by adding additional logic on data, and it would avoid just throwing N/A in the UI.

Do you see how to implement that ?

Thank you @oulianov,

We will try to include more use cases for visualizing datasets in the future releases. Created #1511 for tracking. Please feel free to comment any additional cases here.

Nice ! thank you

lukaszdz · 2023-08-30T14:46:12Z

I'm not sure that this fully resolves the request (#662), Ideally, we would want to see the dataset sizes directly in the graph view, so we can view any issues with the pipeline without having to click through each node in the graph. Even better if we had some way to set up some rules to color the nodes (if N=0, then color the node red).

Please correct me in case I didn't fully understand the update. Added more details: #662 (comment)

ravi-kumar-pilla · 2023-08-30T15:23:44Z

I'm not sure that this fully resolves the request (#662), Ideally, we would want to see the dataset sizes directly in the graph view, so we can view any issues with the pipeline without having to click through each node in the graph. Even better if we had some way to set up some rules to color the nodes (if N=0, then color the node red).

Please correct me in case I didn't fully understand the update. Added more details: #662 (comment)

Hi @lukaszdz , we are planning on integrating the statistics into the graph which would resolve this issue completely. We are exploring options on how to approach this without cluttering the flowchart view. This is the first step towards visualizing the statistics in Kedro Viz. Thank you

ravi-kumar-pilla added 6 commits July 31, 2023 18:04

initial draft with hooks

f86ac6f

modify test for formatFileSize

0370d56

add file size to the api response using fsspec

e6e3898

update unit test for metadata panel

6812664

remove print statements and update stats file

ccafa98

update get file size to not consider empty dir

a37459c

ravi-kumar-pilla added 5 commits August 2, 2023 15:55

fixing linting and format errors

8c6965f

Merge branch 'main' of https://github.com/kedro-org/kedro-viz into fe…

83bf822

…ature/viz-size-datasets

fix format and lint errors

9466ea6

fix pytest errors

cf90243

add test cases and add fix for circle ci builds

5a08448

ravi-kumar-pilla marked this pull request as ready for review August 3, 2023 06:20

ravi-kumar-pilla requested review from rashidakanchwala and tynandebold as code owners August 3, 2023 06:20

ravi-kumar-pilla requested review from noklam, merelcht and amandakys August 3, 2023 06:20

merelcht reviewed Aug 3, 2023

View reviewed changes

ravi-kumar-pilla requested review from merelcht and NeroOkwa August 3, 2023 13:53

tynandebold approved these changes Aug 3, 2023

View reviewed changes

src/components/metadata/styles/metadata.scss Outdated Show resolved Hide resolved

src/utils/index.js Outdated Show resolved Hide resolved

src/components/metadata/metadata.js Outdated Show resolved Hide resolved

src/components/metadata/metadata.js Outdated Show resolved Hide resolved

ravi-kumar-pilla added 7 commits August 3, 2023 19:25

resolve PR comments

bb5a342

fixing PR comments and add additional support for MemoryDataset

9d66c92

update stats and modify file_size extraction

3103894

fix lint and format errors

bbbeb7d

fix lint errors

72b8c74

fix lint errors

36760b4

fix lint errors

dd65977

merelcht approved these changes Aug 10, 2023

View reviewed changes

ravi-kumar-pilla added 2 commits August 10, 2023 08:22

add design change for overflow

8ddfffe

remove matplotlib from requirements and fix metadata suggestions

cf21083

add release notes for visualizing dataset stats

4213cf7

ravi-kumar-pilla requested a review from yetudada as a code owner August 10, 2023 15:50

add release notes for displaying dataset stats

57c139b

noklam reviewed Aug 11, 2023

View reviewed changes

ravi-kumar-pilla added 5 commits August 11, 2023 16:19

hooks update based on Nok's comments

f917c22

fix lint and format checks

7b88fc9

modify stats based on Nok's comments

2d823da

fix lint and format

381dfa4

fixed failing unit test

6dd02ff

noklam reviewed Aug 14, 2023

View reviewed changes

package/kedro_viz/integrations/kedro/hooks.py Outdated Show resolved Hide resolved

noklam approved these changes Aug 14, 2023

View reviewed changes

update code based on Nok's suggestion

ac65d0d

ravi-kumar-pilla merged commit 3c50980 into main Aug 14, 2023
13 of 17 checks passed

ravi-kumar-pilla deleted the feature/viz-size-datasets branch August 14, 2023 17:06

noklam mentioned this pull request Aug 16, 2023

[DON'T MERGE] PoC of recording stats during kedro run #1465

Closed

5 tasks

tynandebold mentioned this pull request Aug 17, 2023

Release v6.4.0 #1492

Merged

5 tasks

NeroOkwa mentioned this pull request Aug 30, 2023

Visualize size of processed datasets #662

Closed

oulianov reviewed Aug 30, 2023

View reviewed changes

ravi-kumar-pilla mentioned this pull request Aug 30, 2023

Extend Visualize dataset statistics to include additional use cases #1511

Open

1 task

PabloDeAlbu mentioned this pull request Nov 19, 2023

Disable dataset existence checks for unavailable sources #1645

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Visualize Dataset statistics in metadata panel #1472

Visualize Dataset statistics in metadata panel #1472

ravi-kumar-pilla commented Aug 2, 2023 •

edited

Loading

amandakys commented Aug 2, 2023

ravi-kumar-pilla commented Aug 2, 2023

merelcht left a comment

tynandebold left a comment

ravi-kumar-pilla commented Aug 10, 2023

merelcht left a comment

tynandebold commented Aug 10, 2023

ravi-kumar-pilla commented Aug 10, 2023

ravi-kumar-pilla commented Aug 10, 2023 •

edited

Loading

noklam Aug 11, 2023

noklam Aug 11, 2023

noklam Aug 11, 2023

noklam Aug 11, 2023

ravi-kumar-pilla Aug 11, 2023

ravi-kumar-pilla commented Aug 12, 2023

noklam left a comment

oulianov Aug 30, 2023

ravi-kumar-pilla Aug 30, 2023

oulianov Aug 31, 2023

lukaszdz commented Aug 30, 2023 •

edited

Loading

ravi-kumar-pilla commented Aug 30, 2023

Visualize Dataset statistics in metadata panel #1472

Visualize Dataset statistics in metadata panel #1472

Conversation

ravi-kumar-pilla commented Aug 2, 2023 • edited Loading

Description

Development notes

QA notes

Checklist

amandakys commented Aug 2, 2023

ravi-kumar-pilla commented Aug 2, 2023

merelcht left a comment

Choose a reason for hiding this comment

tynandebold left a comment

Choose a reason for hiding this comment

ravi-kumar-pilla commented Aug 10, 2023

merelcht left a comment

Choose a reason for hiding this comment

tynandebold commented Aug 10, 2023

ravi-kumar-pilla commented Aug 10, 2023

ravi-kumar-pilla commented Aug 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ravi-kumar-pilla commented Aug 12, 2023

noklam left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lukaszdz commented Aug 30, 2023 • edited Loading

ravi-kumar-pilla commented Aug 30, 2023

ravi-kumar-pilla commented Aug 2, 2023 •

edited

Loading

ravi-kumar-pilla commented Aug 10, 2023 •

edited

Loading

lukaszdz commented Aug 30, 2023 •

edited

Loading