Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Visualize Dataset statistics in metadata panel #1472

Merged
merged 40 commits into from
Aug 14, 2023

Conversation

ravi-kumar-pilla
Copy link
Contributor

@ravi-kumar-pilla ravi-kumar-pilla commented Aug 2, 2023

Description

Resolves #662

Development notes

Dataset_stats_compressed

Screenshots:

All stats -

Screenshot 2023-08-02 at 8 18 07 AM

File size only -

Screenshot 2023-08-02 at 8 22 02 AM

File size not configured due to no file path -

Screenshot 2023-08-02 at 8 23 24 AM

Overflow Dataset stats -

Screenshot 2023-08-02 at 8 25 01 AM

QA notes

Steps to QA :

  1. Checkout feature/viz-size-datasets
  2. Pip install -e package
  3. cd demo-project
  4. Execute kedro run
  5. A file stats.json will be created in the demo-project along with changes in the data folder due to new kedro run
  6. cd .. (go into kedro-viz folder)
  7. Start the backend server executing make run
  8. Start frontend by executing npm start
  9. Click on datasets (covered only instances of pd.Dataframe). You should see the file statistics (rows, columns and file size). If file size is not available, you will see it as NA

Note:

  1. There needs to be further discussion on where to extract file size as it is not available at the hook level (or I do not know how to get file path at hook level). As of now, I am using fsspec inside the flowchart model to get the file size.
  2. Transcoded data name is stored as ingestion.int_typed_shuttles@pandas2

Checklist

  • Read the contributing guidelines
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added new entries to the RELEASE.md file
  • Added tests to cover my changes

@amandakys
Copy link

For the overflow case, I thought 29000 rows would be displayed as 29K? Or is that just a forced example to show the overflow behaviour?

Would you be able to provide a screenshot of the skeleton loader in action?

@ravi-kumar-pilla
Copy link
Contributor Author

For the overflow case, I thought 29000 rows would be displayed as 29K? Or is that just a forced example to show the overflow behaviour?

I wanted to show you how the overflow looks. If you want it to be in a single row, I will implement the row and column shortforms (like 29K). During our call we thought this will not show exact number (unless we have tooltip or something like that). If the 2 rows are not bad, we can have it this way for now.

Would you be able to provide a screenshot of the skeleton loader in action?

Skeleton loader is not needed for this implementation as the dataset statistics are pre-calculated. However, we might need it as a general improvement for the api call which is not related to this specific ticket. I will create a new ticket to track the skeleton loader implementation.

Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done a first review and left initial comments. It's a lot of changes, so I'll do a second review asap!

In terms of using hooks for getting the file size. You could potentially do something like:

@hook_impl
    def after_catalog_created(self, catalog: DataCatalog):
        datasets = catalog.datasets
        for ds_name, ds in datasets.items():
            try:
                fp = ds._filepath
            except Exception:
                logger.error("Something went wrong trying to get the filepath")

This is super rough and I haven't tried it properly, but it's a way to get the filepath for each dataset in the catalog after the catalog has been created.

package/kedro_viz/api/rest/router.py Outdated Show resolved Hide resolved
package/kedro_viz/api/rest/router.py Outdated Show resolved Hide resolved
package/kedro_viz/data_access/managers.py Outdated Show resolved Hide resolved
package/kedro_viz/integrations/kedro/data_loader.py Outdated Show resolved Hide resolved
package/kedro_viz/integrations/kedro/data_loader.py Outdated Show resolved Hide resolved
package/kedro_viz/integrations/kedro/data_loader.py Outdated Show resolved Hide resolved
package/kedro_viz/integrations/kedro/data_loader.py Outdated Show resolved Hide resolved
package/kedro_viz/integrations/kedro/hooks.py Outdated Show resolved Hide resolved
package/kedro_viz/models/flowchart.py Outdated Show resolved Hide resolved
Copy link
Member

@tynandebold tynandebold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll approve this PR from the JS side and will let Merel, Nok, and others weight in on the Python part.

I've left a couple of smaller comments that I hope you can get to before merging.

Well done! Looking forward to this being released.

src/components/metadata/styles/metadata.scss Outdated Show resolved Hide resolved
src/utils/index.js Outdated Show resolved Hide resolved
src/components/metadata/metadata.js Outdated Show resolved Hide resolved
src/components/metadata/metadata.js Outdated Show resolved Hide resolved
@ravi-kumar-pilla
Copy link
Contributor Author

Hi @tynandebold , @vladimir-mck
I have modified the frontend code to account for design changes due to overflow. Please review

Thank you

Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ravi-kumar-pilla , the python tests & code look good now! 👍 ⭐

@tynandebold
Copy link
Member

Hi @tynandebold , @vladimir-mck I have modified the frontend code to account for design changes due to overflow. Please review

Thank you

Where can we view that in action? I clicked on several nodes but didn't see anything overflowing.

@ravi-kumar-pilla
Copy link
Contributor Author

Hi @tynandebold , @vladimir-mck I have modified the frontend code to account for design changes due to overflow. Please review
Thank you

Where can we view that in action? I clicked on several nodes but didn't see anything overflowing.

You can check for datasets - int typed companies, Experiment params, Regressor, R2 score, Regressor, Model Input Table, Prm Spine Table

@ravi-kumar-pilla
Copy link
Contributor Author

ravi-kumar-pilla commented Aug 10, 2023

Hi @stichbury , I need your suggestion in documenting this feature. Once the feature is pushed in the new release -

What happens

Users can see the statistics (rows, columns, file size) for the datasets which are instances of pandas Data Frame. In case the stat is not available or if the dataset is not an instance of pandas Data Frame, the users will see N/A for the stat. All this information will be displayed under Dataset statistics row in the metadata panel

What should users do

Step 1 -
Install latest kedro-viz package - pip install kedro-viz

Step 2 -
Navigate to the kedro project and execute kedro run

Note: A stats file will be created in the kedro project folder (stats.json)

Step3 -
Visualize stats by running kedro-viz from the project directory - kedro viz

The command opens a browser tab to serve the visualisation at http://127.0.0.1:4141/

Click on a dataset node and check the metadata panel. The users should see something like below -

image

@stichbury is https://github.com/kedro-org/kedro/tree/main/docs/source/visualisation good place to document this ?
@noklam can you please confirm if I am not missing any steps. Thank you

package/kedro_viz/integrations/kedro/hooks.py Outdated Show resolved Hide resolved
try:
with open("stats.json", "w", encoding="utf8") as file:
sorted_stats_data = {
dataset_name: stats_order(stats)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the name stats_order slightly confusing. Maybe format_json or prettify_json?

try:
with open("stats.json", "w", encoding="utf8") as file:
sorted_stats_data = {
dataset_name: stats_order(stats)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And can it be just a helper method stay within the class of the Hook? I don't see why we want to put it in utils.py since this is very specific to format the JSON produce by the Hook itself.

try:
with open("stats.json", "w", encoding="utf8") as file:
sorted_stats_data = {
dataset_name: stats_order(stats)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same apply to the other functions. Can we not have the utils.py?

Comment on lines 50 to 53
def get_file_size(file_path: Union[str, None]) -> Union[int, None]:
"""Get the dataset file size using fsspec. If the file_path is a directory,
get the latest file created (this corresponds to the latest run)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we calculate this from the hook? It should be much simpler to do this if you have the Dataset object. That way you don't need to re-create all the fsspec logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried doing that using _filepath in after_context_created hook. But since AbstractDataset does not have an api supporting that, you mentioned it to be fragile and so thought it would be a good idea to move it here

@ravi-kumar-pilla
Copy link
Contributor Author

@noklam I have modified the hooks to include file_size. Though I am calculating the file_size only if the dataset instance is pd.Dataframe. Earlier, I calculated it for every datanode and transcoded datanode which contains file_path. Please have a look and let me know if there needs to be any changes. Thank you !

Copy link
Contributor

@noklam noklam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with non-blocking comments

@ravi-kumar-pilla ravi-kumar-pilla merged commit 3c50980 into main Aug 14, 2023
13 of 17 checks passed
@ravi-kumar-pilla ravi-kumar-pilla deleted the feature/viz-size-datasets branch August 14, 2023 17:06
@tynandebold tynandebold mentioned this pull request Aug 17, 2023
5 tasks
Comment on lines +96 to +105
if isinstance(data, pd.DataFrame):
self._stats[stats_dataset_name]["rows"] = int(data.shape[0])
self._stats[stats_dataset_name]["columns"] = int(data.shape[1])

current_dataset = self.datasets.get(dataset_name, None)

if current_dataset:
self._stats[stats_dataset_name]["file_size"] = self.get_file_size(
current_dataset
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good start ! However it only works with dataframes.

There are 2 additional cases that can be easily implemented :

  • In the case of a PartitionedDataSet that loads pd.DataFrame s
  • In the case of loading multiple sheets of an Excel file (load_args: sheet_name: None )

I think it's not too difficult to account for those cases by adding additional logic on data, and it would avoid just throwing N/A in the UI.

Do you see how to implement that ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @oulianov,

We will try to include more use cases for visualizing datasets in the future releases. Created #1511 for tracking. Please feel free to comment any additional cases here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice ! thank you

@lukaszdz
Copy link

lukaszdz commented Aug 30, 2023

I'm not sure that this fully resolves the request (#662), Ideally, we would want to see the dataset sizes directly in the graph view, so we can view any issues with the pipeline without having to click through each node in the graph. Even better if we had some way to set up some rules to color the nodes (if N=0, then color the node red).

Please correct me in case I didn't fully understand the update. Added more details: #662 (comment)

@ravi-kumar-pilla
Copy link
Contributor Author

I'm not sure that this fully resolves the request (#662), Ideally, we would want to see the dataset sizes directly in the graph view, so we can view any issues with the pipeline without having to click through each node in the graph. Even better if we had some way to set up some rules to color the nodes (if N=0, then color the node red).

Please correct me in case I didn't fully understand the update. Added more details: #662 (comment)

Hi @lukaszdz , we are planning on integrating the statistics into the graph which would resolve this issue completely. We are exploring options on how to approach this without cluttering the flowchart view. This is the first step towards visualizing the statistics in Kedro Viz. Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Visualize size of processed datasets
10 participants