Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#703 Inspect API - Listing columns with their types. #761

Merged
merged 26 commits into from
Mar 20, 2023
Merged
45 changes: 21 additions & 24 deletions external-docs/docs/quick-start/inspect.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Inspecting a CSV-W

This page is designed to help you inspect an existing CSV-W.
This page is designed to help you inspect an existing CSV-W.

## A transcribed video walkthrough

Expand Down Expand Up @@ -38,22 +38,19 @@ All being well we get the below output. A detailed explanation of this output is
- Identifier: Sweden At Eurovision No Missing
- Comment: None
- Description: None

- The data cube has the following data structure definition:
- Dataset Label: Sweden At Eurovision No Missing
- Number of Components: 9
- Components:
Property Property Label Property Type Column Title Observation Value Column Titles Required
sweden-at-eurovision-no-missing.csv#dimension/year Year Dimension Year True
sweden-at-eurovision-no-missing.csv#dimension/entrant Entrant Dimension Entrant True
sweden-at-eurovision-no-missing.csv#dimension/song Song Dimension Song True
sweden-at-eurovision-no-missing.csv#dimension/language Language Dimension Language True
http://purl.org/linked-data/cube#measureType Dimension Measure True
sweden-at-eurovision-no-missing.csv#measure/final-points Final Points Measure True
sweden-at-eurovision-no-missing.csv#measure/final-rank Final Rank Measure True
sweden-at-eurovision-no-missing.csv#measure/people-on-stage People on Stage Measure True
http://purl.org/linked-data/sdmx/2009/attribute#unitMeasure Attribute Unit True
- Columns where suppress output is true: None

- The data cube has the following column component information:
- Dataset Label: Sweden at Eurovision
- Columns:
Title Type Required Property URL Observations Column Titles
Year Dimension True sweden-at-eurovision.csv#dimension/year
Entrant Dimension True sweden-at-eurovision.csv#dimension/entrant
Song Dimension True sweden-at-eurovision.csv#dimension/song
Language Dimension True sweden-at-eurovision.csv#dimension/language
Value Observations True sweden-at-eurovision.csv#measure/{+measure}
Measure Measures True http://purl.org/linked-data/cube#measureType
Unit Units True http://purl.org/linked-data/sdmx/2009/attribute#unitMeasure
- Columns where suppress output is true: None

- The data cube has the following code list information:
- Number of Code Lists: 4
Expand All @@ -67,7 +64,7 @@ language.csv#code-list Language
- The data cube has the following dataset information:
- Number of Observations: 178
- Number of Duplicates: 0
- First 10 Observations:
- First 10 Observations:
Year Entrant Song Language Value Measure Unit
1958 alice-babs lilla-stjarna swedish 4 final-rank ordinal
1958 alice-babs lilla-stjarna swedish 10 final-points points
Expand All @@ -79,7 +76,7 @@ language.csv#code-list Language
1960 siw-malmkvist alla-andra-far-varann swedish 4 final-points points
1960 siw-malmkvist alla-andra-far-varann swedish 1 people-on-stage people
1961 lill-babs april-april swedish 14 final-rank ordinal
- Last 10 Observations:
- Last 10 Observations:
Year Entrant Song Language Value Measure Unit
2017 robin-bengtsson i-can-t-go-on english 6 people-on-stage people
2018 benjamin-ingrosso dance-you-off english 7 final-rank ordinal
Expand All @@ -91,7 +88,7 @@ language.csv#code-list Language
2021 tusse voices english 14 final-rank ordinal
2021 tusse voices english 109 final-points points
2021 tusse voices english 6 people-on-stage people


- The data cube has the following value counts:
- Value counts broken-down by measure and unit (of measure):
Expand Down Expand Up @@ -129,22 +126,22 @@ All being well we get the below output. A detailed explanation of this output is
- Identifier: Language
- Comment: None
- Description: None


- The code list has the following dataset information:
- Number of Concepts: 3
- Number of Duplicates: 0
- First 10 Concepts:
- First 10 Concepts:
Label Notation Parent Notation Sort Priority Description
English english NaN 0 NaN
Multiple multiple NaN 1 NaN
Swedish swedish NaN 2 NaN
- Last 10 Concepts:
- Last 10 Concepts:
Label Notation Parent Notation Sort Priority Description
English english NaN 0 NaN
Multiple multiple NaN 1 NaN
Swedish swedish NaN 2 NaN


- The code list has the following concepts information:
- Concepts hierarchy depth: 1
Expand Down
12 changes: 5 additions & 7 deletions src/csvcubed/cli/inspect/inspect.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,6 @@
from pathlib import Path
from typing import Tuple

import rdflib

from csvcubed.cli.inspect.metadatainputvalidator import MetadataValidator
from csvcubed.cli.inspect.metadataprinter import MetadataPrinter
from csvcubed.models.csvcubedexception import FailedToLoadRDFGraphException
Expand Down Expand Up @@ -48,19 +46,19 @@ def inspect(csvw_metadata_json_path: Path) -> None:
(
type_printable,
catalog_metadata_printable,
dsd_info_printable,
codelist_info_printable,
dataset_observations_printable,
val_counts_by_measure_unit_printable,
codelist_hierarchy_info_printable,
column_component_info_printable,
) = _generate_printables(
csvw_rdf_manager.csvw_inspector,
)

print(f"{linesep}{type_printable}")
print(f"{linesep}{catalog_metadata_printable}")
if csvw_type == CSVWType.QbDataSet:
print(f"{linesep}{dsd_info_printable}")
print(f"{linesep}{column_component_info_printable}")
print(f"{linesep}{codelist_info_printable}")
print(f"{linesep}{dataset_observations_printable}")
if csvw_type == CSVWType.QbDataSet:
Expand Down Expand Up @@ -92,8 +90,8 @@ def _generate_printables(

type_info_printable: str = metadata_printer.type_info_printable
catalog_metadata_printable: str = metadata_printer.catalog_metadata_printable
dsd_info_printable: str = (
metadata_printer.dsd_info_printable if csvw_type == CSVWType.QbDataSet else ""
column_component_info_printable: str = (
metadata_printer.column_component_info_printable
)
codelist_info_printable: str = (
metadata_printer.codelist_info_printable
Expand All @@ -117,9 +115,9 @@ def _generate_printables(
return (
type_info_printable,
catalog_metadata_printable,
dsd_info_printable,
codelist_info_printable,
dataset_observations_info_printable,
dataset_val_counts_by_measure_unit,
codelist_hierarchy_info_printable,
column_component_info_printable,
robons marked this conversation as resolved.
Show resolved Hide resolved
)
93 changes: 67 additions & 26 deletions src/csvcubed/cli/inspect/metadataprinter.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
from dataclasses import dataclass, field
from os import linesep
from pathlib import Path
from textwrap import indent
from typing import Dict, List, Optional, Tuple, Union

from pandas import DataFrame
Expand Down Expand Up @@ -47,6 +48,7 @@
get_codelist_col_title_from_col_name,
)
from csvcubed.utils.sparql_handler.code_list_inspector import CodeListInspector
from csvcubed.utils.sparql_handler.column_component_info import ColumnComponentInfo
from csvcubed.utils.sparql_handler.csvw_inspector import CsvWInspector
from csvcubed.utils.sparql_handler.data_cube_inspector import DataCubeInspector

Expand All @@ -64,8 +66,8 @@ class MetadataPrinter:
dataset: DataFrame = field(init=False)

result_catalog_metadata: CatalogMetadataResult = field(init=False)
result_column_component_infos: List[ColumnComponentInfo] = field(init=False)
primary_cube_table_identifiers: CubeTableIdentifiers = field(init=False)
result_qube_components: QubeComponentsResult = field(init=False)
primary_csv_column_definitions: List[ColumnDefinition] = field(init=False)
result_primary_csv_code_lists: CodelistsResult = field(init=False)
result_dataset_observations_info: DatasetObservationsInfoResult = field(init=False)
Expand All @@ -75,6 +77,13 @@ class MetadataPrinter:
result_code_list_cols: List[ColumnDefinition] = field(init=False)
result_concepts_hierachy_info: CodelistHierarchyInfoResult = field(init=False)

def __post_init__(self):
self.generate_general_results()
if self.state.csvw_inspector.csvw_type == CSVWType.QbDataSet:
self.get_datacube_results()
elif self.state.csvw_inspector.csvw_type == CSVWType.CodeList:
self.generate_codelist_results()

@staticmethod
def get_csvw_type_str(csvw_type: CSVWType) -> str:
if csvw_type == CSVWType.QbDataSet:
Expand Down Expand Up @@ -145,19 +154,18 @@ def get_datacube_results(self):
"""
assert isinstance(self.state, DataCubeInspector) # Make pyright happier

self.result_qube_components = self.state.get_dsd_qube_components_for_csv(
self.primary_csv_url
)

self.primary_cube_table_identifiers = self.state.get_cube_identifiers_for_csv(
self.primary_csv_url
)

self.primary_csv_column_definitions = (
self.state.csvw_inspector.get_column_definitions_for_csv(
self.primary_csv_url
)
)

self.result_column_component_infos = self.state.get_column_component_info(
self.primary_csv_url
)
self.result_primary_csv_code_lists = self.state.get_code_lists_and_cols(
self.primary_csv_url
)
Expand All @@ -170,7 +178,9 @@ def get_datacube_results(self):
self.state,
self.dataset,
self.primary_csv_url,
self.result_qube_components.qube_components,
self.state.get_dsd_qube_components_for_csv(
self.primary_csv_url
).qube_components,
)
self.result_dataset_value_counts = get_dataset_val_counts_info(
canonical_shape_dataset, measure_col, unit_col
Expand Down Expand Up @@ -210,12 +220,32 @@ def generate_codelist_results(self):
self.dataset, parent_col_title, label_col_title, unique_identifier
)

def __post_init__(self):
self.generate_general_results()
if self.state.csvw_inspector.csvw_type == CSVWType.QbDataSet:
self.get_datacube_results()
elif self.state.csvw_inspector.csvw_type == CSVWType.CodeList:
self.generate_codelist_results()
@staticmethod
def _get_column_component_info_for_output(
column_component_infos: List[ColumnComponentInfo],
) -> List[Dict[str, Union[str, bool, None]]]:
"""
Returns the column component and column definitions information ready for outputting into a table.
"""

return [
{
"Title": c.column_definition.title,
"Type": c.column_type.name,
"Required": c.column_definition.required,
"Property URL": c.column_definition.property_url,
"Observations Column Titles": ""
if c.component is None
else ", ".join(
[
c.title
for c in c.component.used_by_observed_value_columns
if c.title is not None
]
),
}
for c in column_component_infos
]

@property
def type_info_printable(self) -> str:
Expand All @@ -232,32 +262,43 @@ def type_info_printable(self) -> str:
return "- This file is a code list."

@property
def catalog_metadata_printable(self) -> str:
def column_component_info_printable(self) -> str:
robons marked this conversation as resolved.
Show resolved Hide resolved
"""
Returns a printable of catalog metadata (e.g. title, description).
Returns a printable of the column titles and types.

Member of :class:`./MetadataPrinter`.

:return: `str` - user-friendly string which will be output to CLI.
"""
return f"- The {self.csvw_type_str} has the following catalog metadata:{self.result_catalog_metadata.output_str}"
primary_csv_suppressed_columns = [
column_definition.title
for column_definition in self.primary_csv_column_definitions
if column_definition.suppress_output and column_definition.title is not None
]
formatted_column_info = get_printable_tabular_str_from_list(
self._get_column_component_info_for_output(
self.result_column_component_infos
)
)
return (
f" - The {self.csvw_type_str} has the following column component information: \n"
+ indent(
f"- Dataset Label: {self.result_catalog_metadata.label}\n"
+ f"- Columns: \n{formatted_column_info}\n"
+ f"- Columns where suppress output is true: {get_printable_list_str(primary_csv_suppressed_columns)}",
prefix=" ",
)
)

@property
def dsd_info_printable(self) -> str:
def catalog_metadata_printable(self) -> str:
"""
Returns a printable of data structure definition (DSD).
Returns a printable of catalog metadata (e.g. title, description).

Member of :class:`./MetadataPrinter`.

:return: `str` - user-friendly string which will be output to CLI.
"""
primary_csv_suppressed_columns = [
column_definition.title
for column_definition in self.primary_csv_column_definitions
if column_definition.suppress_output and column_definition.title is not None
]

return f"- The {self.csvw_type_str} has the following data structure definition:\n- Dataset Label: {self.result_catalog_metadata.title}{self.result_qube_components.output_str}\n- Columns where suppress output is true: {get_printable_list_str(primary_csv_suppressed_columns)}"
return f"- The {self.csvw_type_str} has the following catalog metadata:{self.result_catalog_metadata.output_str}"

@property
def codelist_info_printable(self) -> str:
Expand Down
5 changes: 5 additions & 0 deletions src/csvcubed/definitions.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,3 +23,8 @@
"_sourceRow",
"_name",
]

QB_MEASURE_TYPE_DIMENSION_URI: str = "http://purl.org/linked-data/cube#measureType"
SDMX_ATTRIBUTE_UNIT_URI: str = (
"http://purl.org/linked-data/sdmx/2009/attribute#unitMeasure"
)
Loading