Add a class to differentiate between Tabular and Graph CSV files #517

MisterPNP · 2022-07-07T18:28:35Z

This is a basic and simple method to differentiate between tabular and graph CSV files. This adds a new class, GraphDifferentiator, that serves as a good base for later development. It can be easily built upon later. Three functions are added:

is_match: determines whether the CSV file is a graph/network dataset

Combines the following two functions to detect keywords in the header column names
Only returns true if both a target and source are detected
List of keywords is easily changeable or added upon depending on need
find_target_string_in_column: helper function to detect keywords in column names
Checks column names for keywords (source, target, destination, etc...)
Checks whether the keyword is not a subset of another word (using column names that include _, ., -, etc.)
csv_column_names: grabs the column name header from the CSV.
This includes deleting extraneous whitespaces in column names, thus warranting the need to build upon the similar existing function in the project.
Returns the names as a list of processed strings

This PR also includes tests for each functionality in the aforementioned functions.

CLAassistant · 2022-07-07T18:28:40Z

All committers have signed the CLA.

dataprofiler/data_readers/graph_differentiator.py

taylorfturner

Some comments (don't be dismayed if you feel like its a lot). The one recommendation is to integrate more of the CSVData class in the csv_data.py file more in here. Some of this code is repeate of functionalities that already exist in the Data Profiler.

Also, we will hold off on merging so that the black and isort in @jakleh's pre-commit functionalites is all updated into your branch.

dataprofiler/data_readers/graph_differentiator.py

dataprofiler/tests/data/csv/graph-differentiator-input-positive.csv

dataprofiler/tests/data/csv/graph-differentiator-input-negative.csv

dataprofiler/tests/data_readers/test_csv_graph_data.py

subset.csv

dataprofiler/data_readers/graph_differentiator.py

dataprofiler/data_readers/graph_data.py

dataprofiler/tests/data_readers/test_csv_graph_data.py

…o graph

JGSweets · 2022-07-11T14:55:38Z

dataprofiler/data_readers/graph_data.py

+
+        if options is None:
+            options = dict()
+        if CSVData.is_match(file_path, options):


I may have misspoke before.
I think this should be:
if not CSVData.is_match(file_path, options):
^^ Are we not guaranteeing that it is a CSV to be read?

…ecuting in Graph Data (issue with csv files), tests were cleaned up

JGSweets · 2022-07-11T19:02:11Z

dataprofiler/data_readers/graph_data.py

+        BaseData.__init__(self, input_file_path, data, options)
+
+        if options is None:
+            options = dict()


we can refactor this next PR.

dataprofiler/data_readers/graph_data.py

taylorfturner

just a couple comments...

dataprofiler/data_readers/graph_data.py

taylorfturner · 2022-07-11T20:26:26Z

dataprofiler/data_readers/graph_data.py

+            options.update(column_name = column_names)
+            return True
+
+        return False


new line EOF

taylorfturner · 2022-07-11T20:26:49Z

dataprofiler/tests/data/csv/graph-data-input-json.json

@@ -0,0 +1 @@
+{"name":"John", "age":30, "car":null}


json serializable should be a "null", no?

taylorfturner · 2022-07-11T20:27:49Z

dataprofiler/tests/data_readers/test_csv_graph_data.py

+    def test_is_graph_positive_1(self):
+        """
+        Determine if the input CSV file can automatically be recognized as being a graph
+        """
+        for input_file in self.file_or_buf_list:
+            self.assertTrue(GraphData.is_match(input_file["path"]))
+
+    # test is_match for false output w/ different options
+    def test_is_graph_negative_1(self):


do we need these _1 in the names?

MisterPNP added 6 commits June 29, 2022 17:19

use networkx to differentiate graph

48361ca

use networkx to differentiate graph

04f7a28

add tests, update is_match

da3d524

rebase, accept changes to tests

7d72c6e

simplify graph differentiator, add tests

78b5803

Merge remote-tracking branch 'origin/main' into graph

c195341

MisterPNP requested review from JGSweets, ksneab7, taylorfturner, micdavis and tyfarnan as code owners July 7, 2022 18:28

JGSweets reviewed Jul 7, 2022

View reviewed changes

dataprofiler/data_readers/graph_differentiator.py Outdated Show resolved Hide resolved

JGSweets reviewed Jul 7, 2022

View reviewed changes

dataprofiler/data_readers/graph_differentiator.py Outdated Show resolved Hide resolved

JGSweets reviewed Jul 7, 2022

View reviewed changes

dataprofiler/data_readers/graph_differentiator.py Outdated Show resolved Hide resolved

JGSweets reviewed Jul 7, 2022

View reviewed changes

dataprofiler/data_readers/graph_differentiator.py Outdated Show resolved Hide resolved

JGSweets reviewed Jul 7, 2022

View reviewed changes

dataprofiler/data_readers/graph_differentiator.py Outdated Show resolved Hide resolved

remove extra comments, clean up imports/usr/bin/python3

fdd955c

taylorfturner enabled auto-merge (squash) July 7, 2022 18:43

JGSweets reviewed Jul 7, 2022

View reviewed changes

dataprofiler/data_readers/graph_differentiator.py Outdated Show resolved Hide resolved

taylorfturner reviewed Jul 7, 2022

View reviewed changes

add input options handling, add tests, reformat and condense file

2a484b0

auto-merge was automatically disabled July 7, 2022 20:18
Head branch was pushed to by a user without write access

MisterPNP added 2 commits July 7, 2022 16:22

remove outdated test file

833ef04

add EOF new line

33d0f38

JGSweets reviewed Jul 7, 2022

View reviewed changes

dataprofiler/data_readers/graph_data.py Show resolved Hide resolved

JGSweets reviewed Jul 7, 2022

View reviewed changes

dataprofiler/data_readers/graph_data.py Outdated Show resolved Hide resolved

JGSweets reviewed Jul 7, 2022

View reviewed changes

dataprofiler/data_readers/graph_data.py Outdated Show resolved Hide resolved

JGSweets reviewed Jul 7, 2022

View reviewed changes

dataprofiler/data_readers/graph_data.py Outdated Show resolved Hide resolved

JGSweets reviewed Jul 7, 2022

View reviewed changes

dataprofiler/data_readers/graph_data.py Outdated Show resolved Hide resolved

JGSweets reviewed Jul 8, 2022

View reviewed changes

dataprofiler/tests/data_readers/test_csv_graph_data.py Outdated Show resolved Hide resolved

JGSweets reviewed Jul 8, 2022

View reviewed changes

dataprofiler/tests/data_readers/test_csv_graph_data.py Outdated Show resolved Hide resolved

JGSweets added Work In Progress Solution is being developed Medium Priority Significant improvement or bug / feature reducing overall performance New Feature A feature addition not currently in the library labels Jul 8, 2022

MisterPNP added 3 commits July 11, 2022 10:15

cleanup test file

3feb32d

cleanup GraphData, integrated options

7636af7

Merge branch 'graph' of https://github.com/MisterPNP/DataProfiler int…

fa954eb

…o graph

auto-merge was automatically disabled July 11, 2022 14:49
Head branch was pushed to by a user without write access

JGSweets reviewed Jul 11, 2022

View reviewed changes

JGSweets enabled auto-merge (squash) July 11, 2022 18:31

options now updated in is_match, CSVData.is_match call is properly ex…

f2f7975

…ecuting in Graph Data (issue with csv files), tests were cleaned up

auto-merge was automatically disabled July 11, 2022 18:49
Head branch was pushed to by a user without write access

remove superfluous code

353c710

JGSweets reviewed Jul 11, 2022

View reviewed changes

dataprofiler/data_readers/graph_data.py Outdated Show resolved Hide resolved

remove headers, remove update on delimiter in is_match

5359ccf

JGSweets enabled auto-merge (squash) July 11, 2022 19:16

add networkx to requirements

81e2c53

auto-merge was automatically disabled July 11, 2022 19:29
Head branch was pushed to by a user without write access

JGSweets enabled auto-merge (squash) July 11, 2022 19:34

add networkx 2.5.1 to requirements

c226077

auto-merge was automatically disabled July 11, 2022 19:59
Head branch was pushed to by a user without write access

Merge branch 'main' into graph

388e075

JGSweets enabled auto-merge (squash) July 11, 2022 20:01

JGSweets approved these changes Jul 11, 2022

View reviewed changes

JGSweets removed the Work In Progress Solution is being developed label Jul 11, 2022

taylorfturner reviewed Jul 11, 2022

View reviewed changes

taylorfturner approved these changes Jul 11, 2022

View reviewed changes

JGSweets merged commit 89f69a2 into capitalone:main Jul 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a class to differentiate between Tabular and Graph CSV files #517

Add a class to differentiate between Tabular and Graph CSV files #517

MisterPNP commented Jul 7, 2022

CLAassistant commented Jul 7, 2022 •

edited

Loading

taylorfturner left a comment •

edited

Loading

JGSweets Jul 11, 2022

JGSweets Jul 11, 2022

taylorfturner left a comment

taylorfturner Jul 11, 2022

taylorfturner Jul 11, 2022

taylorfturner Jul 11, 2022

Add a class to differentiate between Tabular and Graph CSV files #517

Add a class to differentiate between Tabular and Graph CSV files #517

Conversation

MisterPNP commented Jul 7, 2022

CLAassistant commented Jul 7, 2022 • edited Loading

taylorfturner left a comment • edited Loading

Choose a reason for hiding this comment

JGSweets Jul 11, 2022

Choose a reason for hiding this comment

JGSweets Jul 11, 2022

Choose a reason for hiding this comment

taylorfturner left a comment

Choose a reason for hiding this comment

taylorfturner Jul 11, 2022

Choose a reason for hiding this comment

taylorfturner Jul 11, 2022

Choose a reason for hiding this comment

taylorfturner Jul 11, 2022

Choose a reason for hiding this comment

CLAassistant commented Jul 7, 2022 •

edited

Loading

taylorfturner left a comment •

edited

Loading