Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a class to differentiate between Tabular and Graph CSV files #517

Merged
merged 22 commits into from
Jul 11, 2022

Conversation

MisterPNP
Copy link
Contributor

This is a basic and simple method to differentiate between tabular and graph CSV files. This adds a new class, GraphDifferentiator, that serves as a good base for later development. It can be easily built upon later. Three functions are added:

is_match: determines whether the CSV file is a graph/network dataset

  • Combines the following two functions to detect keywords in the header column names
  • Only returns true if both a target and source are detected
  • List of keywords is easily changeable or added upon depending on need
    find_target_string_in_column: helper function to detect keywords in column names
  • Checks column names for keywords (source, target, destination, etc...)
  • Checks whether the keyword is not a subset of another word (using column names that include _, ., -, etc.)
    csv_column_names: grabs the column name header from the CSV.
  • This includes deleting extraneous whitespaces in column names, thus warranting the need to build upon the similar existing function in the project.
  • Returns the names as a list of processed strings

This PR also includes tests for each functionality in the aforementioned functions.

@CLAassistant
Copy link

CLAassistant commented Jul 7, 2022

CLA assistant check
All committers have signed the CLA.

@taylorfturner taylorfturner enabled auto-merge (squash) July 7, 2022 18:43
Copy link
Contributor

@taylorfturner taylorfturner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments (don't be dismayed if you feel like its a lot). The one recommendation is to integrate more of the CSVData class in the csv_data.py file more in here. Some of this code is repeate of functionalities that already exist in the Data Profiler.

Also, we will hold off on merging so that the black and isort in @jakleh's pre-commit functionalites is all updated into your branch.

dataprofiler/data_readers/graph_differentiator.py Outdated Show resolved Hide resolved
dataprofiler/data_readers/graph_differentiator.py Outdated Show resolved Hide resolved
dataprofiler/data_readers/graph_differentiator.py Outdated Show resolved Hide resolved
subset.csv Outdated Show resolved Hide resolved
dataprofiler/data_readers/graph_differentiator.py Outdated Show resolved Hide resolved
auto-merge was automatically disabled July 7, 2022 20:18

Head branch was pushed to by a user without write access

@JGSweets JGSweets added Work In Progress Solution is being developed Medium Priority Significant improvement or bug / feature reducing overall performance New Feature A feature addition not currently in the library labels Jul 8, 2022
auto-merge was automatically disabled July 11, 2022 14:49

Head branch was pushed to by a user without write access


if options is None:
options = dict()
if CSVData.is_match(file_path, options):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may have misspoke before.
I think this should be:
if not CSVData.is_match(file_path, options):
^^ Are we not guaranteeing that it is a CSV to be read?

@JGSweets JGSweets enabled auto-merge (squash) July 11, 2022 18:31
…ecuting in Graph Data (issue with csv files), tests were cleaned up
auto-merge was automatically disabled July 11, 2022 18:49

Head branch was pushed to by a user without write access

Comment on lines 14 to 17
BaseData.__init__(self, input_file_path, data, options)

if options is None:
options = dict()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can refactor this next PR.

@JGSweets JGSweets enabled auto-merge (squash) July 11, 2022 19:16
auto-merge was automatically disabled July 11, 2022 19:29

Head branch was pushed to by a user without write access

@JGSweets JGSweets enabled auto-merge (squash) July 11, 2022 19:34
auto-merge was automatically disabled July 11, 2022 19:59

Head branch was pushed to by a user without write access

@JGSweets JGSweets enabled auto-merge (squash) July 11, 2022 20:01
@JGSweets JGSweets removed the Work In Progress Solution is being developed label Jul 11, 2022
Copy link
Contributor

@taylorfturner taylorfturner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a couple comments...

dataprofiler/data_readers/graph_data.py Outdated Show resolved Hide resolved
options.update(column_name = column_names)
return True

return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new line EOF

@@ -0,0 +1 @@
{"name":"John", "age":30, "car":null}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

json serializable should be a "null", no?

Comment on lines +77 to +85
def test_is_graph_positive_1(self):
"""
Determine if the input CSV file can automatically be recognized as being a graph
"""
for input_file in self.file_or_buf_list:
self.assertTrue(GraphData.is_match(input_file["path"]))

# test is_match for false output w/ different options
def test_is_graph_negative_1(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need these _1 in the names?

@JGSweets JGSweets merged commit 89f69a2 into capitalone:main Jul 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Medium Priority Significant improvement or bug / feature reducing overall performance New Feature A feature addition not currently in the library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants