-
Notifications
You must be signed in to change notification settings - Fork 162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a class to differentiate between Tabular and Graph CSV files #517
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments (don't be dismayed if you feel like its a lot). The one recommendation is to integrate more of the CSVData
class in the csv_data.py
file more in here. Some of this code is repeate of functionalities that already exist in the Data Profiler.
Also, we will hold off on merging so that the black
and isort
in @jakleh's pre-commit functionalites is all updated into your branch.
dataprofiler/tests/data/csv/graph-differentiator-input-positive.csv
Outdated
Show resolved
Hide resolved
dataprofiler/tests/data/csv/graph-differentiator-input-negative.csv
Outdated
Show resolved
Hide resolved
Head branch was pushed to by a user without write access
Head branch was pushed to by a user without write access
|
||
if options is None: | ||
options = dict() | ||
if CSVData.is_match(file_path, options): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may have misspoke before.
I think this should be:
if not CSVData.is_match(file_path, options):
^^ Are we not guaranteeing that it is a CSV to be read?
…ecuting in Graph Data (issue with csv files), tests were cleaned up
Head branch was pushed to by a user without write access
BaseData.__init__(self, input_file_path, data, options) | ||
|
||
if options is None: | ||
options = dict() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can refactor this next PR.
Head branch was pushed to by a user without write access
Head branch was pushed to by a user without write access
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a couple comments...
options.update(column_name = column_names) | ||
return True | ||
|
||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new line EOF
@@ -0,0 +1 @@ | |||
{"name":"John", "age":30, "car":null} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
json serializable should be a "null"
, no?
def test_is_graph_positive_1(self): | ||
""" | ||
Determine if the input CSV file can automatically be recognized as being a graph | ||
""" | ||
for input_file in self.file_or_buf_list: | ||
self.assertTrue(GraphData.is_match(input_file["path"])) | ||
|
||
# test is_match for false output w/ different options | ||
def test_is_graph_negative_1(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need these _1
in the names?
This is a basic and simple method to differentiate between tabular and graph CSV files. This adds a new class, GraphDifferentiator, that serves as a good base for later development. It can be easily built upon later. Three functions are added:
is_match: determines whether the CSV file is a graph/network dataset
find_target_string_in_column: helper function to detect keywords in column names
csv_column_names: grabs the column name header from the CSV.
This PR also includes tests for each functionality in the aforementioned functions.