Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Added tests for the function load_pathlist_from_file #1469

Merged
merged 21 commits into from
May 21, 2021
Merged

Conversation

keyabarve
Copy link
Contributor

Fixes #1430

@codecov
Copy link

codecov bot commented Apr 16, 2021

Codecov Report

Merging #1469 (c48bfca) into latest (a782b1d) will increase coverage by 4.99%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           latest    #1469      +/-   ##
==========================================
+ Coverage   90.20%   95.20%   +4.99%     
==========================================
  Files         126       99      -27     
  Lines       21194    17553    -3641     
  Branches     1595     1600       +5     
==========================================
- Hits        19118    16711    -2407     
+ Misses       1844      609    -1235     
- Partials      232      233       +1     
Flag Coverage Δ
python 95.20% <100.00%> (+<0.01%) ⬆️
rust ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/sourmash/sourmash_args.py 94.01% <100.00%> (-0.40%) ⬇️
tests/test_cmd_signature.py 100.00% <100.00%> (ø)
tests/test_sourmash.py 99.72% <100.00%> (+<0.01%) ⬆️
src/core/src/index/sbt/mhbt.rs
src/core/src/index/search.rs
src/core/src/wasm.rs
src/core/src/sketch/hyperloglog/estimators.rs
src/core/src/sketch/minhash.rs
src/core/src/signature.rs
src/core/src/ffi/minhash.rs
... and 20 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a782b1d...c48bfca. Read the comment docs.

@keyabarve
Copy link
Contributor Author

I wrote 2 tests for Badly Formatted Files and 1 test for Duplicate Files. I came across some interesting errors that need to be worked on.
@luizirber

@ctb
Copy link
Contributor

ctb commented Apr 23, 2021

hi @keyabarve could you summarize the errors here in the comments, and (if you're up to it) propose some solutions? thanks!

@keyabarve
Copy link
Contributor Author

keyabarve commented Apr 24, 2021

@ctb @luizirber
Tests are written in tests/test_sourmash.py. The following are the different issues in each of the test cases that need to be fixed. I'm not entirely sure why these issues are arising and how to fix them.

  • Function test_load_pathlist_from_file_empty:
    This function is testing if the given file is empty. The file is written with no contents ("").
    An IndexError is raised, stating "list index out of range".

  • Function test_load_pathlist_from_file_badly_formatted:
    The function is testing if the given file is badly formatted. A bad input ("{'a':1}") is written into the file.
    ValueError is raised, stating "first element of list-of-files does not exist".

  • Function test_load_pathlist_from_file_badly_formatted_2:
    The function is testing if the second line of the file is badly formatted, given that the first line of the file is valid. The contents of an existing file ("compare/genome-s10.fa.gz.sig") are copied into the first line of the file and a bad input ("{'a':1}") is written in the second line.
    assert len(sigs) returns a value of 2.

  • Function test_load_pathlist_from_file_duplicate:
    The function is testing if the same file is passed more than once. The contents of an existing file ("compare/genome-s10.fa.gz.sig") are copied into the first line as well as the second line of the file.
    assert len(sigs) returns a value of 2.

from sourmash.sourmash_args import load_pathlist_from_file
with pytest.raises(ValueError) as e:
load_pathlist_from_file("")
assert "cannot open file ''" in e.message
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note, you should de-indent here, because the assert statement is never reached - this is because load_pathlist_from_file raises a ValueError, which ends the block and is caught by the with statement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which lines should I de-indent?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll ask a question instead ;).

try putting an assert 0 after the load_pathlist_from_file call - does the test fail, as you'd expect it would?

also see:

python exceptions and try/except blocks may also be useful.

And note that you can put print statements in tests and the output will be displayed only when the test fails, so they're useful for debugging tests.

@ctb
Copy link
Contributor

ctb commented Apr 24, 2021

thanks! These are good checks, and I agree load_pathlist_from_file should return better errors on them!

To fix them, you'd want to have load_pathlist_from_file raise ValueError with appropriate error messages for each of them.

For example, to fix the empty error, you could add another if statement  in https://github.com/dib-lab/sourmash/blob/beb3d5963332aba162c530cf612739206d7efbd6/src/sourmash/sourmash_args.py#L389 that checks to see if file_list is empty or not, and if it is, raises a ValueError.

Is this something you want to pursue?

@keyabarve
Copy link
Contributor Author

Yes, I would like to continue working on this. I will try fixing it.

@ctb
Copy link
Contributor

ctb commented Apr 25, 2021

Yes, I would like to continue working on this. I will try fixing it.

excellent!

I think a good high level summary of what to do overall is "take any of these conditions or errors and turn them into statements that raise ValueError (with informative error messages) in the load_pathlist_from_file function."

@keyabarve
Copy link
Contributor Author

@ctb @luizirber
I had a few questions regarding the function load_pathlist_from_file:

  • What exactly does the argument "filename" passed to the function load_pathlist_from_file contain? Is "filename" a single file, which contains a list of files?
  • What does the variable "file_list" that is returned by this function contain? Is it an array containing the list of files from "filename"?
  • When we are checking for all the test cases (file doesn't exist, empty file, badly formatted file, and duplicate file), are we checking all these for each of the files inside "filename"?
  • So, while writing the different if statements in this function, should I loop through "file_list" to check each of the files?

@ctb
Copy link
Contributor

ctb commented Apr 27, 2021 via email

@keyabarve
Copy link
Contributor Author

Alright, thanks!

@luizirber
Copy link
Member

Related discussion: #1410

@keyabarve
Copy link
Contributor Author

keyabarve commented Apr 30, 2021

@ctb @luizirber
Whenever I test any of the test cases, it gives me a type error, saying TypeError: stat: path should be string, bytes, os.PathLike or integer, not list
This error is coming from the function:

def exists(path):
        """Test whether a path exists.  Returns False for broken symbolic links"""
        try:
>           os.stat(path)
E           TypeError: stat: path should be string, bytes, os.PathLike or integer, not list

Could you provide some tips on how I can fix this?

@ctb
Copy link
Contributor

ctb commented Apr 30, 2021 via email

@keyabarve
Copy link
Contributor Author

Where should I write this print statement? Should I write it in load_pathlist_from_file? There is no path variable in this function, so I'm not sure if it will print anything.

@ctb
Copy link
Contributor

ctb commented Apr 30, 2021 via email

@keyabarve
Copy link
Contributor Author

@ctb @luizirber
I think the code works fine for all the test cases except for the duplicate files. Since I have created a loop, should I remove the test_load_pathlist_from_file_badly_formatted_2 test case?

@ctb
Copy link
Contributor

ctb commented Apr 30, 2021

@ctb @luizirber
I think the code works fine for all the test cases except for the duplicate files. Since I have created a loop, should I remove the test_load_pathlist_from_file_badly_formatted_2 test case?

what about collapsing file_list automatically, using an ordered dictionary so that the files remain in the same order but get collapsed? You could also do that within your for loop using a set underneath to see if we've already seen a file.

We probably don't need to emit a warning in the case of duplicate files, because if you passed duplicate files into most sourmash commands, sourmash would invisibly deduplicate them anyway in search/gather/sketch).

@keyabarve
Copy link
Contributor Author

@ctb @luizirber
I think the code works fine for all the test cases except for the duplicate files. Since I have created a loop, should I remove the test_load_pathlist_from_file_badly_formatted_2 test case?

what about collapsing file_list automatically, using an ordered dictionary so that the files remain in the same order but get collapsed? You could also do that within your for loop using a set underneath to see if we've already seen a file.

We probably don't need to emit a warning in the case of duplicate files, because if you passed duplicate files into most sourmash commands, sourmash would invisibly deduplicate them anyway in search/gather/sketch).

What do you mean by collapsing file_list? Should I be doing this for the duplicate files?

Also, should I remove the test case for the duplicate files entirely?

@ctb
Copy link
Contributor

ctb commented Apr 30, 2021 via email


if not os.path.exists(file_list[0]):
raise ValueError("first element of list-of-files does not exist")
if len(file_list) == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note, this can be abbreviated more Pythonically as:

if not file list:
...

(in Python, empty lists evaluate to False)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'll fix that!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bump!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if not os.path.exists(file_list[i]):
cnt = cnt + 1
if cnt > 0:
raise ValueError("list-of-files contains a badly formatted file")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this error message right? You're checking if the file exists; if I were a user I would not know that "badly formatted" means "does not exist"

Also, suggest using "pathlist" rather than "list of files".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you suggest a way to check for badly formatted files?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's probably fine to just say, "file '' does not exist".

@ctb
Copy link
Contributor

ctb commented Apr 30, 2021 via email

@keyabarve
Copy link
Contributor Author

@ctb @luizirber
This is the latest version of my code. According to this, all the test cases pass. I still need to incorporate some of the changes suggested.

@ctb
Copy link
Contributor

ctb commented May 18, 2021

hi @keyabarve looks like there's still a test failing 😭 . Could you take a look and give it a stab, or ask some questions?

@keyabarve
Copy link
Contributor Author

@ctb Sure! I'll take a look at it! Which test is failing exactly?

@ctb
Copy link
Contributor

ctb commented May 18, 2021 via email

@ctb
Copy link
Contributor

ctb commented May 20, 2021

hi @keyabarve you could usefully add a test to catch this situation: #1537

@ctb
Copy link
Contributor

ctb commented May 20, 2021

sorry, a fix AND a test :)

@keyabarve
Copy link
Contributor Author

@ctb My latest code already takes care of the issue and I already have a test for it! :)

@@ -199,7 +252,6 @@ def test_do_compare_quiet(c):
testdata1 = utils.get_test_data('short.fa')
testdata2 = utils.get_test_data('short2.fa')
c.run_sourmash('compute', '-k', '31', testdata1, testdata2)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note - please try to avoid changes in bits of the code that aren't relevant to the issues tackled by this PR - it adds a bit of confusion to the review :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, it's probably something I did before and forgot to revert! Should I fix it and push again?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need, just for future consideration!

@keyabarve
Copy link
Contributor Author

Is this ready to be merged? Or is there anything else that needs to be fixed?

@ctb
Copy link
Contributor

ctb commented May 21, 2021

I think this error message is misleading --

                 raise ValueError(f"file '{checkfile}' listed inside the pathlist contains a badly formatted file")

and should be replaced with the same text as below,

         raise ValueError(f"pathlist file '{filename}' does not exist")    

@keyabarve
Copy link
Contributor Author

keyabarve commented May 21, 2021

Is this in the load_pathlist_from_file function? Should I remove the for loop then? I'm having a hard time understanding why it needs to be replaced.

@keyabarve
Copy link
Contributor Author

@ctb This is the error I'm getting if I change the statement:

 assert "pathlist file '' does not exist" in "pathlist file '{'a':1}' does not exist"
E        +  where "pathlist file '{'a':1}' does not exist" = str(ValueError("pathlist file '{'a':1}' does not exist"))
E        +    where ValueError("pathlist file '{'a':1}' does not exist") = <ExceptionInfo ValueError("pathlist file '{'a':1}' does not exist") tblen=2>.value

@ctb
Copy link
Contributor

ctb commented May 21, 2021

I'm not sure how to comment, other than to say that this logic looks inherently problematic to me :)

        for checkfile in file_list:
            if not os.path.exists(checkfile):
                raise ValueError(f"file '{checkfile}' listed inside the pathlist contains a badly formatted file")

since the error message is saying "this is a badly formatted file" but the if statement is checking to see if that file exists... I think the useful thing to say is "this file doesn't exist".

@keyabarve
Copy link
Contributor Author

Hmm I see. Should I do this then: raise ValueError(f"file '{checkfile}' inside the pathlist does not exist")

@keyabarve
Copy link
Contributor Author

keyabarve commented May 21, 2021

The problem that I'm facing is that it's giving me the error I mentioned above. I'm confused about how to fix that.
Is there any other way to check for a badly formatted file in the function itself?

@keyabarve
Copy link
Contributor Author

keyabarve commented May 21, 2021

The assert statement in my test case is this: assert "file '' inside the pathlist does not exist" in str(e.value). If I change that to assert "file '{'a':1}' inside the pathlist does not exist" in str(e.value), then the tests pass. So should I do that then?

@ctb
Copy link
Contributor

ctb commented May 21, 2021 via email

@keyabarve
Copy link
Contributor Author

Please review. @ctb

Copy link
Contributor

@ctb ctb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! I'll update from latest and then one of us can merge when all the tests pass!

Nice work - I know it took a while ;)

@ctb ctb changed the title [WIP] Added tests for the function load_pathlist_from_file [MRG] Added tests for the function load_pathlist_from_file May 21, 2021
@keyabarve
Copy link
Contributor Author

Alright! Thanks! By merge, you mean 'Squash and merge' right?

@keyabarve
Copy link
Contributor Author

Ready for merge. @ctb

@ctb ctb merged commit 8e7df87 into latest May 21, 2021
@ctb ctb deleted the KB_1430 branch May 21, 2021 23:05
@ctb
Copy link
Contributor

ctb commented May 21, 2021

🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add tests in tests/test_sourmash.py for the function load_pathlist_from_file
3 participants