[MRG] Added tests for the function `load_pathlist_from_file` #1469

keyabarve · 2021-04-16T21:14:00Z

codecov · 2021-04-16T21:19:56Z

Codecov Report

Merging #1469 (c48bfca) into latest (a782b1d) will increase coverage by 4.99%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           latest    #1469      +/-   ##
==========================================
+ Coverage   90.20%   95.20%   +4.99%     
==========================================
  Files         126       99      -27     
  Lines       21194    17553    -3641     
  Branches     1595     1600       +5     
==========================================
- Hits        19118    16711    -2407     
+ Misses       1844      609    -1235     
- Partials      232      233       +1

Flag	Coverage Δ
python	`95.20% <100.00%> (+<0.01%)`	⬆️
rust	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/sourmash/sourmash_args.py	`94.01% <100.00%> (-0.40%)`	⬇️
tests/test_cmd_signature.py	`100.00% <100.00%> (ø)`
tests/test_sourmash.py	`99.72% <100.00%> (+<0.01%)`	⬆️
src/core/src/index/sbt/mhbt.rs
src/core/src/index/search.rs
src/core/src/wasm.rs
src/core/src/sketch/hyperloglog/estimators.rs
src/core/src/sketch/minhash.rs
src/core/src/signature.rs
src/core/src/ffi/minhash.rs
... and 20 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a782b1d...c48bfca. Read the comment docs.

keyabarve · 2021-04-22T20:40:54Z

I wrote 2 tests for Badly Formatted Files and 1 test for Duplicate Files. I came across some interesting errors that need to be worked on.
@luizirber

ctb · 2021-04-23T13:00:13Z

hi @keyabarve could you summarize the errors here in the comments, and (if you're up to it) propose some solutions? thanks!

keyabarve · 2021-04-24T23:43:37Z

@ctb @luizirber
Tests are written in tests/test_sourmash.py. The following are the different issues in each of the test cases that need to be fixed. I'm not entirely sure why these issues are arising and how to fix them.

Function test_load_pathlist_from_file_empty:
This function is testing if the given file is empty. The file is written with no contents ("").
An IndexError is raised, stating "list index out of range".
Function test_load_pathlist_from_file_badly_formatted:
The function is testing if the given file is badly formatted. A bad input ("{'a':1}") is written into the file.
ValueError is raised, stating "first element of list-of-files does not exist".
Function test_load_pathlist_from_file_badly_formatted_2:
The function is testing if the second line of the file is badly formatted, given that the first line of the file is valid. The contents of an existing file ("compare/genome-s10.fa.gz.sig") are copied into the first line of the file and a bad input ("{'a':1}") is written in the second line.
assert len(sigs) returns a value of 2.
Function test_load_pathlist_from_file_duplicate:
The function is testing if the same file is passed more than once. The contents of an existing file ("compare/genome-s10.fa.gz.sig") are copied into the first line as well as the second line of the file.
assert len(sigs) returns a value of 2.

ctb · 2021-04-24T23:50:52Z

tests/test_sourmash.py

+    from sourmash.sourmash_args import load_pathlist_from_file
+    with pytest.raises(ValueError) as e:
+        load_pathlist_from_file("")
+        assert "cannot open file ''" in e.message


note, you should de-indent here, because the assert statement is never reached - this is because load_pathlist_from_file raises a ValueError, which ends the block and is caught by the with statement.

Which lines should I de-indent?

I'll ask a question instead ;).

try putting an assert 0 after the load_pathlist_from_file call - does the test fail, as you'd expect it would?

also see:

asserting with the assert statement

with pytest.raises documentation

[context managers](https://www.python.org/dev/peps/pep-0343/#id36

python exceptions and try/except blocks may also be useful.

And note that you can put print statements in tests and the output will be displayed only when the test fails, so they're useful for debugging tests.

ctb · 2021-04-24T23:54:26Z

thanks! These are good checks, and I agree load_pathlist_from_file should return better errors on them!

To fix them, you'd want to have load_pathlist_from_file raise ValueError with appropriate error messages for each of them.

For example, to fix the empty error, you could add another if statement in https://github.com/dib-lab/sourmash/blob/beb3d5963332aba162c530cf612739206d7efbd6/src/sourmash/sourmash_args.py#L389 that checks to see if file_list is empty or not, and if it is, raises a ValueError.

Is this something you want to pursue?

keyabarve · 2021-04-24T23:57:21Z

Yes, I would like to continue working on this. I will try fixing it.

ctb · 2021-04-25T00:35:40Z

Yes, I would like to continue working on this. I will try fixing it.

excellent!

I think a good high level summary of what to do overall is "take any of these conditions or errors and turn them into statements that raise ValueError (with informative error messages) in the load_pathlist_from_file function."

keyabarve · 2021-04-27T00:17:38Z

@ctb @luizirber
I had a few questions regarding the function load_pathlist_from_file:

What exactly does the argument "filename" passed to the function load_pathlist_from_file contain? Is "filename" a single file, which contains a list of files?
What does the variable "file_list" that is returned by this function contain? Is it an array containing the list of files from "filename"?
When we are checking for all the test cases (file doesn't exist, empty file, badly formatted file, and duplicate file), are we checking all these for each of the files inside "filename"?
So, while writing the different if statements in this function, should I loop through "file_list" to check each of the files?

ctb · 2021-04-27T00:29:17Z

On Mon, Apr 26, 2021 at 05:17:55PM -0700, Keya Barve wrote: - What exactly does the argument "filename" passed to the function `load_pathlist_from_file` contain? Is "filename" a single file, which contains a list of files?

yes.

- What does the variable "file_list" that is returned by this function contain? Is it an array containing the list of files from "filename"?

that is defined by the function, right? there's no mystery here, you have the full source code :). (what is the answer, based on the current code?)

- When we are checking for all the test cases (file doesn't exist, empty file, badly formatted file, and duplicate file), are we checking all these for each of the files inside "filename"?

I believe that most of those conditions are handled by existing test code, so you don't need to check them explicitly. The only exception is duplicate file. I think perhaps we should collapse duplicate entries in the file by returning a set rather than a list.

- So, while writing the different if statements in this function, should I loop through "file_list" to check each of the files?

Nope. You might be interested in reading up on "duck typing" in Python, BTW; as it is, the load_pathlist function takes any stringlike object and returns an iterable, but those are the only operational constraints at the moment. And that's intentional; there's no reason to be more specific about what this function takes and returns.

keyabarve · 2021-04-27T00:48:49Z

Alright, thanks!

luizirber · 2021-04-29T04:05:52Z

Related discussion: #1410

keyabarve · 2021-04-30T18:24:02Z

@ctb @luizirber
Whenever I test any of the test cases, it gives me a type error, saying TypeError: stat: path should be string, bytes, os.PathLike or integer, not list
This error is coming from the function:

def exists(path):
        """Test whether a path exists.  Returns False for broken symbolic links"""
        try:
>           os.stat(path)
E           TypeError: stat: path should be string, bytes, os.PathLike or integer, not list

Could you provide some tips on how I can fix this?

ctb · 2021-04-30T18:25:37Z

On Fri, Apr 30, 2021 at 11:24:24AM -0700, Keya Barve wrote: @ctb @luizirber Whenever I test any of the test cases, it gives me a type error, saying ```TypeError: stat: path should be string, bytes, os.PathLike or integer, not list``` This error is coming from the function: ```def exists(path): """Test whether a path exists. Returns False for broken symbolic links""" try: > os.stat(path) E TypeError: stat: path should be string, bytes, os.PathLike or integer, not list```

what is path? do: print((path,), type(path))

keyabarve · 2021-04-30T18:28:13Z

Where should I write this print statement? Should I write it in load_pathlist_from_file? There is no path variable in this function, so I'm not sure if it will print anything.

ctb · 2021-04-30T18:32:57Z

On Fri, Apr 30, 2021 at 11:28:34AM -0700, Keya Barve wrote: Where should I write this print statement? Should I write it in `load_pathlist_from_file`? There is no path variable in this function, so I'm not sure if it will print anything.

either where you call load_pathlist_from_file, or in load_pathlist from file. please remember you control all the code here, so you can print out variables where you want :)

keyabarve · 2021-04-30T18:58:58Z

@ctb @luizirber
I think the code works fine for all the test cases except for the duplicate files. Since I have created a loop, should I remove the test_load_pathlist_from_file_badly_formatted_2 test case?

ctb · 2021-04-30T19:31:21Z

@ctb @luizirber
I think the code works fine for all the test cases except for the duplicate files. Since I have created a loop, should I remove the test_load_pathlist_from_file_badly_formatted_2 test case?

what about collapsing file_list automatically, using an ordered dictionary so that the files remain in the same order but get collapsed? You could also do that within your for loop using a set underneath to see if we've already seen a file.

We probably don't need to emit a warning in the case of duplicate files, because if you passed duplicate files into most sourmash commands, sourmash would invisibly deduplicate them anyway in search/gather/sketch).

keyabarve · 2021-04-30T19:57:50Z

@ctb @luizirber
I think the code works fine for all the test cases except for the duplicate files. Since I have created a loop, should I remove the test_load_pathlist_from_file_badly_formatted_2 test case?

what about collapsing file_list automatically, using an ordered dictionary so that the files remain in the same order but get collapsed? You could also do that within your for loop using a set underneath to see if we've already seen a file.

We probably don't need to emit a warning in the case of duplicate files, because if you passed duplicate files into most sourmash commands, sourmash would invisibly deduplicate them anyway in search/gather/sketch).

What do you mean by collapsing file_list? Should I be doing this for the duplicate files?

Also, should I remove the test case for the duplicate files entirely?

ctb · 2021-04-30T20:10:10Z

On Fri, Apr 30, 2021 at 12:58:09PM -0700, Keya Barve wrote: > > @ctb @luizirber > > I think the code works fine for all the test cases except for the duplicate files. Since I have created a loop, should I remove the `test_load_pathlist_from_file_badly_formatted_2` test case? > > what about collapsing `file_list` automatically, using an [ordered dictionary](https://docs.python.org/3/library/collections.html#collections.OrderedDict) so that the files remain in the same order but get collapsed? You could also do that within your for loop using a `set` underneath to see if we've already seen a file. > > We probably don't need to emit a warning in the case of duplicate files, because if you passed duplicate files into most sourmash commands, sourmash would invisibly deduplicate them anyway in search/gather/sketch). What do you mean by collapsing `file_list`? Should I be doing this for the duplicate files?

collapsing => ignoring duplicates.

Also, should I remove the test case for the duplicate files entirely?

I would only remove tests if they are entirely redundant or serve no purpose whatsoever.

ctb · 2021-04-30T20:23:57Z

src/sourmash/sourmash_args.py

-
-        if not os.path.exists(file_list[0]):
-            raise ValueError("first element of list-of-files does not exist")
+        if len(file_list) == 0:


note, this can be abbreviated more Pythonically as:

if not file list: ...

(in Python, empty lists evaluate to False)

Okay, I'll fix that!

Please revisit this comment, thanks - https://github.com/dib-lab/sourmash/pull/1469/files#r624171173

ctb · 2021-04-30T20:25:20Z

src/sourmash/sourmash_args.py

+            if not os.path.exists(file_list[i]):
+                cnt = cnt + 1
+        if cnt > 0:
+            raise ValueError("list-of-files contains a badly formatted file")


Is this error message right? You're checking if the file exists; if I were a user I would not know that "badly formatted" means "does not exist"

Also, suggest using "pathlist" rather than "list of files".

Could you suggest a way to check for badly formatted files?

I think it's probably fine to just say, "file '' does not exist".

src/sourmash/sourmash_args.py

ctb · 2021-04-30T20:28:15Z

Could you suggest a way to check for badly formatted files?

I'm afraid I don't really have any good suggestions off the top of my head!

keyabarve · 2021-04-30T20:34:59Z

@ctb @luizirber
This is the latest version of my code. According to this, all the test cases pass. I still need to incorporate some of the changes suggested.

ctb · 2021-05-18T01:24:20Z

hi @keyabarve looks like there's still a test failing 😭 . Could you take a look and give it a stab, or ask some questions?

keyabarve · 2021-05-18T01:25:37Z

@ctb Sure! I'll take a look at it! Which test is failing exactly?

ctb · 2021-05-18T01:27:42Z

On Mon, May 17, 2021 at 06:25:53PM -0700, Keya Barve wrote: @ctb Sure! I'll take a look at it! Which test is failing exactly?

you can dig around in the "some checks were not successful" to find out, or run all the tests yourself, locally.

ctb · 2021-05-20T12:57:26Z

hi @keyabarve you could usefully add a test to catch this situation: #1537

ctb · 2021-05-20T12:57:36Z

sorry, a fix AND a test :)

keyabarve · 2021-05-20T20:23:50Z

@ctb My latest code already takes care of the issue and I already have a test for it! :)

ctb · 2021-05-21T20:16:10Z

tests/test_sourmash.py

@@ -199,7 +252,6 @@ def test_do_compare_quiet(c):
    testdata1 = utils.get_test_data('short.fa')
    testdata2 = utils.get_test_data('short2.fa')
    c.run_sourmash('compute', '-k', '31', testdata1, testdata2)
-


note - please try to avoid changes in bits of the code that aren't relevant to the issues tackled by this PR - it adds a bit of confusion to the review :)

Sorry, it's probably something I did before and forgot to revert! Should I fix it and push again?

no need, just for future consideration!

keyabarve · 2021-05-21T20:24:28Z

Is this ready to be merged? Or is there anything else that needs to be fixed?

ctb · 2021-05-21T20:26:31Z

I think this error message is misleading --

                 raise ValueError(f"file '{checkfile}' listed inside the pathlist contains a badly formatted file")

and should be replaced with the same text as below,

         raise ValueError(f"pathlist file '{filename}' does not exist")

keyabarve · 2021-05-21T20:27:56Z

Is this in the load_pathlist_from_file function? Should I remove the for loop then? I'm having a hard time understanding why it needs to be replaced.

keyabarve · 2021-05-21T21:03:51Z

@ctb This is the error I'm getting if I change the statement:

 assert "pathlist file '' does not exist" in "pathlist file '{'a':1}' does not exist"
E        +  where "pathlist file '{'a':1}' does not exist" = str(ValueError("pathlist file '{'a':1}' does not exist"))
E        +    where ValueError("pathlist file '{'a':1}' does not exist") = <ExceptionInfo ValueError("pathlist file '{'a':1}' does not exist") tblen=2>.value

ctb · 2021-05-21T21:22:21Z

I'm not sure how to comment, other than to say that this logic looks inherently problematic to me :)

        for checkfile in file_list:
            if not os.path.exists(checkfile):
                raise ValueError(f"file '{checkfile}' listed inside the pathlist contains a badly formatted file")

since the error message is saying "this is a badly formatted file" but the if statement is checking to see if that file exists... I think the useful thing to say is "this file doesn't exist".

keyabarve · 2021-05-21T21:25:02Z

Hmm I see. Should I do this then: raise ValueError(f"file '{checkfile}' inside the pathlist does not exist")

keyabarve · 2021-05-21T21:26:28Z

The problem that I'm facing is that it's giving me the error I mentioned above. I'm confused about how to fix that.
Is there any other way to check for a badly formatted file in the function itself?

keyabarve · 2021-05-21T21:29:22Z

The assert statement in my test case is this: assert "file '' inside the pathlist does not exist" in str(e.value). If I change that to assert "file '{'a':1}' inside the pathlist does not exist" in str(e.value), then the tests pass. So should I do that then?

ctb · 2021-05-21T21:30:29Z

On Fri, May 21, 2021 at 02:29:36PM -0700, Keya Barve wrote: The assert statement in my test case is this: `assert "file '' inside the pathlist does not exist" in str(e.value)`. If I change that to `assert "file '{'a':1}' inside the pathlist does not exist"` in str(e.value), then the tests pass. So should I do that then?

Sure, give it a try.

keyabarve · 2021-05-21T21:53:22Z

Please review. @ctb

ctb

Looks good to me! I'll update from latest and then one of us can merge when all the tests pass!

Nice work - I know it took a while ;)

keyabarve · 2021-05-21T22:29:53Z

Alright! Thanks! By merge, you mean 'Squash and merge' right?

keyabarve · 2021-05-21T22:51:33Z

Ready for merge. @ctb

ctb · 2021-05-21T23:05:14Z

🎉

Added 2 test cases

8971219

Badly formatted and Duplicate Files tests

bbdef96

ctb reviewed Apr 24, 2021

View reviewed changes

Made changes to the function to handle some test cases

9706c5b

All test cases except duplicate files fixed

45b4895

ctb reviewed Apr 30, 2021

View reviewed changes

src/sourmash/sourmash_args.py Show resolved Hide resolved

All test cases pass

e2bbc91

taylorreiter mentioned this pull request May 19, 2021

Sourmash load_file_as_signatures on empty signatures produces IndexError #1537

Closed

Fixed failing test case

077884f

ctb reviewed May 21, 2021

View reviewed changes

Made changes for badly formatted file test

544da7e

ctb approved these changes May 21, 2021

View reviewed changes

Merge branch 'latest' into KB_1430

c48bfca

ctb changed the title ~~[WIP] Added tests for the function load_pathlist_from_file~~ [MRG] Added tests for the function load_pathlist_from_file May 21, 2021

ctb merged commit 8e7df87 into latest May 21, 2021

ctb deleted the KB_1430 branch May 21, 2021 23:05

ctb mentioned this pull request May 21, 2021

Draft release notes for v4.1.1 #1535

Closed

ctb mentioned this pull request Jun 15, 2021

what is the right way to validate a file containing a list-of-files? #1410

Closed

[MRG] Added tests for the function load_pathlist_from_file #1469

[MRG] Added tests for the function load_pathlist_from_file #1469

Conversation

keyabarve commented Apr 16, 2021

codecov bot commented Apr 16, 2021 • edited Loading

Codecov Report

keyabarve commented Apr 22, 2021

ctb commented Apr 23, 2021

keyabarve commented Apr 24, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ctb commented Apr 24, 2021

keyabarve commented Apr 24, 2021

ctb commented Apr 25, 2021

keyabarve commented Apr 27, 2021

ctb commented Apr 27, 2021 via email

keyabarve commented Apr 27, 2021

luizirber commented Apr 29, 2021

keyabarve commented Apr 30, 2021 • edited Loading

ctb commented Apr 30, 2021 via email

keyabarve commented Apr 30, 2021

ctb commented Apr 30, 2021 via email

keyabarve commented Apr 30, 2021

ctb commented Apr 30, 2021

keyabarve commented Apr 30, 2021

ctb commented Apr 30, 2021 via email

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ctb commented Apr 30, 2021 via email

keyabarve commented Apr 30, 2021

ctb commented May 18, 2021

keyabarve commented May 18, 2021

ctb commented May 18, 2021 via email

ctb commented May 20, 2021

ctb commented May 20, 2021

keyabarve commented May 20, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

keyabarve commented May 21, 2021

ctb commented May 21, 2021

keyabarve commented May 21, 2021 • edited Loading

keyabarve commented May 21, 2021

ctb commented May 21, 2021

keyabarve commented May 21, 2021

keyabarve commented May 21, 2021 • edited Loading

keyabarve commented May 21, 2021 • edited Loading

ctb commented May 21, 2021 via email

keyabarve commented May 21, 2021

ctb left a comment

Choose a reason for hiding this comment

keyabarve commented May 21, 2021

keyabarve commented May 21, 2021

ctb commented May 21, 2021

[MRG] Added tests for the function `load_pathlist_from_file` #1469

[MRG] Added tests for the function `load_pathlist_from_file` #1469

codecov bot commented Apr 16, 2021 •

edited

Loading

keyabarve commented Apr 24, 2021 •

edited

Loading

keyabarve commented Apr 30, 2021 •

edited

Loading

keyabarve commented May 21, 2021 •

edited

Loading

keyabarve commented May 21, 2021 •

edited

Loading

keyabarve commented May 21, 2021 •

edited

Loading