ERROR: CSV parse error #14

bernd-wechner · 2021-03-18T01:04:37Z

I'm trying to diff two CSV files and csv-diff just responds with:

ERROR: CSV parse error on line 2

So I do the same things using it as a python package (that is I write a Python script that loads my two files and runs csv--diff on them as per the README) and I get a different error:

KeyError: 'my_key'

Double check the key and it is there, as column 1 in the files which load fine in LibreOffice Calc and in Excel and look fine in a text editor.

So I look at the the file encoding and Python's magic library tells me:

'UTF-8 Unicode (with BOM) text, with very long lines, with CRLF line terminators'

so if I open the file with an encoding of "utf-8-sig" all works fine.

Seems to me, to be a file encoding issue, and one I have encountered in Python a lot so I wrote this:

def file_encoding(filepath):
    '''
    Text encoding is a bit of a schmozzle in Python. Alas.
    
    A quick summary:
    
    1. I come across CSV files with a UTF-8 or UTF-16 encoding regularly enough.
    2. Python wants to know the encoding when we open the file
    3. UTF-16 is fine, but UTF-8 comes in two flavours, with and without a BOM
    4. The BOM (byte order mark) is an optional and irrelevant to UTF-8 field
    5. In fact Unicode standards recommend against including a BOM with UTF-8
        https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8
    6. Python assumes it's not there
    7. Some CSV sources though write with a BOM
    8. The encoding must therefore be specified as:
        utf-16     for UTF-16 files
        utf-8       for UTF-8 files with no BOM
        utf-8-sig for UTF files with a BOM 
    9. The "magic" library reliably determines the encoding efficiently by looking
       at the magic numbers at the start of a file
    10. Alas it returns a rich string describing the encoding.
    11. It contains either UTF-16 or UTF-18
    12. It contains "(with BOM)" if a BOM is detected
    13. Because of this schmozzle a quick function to translate "magic" output
        to standard encoding names is here.
    
    :param filepath: The path to a file
    '''
    m = magic.from_file(filepath)
    utf16 = m.find("UTF-16")>=0
    utf8 = m.find("UTF-8")>=0
    bom = m.find("(with BOM)")>=0
    
    if utf16:
        return "utf-16"
    elif utf8:
        if bom:
            return "utf-8-sig"
        else:
            return "utf-8"

and then if I run:

with open(File1, "r", encoding=file_encoding(File1), newline='') as f1:
    csv1 = load_csv(f1, key=key)
    
with open(File2, "r", encoding=file_encoding(File2), newline='') as f2:
    csv2 = load_csv(f2, key=key)

diff = compare(csv1, csv2)

all is good and I get a reliable diff.

I can't work out how to debug the CLI interface in PyDev alas. I'm a tad green in this space it seems. But setup.py build just creates a build folder with a lib folder with __init__.py and cli.py in it. Yet my Windows box (man I hate Windows but I'm stuck there right now) runs a csvdiff.exe which was presumably installed by pip when I installed csv-diff (pip install csv-diff). But I can't see how to run the CLI interface from the source. Guess I could do some reading on click and setup-tools, but hey for the moment, I have it working via its Python package interface and can run with that.

If the CLI error is in fact related to this encoding issue (hard to know for sure), then it could of course be fixed by including an encoding check as above and opening the files with their appropriate encoding. Frankly it'd be nice if python's open() could better guess the encoding (the way magic can).

The text was updated successfully, but these errors were encountered:

patric-r · 2021-08-23T17:33:18Z

Having this feature would be awesome.

mikecoop83 · 2021-08-24T00:00:38Z

Having this feature would be awesome.

If you get a chance, could you try out my PR to see if it solves your problem?

rene-schwabe · 2021-10-28T13:52:32Z

Any chance the PR from @mikecoop83 gets merged?

mikecoop83 linked a pull request Apr 7, 2021 that will close this issue

fix comparing of csv files with non-default file encoding #19

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ERROR: CSV parse error #14

ERROR: CSV parse error #14

bernd-wechner commented Mar 18, 2021 •

edited

Loading

patric-r commented Aug 23, 2021

mikecoop83 commented Aug 24, 2021

rene-schwabe commented Oct 28, 2021

ERROR: CSV parse error #14

ERROR: CSV parse error #14

Comments

bernd-wechner commented Mar 18, 2021 • edited Loading

patric-r commented Aug 23, 2021

mikecoop83 commented Aug 24, 2021

rene-schwabe commented Oct 28, 2021

bernd-wechner commented Mar 18, 2021 •

edited

Loading