You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to diff two CSV files and csv-diff just responds with:
ERROR: CSV parse error on line 2
So I do the same things using it as a python package (that is I write a Python script that loads my two files and runs csv--diff on them as per the README) and I get a different error:
KeyError: 'my_key'
Double check the key and it is there, as column 1 in the files which load fine in LibreOffice Calc and in Excel and look fine in a text editor.
So I look at the the file encoding and Python's magic library tells me:
'UTF-8 Unicode (with BOM) text, with very long lines, with CRLF line terminators'
so if I open the file with an encoding of "utf-8-sig" all works fine.
Seems to me, to be a file encoding issue, and one I have encountered in Python a lot so I wrote this:
def file_encoding(filepath):
'''
Text encoding is a bit of a schmozzle in Python. Alas.
A quick summary:
1. I come across CSV files with a UTF-8 or UTF-16 encoding regularly enough.
2. Python wants to know the encoding when we open the file
3. UTF-16 is fine, but UTF-8 comes in two flavours, with and without a BOM
4. The BOM (byte order mark) is an optional and irrelevant to UTF-8 field
5. In fact Unicode standards recommend against including a BOM with UTF-8
https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8
6. Python assumes it's not there
7. Some CSV sources though write with a BOM
8. The encoding must therefore be specified as:
utf-16 for UTF-16 files
utf-8 for UTF-8 files with no BOM
utf-8-sig for UTF files with a BOM
9. The "magic" library reliably determines the encoding efficiently by looking
at the magic numbers at the start of a file
10. Alas it returns a rich string describing the encoding.
11. It contains either UTF-16 or UTF-18
12. It contains "(with BOM)" if a BOM is detected
13. Because of this schmozzle a quick function to translate "magic" output
to standard encoding names is here.
:param filepath: The path to a file
'''
m = magic.from_file(filepath)
utf16 = m.find("UTF-16")>=0
utf8 = m.find("UTF-8")>=0
bom = m.find("(with BOM)")>=0
if utf16:
return "utf-16"
elif utf8:
if bom:
return "utf-8-sig"
else:
return "utf-8"
and then if I run:
with open(File1, "r", encoding=file_encoding(File1), newline='') as f1:
csv1 = load_csv(f1, key=key)
with open(File2, "r", encoding=file_encoding(File2), newline='') as f2:
csv2 = load_csv(f2, key=key)
diff = compare(csv1, csv2)
all is good and I get a reliable diff.
I can't work out how to debug the CLI interface in PyDev alas. I'm a tad green in this space it seems. But setup.py build just creates a build folder with a lib folder with __init__.py and cli.py in it. Yet my Windows box (man I hate Windows but I'm stuck there right now) runs a csvdiff.exe which was presumably installed by pip when I installed csv-diff (pip install csv-diff). But I can't see how to run the CLI interface from the source. Guess I could do some reading on click and setup-tools, but hey for the moment, I have it working via its Python package interface and can run with that.
If the CLI error is in fact related to this encoding issue (hard to know for sure), then it could of course be fixed by including an encoding check as above and opening the files with their appropriate encoding. Frankly it'd be nice if python's open() could better guess the encoding (the way magic can).
The text was updated successfully, but these errors were encountered:
I'm trying to diff two CSV files and csv-diff just responds with:
So I do the same things using it as a python package (that is I write a Python script that loads my two files and runs csv--diff on them as per the README) and I get a different error:
Double check the key and it is there, as column 1 in the files which load fine in LibreOffice Calc and in Excel and look fine in a text editor.
So I look at the the file encoding and Python's magic library tells me:
'UTF-8 Unicode (with BOM) text, with very long lines, with CRLF line terminators'
so if I open the file with an encoding of "utf-8-sig" all works fine.
Seems to me, to be a file encoding issue, and one I have encountered in Python a lot so I wrote this:
and then if I run:
all is good and I get a reliable diff.
I can't work out how to debug the CLI interface in PyDev alas. I'm a tad green in this space it seems. But
setup.py build
just creates a build folder with a lib folder with__init__.py
andcli.py
in it. Yet my Windows box (man I hate Windows but I'm stuck there right now) runs a csvdiff.exe which was presumably installed by pip when I installed csv-diff (pip install csv-diff
). But I can't see how to run the CLI interface from the source. Guess I could do some reading on click and setup-tools, but hey for the moment, I have it working via its Python package interface and can run with that.If the CLI error is in fact related to this encoding issue (hard to know for sure), then it could of course be fixed by including an encoding check as above and opening the files with their appropriate encoding. Frankly it'd be nice if python's
open()
could better guess the encoding (the way magic can).The text was updated successfully, but these errors were encountered: