Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle seeds with utf-8 BOM (#1177) #1452

Merged
merged 4 commits into from
May 10, 2019

Conversation

beckjake
Copy link
Contributor

Fixes #1177

Instead of passing agate a filepath, open the file at that path, look at the first byte, and seek back if it's not a BOM. Regardless, pass the file handle to agate instead.

That way when agate gets the file handle, if there is a BOM, it won't see it.

Jacob Beck added 3 commits May 9, 2019 17:56
Fixes some neat windows behavior where the default code page can be cp1252
Add BOM unit test
@@ -41,4 +42,7 @@ def as_matrix(table):


def from_csv(abspath):
return agate.Table.from_csv(abspath, column_types=DEFAULT_TYPE_TESTER)
with dbt.compat.open_file(abspath) as fp:
if fp.read(1) != u'\ufeff':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm explicitly opening the value as unicode (that's what open_file does), so examining the unicode code point rather than the byte sequence is correct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not suggesting that your comparison is wrong, just that you could replace the magic value u'\ufeff' with codecs.BOM_UTF8.decode('utf-8')

@beckjake beckjake merged commit d5774b3 into dev/wilt-chamberlain May 10, 2019
@beckjake beckjake deleted the feature/strip-seed-bom branch May 10, 2019 16:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Strip BOM from Excel-generated UTF-8 CSV files
2 participants