Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error with white spaces & other wrong characters in column names #14

Closed
iuiu34 opened this issue Aug 27, 2018 · 6 comments
Closed

error with white spaces & other wrong characters in column names #14

iuiu34 opened this issue Aug 27, 2018 · 6 comments

Comments

@iuiu34
Copy link

iuiu34 commented Aug 27, 2018

When the schema is created, column names with spaces are writen as they are.
Therefore, when uploading to bq generates the following error
<BigQuery error in load operation: Invalid field name "utm_medium-partners". Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 128 characters long.>

Would be posible to substitute blank spaces and other wrong characters with '_' as the '--autodetect' option does?
For example:
'Column.example 1' is written as 'Column_example_1'

@bxparks
Copy link
Owner

bxparks commented Aug 27, 2018

Hi,
Can you describe the bq load command where you see the automatic normalization of field names? Because when I run bq load --autodetect ... with invalid field names, it prints an error message. I don't see it automatically normalizing the field.

$ cat invalid_name.json.data
{ "Column.example 1": [1, 2] }
{ "example2": 3 }

$  bq load --source_format NEWLINE_DELIMITED_JSON --ignore_unknown_values --autodetect tmp.invalid invalid_name.data.json

Error in query string: Error processing job 'vital-future-582:bqjob_r6c4ce23bd29c0cf4_000001657d7b2300_1': Invalid field name "Column.example 1". Fields must
contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 128 characters long. Table:
invalid_e9a4f223_84d6_4ca3_900d_7e17c34bf5e4_source

It's easy enough to make generate_schema.py perform this normalization, if bq load is able to make use of it.

@bxparks
Copy link
Owner

bxparks commented Aug 27, 2018

I also tried to upload using the web interface in the BigQuery Console, with the "auto detect" checkbox selected, and I get the same error:

Invalid field name "Column.example 1". Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 128 characters long. Table: invalid_11c968ff_2e40_4c8a_a8ec_eca2fd442215_source

@iuiu34
Copy link
Author

iuiu34 commented Aug 28, 2018

You're right. I mistake with csv.
I'll ask for the feature in the Google Platform: that the ingestion automatically remove invalid characters in columns.

@bxparks
Copy link
Owner

bxparks commented Aug 28, 2018

I suspect that they will just tell you to pre-process your JSON file to modify your invalid column names. The code to do that is fairly straightforward. For example, save the following as convert.py, and run it like this:

$ ./convert.py < raw.data.json > converted.data.json

convert.py:

#!/usr/bin/env python3

from collections import OrderedDict
import json
import logging
import re
import sys

FIELD_MATCHER = re.compile(r'(^[^a-zA-Z])|(\W)')

line_number = 0


def normalize_file(file):
    global line_number
    for line in file:
        line_number += 1
        json_object = json.loads(line)

        if not isinstance(json_object, dict):
            logging.error('%s: Top level record must be a dict but was a %s',
                         line_number, type(json_object))
            continue

        try:
            normalized_dict = normalize_dict(json_object)
        except Exception as e:
            logging.error('%s: %s', line_number, e)
            continue
        json.dump(normalized_dict, sys.stdout)
        print()
    logging.info("Processed %s lines", line_number)


def normalize_dict(json_dict):
    normalized_dict = OrderedDict()
    for key, value in json_dict.items():
        key = FIELD_MATCHER.sub('_', key)
        if isinstance(value, dict):
            value = normalize_dict(value)
        normalized_dict[key] = value
    return normalized_dict


if __name__ == '__main__':
    logging.basicConfig(level=logging.INFO)
    normalize_file(sys.stdin)

@bxparks
Copy link
Owner

bxparks commented Oct 25, 2018

Hi Iuiu,
Can I close this issue, or do you have additional questions/comments?

@bxparks
Copy link
Owner

bxparks commented Dec 17, 2018

Closing with no activity for 4 months. Let me know if you have further questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants