Support newline-delimited JSON for seeds #2365

jml · 2020-04-28T13:55:04Z

Describe the feature

Allow dbt users to load in data from newline-delimited JSON files, rather than just CSV.

CSV is barely a format, and has lots of ambiguities around quoting, names, and so forth. Supporting newline-delimited JSON would make things

Describe alternatives you've considered

handling this outside of dbt (what we do now)
converting JSON data to CSV as part of the build process (a bit of a faff)
supporting other formats, like Avro (not likely to store binary data in a git repo)

Additional context

Loading in newline-delimited data into BigQuery is really easy: there's an API for it.

Who will this benefit?

Anyone looking into seeds who already has their data in newline-delimited JSON format (hi!). Anyone who has ever tried to debug dodgy quoting issues in CSV.

drewbanin · 2020-04-29T11:59:58Z

hey @jml - check out this issue for some prior art: #2276

I'm going to close this issue out as a dupe, but this is a good suggestion and I'm keen to support non-csv seeds in the future!

stewartbryson · 2020-07-18T12:15:57Z

hey @jml - check out this issue for some prior art: #2276

I'm going to close this issue out as a dupe, but this is a good suggestion and I'm keen to support non-csv seeds in the future!

@drewbanin Are you saying this request is a dupe of #2276 ? I don’t understand the relationship. Asking for newline-delimited JSON as a seed format doesn’t seem related to Jinga blocks at all.

drewbanin · 2020-07-20T12:44:27Z

Hey @stewartbryson - the issue I linked to includes an example of a newline-delimited json seed file. The included example looks like:

{% seed my_json_seed type=json %}

{"col1": "value", "col2": "other value"},
{"col1": "value2", "col2": "other value2"},

{% endseed %}

I do think there's an opportunity to support newline-delimited json seed files without building jinja seed blocks, but it's probably not a change we're going to prioritize in isolation. Can you tell me more about what you're looking to use newline-delimited json seed files for? That might help us better prioritize an issue like this one

aspfohl · 2020-11-05T20:15:30Z

@jml Would you be able to share how you currently handle it outside of dbt? I agree this would be useful, it could help around csv quoting issues, readability (if there was support for other delimiters), and representing unstructured data

jml · 2020-11-06T09:57:13Z

Sure. We use BigQuery-specific APIs.

def load_file_to_bigquery(
    client: bigquery.Client,
    source_file: IO[bytes],
    destination_table: str,
    load_config: bigquery.LoadJobConfig,
    rewind: bool = False,
) -> LoadJob:
    logging.info("Starting load into %s in BigQuery", destination_table)
    job = client.load_table_from_file(
        source_file, destination_table, job_config=load_config, rewind=rewind
    )
    logging.info("Loading data into BigQuery")
    duration = _wait_for_job(job)
    logging.info(
        "Done in %0.1f seconds! %s rows and %s bytes were loaded",
        duration,
        job.output_rows,
        job.output_bytes,
    )
    return job

def _wait_for_job(job: bigquery.LoadJob) -> float:
    start_time = time.monotonic()
    try:
        # Starts the load asynchronously and polls until completed, raising an exception in case of problems
        job.result()
    except BadRequest:
        logging.error(
            "Errors occurred during loading: %s", job.errors, extra={"errors": job.errors}
        )
    return time.monotonic() - start_time

def load_json_line_data_to_bigquery(
    project: str,
    location: str,
    destination_table: str,
    filename: str,
    table_description: str,
    field_descriptions: Dict[str, str],
) -> None:  # pragma: no cover
    """Load line-separated JSON files from data/ to BigQuery."""
    client = get_bigquery_client(project=project, location=location)
    path = config.RAW_DATA_PATH.joinpath(filename)
    with path.open(mode="rb") as f:
        load_file_to_bigquery(
            client=client,
            source_file=f,
            destination_table=destination_table,
            load_config=bigquery.LoadJobConfig(
                autodetect=True,
                source_format=bigquery.SourceFormat.NEWLINE_DELIMITED_JSON,
                write_disposition=bigquery.WriteDisposition.WRITE_TRUNCATE,
            ),
        )
    update_schema_documentation(client, destination_table, table_description, field_descriptions)

def update_schema_documentation(
    client: bigquery.Client,
    table_name: str,
    table_description: str,
    field_descriptions: Dict[str, str],
) -> None:  # pragma: no cover
    """Update the documentation for a BigQuery table.

    Known limitations:

    - Does not support nested field definitions

    Feel free to fix these limitations!
    """
    table = client.get_table(table_name)
    table.description = table_description
    table.schema = _document_schema(table_name, table.schema, field_descriptions)
    client.update_table(table, ["description", "schema"])


def _document_schema(
    table_name: str, schema: Sequence[SchemaField], field_descriptions: Dict[str, str]
) -> List[SchemaField]:
    """Create a documented version of the given BigQuery schema."""
    existing_fields = set()
    new_schema = []
    for field in schema:
        description = field_descriptions.get(field.name)
        new_schema.append(
            SchemaField(
                name=field.name,
                field_type=field.field_type,
                mode=field.mode,
                description=description,
                fields=field.fields,
            )
        )
        existing_fields.add(field.name)
    undescribed_fields = existing_fields - set(field_descriptions)
    # TODO: Also raise an exception if we have described fields that don't exist.
    if undescribed_fields:
        # TODO: Raise a more specific exception so we don't have to pass table_name in.
        raise RuntimeError(f"Unexpected fields defined in {table_name}: {undescribed_fields}")
    return new_schema

Happy to answer any questions. Memrise Limited is making this code available under the Apache License 2.0.

jtcohen6 · 2021-10-04T12:20:01Z

If we're willing to stick with agate as our dataframe provider—very much an open question, since there are some very good reasons to rip it out (#3413)—but if we were to stick with it, I think this might be a fairly trivial addition, since agate supports a from_json method that looks quite similar the from_csv method we use for seeds today.

We'd need to update the load_agate_table context method, and/or turn the agate_helper.from_csv method into a generalized from_file method.

Those are implementation details. I'd still like to hear a bit more from users about the rationale for supporting NDJSON—ease of quoting/escaping? ease of debugging?—rather than requiring conversion to CSV as a preprocessing step. Unless I'm missing something big, I don't think JSON seeds would have a leg up in terms of support for semi-/unstructured data, since the ultimate aim is always conversion to tabular format (data frame).

jml · 2021-10-04T13:27:07Z

I'd still like to hear a bit more from users about the rationale for supporting NDJSON—ease of quoting/escaping? ease of debugging?—rather than requiring conversion to CSV as a preprocessing step.

For us it's about having a harder-to-get-wrong format. It's really easy to get CSV wrong (mostly quoting/escaping, as you say).

There's also a migration factor. Before we switched to dbt, we were loading data from NDJSON files. Now, switching to dbt, we have to make those files worse (FSVO 'worse'), which feels off, given that everything else about dbt has been an improvement.

rumbin · 2021-10-04T13:35:23Z

Our use case:
We JSON in our "seeds" in order to reflect some nested relationships. Something like one row per user containing a JSON list of groups or roles that this user belongs to. We do not want to flatten such lists immediately but rather keep them as VARIANT or ARRAY type.

Currently we maintain such JSON "seeds" as a string literal of a JSON array of these objects, parse them via Snowflake's parse_json() and flatten() the outer array.

It would be much more convenient to be able to use NDJSON here, so the editor can provide syntax highlighting etc.
However, as said, our use case requires nested lists and maps to stay nested.

rumbin · 2021-10-04T13:41:09Z

If a flat table is the result that we are aiming for after loading the seed, we could live with the nested fields being included as escaped text, though, as that could be easily converted downstream.

So, yes, type and quoting safety are the big benefits of JSON here. CSV is error-prone and too less standardized imho.
Totally agree with @jml here.

amardeep · 2021-10-04T13:44:31Z

Same as what some others have mentioned. We use bigquery, and miss not having support for nested data in seeds.

One use case was testing where it would have been nice if the seeds mirrored the structure of what was being tested - without using nesting in seeds it added an extra step to first create that nested structure from csv read.

frosty5689 · 2021-12-23T18:46:56Z

Hello,

We have JSONB columns in our Postgres database that we use as a source for our data. Managing JSON inside CSVs is painful where as having it inside NDJSON or YAML will make it much easier to manage. That is our use case for wanting something other than CSV as the seed file format.

naga-kondapaturi-zefr · 2022-02-22T09:43:15Z

We use Snowflake source tables that heavily use VARIANT columns. Using CSV seeds to add sample data for testing transformations is resulting in error SQL compilation error: Expression type does not match column data type, expecting VARIANT but got VARCHAR(210) for column PAYLOAD. I am not entirely sure how to use the seed json example given above inside sources.yml file as the tables are already created in our EL (extract and load) process and we don't define them as models in dbt.

We are overriding the snowflake load macro snowflake__load_csv_rows to use parse_json for variant columns but it seems like an ugly hack. I am not sure if support of ndjson will resolve our issue but a way to load text into Snowflake variant columns will be quite useful.

jtcohen6 · 2022-03-02T18:59:48Z

@naga-kondapaturi-zefr Are you trying to use seed-related functionality for a table that already exists in your data warehouse? Or in order to load a CSV containing a column with a JSON value? (Perhaps loading it as a string and casting it within a staging model?)

boxysean · 2022-07-12T09:17:05Z

I worked with a dbt user recently who uses Snowflake, who had an interesting way to solve the lack of JSON seeds (ND or otherwise):

They create dbt models that use the PARSE_JSON function, to persist VARIANT type tables, that they can use as static datasets. For example:

# models/animals.sql
select parse_json('{"cats": "the best", "dogs": "the worst"}') as json_blob

brad-meru · 2022-08-12T12:16:22Z

@drewbanin is this one on the roadmap? This would be a helpful feature for Meru for the same reason @jml mentioned:

For us it's about having a harder-to-get-wrong format. It's really easy to get CSV wrong (mostly quoting/escaping, as you say).

jhsb25 · 2022-11-22T02:04:01Z

We use Snowflake source tables that heavily use VARIANT columns. Using CSV seeds to add sample data for testing transformations is resulting in error SQL compilation error: Expression type does not match column data type, expecting VARIANT but got VARCHAR(210) for column PAYLOAD. I am not entirely sure how to use the seed json example given above inside sources.yml file as the tables are already created in our EL (extract and load) process and we don't define them as models in dbt.

We are overriding the snowflake load macro snowflake__load_csv_rows to use parse_json for variant columns but it seems like an ugly hack. I am not sure if support of ndjson will resolve our issue but a way to load text into Snowflake variant columns will be quite useful.

@naga-kondapaturi-zefr would you be able to share the override implementation of snowflake__load_csv_rows to use parse_json for variant columns. We are facing the exact same problem.

jhsb25 · 2022-11-23T18:49:09Z

Anyone in need of the hack solution in overriding snowflake__load_csv_rows here it is:


{% macro snowflake__load_csv_rows(model, agate_table) %}
    {% set batch_size = get_batch_size() %}
    {% set cols_sql = get_seed_column_quoted_csv(model, agate_table.column_names) %}
    {% set bindings = [] %}
    {% set column_list = [] %}
    {% set variant_columns = [] %}
    {% set statements = [] %}

    {#  Get all columns that have type variant  #}
    {% for i in model['config']['column_types'].items() %}
        {% if i[1] == 'variant' %}
            {% do variant_columns.append(i[0]) %}
        {% endif %}
    {% endfor %}

    {#  Create column list expression for columns using parse_json for variant types  #}
    {% for i in agate_table.column_names %}
        {% if i in variant_columns %}
            {% do column_list.append('parse_json(column{0})'.format(loop.index)) %}
        {% else %}
            {% do column_list.append('column{0}'.format(loop.index)) %}
        {% endif %}
    {% endfor %}

    {#  Create insert query  #}
    {% for chunk in agate_table.rows | batch(batch_size) %}
        {% set bindings = [] %}

        {% for row in chunk %}
            {% do bindings.extend(row) %}
        {% endfor %}

        {% set sql %}
            insert into {{ this.render() }} select {{ ', '.join(column_list) }} from values
            {% for row in chunk -%}
                ({%- for column in agate_table.column_names -%}
                    %s
                    {%- if not loop.last%},{%- endif %}
                {%- endfor -%})
                {%- if not loop.last%},{%- endif %}
            {%- endfor %}
        {% endset %}

        {% do adapter.add_query('BEGIN', auto_begin=False) %}
        {% do adapter.add_query(sql, bindings=bindings, abridge_sql_log=True) %}
        {% do adapter.add_query('COMMIT', auto_begin=False) %}

        {% if loop.index0 == 0 %}
            {% do statements.append(sql) %}
        {% endif %}
    {% endfor %}

    {# Return SQL so we can render it out into the compiled files #}
    {{ return(statements[0]) }}
{% endmacro %}

{% materialization seed, adapter='snowflake' %}
    {% set original_query_tag = set_query_tag() %}

    {% set relations = materialization_seed_default() %}

    {% do unset_query_tag(original_query_tag) %}

    {{ return(relations) }}
{% endmaterialization %}

ruslan-marian · 2023-02-14T21:13:26Z

I'd still like to hear a bit more from users about the rationale for supporting NDJSON—ease of quoting/escaping? ease of debugging?—rather than requiring conversion to CSV as a preprocessing step. Unless I'm missing something big, I don't think JSON seeds would have a leg up in terms of support for semi-/unstructured data, since the ultimate aim is always conversion to tabular format (data frame).

@jtcohen6 , my use case is very similar to what @naga-kondapaturi-zefr mentioned above.

I have a raw table in Snowflake that receives events through Kafka from an operational system. The table has only two columns, both of VARIANT type.

create table if not exists PROD.RAW.EVENTS (
	RECORD_METADATA VARIANT,
	RECORD_CONTENT VARIANT
);

The first column is generated by a Snowflake Connector for Kafka and the second contains the actual JSON events. We are using dbt to extract the info that we need from events and transform it to analytics. Now, I need to create an environment for testing. It is located in a separate Snowflake account, so I can't clone the original raw table. I was looking for options on how to load JSON to Snowflake, and when I learned about dbt seed I thought that it would be the perfect tool for the job, but then it was disappointing to realize that only CSV was supported (and I'm getting the same SQL compilation error as mentioned by @naga-kondapaturi-zefr).

One thing to mention here. In my use case I have huge JSON events, with lots of nested levels of varying depth, that should be loaded to the VARIANT column as they are, without being parsed and loaded to specific columns, as is the case for CSV files (and I assume is the case for some people in this thread that would like to get this feature). Perhaps that is something to bear in mind for the person who would be working on this feature, and load the data depending on the destination table (VARIANT datatype or not).

This message seems to have irrelevant details, but it's for the purpose of providing more context and understanding the use case.

adamcunnington-mlg · 2023-05-05T11:14:09Z

@jtcohen6 / @dbeatty10 please can you advise what the current status is of considering supporting JSON seeds? This issue is pretty long so I've not read all of the proposals / use cases but it seems to me like a very reasonable and generic thing to do to want to load seed data that contains nested data. At which point, you're dead out of the water with our primitive friend, CSV.

As far as I can tell, the only workaround right now is to create a CSV with some JSON columns and then to create an intermediate model on top of the seed that parses the JSON - it makes for ugly maintenance of the CSV data though as developer would find it much easier to just maintain a JSON doc.

mk-oliver-s · 2023-06-20T00:20:36Z

Not sure if its worth adding another issue for this but YAML as well would be good here for multiline strings.

eg I want to store prompts in a database and use seeds to add them, yaml would look something like this

- id: 9ryuf89wh
  prompt: |
       your task is to do x
       you should do it via y
       here is the data: {data}
- id: i8d3h89
  prompt: |
       your task is to do c
       you should do it via b
       here is the data: {data}

github-actions · 2024-02-27T01:44:53Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

dbeatty10 · 2024-02-27T02:15:10Z

I removed the stale label to keep this issue open.

github-actions · 2024-08-26T01:54:02Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

brad-meru · 2024-08-26T13:56:34Z

Commenting on the issue to keep it open. We'd still like this functionality.

jml added enhancement New feature or request triage labels Apr 28, 2020

drewbanin removed the triage label Apr 29, 2020

drewbanin closed this as completed Apr 29, 2020

jml mentioned this issue Apr 29, 2020

Seed blocks #2276

Closed

jtcohen6 reopened this Sep 10, 2020

jtcohen6 mentioned this issue Jul 2, 2021

Load data using dbt seed from .tpl file present in repository #3519

Closed

1 task

jtcohen6 mentioned this issue Oct 4, 2021

Introduce "delimiter" parameter for seeds? Allow non-csv file types #3990

Closed

jtcohen6 added the seeds Issues related to dbt's seed functionality label Oct 5, 2021

jtcohen6 mentioned this issue Mar 23, 2022

[CT-411] [Feature] Seed json files #4936

Closed

1 task

jtcohen6 added the Team:Execution label Mar 23, 2022

epapineau mentioned this issue Aug 25, 2022

Add not_empty_string test dbt-labs/dbt-utils#634

Merged

16 tasks

jtcohen6 added the paper_cut A small change that impacts lots of users in their day-to-day label Nov 28, 2022

dbeatty10 mentioned this issue Mar 13, 2023

[CT-2294] [Feature] Seed JSON and Parquet #7155

Closed

3 tasks

jtcohen6 removed the Team:Execution label Jul 19, 2023

dbeatty10 mentioned this issue Jan 31, 2024

Seed array type #9497

Closed

3 tasks

github-actions bot added the stale Issues that have gone stale label Feb 27, 2024

dbeatty10 removed the stale Issues that have gone stale label Feb 27, 2024

github-actions bot added the stale Issues that have gone stale label Aug 26, 2024

github-actions bot removed the stale Issues that have gone stale label Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support newline-delimited JSON for seeds #2365

Support newline-delimited JSON for seeds #2365

jml commented Apr 28, 2020

drewbanin commented Apr 29, 2020

stewartbryson commented Jul 18, 2020

drewbanin commented Jul 20, 2020

aspfohl commented Nov 5, 2020

jml commented Nov 6, 2020

jtcohen6 commented Oct 4, 2021 •

edited

Loading

jml commented Oct 4, 2021

rumbin commented Oct 4, 2021

rumbin commented Oct 4, 2021

amardeep commented Oct 4, 2021

frosty5689 commented Dec 23, 2021

naga-kondapaturi-zefr commented Feb 22, 2022

jtcohen6 commented Mar 2, 2022

boxysean commented Jul 12, 2022

brad-meru commented Aug 12, 2022

jhsb25 commented Nov 22, 2022

jhsb25 commented Nov 23, 2022

ruslan-marian commented Feb 14, 2023 •

edited

Loading

adamcunnington-mlg commented May 5, 2023 •

edited

Loading

mk-oliver-s commented Jun 20, 2023

github-actions bot commented Feb 27, 2024

dbeatty10 commented Feb 27, 2024

github-actions bot commented Aug 26, 2024

brad-meru commented Aug 26, 2024

Support newline-delimited JSON for seeds #2365

Support newline-delimited JSON for seeds #2365

Comments

jml commented Apr 28, 2020

Describe the feature

Describe alternatives you've considered

Additional context

Who will this benefit?

drewbanin commented Apr 29, 2020

stewartbryson commented Jul 18, 2020

drewbanin commented Jul 20, 2020

aspfohl commented Nov 5, 2020

jml commented Nov 6, 2020

jtcohen6 commented Oct 4, 2021 • edited Loading

jml commented Oct 4, 2021

rumbin commented Oct 4, 2021

rumbin commented Oct 4, 2021

amardeep commented Oct 4, 2021

frosty5689 commented Dec 23, 2021

naga-kondapaturi-zefr commented Feb 22, 2022

jtcohen6 commented Mar 2, 2022

boxysean commented Jul 12, 2022

brad-meru commented Aug 12, 2022

jhsb25 commented Nov 22, 2022

jhsb25 commented Nov 23, 2022

ruslan-marian commented Feb 14, 2023 • edited Loading

adamcunnington-mlg commented May 5, 2023 • edited Loading

mk-oliver-s commented Jun 20, 2023

github-actions bot commented Feb 27, 2024

dbeatty10 commented Feb 27, 2024

github-actions bot commented Aug 26, 2024

brad-meru commented Aug 26, 2024

jtcohen6 commented Oct 4, 2021 •

edited

Loading

ruslan-marian commented Feb 14, 2023 •

edited

Loading

adamcunnington-mlg commented May 5, 2023 •

edited

Loading