-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support newline-delimited JSON for seeds #2365
Comments
@drewbanin Are you saying this request is a dupe of #2276 ? I don’t understand the relationship. Asking for newline-delimited JSON as a seed format doesn’t seem related to Jinga blocks at all. |
Hey @stewartbryson - the issue I linked to includes an example of a newline-delimited json seed file. The included example looks like:
I do think there's an opportunity to support newline-delimited json seed files without building jinja seed blocks, but it's probably not a change we're going to prioritize in isolation. Can you tell me more about what you're looking to use newline-delimited json seed files for? That might help us better prioritize an issue like this one |
@jml Would you be able to share how you currently handle it outside of dbt? I agree this would be useful, it could help around csv quoting issues, readability (if there was support for other delimiters), and representing unstructured data |
Sure. We use BigQuery-specific APIs. def load_file_to_bigquery(
client: bigquery.Client,
source_file: IO[bytes],
destination_table: str,
load_config: bigquery.LoadJobConfig,
rewind: bool = False,
) -> LoadJob:
logging.info("Starting load into %s in BigQuery", destination_table)
job = client.load_table_from_file(
source_file, destination_table, job_config=load_config, rewind=rewind
)
logging.info("Loading data into BigQuery")
duration = _wait_for_job(job)
logging.info(
"Done in %0.1f seconds! %s rows and %s bytes were loaded",
duration,
job.output_rows,
job.output_bytes,
)
return job
def _wait_for_job(job: bigquery.LoadJob) -> float:
start_time = time.monotonic()
try:
# Starts the load asynchronously and polls until completed, raising an exception in case of problems
job.result()
except BadRequest:
logging.error(
"Errors occurred during loading: %s", job.errors, extra={"errors": job.errors}
)
return time.monotonic() - start_time
def load_json_line_data_to_bigquery(
project: str,
location: str,
destination_table: str,
filename: str,
table_description: str,
field_descriptions: Dict[str, str],
) -> None: # pragma: no cover
"""Load line-separated JSON files from data/ to BigQuery."""
client = get_bigquery_client(project=project, location=location)
path = config.RAW_DATA_PATH.joinpath(filename)
with path.open(mode="rb") as f:
load_file_to_bigquery(
client=client,
source_file=f,
destination_table=destination_table,
load_config=bigquery.LoadJobConfig(
autodetect=True,
source_format=bigquery.SourceFormat.NEWLINE_DELIMITED_JSON,
write_disposition=bigquery.WriteDisposition.WRITE_TRUNCATE,
),
)
update_schema_documentation(client, destination_table, table_description, field_descriptions)
def update_schema_documentation(
client: bigquery.Client,
table_name: str,
table_description: str,
field_descriptions: Dict[str, str],
) -> None: # pragma: no cover
"""Update the documentation for a BigQuery table.
Known limitations:
- Does not support nested field definitions
Feel free to fix these limitations!
"""
table = client.get_table(table_name)
table.description = table_description
table.schema = _document_schema(table_name, table.schema, field_descriptions)
client.update_table(table, ["description", "schema"])
def _document_schema(
table_name: str, schema: Sequence[SchemaField], field_descriptions: Dict[str, str]
) -> List[SchemaField]:
"""Create a documented version of the given BigQuery schema."""
existing_fields = set()
new_schema = []
for field in schema:
description = field_descriptions.get(field.name)
new_schema.append(
SchemaField(
name=field.name,
field_type=field.field_type,
mode=field.mode,
description=description,
fields=field.fields,
)
)
existing_fields.add(field.name)
undescribed_fields = existing_fields - set(field_descriptions)
# TODO: Also raise an exception if we have described fields that don't exist.
if undescribed_fields:
# TODO: Raise a more specific exception so we don't have to pass table_name in.
raise RuntimeError(f"Unexpected fields defined in {table_name}: {undescribed_fields}")
return new_schema Happy to answer any questions. Memrise Limited is making this code available under the Apache License 2.0. |
If we're willing to stick with We'd need to update the Those are implementation details. I'd still like to hear a bit more from users about the rationale for supporting NDJSON—ease of quoting/escaping? ease of debugging?—rather than requiring conversion to CSV as a preprocessing step. Unless I'm missing something big, I don't think JSON seeds would have a leg up in terms of support for semi-/unstructured data, since the ultimate aim is always conversion to tabular format (data frame). |
For us it's about having a harder-to-get-wrong format. It's really easy to get CSV wrong (mostly quoting/escaping, as you say). There's also a migration factor. Before we switched to dbt, we were loading data from NDJSON files. Now, switching to dbt, we have to make those files worse (FSVO 'worse'), which feels off, given that everything else about dbt has been an improvement. |
Our use case: Currently we maintain such JSON "seeds" as a string literal of a JSON array of these objects, parse them via Snowflake's parse_json() and flatten() the outer array. It would be much more convenient to be able to use NDJSON here, so the editor can provide syntax highlighting etc. |
If a flat table is the result that we are aiming for after loading the seed, we could live with the nested fields being included as escaped text, though, as that could be easily converted downstream. So, yes, type and quoting safety are the big benefits of JSON here. CSV is error-prone and too less standardized imho. |
Same as what some others have mentioned. We use bigquery, and miss not having support for nested data in seeds. One use case was testing where it would have been nice if the seeds mirrored the structure of what was being tested - without using nesting in seeds it added an extra step to first create that nested structure from csv read. |
Hello, We have JSONB columns in our Postgres database that we use as a source for our data. Managing JSON inside CSVs is painful where as having it inside NDJSON or YAML will make it much easier to manage. That is our use case for wanting something other than CSV as the seed file format. |
We use Snowflake source tables that heavily use VARIANT columns. Using CSV seeds to add sample data for testing transformations is resulting in error We are overriding the snowflake load macro |
@naga-kondapaturi-zefr Are you trying to use seed-related functionality for a table that already exists in your data warehouse? Or in order to load a CSV containing a column with a JSON value? (Perhaps loading it as a string and casting it within a staging model?) |
I worked with a dbt user recently who uses Snowflake, who had an interesting way to solve the lack of JSON seeds (ND or otherwise): They create dbt models that use the PARSE_JSON function, to persist VARIANT type tables, that they can use as static datasets. For example:
|
@drewbanin is this one on the roadmap? This would be a helpful feature for Meru for the same reason @jml mentioned:
|
@naga-kondapaturi-zefr would you be able to share the override implementation of |
Anyone in need of the hack solution in overriding
|
@jtcohen6 , my use case is very similar to what @naga-kondapaturi-zefr mentioned above. I have a raw table in Snowflake that receives events through Kafka from an operational system. The table has only two columns, both of VARIANT type. create table if not exists PROD.RAW.EVENTS (
RECORD_METADATA VARIANT,
RECORD_CONTENT VARIANT
); The first column is generated by a Snowflake Connector for Kafka and the second contains the actual JSON events. We are using One thing to mention here. In my use case I have huge JSON events, with lots of nested levels of varying depth, that should be loaded to the VARIANT column as they are, without being parsed and loaded to specific columns, as is the case for CSV files (and I assume is the case for some people in this thread that would like to get this feature). Perhaps that is something to bear in mind for the person who would be working on this feature, and load the data depending on the destination table (VARIANT datatype or not). This message seems to have irrelevant details, but it's for the purpose of providing more context and understanding the use case. |
@jtcohen6 / @dbeatty10 please can you advise what the current status is of considering supporting JSON seeds? This issue is pretty long so I've not read all of the proposals / use cases but it seems to me like a very reasonable and generic thing to do to want to load seed data that contains nested data. At which point, you're dead out of the water with our primitive friend, CSV. As far as I can tell, the only workaround right now is to create a CSV with some JSON columns and then to create an intermediate model on top of the seed that parses the JSON - it makes for ugly maintenance of the CSV data though as developer would find it much easier to just maintain a JSON doc. |
Not sure if its worth adding another issue for this but YAML as well would be good here for multiline strings. eg I want to store prompts in a database and use seeds to add them, yaml would look something like this - id: 9ryuf89wh
prompt: |
your task is to do x
you should do it via y
here is the data: {data}
- id: i8d3h89
prompt: |
your task is to do c
you should do it via b
here is the data: {data} |
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days. |
I removed the |
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days. |
Commenting on the issue to keep it open. We'd still like this functionality. |
Describe the feature
Allow dbt users to load in data from newline-delimited JSON files, rather than just CSV.
CSV is barely a format, and has lots of ambiguities around quoting, names, and so forth. Supporting newline-delimited JSON would make things
Describe alternatives you've considered
Additional context
Loading in newline-delimited data into BigQuery is really easy: there's an API for it.
Who will this benefit?
Anyone looking into seeds who already has their data in newline-delimited JSON format (hi!). Anyone who has ever tried to debug dodgy quoting issues in CSV.
The text was updated successfully, but these errors were encountered: