[CT-725] [CT-630] Sources are not passed into catalog.json #368

hanna-liashchuk · 2022-05-11T12:17:00Z

Hi, I'm working with dat-spark and I have a dbt project with source and a model that is using this source. I'm generating docs and ingesting metadata into Datahub. My issue is that schema for Source is not transferred.

Source definition:

version: 2

sources:
  - name: filialinfo
    description: Info about filials
    tables:
      - name: tab_filialfiscalregisters
        description: Fiscal registers in filials table
        columns:
          - name: filID
            description: Foreign key of the filials table
          - name: frID
            description: Fiscal register ID
          - name: frNum
            description: Fiscal register number
          - name: modified
            description: modified datetime
          - name: frType
            description: Fiscal register type
          - name: frVersion
            description: Fiscal register version
          - name: FACTORYSERIALNUMBER
            description: Factory serial number

Model definition:

{{ config(
    materialized = 'table'
) }}

WITH source_data AS (

    SELECT
        filID,
        frID,
        frNum,
        date_format(
            modified,
            "yyyy-MM-dd HH:mm:ss"
        ) AS modified,
        CASE
            WHEN frType == ''
            OR frType == 'NONE' THEN NULL
            ELSE frType
        END AS frType,
        CASE
            WHEN frVersion == ''
            OR frVersion == 'NONE' THEN NULL
            ELSE frVersion
        END AS frVersion,
        CASE
            WHEN factoryserialnumber == ''
            OR factoryserialnumber == 'NONE' THEN NULL
            ELSE factoryserialnumber
        END AS factoryserialnumber
    FROM
        {{ source(
            'filialinfo',
            'tab_filialfiscalregisters'
        ) }}
)
SELECT
    *
FROM
    source_data

dbt_project.yml


# Name your project! Project names should contain only lowercase characters
# and underscores. A good package name should reflect your organization's
# name or the intended use of these models
name: 'palefat_dbt'
version: '1.0.0'
config-version: 2

# This setting configures which "profile" dbt uses for this project.
profile: 'palefat_dbt'

# These configurations specify where dbt should look for different types of files.
# The `model-paths` config, for example, states that models in this project can be
# found in the "models/" directory. You probably won't need to change these!
model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]

target-path: "target"  # directory which will store compiled SQL files
clean-targets:         # directories to be removed by `dbt clean`
  - "target"
  - "dbt_packages"


# Configuring models
# Full documentation: https://docs.getdbt.com/docs/configuring-models

# In this example config, we tell dbt to build all models in the example/ directory
# as tables. These settings can be overridden in the individual model files
# using the `{{ config(...) }}` macro.
models:
  palefat_dbt:
    # Config indicated by + and applies to all files under models/example/
    staging:
      +materialized: view
    marts:
      +materialized: table
      +file_format: parquet

According to Datahub documentation, catalog.json should contain schema and I can see that it's generated without sources' schema, while manifest.json has this information. I tried to add sources into catalog.json manually and metadata were ingested correctly.

The text was updated successfully, but these errors were encountered:

emmyoop · 2022-05-11T19:11:33Z

@hanna-liashchuk thank you for writing this up and providing details on your setup. I'm not exactly sure what's happening here as I cannot reproduce this behavior so I have a few follow up questions.

Does dbt run execute successfully for the model?
If you look at the compiled sql for the model using the source, does it look as expected? The compiled sql can be found at target/compiled/path/to/model.

This is a bit of an educated guess without more information, but you may need to specify the schema if the source lives somewhere else and is poorly named. If the source does not live in the target database you can also define the database where it does live.

hanna-liashchuk · 2022-05-11T19:49:44Z

@emmyoop

Yes, dbt run works ok, the models are created in the target database.
Compiled sql looks as I would expect

with source_data as (

    select 
    filID,
    frID,
    frNum,
    date_format(modified, "yyyy-MM-dd HH:mm:ss") as modified,
    case when frType == '' or frType == 'NONE' then NULL else frType end as frType,
    case when frVersion == '' or frVersion == 'NONE' then NULL else frVersion end as frVersion,
    case when FACTORYSERIALNUMBER == '' or FACTORYSERIALNUMBER == 'NONE' then NULL else FACTORYSERIALNUMBER end as FACTORYSERIALNUMBER
    from filialinfo.tab_filialfiscalregisters

)

select *
from source_data

Source's schema is already defined in source definition, or should I also create separate schema under models? If yes, under what name, since source is not materialised?

If I open catalog.json after dbt docs generate, I cannot see my source, there is "sources": {}, but it's empty.

emmyoop · 2022-05-12T15:06:56Z

Well that all seems as expected. We will have to dig a bit more!

Can you see the "catalog" query being run for the schema where your source tables live? To find the catalog query, check logs/dbt.log. On Spark, that query generally looks like:

show table extended like '*' in <schemaname>

So you should be able to find:

show table extended like '*' in filialinfo

Once you let me know if you can find that we may know where to look next.

hanna-liashchuk · 2022-05-14T18:21:04Z

Hi @emmyoop
sorry for the late response and thanks for helping me out!
I can see show table extended in filialinfo like '*' in the dbt.log file

emmyoop · 2022-05-16T21:02:49Z

@hanna-liashchuk would you be willing to share you logs and artifacts (catalog.json & manifest.json) with me via email? Since I can't reproduce this behavior, it may help me determine what's happening.

hanna-liashchuk · 2022-05-17T09:08:28Z

@emmyoop yes, sure

emmyoop · 2022-05-17T12:54:46Z

Great! Just zip up the relevant log, manifest and catalog and send over to [email protected].

hanna-liashchuk · 2022-05-18T08:06:41Z

@emmyoop, the zip is sent :)

jtcohen6 · 2022-05-30T10:51:41Z

@hanna-liashchuk Thanks for sharing those logs!

So we can indeed see:

dbt runs the show table extended statement for the filialinfo schema at 18:17:10.620947
dbt logs Getting table schema for relation filialinfo.tab_filialfiscalregisters at 18:17:10.620947 — that is, it seems to find exactly one source table in that schema, and tries to parse the column schema out of the query result using this regex:

    INFORMATION_COLUMNS_REGEX = re.compile(r"^ \|-- (.*): (.*) \(nullable = (.*)\b", re.MULTILINE)

So it seems likely that either the column info is missing from that metadata query's result, or it's showing up in a text format that's slightly different from the one our regex is expecting. Could I ask you to try running these queries yourself, and share the results:

show table extended in filialinfo like '*'
describe table extended filialinfo.tab_filialfiscalregisters

If we find that this is indeed due to missing or mismatched info in those statements, we'll want to transfer this issue over to the dbt-spark repository, and see if we can nail down a straightforward reproduction case.

hanna-liashchuk · 2022-06-05T18:33:59Z

hi @jtcohen6
show table extended in filialinfo like '*' returns

tableName:tab_filialfiscalregisters
owner:xxx
location:hdfs://xxx/xx/xx/filialinfo/tab_filialfiscalregisters
inputformat:io.delta.hive.DeltaInputFormat
outputformat:io.delta.hive.DeltaOutputFormat
columns:struct columns { i32 filid, i32 frid, decimal(13,0) frnum, timestamp modified, string frtype, string frversion, string factoryserialnumber}
partitioned:false
partitionColumns:
totalNumberFiles:1831
totalFileSize:45428870
maxFileSize:65796
minFileSize:27
lastAccessTime:1654453481662
lastUpdateTime:1654376846290

describe table extended filialinfo.tab_filialfiscalregisters returns

+-------------------------------+----------------------------------------------------+----------+
|           col_name            |                     data_type                      | comment  |
+-------------------------------+----------------------------------------------------+----------+
| filID                         | int                                                | NULL     |
| frID                          | int                                                | NULL     |
| frNum                         | decimal(13,0)                                      | NULL     |
| modified                      | timestamp                                          | NULL     |
| frType                        | string                                             | NULL     |
| frVersion                     | string                                             | NULL     |
| FACTORYSERIALNUMBER           | string                                             | NULL     |
|                               |                                                    |          |
| # Detailed Table Information  |                                                    |          |
| Database                      | filialinfo                                         |          |
| Table                         | tab_filialfiscalregisters                          |          |
| Owner                         | xx                                          |          |
| Created Time                  | Sun Jun 05 00:22:26 EEST 2022                      |          |
| Last Access                   | UNKNOWN                                            |          |
| Created By                    | Spark 2.2 or prior                                 |          |
| Type                          | EXTERNAL                                           |          |
| Provider                      | DELTA                                              |          |
| Table Properties              | [bucketing_version=2, storage_handler=io.delta.hive.DeltaStorageHandler] |          |
| Statistics                    | 45105443 bytes                                     |          |
| Location                      | hdfs://xxx/filialinfo/tab_filialfiscalregisters |          |
| Serde Library                 | org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe |          |
| InputFormat                   | io.delta.hive.DeltaInputFormat                     |          |
| OutputFormat                  | io.delta.hive.DeltaOutputFormat                    |          |
+-------------------------------+----------------------------------------------------+----------+

jtcohen6 · 2022-06-06T11:21:16Z

Indeed, it looks like the columns info returned by show table extended:

struct columns { i32 filid, i32 frid, decimal(13,0) frnum, timestamp modified, string frtype, string frversion, string factoryserialnumber}

Does not match the regex that dbt-spark is using currently:

    INFORMATION_COLUMNS_REGEX = re.compile(r"^ \|-- (.*): (.*) \(nullable = (.*)\b", re.MULTILINE)

We're expecting an output looking like this:

Database: dbt_jcohen
Table: dim_listings
Owner: root Created Time: Fri May 06 15:05:18 UTC 2022
Last Access: UNKNOWN
Created By: Spark 3.2.1
Type: MANAGED
Provider: delta
Location: dbfs:/user/hive/warehouse/dbt_jcohen.db/dim_listings
Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Partition Provider: Catalog
Schema: root
|-- listing_id: integer (nullable = true)
|-- host_id: integer (nullable = true)
|-- listing_url: string (nullable = true)
...

It looks like you're using a Delta table... though perhaps not a managed one? Which versions of Apache Spark + Databricks Runtime are you using?

(I'm going to transfer this issue to the dbt-spark repo, since that's where we'd need to adjust the regex / post-processing logic)

hanna-liashchuk · 2022-06-10T14:47:44Z

@jtcohen6 It's Spark 3.1.2 with Delta 1.0
We are not using Databricks, it's opensource version

harryharanb · 2022-06-17T11:58:07Z

opensource deltalake doesnt provide schema details in query show table extended in databasename like '*'
This was an issue in model execution also, which was fixed with my PR below.

#207

We may need similar approach for docs generate as well.

harryharanb · 2022-06-17T12:05:06Z

Also this dbt docs issue is already being discussed #295

github-actions · 2022-12-15T02:00:46Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

VShkaberda · 2022-12-23T10:17:39Z

Is there any chance of this issue being fixed? There is still no schema after generating docs and ingesting metadata into Datahub. And there is still no Schema when using show table extended.

github-actions bot changed the title ~~Sources are not passed into catalog.json~~ [CT-630] Sources are not passed into catalog.json May 11, 2022

jtcohen6 added bug Something isn't working triage labels May 11, 2022

emmyoop added awaiting_response and removed triage labels May 11, 2022

github-actions bot added triage and removed awaiting_response labels May 11, 2022

emmyoop added awaiting_response and removed triage labels May 12, 2022

github-actions bot added triage and removed awaiting_response labels May 14, 2022

emmyoop added awaiting_response and removed triage labels May 16, 2022

github-actions bot added triage and removed awaiting_response labels May 17, 2022

emmyoop added awaiting_response and removed triage labels May 17, 2022

github-actions bot added triage and removed awaiting_response labels May 18, 2022

emmyoop self-assigned this May 18, 2022

jtcohen6 removed the triage label May 30, 2022

jtcohen6 added the awaiting_response label May 30, 2022

github-actions bot added triage and removed awaiting_response labels May 30, 2022

jtcohen6 added awaiting_response and removed triage labels May 31, 2022

github-actions bot added triage and removed awaiting_response labels Jun 5, 2022

jtcohen6 removed the triage label Jun 6, 2022

jtcohen6 transferred this issue from dbt-labs/dbt-core Jun 6, 2022

github-actions bot changed the title ~~[CT-630] Sources are not passed into catalog.json~~ [CT-725] [CT-630] Sources are not passed into catalog.json Jun 6, 2022

emmyoop removed their assignment Jul 26, 2022

github-actions bot added the Stale label Dec 15, 2022

github-actions bot closed this as completed Dec 23, 2022

hanna-liashchuk mentioned this issue Jan 12, 2023

fix missing schema in catalog for delta tables #589

Open

6 tasks

Fleid mentioned this issue Apr 4, 2023

[ADAP-429] [PR Tracking] fix missing schema in catalog for delta tables #589 #706

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-725] [CT-630] Sources are not passed into catalog.json #368

[CT-725] [CT-630] Sources are not passed into catalog.json #368

hanna-liashchuk commented May 11, 2022

emmyoop commented May 11, 2022

hanna-liashchuk commented May 11, 2022

emmyoop commented May 12, 2022

hanna-liashchuk commented May 14, 2022

emmyoop commented May 16, 2022

hanna-liashchuk commented May 17, 2022

emmyoop commented May 17, 2022

hanna-liashchuk commented May 18, 2022

jtcohen6 commented May 30, 2022 •

edited

Loading

hanna-liashchuk commented Jun 5, 2022

jtcohen6 commented Jun 6, 2022

hanna-liashchuk commented Jun 10, 2022

harryharanb commented Jun 17, 2022

harryharanb commented Jun 17, 2022

github-actions bot commented Dec 15, 2022

VShkaberda commented Dec 23, 2022

[CT-725] [CT-630] Sources are not passed into catalog.json #368

[CT-725] [CT-630] Sources are not passed into catalog.json #368

Comments

hanna-liashchuk commented May 11, 2022

emmyoop commented May 11, 2022

hanna-liashchuk commented May 11, 2022

emmyoop commented May 12, 2022

hanna-liashchuk commented May 14, 2022

emmyoop commented May 16, 2022

hanna-liashchuk commented May 17, 2022

emmyoop commented May 17, 2022

hanna-liashchuk commented May 18, 2022

jtcohen6 commented May 30, 2022 • edited Loading

hanna-liashchuk commented Jun 5, 2022

jtcohen6 commented Jun 6, 2022

hanna-liashchuk commented Jun 10, 2022

harryharanb commented Jun 17, 2022

harryharanb commented Jun 17, 2022

github-actions bot commented Dec 15, 2022

VShkaberda commented Dec 23, 2022

jtcohen6 commented May 30, 2022 •

edited

Loading