Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-725] [CT-630] Sources are not passed into catalog.json #368

Closed
hanna-liashchuk opened this issue May 11, 2022 · 16 comments · May be fixed by #589
Closed

[CT-725] [CT-630] Sources are not passed into catalog.json #368

hanna-liashchuk opened this issue May 11, 2022 · 16 comments · May be fixed by #589
Labels
bug Something isn't working Stale

Comments

@hanna-liashchuk
Copy link
Contributor

Hi, I'm working with dat-spark and I have a dbt project with source and a model that is using this source. I'm generating docs and ingesting metadata into Datahub. My issue is that schema for Source is not transferred.

Source definition:

version: 2

sources:
  - name: filialinfo
    description: Info about filials
    tables:
      - name: tab_filialfiscalregisters
        description: Fiscal registers in filials table
        columns:
          - name: filID
            description: Foreign key of the filials table
          - name: frID
            description: Fiscal register ID
          - name: frNum
            description: Fiscal register number
          - name: modified
            description: modified datetime
          - name: frType
            description: Fiscal register type
          - name: frVersion
            description: Fiscal register version
          - name: FACTORYSERIALNUMBER
            description: Factory serial number

Model definition:

{{ config(
    materialized = 'table'
) }}

WITH source_data AS (

    SELECT
        filID,
        frID,
        frNum,
        date_format(
            modified,
            "yyyy-MM-dd HH:mm:ss"
        ) AS modified,
        CASE
            WHEN frType == ''
            OR frType == 'NONE' THEN NULL
            ELSE frType
        END AS frType,
        CASE
            WHEN frVersion == ''
            OR frVersion == 'NONE' THEN NULL
            ELSE frVersion
        END AS frVersion,
        CASE
            WHEN factoryserialnumber == ''
            OR factoryserialnumber == 'NONE' THEN NULL
            ELSE factoryserialnumber
        END AS factoryserialnumber
    FROM
        {{ source(
            'filialinfo',
            'tab_filialfiscalregisters'
        ) }}
)
SELECT
    *
FROM
    source_data

dbt_project.yml


# Name your project! Project names should contain only lowercase characters
# and underscores. A good package name should reflect your organization's
# name or the intended use of these models
name: 'palefat_dbt'
version: '1.0.0'
config-version: 2

# This setting configures which "profile" dbt uses for this project.
profile: 'palefat_dbt'

# These configurations specify where dbt should look for different types of files.
# The `model-paths` config, for example, states that models in this project can be
# found in the "models/" directory. You probably won't need to change these!
model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]

target-path: "target"  # directory which will store compiled SQL files
clean-targets:         # directories to be removed by `dbt clean`
  - "target"
  - "dbt_packages"


# Configuring models
# Full documentation: https://docs.getdbt.com/docs/configuring-models

# In this example config, we tell dbt to build all models in the example/ directory
# as tables. These settings can be overridden in the individual model files
# using the `{{ config(...) }}` macro.
models:
  palefat_dbt:
    # Config indicated by + and applies to all files under models/example/
    staging:
      +materialized: view
    marts:
      +materialized: table
      +file_format: parquet

According to Datahub documentation, catalog.json should contain schema and I can see that it's generated without sources' schema, while manifest.json has this information. I tried to add sources into catalog.json manually and metadata were ingested correctly.

@github-actions github-actions bot changed the title Sources are not passed into catalog.json [CT-630] Sources are not passed into catalog.json May 11, 2022
@jtcohen6 jtcohen6 added bug Something isn't working triage labels May 11, 2022
@emmyoop
Copy link
Member

emmyoop commented May 11, 2022

@hanna-liashchuk thank you for writing this up and providing details on your setup. I'm not exactly sure what's happening here as I cannot reproduce this behavior so I have a few follow up questions.

  1. Does dbt run execute successfully for the model?
  2. If you look at the compiled sql for the model using the source, does it look as expected? The compiled sql can be found at target/compiled/path/to/model.

This is a bit of an educated guess without more information, but you may need to specify the schema if the source lives somewhere else and is poorly named. If the source does not live in the target database you can also define the database where it does live.

@hanna-liashchuk
Copy link
Contributor Author

@emmyoop

  1. Yes, dbt run works ok, the models are created in the target database.
  2. Compiled sql looks as I would expect
with source_data as (

    select 
    filID,
    frID,
    frNum,
    date_format(modified, "yyyy-MM-dd HH:mm:ss") as modified,
    case when frType == '' or frType == 'NONE' then NULL else frType end as frType,
    case when frVersion == '' or frVersion == 'NONE' then NULL else frVersion end as frVersion,
    case when FACTORYSERIALNUMBER == '' or FACTORYSERIALNUMBER == 'NONE' then NULL else FACTORYSERIALNUMBER end as FACTORYSERIALNUMBER
    from filialinfo.tab_filialfiscalregisters

)

select *
from source_data

Source's schema is already defined in source definition, or should I also create separate schema under models? If yes, under what name, since source is not materialised?

If I open catalog.json after dbt docs generate, I cannot see my source, there is "sources": {}, but it's empty.

@emmyoop
Copy link
Member

emmyoop commented May 12, 2022

Well that all seems as expected. We will have to dig a bit more!

Can you see the "catalog" query being run for the schema where your source tables live? To find the catalog query, check logs/dbt.log. On Spark, that query generally looks like:

show table extended like '*' in <schemaname>

So you should be able to find:

show table extended like '*' in filialinfo

Once you let me know if you can find that we may know where to look next.

@hanna-liashchuk
Copy link
Contributor Author

Hi @emmyoop
sorry for the late response and thanks for helping me out!
I can see show table extended in filialinfo like '*' in the dbt.log file

@emmyoop
Copy link
Member

emmyoop commented May 16, 2022

@hanna-liashchuk would you be willing to share you logs and artifacts (catalog.json & manifest.json) with me via email? Since I can't reproduce this behavior, it may help me determine what's happening.

@hanna-liashchuk
Copy link
Contributor Author

@emmyoop yes, sure

@emmyoop
Copy link
Member

emmyoop commented May 17, 2022

Great! Just zip up the relevant log, manifest and catalog and send over to [email protected].

@hanna-liashchuk
Copy link
Contributor Author

@emmyoop, the zip is sent :)

@emmyoop emmyoop self-assigned this May 18, 2022
@jtcohen6
Copy link
Contributor

jtcohen6 commented May 30, 2022

@hanna-liashchuk Thanks for sharing those logs!

So we can indeed see:

  • dbt runs the show table extended statement for the filialinfo schema at 18:17:10.620947
  • dbt logs Getting table schema for relation filialinfo.tab_filialfiscalregisters at 18:17:10.620947 — that is, it seems to find exactly one source table in that schema, and tries to parse the column schema out of the query result using this regex:
    INFORMATION_COLUMNS_REGEX = re.compile(r"^ \|-- (.*): (.*) \(nullable = (.*)\b", re.MULTILINE)

So it seems likely that either the column info is missing from that metadata query's result, or it's showing up in a text format that's slightly different from the one our regex is expecting. Could I ask you to try running these queries yourself, and share the results:

  • show table extended in filialinfo like '*'
  • describe table extended filialinfo.tab_filialfiscalregisters

If we find that this is indeed due to missing or mismatched info in those statements, we'll want to transfer this issue over to the dbt-spark repository, and see if we can nail down a straightforward reproduction case.

@jtcohen6 jtcohen6 removed the triage label May 30, 2022
@hanna-liashchuk
Copy link
Contributor Author

hi @jtcohen6
show table extended in filialinfo like '*' returns

tableName:tab_filialfiscalregisters
owner:xxx
location:hdfs://xxx/xx/xx/filialinfo/tab_filialfiscalregisters
inputformat:io.delta.hive.DeltaInputFormat
outputformat:io.delta.hive.DeltaOutputFormat
columns:struct columns { i32 filid, i32 frid, decimal(13,0) frnum, timestamp modified, string frtype, string frversion, string factoryserialnumber}
partitioned:false
partitionColumns:
totalNumberFiles:1831
totalFileSize:45428870
maxFileSize:65796
minFileSize:27
lastAccessTime:1654453481662
lastUpdateTime:1654376846290

describe table extended filialinfo.tab_filialfiscalregisters returns

+-------------------------------+----------------------------------------------------+----------+
|           col_name            |                     data_type                      | comment  |
+-------------------------------+----------------------------------------------------+----------+
| filID                         | int                                                | NULL     |
| frID                          | int                                                | NULL     |
| frNum                         | decimal(13,0)                                      | NULL     |
| modified                      | timestamp                                          | NULL     |
| frType                        | string                                             | NULL     |
| frVersion                     | string                                             | NULL     |
| FACTORYSERIALNUMBER           | string                                             | NULL     |
|                               |                                                    |          |
| # Detailed Table Information  |                                                    |          |
| Database                      | filialinfo                                         |          |
| Table                         | tab_filialfiscalregisters                          |          |
| Owner                         | xx                                          |          |
| Created Time                  | Sun Jun 05 00:22:26 EEST 2022                      |          |
| Last Access                   | UNKNOWN                                            |          |
| Created By                    | Spark 2.2 or prior                                 |          |
| Type                          | EXTERNAL                                           |          |
| Provider                      | DELTA                                              |          |
| Table Properties              | [bucketing_version=2, storage_handler=io.delta.hive.DeltaStorageHandler] |          |
| Statistics                    | 45105443 bytes                                     |          |
| Location                      | hdfs://xxx/filialinfo/tab_filialfiscalregisters |          |
| Serde Library                 | org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe |          |
| InputFormat                   | io.delta.hive.DeltaInputFormat                     |          |
| OutputFormat                  | io.delta.hive.DeltaOutputFormat                    |          |
+-------------------------------+----------------------------------------------------+----------+

@jtcohen6
Copy link
Contributor

jtcohen6 commented Jun 6, 2022

Indeed, it looks like the columns info returned by show table extended:

struct columns { i32 filid, i32 frid, decimal(13,0) frnum, timestamp modified, string frtype, string frversion, string factoryserialnumber}

Does not match the regex that dbt-spark is using currently:

    INFORMATION_COLUMNS_REGEX = re.compile(r"^ \|-- (.*): (.*) \(nullable = (.*)\b", re.MULTILINE)

We're expecting an output looking like this:

Database: dbt_jcohen
Table: dim_listings
Owner: root Created Time: Fri May 06 15:05:18 UTC 2022
Last Access: UNKNOWN
Created By: Spark 3.2.1
Type: MANAGED
Provider: delta
Location: dbfs:/user/hive/warehouse/dbt_jcohen.db/dim_listings
Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Partition Provider: Catalog
Schema: root
|-- listing_id: integer (nullable = true)
|-- host_id: integer (nullable = true)
|-- listing_url: string (nullable = true)
...

It looks like you're using a Delta table... though perhaps not a managed one? Which versions of Apache Spark + Databricks Runtime are you using?

(I'm going to transfer this issue to the dbt-spark repo, since that's where we'd need to adjust the regex / post-processing logic)

@jtcohen6 jtcohen6 removed the triage label Jun 6, 2022
@jtcohen6 jtcohen6 transferred this issue from dbt-labs/dbt-core Jun 6, 2022
@github-actions github-actions bot changed the title [CT-630] Sources are not passed into catalog.json [CT-725] [CT-630] Sources are not passed into catalog.json Jun 6, 2022
@hanna-liashchuk
Copy link
Contributor Author

@jtcohen6 It's Spark 3.1.2 with Delta 1.0
We are not using Databricks, it's opensource version

@harryharanb
Copy link
Contributor

opensource deltalake doesnt provide schema details in query show table extended in databasename like '*'
This was an issue in model execution also, which was fixed with my PR below.

#207

We may need similar approach for docs generate as well.

@harryharanb
Copy link
Contributor

Also this dbt docs issue is already being discussed #295

@emmyoop emmyoop removed their assignment Jul 26, 2022
@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

@VShkaberda
Copy link
Contributor

Is there any chance of this issue being fixed? There is still no schema after generating docs and ingesting metadata into Datahub. And there is still no Schema when using show table extended.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Stale
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants