You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have been working with the one_hot_encoder macro on a feature store and I came across a bug when using it.
Bug
The bug that I encountered was that when the include_columns argument was not '*', the code for the one_hot_encoder would insert blanks as the columns in the select statement and the code would not compile because of the comma that would be alone there.
The compiled SQL file would provide an "unexpected ',' error" because the line would look like:
select , case WHEN GENDER = 'MALE' then true... (and the rest of the compiled sql)
When it should be:
select GENDER, case WHEN GENDER = 'MALE' then true... (and the rest of the compiled sql)
I looked at the code for the one_hot_encoder and I found the problem to be in lines 102-104 of the code. (In the deafult__one_hot_encoder macro).
For some reason, when include_columns is set to a list of columns we want, the "column" in "for column in col_list" is already the name of what is in the list. For example, if include_columns = ['ID'], column would be 'ID', so when you do column.name on that, it would be like 'ID'.name and it returns a blank for some reason.
It works fine when include_columns = "*" because that makes "column" something like SnowflakeColumn(column='name', dtype='VARCHAR', char_size = 16777216, numeric_precision=None, numeric_scale=None), so when you do column.name on it, it will give you the name.
Quick Solution
I have come up with a quick solution to this bug, and it is probably not the most elegant and robust. In the original code, under line 55, I made a new variable called 'flag' and set it equal to 'inc' like this: {% set flag = 'not_inc' %}. This will be a variable that tells the macro if the include_columns argument is "*" or not. Then under line 60 in the original code, I set flag = 'inc' because it would mean the include_columns variable was set by the programmer. So like this: {% set flag = 'inc' %}.
Then on line 70 of the original code, I added that variable into the adapter.dispatch thing, so the new line was:
Finally, for the default__one_hot_encoder macro on line 73 of the original code, I changed the logic up a bit to catch the instances where the include_columns argument was not "*". I replaced lines 76-78 of the original code with:
{%- if flag == 'not_inc' -%}
{% for column in col_list %}
{{ column.name }},
{%- endfor -%}
{%- elif flag == 'inc' -%}
{% for column in col_list %}
{{ column }},
{%- endfor -%}
{%- endif -%}
This logic makes sure that if include_arguments is something like ['ID'], then it will not do column.name on 'ID', but rather just take the column name straight from column. Otherwise, when include_columns="*", proceed like before.
This solution may not be robust, so I was hoping to find out why this bug occurs and maybe find a better fix for it.
The text was updated successfully, but these errors were encountered:
I have been working with the one_hot_encoder macro on a feature store and I came across a bug when using it.
Bug
The bug that I encountered was that when the include_columns argument was not '*', the code for the one_hot_encoder would insert blanks as the columns in the select statement and the code would not compile because of the comma that would be alone there.
For example, if the code was this:
gender as ( {{ dbt_ml_preprocessing.one_hot_encoder( source_table = ref('my_table'), source_column = 'GENDER', include_columns = ['ID']) }} )
The compiled SQL file would provide an "unexpected ',' error" because the line would look like:
select , case WHEN GENDER = 'MALE' then true... (and the rest of the compiled sql)
When it should be:
select GENDER, case WHEN GENDER = 'MALE' then true... (and the rest of the compiled sql)
I looked at the code for the one_hot_encoder and I found the problem to be in lines 102-104 of the code. (In the deafult__one_hot_encoder macro).
For some reason, when include_columns is set to a list of columns we want, the "column" in "for column in col_list" is already the name of what is in the list. For example, if include_columns = ['ID'], column would be 'ID', so when you do column.name on that, it would be like 'ID'.name and it returns a blank for some reason.
It works fine when include_columns = "*" because that makes "column" something like
SnowflakeColumn(column='name', dtype='VARCHAR', char_size = 16777216, numeric_precision=None, numeric_scale=None)
, so when you do column.name on it, it will give you the name.Quick Solution
I have come up with a quick solution to this bug, and it is probably not the most elegant and robust. In the original code, under line 55, I made a new variable called 'flag' and set it equal to 'inc' like this:
{% set flag = 'not_inc' %}
. This will be a variable that tells the macro if the include_columns argument is "*" or not. Then under line 60 in the original code, I set flag = 'inc' because it would mean the include_columns variable was set by the programmer. So like this:{% set flag = 'inc' %}
.Then on line 70 of the original code, I added that variable into the adapter.dispatch thing, so the new line was:
{{ adapter.dispatch('one_hot_encoder',packages=['dbt_ml_preprocessing'])(source_table, source_column, category_values, handle_unknown, col_list, flag) }}
as opposed to
{{ adapter.dispatch('one_hot_encoder',packages=['dbt_ml_preprocessing'])(source_table, source_column, category_values, handle_unknown, col_list) }}
Finally, for the default__one_hot_encoder macro on line 73 of the original code, I changed the logic up a bit to catch the instances where the include_columns argument was not "*". I replaced lines 76-78 of the original code with:
This logic makes sure that if include_arguments is something like ['ID'], then it will not do column.name on 'ID', but rather just take the column name straight from column. Otherwise, when include_columns="*", proceed like before.
This solution may not be robust, so I was hoping to find out why this bug occurs and maybe find a better fix for it.
The text was updated successfully, but these errors were encountered: