-
Notifications
You must be signed in to change notification settings - Fork 13.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd4 in position 6: invalid continuation byte #29457
Comments
To resolve the Here is the relevant code snippet from the if not df.empty:
utils.normalize_dttm_col(
df=df,
dttm_cols=tuple(
[
DateColumn.get_legacy_time_column(
timestamp_format=timestamp_format,
offset=self.datasource.offset,
time_shift=self.form_data.get("time_shift"),
)
]
),
)
if self.enforce_numerical_metrics:
self.df_metrics_to_num(df)
df.replace([np.inf, -np.inf], np.nan, inplace=True) Ensure that this normalization step is correctly applied to handle the timestamp columns and avoid Additionally, the Pinot database connector is known to cause issues with # Pinot driver infers TIMESTAMP column as LONG, so make the quick fix.
# When the Pinot driver fixes this bug, the current method could be removed.
if isinstance(sqla_column_type, types.TIMESTAMP):
return sqla_column_type.compile().upper() This fix ensures that |
#28266 was first cherry-picked in 4.0.2 Please try upgrading and confirm the issue is fixed |
@mistercrunch unfortunately, this did not fix the issue. I upgraded to
Additionally, I noticed that this issue occurs only when selecting columns with the |
Full stracktrace please! Also curious which database engine/driver/version your are using. |
Database engine: mssql+pyodbc Stracktrace:
|
Oh it appears 4.0.2 does not include the large json refactor that centralized all calls to This should make 4.1.x I believe, I don't recommend brining in this large refactor as a cherry as it'll merge-conflict heavily |
@Habeeb556 if you have the ability to test against the |
@mistercrunch, I have some good news and bad news. The good news is that I think I have successfully pushed to the I'm not sure if this is a bug or if my push was incorrect and missed something. |
This is where the The chinese characters would show if/when your binary blob are decodable to utf-8 or utf-16. What is in your binary blob? What do you expect to see? Maybe you're using some funky other encoding or "collation". At this point if you're using something else than utf-N in this day and age you may want to standardize, or wrap the column with some database function that brings things to a modern encoding. |
But what's in there? Some other language/character set? Guessing these bytes represents something intelligible (?) Having worked with SQL Server a long time ago, I'm guessing this has to do with "collation" and MSFT SQL SERVER deep support for different character sets. From my understanding, all this is pretty much obsolete with the rise of the utf-8 / utf-16 standards. Given that, Apache Superset probably shouldn't go out of its way to support the intricacies of how different databases support different character sets, and just tell people to convert to |
I agree with you. I'm not exactly sure about the business logic here since I'm a DBA focused on database support for analytical tools. They encountered the error because of a Overall, it's good that we can skip this error now when using |
I am experiencing the same issue. |
@OleksandrDikov I think it will be published in |
My guess is that you're using a database/driver that supports different character sets. Seems python itself can't know what's in that binary and therefore can't know what to do with it. Seems that you'll need to convert to utf8 at a lower level (either the database or driver-level). Which database is it? What's the exact data type of the columns that trigger the error? Any chance it's something similar to this -> https://learn.microsoft.com/en-us/sql/relational-databases/collations/collation-and-unicode-support?view=sql-server-ver16 . Note that Oracle has similar concepts I think they "called character sets" With the rise of utf-N standards, there shouldn't be much or any needs for these obsolete character sets. |
This issue is related to Matomo (formerly Piwik). The error particularly occurs with the matomo_log_visit table. When I run the query SELECT * from matomo_log_visit; I receive the following error message: After some investigation, I found that the problem is related to binary fields. If I exclude these binary fields from the query, everything works fine. |
I'm curious if the type is Superset should not try to convert a BINARY-like type as text and fail. We should just show a label I'd say nowadays python >= 3.x all python dbapi drivers (or the database itself at a lower level) should handle the database administrators should be standardizing and kill all older/funky character set support on sight where possible. For old legacy apps, it should be easy enough to expose views that force the utf-8 conversion upfront so that tools like Superset don't have to deal with legacy/gibberish from old databases and drivers. |
I agree that databases and their drivers should handle UTF-8 conversion natively to avoid such errors. Superset should not attempt to convert binary-like data to text. For older systems, exposing views that force UTF-8 conversion upfront could prevent similar issues in modern tools like Superset. |
Bug description
I encountered the following error when querying
select * from table
in SQL Lab. This issue occurred after upgrading from Supersetversion 2.1.3
toversion 4.0.1
.UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd4 in position 6: invalid continuation byte
How to reproduce the bug
Any table column with a
TIMESTAMP datatype
generates this error.Screenshots/recordings
Superset version
4.0.1
Python version
3.11
Node version
I don't know
Browser
Chrome
Additional context
No response
Checklist
The text was updated successfully, but these errors were encountered: