Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd4 in position 6: invalid continuation byte #29457

Open
2 of 3 tasks
Habeeb556 opened this issue Jul 2, 2024 · 18 comments
Labels
#bug:regression Bugs that are identified as regessions sqllab Namespace | Anything related to the SQL Lab

Comments

@Habeeb556
Copy link
Contributor

Bug description

I encountered the following error when querying select * from table in SQL Lab. This issue occurred after upgrading from Superset version 2.1.3 to version 4.0.1.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd4 in position 6: invalid continuation byte

How to reproduce the bug

Any table column with a TIMESTAMP datatype generates this error.

Screenshots/recordings

image

Superset version

4.0.1

Python version

3.11

Node version

I don't know

Browser

Chrome

Additional context

No response

Checklist

  • I have searched Superset docs and Slack and didn't find a solution to my problem.
  • I have searched the GitHub issue tracker and didn't find a similar bug report.
  • I have checked Superset's logs for errors and if I found a relevant Python stacktrace, I included it here as text in the "additional context" section.
@dosubot dosubot bot added #bug:regression Bugs that are identified as regessions sqllab Namespace | Anything related to the SQL Lab labels Jul 2, 2024
Copy link

dosubot bot commented Jul 2, 2024

To resolve the UnicodeDecodeError when querying a table with a TIMESTAMP datatype in SQL Lab after upgrading from Superset version 2.1.3 to version 4.0.1, ensure that the timestamp columns are properly normalized to a pandas-supported datetime format. This can be done using the utils.normalize_dttm_col function in the get_df method of the BaseViz class.

Here is the relevant code snippet from the get_df method:

if not df.empty:
    utils.normalize_dttm_col(
        df=df,
        dttm_cols=tuple(
            [
                DateColumn.get_legacy_time_column(
                    timestamp_format=timestamp_format,
                    offset=self.datasource.offset,
                    time_shift=self.form_data.get("time_shift"),
                )
            ]
        ),
    )

    if self.enforce_numerical_metrics:
        self.df_metrics_to_num(df)

    df.replace([np.inf, -np.inf], np.nan, inplace=True)

Ensure that this normalization step is correctly applied to handle the timestamp columns and avoid UnicodeDecodeError [1].

Additionally, the Pinot database connector is known to cause issues with TIMESTAMP columns in Superset. The Pinot driver infers TIMESTAMP columns as LONG, which can lead to problems such as UnicodeDecodeError. Here is a quick fix for the Pinot driver:

# Pinot driver infers TIMESTAMP column as LONG, so make the quick fix.
# When the Pinot driver fixes this bug, the current method could be removed.
if isinstance(sqla_column_type, types.TIMESTAMP):
    return sqla_column_type.compile().upper()

This fix ensures that TIMESTAMP columns are correctly interpreted and processed [2].

To continue talking to Dosu, mention @dosu.

@mistercrunch
Copy link
Member

#28266 was first cherry-picked in 4.0.2

Please try upgrading and confirm the issue is fixed

@Habeeb556
Copy link
Contributor Author

@mistercrunch unfortunately, this did not fix the issue. I upgraded to version 4.0.2 and encountered the same error.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 6: invalid start byte
ERROR:superset.views.base:'utf-8' codec can't decode byte 0xff in position 6: invalid start byte

Additionally, I noticed that this issue occurs only when selecting columns with the TIMESTAMP datatype. All other columns work fine. It worked correctly with version 2.1.3 when I switched back.

@mistercrunch
Copy link
Member

mistercrunch commented Jul 3, 2024

Full stracktrace please! Also curious which database engine/driver/version your are using.

@Habeeb556
Copy link
Contributor Author

Database engine: mssql+pyodbc
Version: 5.1.0

Stracktrace:

'utf-8' codec can't decode byte 0xff in position 6: invalid start byte
Traceback (most recent call last):
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/flask/app.py", line 1484, in full_dispatch_request
   rv = self.dispatch_request()
        ^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/flask/app.py", line 1469, in dispatch_request
   return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/flask_appbuilder/security/decorators.py", line 95, in wraps
   return f(self, *args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/views/base_api.py", line 127, in wraps
   raise ex
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/views/base_api.py", line 121, in wraps
   duration, response = time_function(f, self, *args, **kwargs)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/utils/core.py", line 1470, in time_function
   response = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/flask_appbuilder/api/__init__.py", line 183, in wraps
   return f(self, *args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/utils/log.py", line 255, in wrapper
   value = f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/sqllab/api.py", line 346, in get_results
   payload = json.dumps(
             ^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/simplejson/__init__.py", line 395, in dumps
   **kw).encode(obj)
         ^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/simplejson/encoder.py", line 298, in encode
   chunks = self.iterencode(o)
            ^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/simplejson/encoder.py", line 379, in iterencode
   return _iterencode(o, 0)
          ^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 6: invalid start byte
2024-07-03 20:26:50,670:ERROR:superset.views.base:'utf-8' codec can't decode byte 0xff in position 6: invalid start byte
Traceback (most recent call last):
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/flask/app.py", line 1484, in full_dispatch_request
   rv = self.dispatch_request()
        ^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/flask/app.py", line 1469, in dispatch_request
   return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/flask_appbuilder/security/decorators.py", line 95, in wraps
   return f(self, *args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/views/base_api.py", line 127, in wraps
   raise ex
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/views/base_api.py", line 121, in wraps
   duration, response = time_function(f, self, *args, **kwargs)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/utils/core.py", line 1470, in time_function
   response = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/flask_appbuilder/api/__init__.py", line 183, in wraps
   return f(self, *args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/utils/log.py", line 255, in wrapper
   value = f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/sqllab/api.py", line 346, in get_results
   payload = json.dumps(
             ^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/simplejson/__init__.py", line 395, in dumps
   **kw).encode(obj)
         ^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/simplejson/encoder.py", line 298, in encode
   chunks = self.iterencode(o)
            ^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/simplejson/encoder.py", line 379, in iterencode
   return _iterencode(o, 0)
          ^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 6: invalid start byte
Triggering query_id: 41782
2024-07-03 20:26:50,944:INFO:superset.commands.sql_lab.execute:Triggering query_id: 41782
Query 41782: Running query on a Celery worker
2024-07-03 20:26:50,954:INFO:superset.sqllab.sql_json_executer:Query 41782: Running query on a Celery worker
'utf-8' codec can't decode byte 0xff in position 6: invalid start byte
Traceback (most recent call last):
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/flask/app.py", line 1484, in full_dispatch_request
   rv = self.dispatch_request()
        ^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/flask/app.py", line 1469, in dispatch_request
   return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/flask_appbuilder/security/decorators.py", line 95, in wraps
   return f(self, *args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/views/base_api.py", line 127, in wraps
   raise ex
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/views/base_api.py", line 121, in wraps
   duration, response = time_function(f, self, *args, **kwargs)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/utils/core.py", line 1470, in time_function
   response = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/flask_appbuilder/api/__init__.py", line 183, in wraps
   return f(self, *args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/utils/log.py", line 255, in wrapper
   value = f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/sqllab/api.py", line 346, in get_results
   payload = json.dumps(
             ^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/simplejson/__init__.py", line 395, in dumps
   **kw).encode(obj)
         ^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/simplejson/encoder.py", line 298, in encode
   chunks = self.iterencode(o)
            ^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/simplejson/encoder.py", line 379, in iterencode
   return _iterencode(o, 0)
          ^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 6: invalid start byte
2024-07-03 20:26:59,507:ERROR:superset.views.base:'utf-8' codec can't decode byte 0xff in position 6: invalid start byte
Traceback (most recent call last):
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/flask/app.py", line 1484, in full_dispatch_request
   rv = self.dispatch_request()
        ^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/flask/app.py", line 1469, in dispatch_request
   return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/flask_appbuilder/security/decorators.py", line 95, in wraps
   return f(self, *args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/views/base_api.py", line 127, in wraps
   raise ex
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/views/base_api.py", line 121, in wraps
   duration, response = time_function(f, self, *args, **kwargs)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/utils/core.py", line 1470, in time_function
   response = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/flask_appbuilder/api/__init__.py", line 183, in wraps
   return f(self, *args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/utils/log.py", line 255, in wrapper
   value = f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/sqllab/api.py", line 346, in get_results
   payload = json.dumps(
             ^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/simplejson/__init__.py", line 395, in dumps
   **kw).encode(obj)
         ^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/simplejson/encoder.py", line 298, in encode
   chunks = self.iterencode(o)
            ^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/simplejson/encoder.py", line 379, in iterencode
   return _iterencode(o, 0)
          ^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 6: invalid start byte

@mistercrunch
Copy link
Member

Oh it appears 4.0.2 does not include the large json refactor that centralized all calls to superset/utils/json.py here -> #28702

This should make 4.1.x I believe, I don't recommend brining in this large refactor as a cherry as it'll merge-conflict heavily

@mistercrunch
Copy link
Member

mistercrunch commented Jul 3, 2024

@Habeeb556 if you have the ability to test against the master branch, you could confirm that it's working there. I'm tempted to close the issue, but will wait until you confirm the fix.

@Habeeb556
Copy link
Contributor Author

@mistercrunch, I have some good news and bad news. The good news is that I think I have successfully pushed to the master branch, and the query is running fine. However, the bad news is that the output is incorrectly formatted with Chinese characters.

image

I'm not sure if this is a bug or if my push was incorrect and missed something.

@mistercrunch
Copy link
Member

This is where the [bytes] come from:
https://github.com/apache/superset/blob/master/superset/utils/json.py#L102

The chinese characters would show if/when your binary blob are decodable to utf-8 or utf-16.

What is in your binary blob? What do you expect to see?

Maybe you're using some funky other encoding or "collation". At this point if you're using something else than utf-N in this day and age you may want to standardize, or wrap the column with some database function that brings things to a modern encoding.

@Habeeb556
Copy link
Contributor Author

Yes, I checked this now with the old version 2.1.3, and it was returned the same value [bytes] when running. So, I can confirm that this master push with version 4.x is working.

Regarding the binary blob, here's what I expect to see when running directly from the SQL server.

image

@mistercrunch
Copy link
Member

But what's in there? Some other language/character set? Guessing these bytes represents something intelligible (?)

Having worked with SQL Server a long time ago, I'm guessing this has to do with "collation" and MSFT SQL SERVER deep support for different character sets. From my understanding, all this is pretty much obsolete with the rise of the utf-8 / utf-16 standards.

Given that, Apache Superset probably shouldn't go out of its way to support the intricacies of how different databases support different character sets, and just tell people to convert to utf-x (either physically in your tables or using casting in views) in order to get Superset to deal with non ASCII characters.

@Habeeb556
Copy link
Contributor Author

I agree with you. I'm not exactly sure about the business logic here since I'm a DBA focused on database support for analytical tools. They encountered the error because of a SELECT * FROM table query, and they might not need that column, or it could reference something within the application — I'm not sure.

Overall, it's good that we can skip this error now when using SELECT *.

@OleksandrDikov
Copy link

I am experiencing the same issue.
"Failed to execute query '470' - 'SELECT * from piwik_log_visit;': 'utf-8' codec can't decode byte 0x88 in position 0: invalid start byte"
I am currently running Superset via Helm (version 0.12.11) and attempted to manually update the containers to version 4.0.2. However, the problem persists.
Could you please let me know if a newer version resolves this issue? If so, could you update the Helm repository to the latest release?

@Habeeb556
Copy link
Contributor Author

@OleksandrDikov I think it will be published in v4.1.x which is still in pre-release. We're waiting for 4.1.0rc2 to test.

@mistercrunch
Copy link
Member

My guess is that you're using a database/driver that supports different character sets. Seems python itself can't know what's in that binary and therefore can't know what to do with it.

Seems that you'll need to convert to utf8 at a lower level (either the database or driver-level). Which database is it? What's the exact data type of the columns that trigger the error? Any chance it's something similar to this -> https://learn.microsoft.com/en-us/sql/relational-databases/collations/collation-and-unicode-support?view=sql-server-ver16 . Note that Oracle has similar concepts I think they "called character sets"

With the rise of utf-N standards, there shouldn't be much or any needs for these obsolete character sets.

@OleksandrDikov
Copy link

This issue is related to Matomo (formerly Piwik).
For troubleshooting, I set up a new instance of Matomo with a fresh database and encountered the same error.

The error particularly occurs with the matomo_log_visit table. When I run the query SELECT * from matomo_log_visit;

I receive the following error message:
Failed to execute query '548' - 'SELECT * from matomo_log_visit;': 'utf-8' codec can't decode byte 0xd0 in position 0: invalid continuation byte

After some investigation, I found that the problem is related to binary fields. If I exclude these binary fields from the query, everything works fine.

@mistercrunch
Copy link
Member

mistercrunch commented Sep 3, 2024

I'm curious if the type is BINARY or some sort of STRING / TEXT-type data type.

Superset should not try to convert a BINARY-like type as text and fail. We should just show a labelbinary or blob when encountering those types like in the data preview pane.

I'd say nowadays python >= 3.x all python dbapi drivers (or the database itself at a lower level) should handle the utf-8 conversion themselves. Like if the database allows for various pre-historic character sets for whatever reason, it should hide that from python.

database administrators should be standardizing and kill all older/funky character set support on sight where possible.

For old legacy apps, it should be easy enough to expose views that force the utf-8 conversion upfront so that tools like Superset don't have to deal with legacy/gibberish from old databases and drivers.

@CamilYed
Copy link

Superset should not try to convert a BINARY-like type as text and fail. We should just show a labelbinary or blob when encountering those types like in the data preview panel.

I agree that databases and their drivers should handle UTF-8 conversion natively to avoid such errors. Superset should not attempt to convert binary-like data to text. For older systems, exposing views that force UTF-8 conversion upfront could prevent similar issues in modern tools like Superset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
#bug:regression Bugs that are identified as regessions sqllab Namespace | Anything related to the SQL Lab
Projects
None yet
Development

No branches or pull requests

4 participants