UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd4 in position 6: invalid continuation byte #29457

Habeeb556 · 2024-07-02T14:52:42Z

Bug description

I encountered the following error when querying select * from table in SQL Lab. This issue occurred after upgrading from Superset version 2.1.3 to version 4.0.1.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd4 in position 6: invalid continuation byte

How to reproduce the bug

Any table column with a TIMESTAMP datatype generates this error.

Screenshots/recordings

Superset version

4.0.1

Python version

3.11

Node version

I don't know

Browser

Chrome

Additional context

No response

Checklist

I have searched Superset docs and Slack and didn't find a solution to my problem.
I have searched the GitHub issue tracker and didn't find a similar bug report.
I have checked Superset's logs for errors and if I found a relevant Python stacktrace, I included it here as text in the "additional context" section.

The text was updated successfully, but these errors were encountered:

dosubot · 2024-07-02T15:00:24Z

To resolve the UnicodeDecodeError when querying a table with a TIMESTAMP datatype in SQL Lab after upgrading from Superset version 2.1.3 to version 4.0.1, ensure that the timestamp columns are properly normalized to a pandas-supported datetime format. This can be done using the utils.normalize_dttm_col function in the get_df method of the BaseViz class.

Here is the relevant code snippet from the get_df method:

if not df.empty:
    utils.normalize_dttm_col(
        df=df,
        dttm_cols=tuple(
            [
                DateColumn.get_legacy_time_column(
                    timestamp_format=timestamp_format,
                    offset=self.datasource.offset,
                    time_shift=self.form_data.get("time_shift"),
                )
            ]
        ),
    )

    if self.enforce_numerical_metrics:
        self.df_metrics_to_num(df)

    df.replace([np.inf, -np.inf], np.nan, inplace=True)

Ensure that this normalization step is correctly applied to handle the timestamp columns and avoid UnicodeDecodeError ^[1].

Additionally, the Pinot database connector is known to cause issues with TIMESTAMP columns in Superset. The Pinot driver infers TIMESTAMP columns as LONG, which can lead to problems such as UnicodeDecodeError. Here is a quick fix for the Pinot driver:

# Pinot driver infers TIMESTAMP column as LONG, so make the quick fix.
# When the Pinot driver fixes this bug, the current method could be removed.
if isinstance(sqla_column_type, types.TIMESTAMP):
    return sqla_column_type.compile().upper()

This fix ensures that TIMESTAMP columns are correctly interpreted and processed ^[2].

To continue talking to Dosu, mention @dosu.

mistercrunch · 2024-07-02T20:43:59Z

#28266 was first cherry-picked in 4.0.2

Please try upgrading and confirm the issue is fixed

Habeeb556 · 2024-07-03T08:47:22Z

@mistercrunch unfortunately, this did not fix the issue. I upgraded to version 4.0.2 and encountered the same error.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 6: invalid start byte
ERROR:superset.views.base:'utf-8' codec can't decode byte 0xff in position 6: invalid start byte

Additionally, I noticed that this issue occurs only when selecting columns with the TIMESTAMP datatype. All other columns work fine. It worked correctly with version 2.1.3 when I switched back.

mistercrunch · 2024-07-03T15:23:23Z

Full stracktrace please! Also curious which database engine/driver/version your are using.

Habeeb556 · 2024-07-03T17:32:31Z

Database engine: mssql+pyodbc
Version: 5.1.0

Stracktrace:

'utf-8' codec can't decode byte 0xff in position 6: invalid start byte
Traceback (most recent call last):
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/flask/app.py", line 1484, in full_dispatch_request
   rv = self.dispatch_request()
        ^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/flask/app.py", line 1469, in dispatch_request
   return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/flask_appbuilder/security/decorators.py", line 95, in wraps
   return f(self, *args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/views/base_api.py", line 127, in wraps
   raise ex
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/views/base_api.py", line 121, in wraps
   duration, response = time_function(f, self, *args, **kwargs)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/utils/core.py", line 1470, in time_function
   response = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/flask_appbuilder/api/__init__.py", line 183, in wraps
   return f(self, *args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/utils/log.py", line 255, in wrapper
   value = f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/sqllab/api.py", line 346, in get_results
   payload = json.dumps(
             ^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/simplejson/__init__.py", line 395, in dumps
   **kw).encode(obj)
         ^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/simplejson/encoder.py", line 298, in encode
   chunks = self.iterencode(o)
            ^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/simplejson/encoder.py", line 379, in iterencode
   return _iterencode(o, 0)
          ^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 6: invalid start byte
2024-07-03 20:26:50,670:ERROR:superset.views.base:'utf-8' codec can't decode byte 0xff in position 6: invalid start byte
Traceback (most recent call last):
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/flask/app.py", line 1484, in full_dispatch_request
   rv = self.dispatch_request()
        ^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/flask/app.py", line 1469, in dispatch_request
   return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/flask_appbuilder/security/decorators.py", line 95, in wraps
   return f(self, *args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/views/base_api.py", line 127, in wraps
   raise ex
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/views/base_api.py", line 121, in wraps
   duration, response = time_function(f, self, *args, **kwargs)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/utils/core.py", line 1470, in time_function
   response = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/flask_appbuilder/api/__init__.py", line 183, in wraps
   return f(self, *args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/utils/log.py", line 255, in wrapper
   value = f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/sqllab/api.py", line 346, in get_results
   payload = json.dumps(
             ^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/simplejson/__init__.py", line 395, in dumps
   **kw).encode(obj)
         ^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/simplejson/encoder.py", line 298, in encode
   chunks = self.iterencode(o)
            ^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/simplejson/encoder.py", line 379, in iterencode
   return _iterencode(o, 0)
          ^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 6: invalid start byte
Triggering query_id: 41782
2024-07-03 20:26:50,944:INFO:superset.commands.sql_lab.execute:Triggering query_id: 41782
Query 41782: Running query on a Celery worker
2024-07-03 20:26:50,954:INFO:superset.sqllab.sql_json_executer:Query 41782: Running query on a Celery worker
'utf-8' codec can't decode byte 0xff in position 6: invalid start byte
Traceback (most recent call last):
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/flask/app.py", line 1484, in full_dispatch_request
   rv = self.dispatch_request()
        ^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/flask/app.py", line 1469, in dispatch_request
   return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/flask_appbuilder/security/decorators.py", line 95, in wraps
   return f(self, *args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/views/base_api.py", line 127, in wraps
   raise ex
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/views/base_api.py", line 121, in wraps
   duration, response = time_function(f, self, *args, **kwargs)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/utils/core.py", line 1470, in time_function
   response = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/flask_appbuilder/api/__init__.py", line 183, in wraps
   return f(self, *args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/utils/log.py", line 255, in wrapper
   value = f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/sqllab/api.py", line 346, in get_results
   payload = json.dumps(
             ^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/simplejson/__init__.py", line 395, in dumps
   **kw).encode(obj)
         ^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/simplejson/encoder.py", line 298, in encode
   chunks = self.iterencode(o)
            ^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/simplejson/encoder.py", line 379, in iterencode
   return _iterencode(o, 0)
          ^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 6: invalid start byte
2024-07-03 20:26:59,507:ERROR:superset.views.base:'utf-8' codec can't decode byte 0xff in position 6: invalid start byte
Traceback (most recent call last):
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/flask/app.py", line 1484, in full_dispatch_request
   rv = self.dispatch_request()
        ^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/flask/app.py", line 1469, in dispatch_request
   return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/flask_appbuilder/security/decorators.py", line 95, in wraps
   return f(self, *args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/views/base_api.py", line 127, in wraps
   raise ex
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/views/base_api.py", line 121, in wraps
   duration, response = time_function(f, self, *args, **kwargs)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/utils/core.py", line 1470, in time_function
   response = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/flask_appbuilder/api/__init__.py", line 183, in wraps
   return f(self, *args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/utils/log.py", line 255, in wrapper
   value = f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/sqllab/api.py", line 346, in get_results
   payload = json.dumps(
             ^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/simplejson/__init__.py", line 395, in dumps
   **kw).encode(obj)
         ^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/simplejson/encoder.py", line 298, in encode
   chunks = self.iterencode(o)
            ^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/simplejson/encoder.py", line 379, in iterencode
   return _iterencode(o, 0)
          ^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 6: invalid start byte

mistercrunch · 2024-07-03T20:35:28Z

Oh it appears 4.0.2 does not include the large json refactor that centralized all calls to superset/utils/json.py here -> #28702

This should make 4.1.x I believe, I don't recommend brining in this large refactor as a cherry as it'll merge-conflict heavily

mistercrunch · 2024-07-03T20:36:25Z

@Habeeb556 if you have the ability to test against the master branch, you could confirm that it's working there. I'm tempted to close the issue, but will wait until you confirm the fix.

Habeeb556 · 2024-07-03T22:08:00Z

@mistercrunch, I have some good news and bad news. The good news is that I think I have successfully pushed to the master branch, and the query is running fine. However, the bad news is that the output is incorrectly formatted with Chinese characters.

I'm not sure if this is a bug or if my push was incorrect and missed something.

mistercrunch · 2024-07-08T17:54:09Z

This is where the [bytes] come from:
https://github.com/apache/superset/blob/master/superset/utils/json.py#L102

The chinese characters would show if/when your binary blob are decodable to utf-8 or utf-16.

What is in your binary blob? What do you expect to see?

Maybe you're using some funky other encoding or "collation". At this point if you're using something else than utf-N in this day and age you may want to standardize, or wrap the column with some database function that brings things to a modern encoding.

Habeeb556 · 2024-07-09T07:55:54Z

Yes, I checked this now with the old version 2.1.3, and it was returned the same value [bytes] when running. So, I can confirm that this master push with version 4.x is working.

Regarding the binary blob, here's what I expect to see when running directly from the SQL server.

mistercrunch · 2024-07-09T17:11:04Z

But what's in there? Some other language/character set? Guessing these bytes represents something intelligible (?)

Having worked with SQL Server a long time ago, I'm guessing this has to do with "collation" and MSFT SQL SERVER deep support for different character sets. From my understanding, all this is pretty much obsolete with the rise of the utf-8 / utf-16 standards.

Given that, Apache Superset probably shouldn't go out of its way to support the intricacies of how different databases support different character sets, and just tell people to convert to utf-x (either physically in your tables or using casting in views) in order to get Superset to deal with non ASCII characters.

Habeeb556 · 2024-07-09T18:33:19Z

I agree with you. I'm not exactly sure about the business logic here since I'm a DBA focused on database support for analytical tools. They encountered the error because of a SELECT * FROM table query, and they might not need that column, or it could reference something within the application — I'm not sure.

Overall, it's good that we can skip this error now when using SELECT *.

OleksandrDikov · 2024-08-13T16:03:58Z

I am experiencing the same issue.
"Failed to execute query '470' - 'SELECT * from piwik_log_visit;': 'utf-8' codec can't decode byte 0x88 in position 0: invalid start byte"
I am currently running Superset via Helm (version 0.12.11) and attempted to manually update the containers to version 4.0.2. However, the problem persists.
Could you please let me know if a newer version resolves this issue? If so, could you update the Helm repository to the latest release?

Habeeb556 · 2024-08-13T21:21:50Z

@OleksandrDikov I think it will be published in v4.1.x which is still in pre-release. We're waiting for 4.1.0rc2 to test.

mistercrunch · 2024-08-16T19:22:12Z

My guess is that you're using a database/driver that supports different character sets. Seems python itself can't know what's in that binary and therefore can't know what to do with it.

Seems that you'll need to convert to utf8 at a lower level (either the database or driver-level). Which database is it? What's the exact data type of the columns that trigger the error? Any chance it's something similar to this -> https://learn.microsoft.com/en-us/sql/relational-databases/collations/collation-and-unicode-support?view=sql-server-ver16 . Note that Oracle has similar concepts I think they "called character sets"

With the rise of utf-N standards, there shouldn't be much or any needs for these obsolete character sets.

OleksandrDikov · 2024-08-28T13:22:49Z

This issue is related to Matomo (formerly Piwik).
For troubleshooting, I set up a new instance of Matomo with a fresh database and encountered the same error.

The error particularly occurs with the matomo_log_visit table. When I run the query SELECT * from matomo_log_visit;

I receive the following error message:
Failed to execute query '548' - 'SELECT * from matomo_log_visit;': 'utf-8' codec can't decode byte 0xd0 in position 0: invalid continuation byte

After some investigation, I found that the problem is related to binary fields. If I exclude these binary fields from the query, everything works fine.

mistercrunch · 2024-09-03T21:37:39Z

I'm curious if the type is BINARY or some sort of STRING / TEXT-type data type.

Superset should not try to convert a BINARY-like type as text and fail. We should just show a labelbinary or blob when encountering those types like in the data preview pane.

I'd say nowadays python >= 3.x all python dbapi drivers (or the database itself at a lower level) should handle the utf-8 conversion themselves. Like if the database allows for various pre-historic character sets for whatever reason, it should hide that from python.

database administrators should be standardizing and kill all older/funky character set support on sight where possible.

For old legacy apps, it should be easy enough to expose views that force the utf-8 conversion upfront so that tools like Superset don't have to deal with legacy/gibberish from old databases and drivers.

CamilYed · 2024-09-27T11:07:28Z

Superset should not try to convert a BINARY-like type as text and fail. We should just show a labelbinary or blob when encountering those types like in the data preview panel.

I agree that databases and their drivers should handle UTF-8 conversion natively to avoid such errors. Superset should not attempt to convert binary-like data to text. For older systems, exposing views that force UTF-8 conversion upfront could prevent similar issues in modern tools like Superset.

dosubot bot added #bug:regression Bugs that are identified as regessions sqllab Namespace | Anything related to the SQL Lab labels Jul 2, 2024

Habeeb556 mentioned this issue Jul 2, 2024

fix: use pessimistic json encoder in SQL Lab #28266

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd4 in position 6: invalid continuation byte #29457

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd4 in position 6: invalid continuation byte #29457

Habeeb556 commented Jul 2, 2024

dosubot bot commented Jul 2, 2024

mistercrunch commented Jul 2, 2024

Habeeb556 commented Jul 3, 2024

mistercrunch commented Jul 3, 2024 •

edited

Loading

Habeeb556 commented Jul 3, 2024

mistercrunch commented Jul 3, 2024

mistercrunch commented Jul 3, 2024 •

edited

Loading

Habeeb556 commented Jul 3, 2024

mistercrunch commented Jul 8, 2024

Habeeb556 commented Jul 9, 2024

mistercrunch commented Jul 9, 2024

Habeeb556 commented Jul 9, 2024

OleksandrDikov commented Aug 13, 2024

Habeeb556 commented Aug 13, 2024

mistercrunch commented Aug 16, 2024

OleksandrDikov commented Aug 28, 2024

mistercrunch commented Sep 3, 2024 •

edited

Loading

CamilYed commented Sep 27, 2024

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd4 in position 6: invalid continuation byte #29457

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd4 in position 6: invalid continuation byte #29457

Comments

Habeeb556 commented Jul 2, 2024

Bug description

How to reproduce the bug

Screenshots/recordings

Superset version

Python version

Node version

Browser

Additional context

Checklist

dosubot bot commented Jul 2, 2024

mistercrunch commented Jul 2, 2024

Habeeb556 commented Jul 3, 2024

mistercrunch commented Jul 3, 2024 • edited Loading

Habeeb556 commented Jul 3, 2024

mistercrunch commented Jul 3, 2024

mistercrunch commented Jul 3, 2024 • edited Loading

Habeeb556 commented Jul 3, 2024

mistercrunch commented Jul 8, 2024

Habeeb556 commented Jul 9, 2024

mistercrunch commented Jul 9, 2024

Habeeb556 commented Jul 9, 2024

OleksandrDikov commented Aug 13, 2024

Habeeb556 commented Aug 13, 2024

mistercrunch commented Aug 16, 2024

OleksandrDikov commented Aug 28, 2024

mistercrunch commented Sep 3, 2024 • edited Loading

CamilYed commented Sep 27, 2024

mistercrunch commented Jul 3, 2024 •

edited

Loading

mistercrunch commented Jul 3, 2024 •

edited

Loading

mistercrunch commented Sep 3, 2024 •

edited

Loading