Consider adding unicode support to SQL server #518

erikvona · 2022-11-04T13:13:29Z

Currently, the implementation for SQL server does not allow for unicode support when writing tables, quoting literals, or using parameters.

This leads to tables being created without unicode support or special unicode characters, and queries being created with literals without special unicode characters, leading to a comparison with a non-ascii unicode character returning the wrong results.

There have been some issues, like r-dbi/DBI#215, where it's suggested to encode strings as bytes, but that results in fragile code that relies on specific SQL server settings and locales, and only solves specific issues like writing tables. I'd like to write code that "just works" if at all possible.

library(DBI)
#> Warning: package 'DBI' was built under R version 4.1.3
library(odbc)
conn <- dbConnect(odbc(), driver = "ODBC Driver 17 for SQL Server", server = "localhost", database = "example", trusted_connection = "yes", encoding = "utf-8")
unicode_string <- "\u2915"
odbc::dbQuoteString(conn, unicode_string) # Expected: <SQL> N'⤕'
#> <SQL> '<U+2915>'
odbc::odbcDataType(conn, unicode_string) # Expected: nvarchar(255)
#> [1] "varchar(255)"
dbGetQuery(conn, "SELECT ?", params = list(unicode_string)) # Expected: 1  '⤕'
#>    
#> 1 ?

# Note that the connection does support it and properly returns the expected character
dbGetQuery(conn, "SELECT N'\u2915'")
#>           
#> 1 <U+2915>

^{Created on 2022-11-04 with reprex v2.0.2}

Session info

sessioninfo::session_info()
#> - Session info ---------------------------------------------------------------
#>  setting  value
#>  version  R version 4.1.2 (2021-11-01)
#>  os       Windows 10 x64 (build 19044)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  Dutch_Netherlands.1252
#>  ctype    Dutch_Netherlands.1252
#>  tz       Europe/Berlin
#>  date     2022-11-04
#>  pandoc   2.19.2 @ H:/Programs/RStudio/bin/quarto/bin/tools/ (via rmarkdown)
#> 
#> - Packages -------------------------------------------------------------------
#>  package     * version date (UTC) lib source
#>  bit           4.0.4   2020-08-04 [1] CRAN (R 4.0.2)
#>  bit64         4.0.5   2020-08-30 [1] CRAN (R 4.0.2)
#>  blob          1.2.3   2022-04-10 [1] CRAN (R 4.1.3)
#>  cli           3.4.1   2022-09-23 [1] CRAN (R 4.1.3)
#>  DBI         * 1.1.3   2022-06-18 [1] CRAN (R 4.1.3)
#>  digest        0.6.29  2021-12-01 [1] CRAN (R 4.1.2)
#>  ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.1.0)
#>  evaluate      0.17    2022-10-07 [1] CRAN (R 4.1.3)
#>  fansi         1.0.3   2022-03-24 [1] CRAN (R 4.1.3)
#>  fastmap       1.1.0   2021-01-25 [1] CRAN (R 4.0.3)
#>  fs            1.5.2   2021-12-08 [1] CRAN (R 4.1.2)
#>  glue          1.6.2   2022-02-24 [1] CRAN (R 4.1.3)
#>  highr         0.9     2021-04-16 [1] CRAN (R 4.1.0)
#>  hms           1.1.2   2022-08-19 [1] CRAN (R 4.1.3)
#>  htmltools     0.5.3   2022-07-18 [1] CRAN (R 4.1.3)
#>  knitr         1.40    2022-08-24 [1] CRAN (R 4.1.3)
#>  lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.1.3)
#>  magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.1.3)
#>  odbc        * 1.3.3   2021-11-30 [1] CRAN (R 4.1.2)
#>  pillar        1.8.1   2022-08-19 [1] CRAN (R 4.1.3)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.0.3)
#>  purrr         0.3.5   2022-10-06 [1] CRAN (R 4.1.3)
#>  R.cache       0.16.0  2022-07-21 [1] CRAN (R 4.1.3)
#>  R.methodsS3   1.8.2   2022-06-13 [1] CRAN (R 4.1.3)
#>  R.oo          1.25.0  2022-06-12 [1] CRAN (R 4.1.3)
#>  R.utils       2.12.0  2022-06-28 [1] CRAN (R 4.1.3)
#>  Rcpp          1.0.9   2022-07-08 [1] CRAN (R 4.1.3)
#>  reprex        2.0.2   2022-08-17 [1] CRAN (R 4.1.3)
#>  rlang         1.0.6   2022-09-24 [1] CRAN (R 4.1.3)
#>  rmarkdown     2.17    2022-10-07 [1] CRAN (R 4.1.3)
#>  rstudioapi    0.14    2022-08-22 [1] CRAN (R 4.1.3)
#>  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.1.2)
#>  stringi       1.7.8   2022-07-11 [1] CRAN (R 4.1.2)
#>  stringr       1.4.1   2022-08-20 [1] CRAN (R 4.1.3)
#>  styler        1.7.0   2022-03-13 [1] CRAN (R 4.1.3)
#>  tibble        3.1.8   2022-07-22 [1] CRAN (R 4.1.3)
#>  utf8          1.2.2   2021-07-24 [1] CRAN (R 4.1.0)
#>  vctrs         0.4.2   2022-09-29 [1] CRAN (R 4.1.3)
#>  withr         2.5.0   2022-03-03 [1] CRAN (R 4.1.3)
#>  xfun          0.33    2022-09-12 [1] CRAN (R 4.1.3)
#>  yaml          2.3.5   2022-02-21 [1] CRAN (R 4.1.3)
#> 
#>  [1] C:/r-checkpoint/no-checkpoint
#>  [2] \\vf-d3-home/d3home$/egjvonasmuth/Programs/R/R-4.1.2/library
#> 
#> ------------------------------------------------------------------------------

krlmlr · 2022-12-05T04:35:56Z

Thanks. Perhaps this works better with R 4.2.1?

hadley · 2023-12-12T14:45:35Z

Somewhat simpler/clearer reprex:

library(DBI)
con <- dbConnect(
  odbc::odbc(),
  dsn = "MicrosoftSQLServer",
  uid = "SA",
  pwd = "BoopBop123"
)

dbGetQuery(con, "SELECT ?", params = list("\u2915"))[[1]]
#> [1] "?"

df <- data.frame(x = "\u2915")
dbWriteTable(con, "testunicode", df)
dbReadTable(con, "testunicode")
#>   x
#> 1 ?
dbRemoveTable(con, "testunicode")

^{Created on 2023-12-12 with reprex v2.0.2.9000}

detule · 2024-09-16T09:47:20Z

Hi @erikvona:

Have you considered setting the AutoTranslate connection string argument to no?

conn <- dbConnect(
  odbc::odbc(),
  dsn = "mssql_oem_db",
  uid = "<snip>",
  pwd = "<snip>", AutoTranslate="no" )

> dbGetQuery(conn, "SELECT ?", params = list("\u2915"))[[1]]
[1] "⤕"
> df <- data.frame(x = "\u2915")
> dbWriteTable(conn, "#testunicode", df)
> DBI::dbReadTable(conn, "#testunicode")
  x

erikvona · 2024-09-16T10:30:48Z

@detule I have. However, after turning AutoTranslate off, you will only get the correct characters if the character set matches. Depending on the collation of the database, you may or may not get the correct character.

library(DBI)
library(odbc)
conn <- dbConnect(odbc(), driver = "ODBC Driver 17 for SQL Server", server = "localhost",  trusted_connection = "yes", AutoTranslate = "no")


# Specifying collation can cause incorrect interpretation of non-wide characters
# Actually, getting the correct character is hard on my system and requires both changing the SQL server side and the R side
dbGetQuery(conn, "SELECT ?", params = list("\u2915"))[[1]]
#> [1] "G\xf1\xf2"
dbGetQuery(conn, "SELECT ? COLLATE  Latin1_General_100_CI_AI_SC_UTF8", params = list("\u2915"))[[1]]
#> [1] "â¤•"

# Specifying collation does not affect wide characters
dbGetQuery(conn, "SELECT N'\u2915'")[[1]]
#> [1] "⤕"
dbGetQuery(conn, "SELECT N'\u2915' COLLATE  Latin1_General_100_CI_AI_SC_UTF8")[[1]]
#> [1] "⤕"

# Reproduction information
sessionInfo()$locale
#> [1] "LC_COLLATE=Dutch_Netherlands.utf8;LC_CTYPE=Dutch_Netherlands.utf8;LC_MONETARY=Dutch_Netherlands.utf8;LC_NUMERIC=C;LC_TIME=Dutch_Netherlands.utf8"
dbGetQuery(conn, "SELECT CONVERT (varchar(256), SERVERPROPERTY('collation'))")
#>                               
#> 1 SQL_Latin1_General_CP1_CI_AS

dbDisconnect(conn)

If we have native support for unicode characters in ODBC, we should be able to get the correct character regardless of database collation. If we don't, we may get code working for a specific database and a specific configuration of R, but it will be very hard to get it working for all users regardless of machine language and database configuration.

detule · 2024-09-16T12:22:31Z

Hey @erikvona:

Thanks. I don't have a Windows machine, but will try to replicate your environment on my Linux box this week.

Before I try that, however, what version of SQL Server are you connecting to?

FWIW I would be interested in necking down what exactly the ask is. "Adding unicode support to SQL server" seems a bit overly broad.

erikvona · 2024-09-16T12:57:50Z

Most of this testing is done on SQL Server Developer version 16.0.1000.6.

I'm not that familiar with nanodbc, but using the Windows ODBC API, this would entail:

Making sure all character vectors containing non-ASCII characters are converted to UTF-16 and passed to the ODBC driver using the ODBC data types SQL_WCHAR, SQL_WVARCHAR or SQL_WLONGVARCHAR if the ODBC driver supports those data types
Possibly changing the assumed data type from VARCHAR to NVARCHAR for character vectors when connected to SQL server

I believe that in nanodbc, this would entail storing those strings as wide_string instead of string, but that's just a quick guess after glancing at the documentation.

detule · 2024-09-17T00:15:36Z

Thanks @erikvona:

I am not convinced, but happy to be wrong. Is the summary below correct?

You are not seeing any issues when writing and reading unicode data in wide string columns - NCHAR(n), NVARCHAR.
When creating a table using dbWriteTable, for example, you can override the inferred data type from VARCHAR to NVARCHAR using the field.types argument.
You are seeing issues when round-triping unicode data in VAR/CHAR columns. Looks like limited support for unicode VAR/CHAR was added starting with SQL Server 2019 ( v15? ) - so probably should be in the version you are using. I suspect this is something we need to understand better.

Does that sound reasonable? Thanks again for iterating. I want to make sure we are on the same page in terms of what you are seeing as not working.

Cheers

erikvona · 2024-09-17T08:03:21Z

@detule Thanks for your summary, and the time invested.

Indeed, if I'm working with strings where the data type is well-defined, then it's working, except if they have a non-unicode type.

The main problem I'm having is when working with parameters with an undefined data type, SELECT ? is the most simple example. Then, the ODBC client determines the data type when binding the parameter, as I understand it. And this is done as a non-unicode string.

In the following example, we can see using a SELECT ? INTO query, that the underlying data type is VARCHAR, and that this results in incorrect data being stored in the table.

library(DBI)
conn <- dbConnect(odbc::odbc(), driver = "ODBC Driver 17 for SQL Server", server = "localhost",  trusted_connection = "yes", AutoTranslate = "no", encoding = "utf8")
dbExecute(conn, "SELECT ? As Col1 INTO tmp", params = list("\u2915"))
#> [1] 1
dbGetQuery(conn, "SELECT COLUMN_NAME, DATA_TYPE FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = 'tmp'")
#>   COLUMN_NAME DATA_TYPE
#> 1        Col1   varchar
dbReadTable(conn, "tmp")
#> Error in eval(expr, envir, enclos): Incomplete multibyte sequence
#> Warning in dbClearResult(rs): Result already cleared
dbExecute(conn, "DROP TABLE tmp")
#> [1] 0

In other ODBC implementations, e.g. .NET/C/etc., I'm allowed to specify the data type when passing a parameter, so I can ensure the right type is used. E.g.:

OdbcCommand cmd = conn.CreateCommand();
cmd.CommandText = "SELECT ?";
cmd.Parameters.Add("@par1", OdbcType.NVarChar).Value = "\u2915"

Regarding unicode support for SQL server, as I understand it, it is supported as a storage medium, not as a code page. See the following example:

library(DBI)
conn <- dbConnect(odbc::odbc(), driver = "ODBC Driver 17 for SQL Server", server = "localhost",  trusted_connection = "yes", AutoTranslate = "no", encoding = "utf8")
dbExecute(conn, "CREATE TABLE tmp (col1 VARCHAR(50) COLLATE Latin1_General_100_CI_AI_SC_UTF8);")
#> [1] 0
dbExecute(conn, "INSERT INTO tmp SELECT ? COLLATE Latin1_General_100_CI_AI_SC_UTF8", params = list("\u2915"))
#> [1] 1
dbReadTable(conn, "tmp")
#>   col1
#> 1  â¤•

As we can see, even though both the table and the insert query use an UTF-8 collation, and I specify encoding = "utf8" when connecting, the data inserted still gets mojibaked. As I understand it, the UTF-8 sequence still gets interpreted as Latin1 by SQL server as specified in the collation, but then gets stored as UTF-8 as specified in the collation.

AFAIK, all UTF-8 collations still use code pages to interpret data passed as VARCHAR (source indicating the code page part is separate from whether or not to use UTF-8). There unfortunately are no code page 65001 collations in SQL server (yet), all UTF-8 collations specify a different code page as well, and only specify the use of UTF-8 at the end, not in place of the code page. This means I can't adjust collation on the server to get this to work for all clients.

This may be a more niche problem than I originally thought, since now that I understand it better it is easier to work around and I thought initially that more users would encounter this, but it's still a problem.

detule · 2024-09-19T10:36:33Z

Thanks for the write up. I think I have a better idea what the issue you are facing is.

BLUF: If you have the ability to install the development/master@github version of this package, you could try the following

dbExecute(conn, "CREATE TABLE tmp (col1 VARCHAR(50) COLLATE Latin1_General_100_CI_AI_SC_UTF8);")
res <- dbSendQuery(conn, "INSERT INTO tmp SELECT ? COLLATE Latin1_General_100_CI_AI_SC_UTF8")
df <- data.frame( "param_index" = 1, "data_type" = -9, "column_size"=10, decimal_digits=0)
odbc:::result_describe_parameters(res@ptr, df)
dbBind(res, params = list("\u2915"))
dbClearResult(res)
dbReadTable(conn, "tmp")

In a bit more detail: You are right, package:odbc does not have a public/exported method allowing the user to specify the parameter meta data ( type, size, etc ). Rather, we rely on the SQLDescribeParam ODBC API endpoint to accurately do this inference for us. I am not sure how the closed OEM driver implements this endpoint, but I have a guess and I suspect what happens on the back end is something similar to:

 > dbGetQuery(conn, "EXEC sp_describe_undeclared_parameters @tsql=N'INSERT INTO tmp SELECT @par COLLATE Latin1_General_100_CI_AI_SC_UTF8'")$suggested_system_type_name
[1] "varchar(50)"

Not surprising, I guess, as this call happens before the data is bound to the buffer, and the driver can only guess based on the target. At any rate, historically we have had some issues with SQLDescribeParam and hence the private odbc:::result_describe_parameters.

In terms of what to do about this.

Arguably if you want good, reliable performance when storing UNICODE characters one should store these data in N/VAR/CHAR. Yes VARCHAR, in modern versions of SQL Server can handle UNICODE encoded data, but that seems like more superficial support; if you are trying to do something more sophisticated with parametrized queries, it starts to break down a bit.
I think making the API public is probably not the solution as that would be exposing the innards of ODBC to the end user, and I think that's a bit against the philosophy of this package.
We could internalize some of the complexity and when binding parameters to prepared statements we could inspect character columns and provide parameter description overrides if we detect unicode characters. There is a performance cost to this, and also we would need to then make sure that this pattern is well behaved across all the back ends, various drivers and architectures that this package supports. I am not saying we shouldn't do this, but it's a non trivial cost to support what is arguably somewhat specific use case ( one for which some work-a-rounds do exist, however ugly ).

At any rate, if you agree, I think we can close this issue and open a more targeted feature request. Perhaps something like "Add an argument to dbBind allowing the user to override parameter metadata."

krlmlr added the mssql Microsoft SQL Server label Dec 5, 2022

This comment was marked as outdated.

Sign in to view

detule added the encoding 🔤 label Jan 13, 2023

hadley added the bug an unexpected problem or unintended behavior label Apr 24, 2023

hadley added this to the v1.4.0 milestone Apr 24, 2023

This comment was marked as outdated.

Sign in to view

hadley modified the milestones: v1.4.0, v1.5.0 Dec 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider adding unicode support to SQL server #518

Consider adding unicode support to SQL server #518

erikvona commented Nov 4, 2022

krlmlr commented Dec 5, 2022

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

hadley commented Dec 12, 2023 •

edited

Loading

detule commented Sep 16, 2024

erikvona commented Sep 16, 2024 •

edited

Loading

detule commented Sep 16, 2024

erikvona commented Sep 16, 2024

detule commented Sep 17, 2024 •

edited

Loading

erikvona commented Sep 17, 2024 •

edited

Loading

detule commented Sep 19, 2024

Consider adding unicode support to SQL server #518

Consider adding unicode support to SQL server #518

Comments

erikvona commented Nov 4, 2022

krlmlr commented Dec 5, 2022

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

hadley commented Dec 12, 2023 • edited Loading

detule commented Sep 16, 2024

erikvona commented Sep 16, 2024 • edited Loading

detule commented Sep 16, 2024

erikvona commented Sep 16, 2024

detule commented Sep 17, 2024 • edited Loading

erikvona commented Sep 17, 2024 • edited Loading

detule commented Sep 19, 2024

hadley commented Dec 12, 2023 •

edited

Loading

erikvona commented Sep 16, 2024 •

edited

Loading

detule commented Sep 17, 2024 •

edited

Loading

erikvona commented Sep 17, 2024 •

edited

Loading