Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider adding unicode support to SQL server #518

Open
erikvona opened this issue Nov 4, 2022 · 12 comments
Open

Consider adding unicode support to SQL server #518

erikvona opened this issue Nov 4, 2022 · 12 comments
Labels
bug an unexpected problem or unintended behavior encoding 🔤 mssql Microsoft SQL Server
Milestone

Comments

@erikvona
Copy link

erikvona commented Nov 4, 2022

Currently, the implementation for SQL server does not allow for unicode support when writing tables, quoting literals, or using parameters.

This leads to tables being created without unicode support or special unicode characters, and queries being created with literals without special unicode characters, leading to a comparison with a non-ascii unicode character returning the wrong results.

There have been some issues, like r-dbi/DBI#215, where it's suggested to encode strings as bytes, but that results in fragile code that relies on specific SQL server settings and locales, and only solves specific issues like writing tables. I'd like to write code that "just works" if at all possible.

library(DBI)
#> Warning: package 'DBI' was built under R version 4.1.3
library(odbc)
conn <- dbConnect(odbc(), driver = "ODBC Driver 17 for SQL Server", server = "localhost", database = "example", trusted_connection = "yes", encoding = "utf-8")
unicode_string <- "\u2915"
odbc::dbQuoteString(conn, unicode_string) # Expected: <SQL> N'⤕'
#> <SQL> '<U+2915>'
odbc::odbcDataType(conn, unicode_string) # Expected: nvarchar(255)
#> [1] "varchar(255)"
dbGetQuery(conn, "SELECT ?", params = list(unicode_string)) # Expected: 1  '⤕'
#>    
#> 1 ?

# Note that the connection does support it and properly returns the expected character
dbGetQuery(conn, "SELECT N'\u2915'")
#>           
#> 1 <U+2915>

Created on 2022-11-04 with reprex v2.0.2

Session info
sessioninfo::session_info()
#> - Session info ---------------------------------------------------------------
#>  setting  value
#>  version  R version 4.1.2 (2021-11-01)
#>  os       Windows 10 x64 (build 19044)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  Dutch_Netherlands.1252
#>  ctype    Dutch_Netherlands.1252
#>  tz       Europe/Berlin
#>  date     2022-11-04
#>  pandoc   2.19.2 @ H:/Programs/RStudio/bin/quarto/bin/tools/ (via rmarkdown)
#> 
#> - Packages -------------------------------------------------------------------
#>  package     * version date (UTC) lib source
#>  bit           4.0.4   2020-08-04 [1] CRAN (R 4.0.2)
#>  bit64         4.0.5   2020-08-30 [1] CRAN (R 4.0.2)
#>  blob          1.2.3   2022-04-10 [1] CRAN (R 4.1.3)
#>  cli           3.4.1   2022-09-23 [1] CRAN (R 4.1.3)
#>  DBI         * 1.1.3   2022-06-18 [1] CRAN (R 4.1.3)
#>  digest        0.6.29  2021-12-01 [1] CRAN (R 4.1.2)
#>  ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.1.0)
#>  evaluate      0.17    2022-10-07 [1] CRAN (R 4.1.3)
#>  fansi         1.0.3   2022-03-24 [1] CRAN (R 4.1.3)
#>  fastmap       1.1.0   2021-01-25 [1] CRAN (R 4.0.3)
#>  fs            1.5.2   2021-12-08 [1] CRAN (R 4.1.2)
#>  glue          1.6.2   2022-02-24 [1] CRAN (R 4.1.3)
#>  highr         0.9     2021-04-16 [1] CRAN (R 4.1.0)
#>  hms           1.1.2   2022-08-19 [1] CRAN (R 4.1.3)
#>  htmltools     0.5.3   2022-07-18 [1] CRAN (R 4.1.3)
#>  knitr         1.40    2022-08-24 [1] CRAN (R 4.1.3)
#>  lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.1.3)
#>  magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.1.3)
#>  odbc        * 1.3.3   2021-11-30 [1] CRAN (R 4.1.2)
#>  pillar        1.8.1   2022-08-19 [1] CRAN (R 4.1.3)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.0.3)
#>  purrr         0.3.5   2022-10-06 [1] CRAN (R 4.1.3)
#>  R.cache       0.16.0  2022-07-21 [1] CRAN (R 4.1.3)
#>  R.methodsS3   1.8.2   2022-06-13 [1] CRAN (R 4.1.3)
#>  R.oo          1.25.0  2022-06-12 [1] CRAN (R 4.1.3)
#>  R.utils       2.12.0  2022-06-28 [1] CRAN (R 4.1.3)
#>  Rcpp          1.0.9   2022-07-08 [1] CRAN (R 4.1.3)
#>  reprex        2.0.2   2022-08-17 [1] CRAN (R 4.1.3)
#>  rlang         1.0.6   2022-09-24 [1] CRAN (R 4.1.3)
#>  rmarkdown     2.17    2022-10-07 [1] CRAN (R 4.1.3)
#>  rstudioapi    0.14    2022-08-22 [1] CRAN (R 4.1.3)
#>  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.1.2)
#>  stringi       1.7.8   2022-07-11 [1] CRAN (R 4.1.2)
#>  stringr       1.4.1   2022-08-20 [1] CRAN (R 4.1.3)
#>  styler        1.7.0   2022-03-13 [1] CRAN (R 4.1.3)
#>  tibble        3.1.8   2022-07-22 [1] CRAN (R 4.1.3)
#>  utf8          1.2.2   2021-07-24 [1] CRAN (R 4.1.0)
#>  vctrs         0.4.2   2022-09-29 [1] CRAN (R 4.1.3)
#>  withr         2.5.0   2022-03-03 [1] CRAN (R 4.1.3)
#>  xfun          0.33    2022-09-12 [1] CRAN (R 4.1.3)
#>  yaml          2.3.5   2022-02-21 [1] CRAN (R 4.1.3)
#> 
#>  [1] C:/r-checkpoint/no-checkpoint
#>  [2] \\vf-d3-home/d3home$/egjvonasmuth/Programs/R/R-4.1.2/library
#> 
#> ------------------------------------------------------------------------------
@krlmlr
Copy link
Member

krlmlr commented Dec 5, 2022

Thanks. Perhaps this works better with R 4.2.1?

@krlmlr krlmlr added the mssql Microsoft SQL Server label Dec 5, 2022
@erikvona

This comment was marked as outdated.

@krlmlr

This comment was marked as outdated.

@hadley hadley added the bug an unexpected problem or unintended behavior label Apr 24, 2023
@hadley hadley added this to the v1.4.0 milestone Apr 24, 2023
@razz-matazz

This comment was marked as outdated.

@hadley
Copy link
Member

hadley commented Dec 12, 2023

Somewhat simpler/clearer reprex:

library(DBI)
con <- dbConnect(
  odbc::odbc(),
  dsn = "MicrosoftSQLServer",
  uid = "SA",
  pwd = "BoopBop123"
)

dbGetQuery(con, "SELECT ?", params = list("\u2915"))[[1]]
#> [1] "?"

df <- data.frame(x = "\u2915")
dbWriteTable(con, "testunicode", df)
dbReadTable(con, "testunicode")
#>   x
#> 1 ?
dbRemoveTable(con, "testunicode")

Created on 2023-12-12 with reprex v2.0.2.9000

@hadley hadley modified the milestones: v1.4.0, v1.5.0 Dec 14, 2023
@detule
Copy link
Collaborator

detule commented Sep 16, 2024

Hi @erikvona:

Have you considered setting the AutoTranslate connection string argument to no?

conn <- dbConnect(
  odbc::odbc(),
  dsn = "mssql_oem_db",
  uid = "<snip>",
  pwd = "<snip>", AutoTranslate="no" )

> dbGetQuery(conn, "SELECT ?", params = list("\u2915"))[[1]]
[1] "⤕"
> df <- data.frame(x = "\u2915")
> dbWriteTable(conn, "#testunicode", df)
> DBI::dbReadTable(conn, "#testunicode")
  x

@erikvona
Copy link
Author

erikvona commented Sep 16, 2024

@detule I have. However, after turning AutoTranslate off, you will only get the correct characters if the character set matches. Depending on the collation of the database, you may or may not get the correct character.

library(DBI)
library(odbc)
conn <- dbConnect(odbc(), driver = "ODBC Driver 17 for SQL Server", server = "localhost",  trusted_connection = "yes", AutoTranslate = "no")


# Specifying collation can cause incorrect interpretation of non-wide characters
# Actually, getting the correct character is hard on my system and requires both changing the SQL server side and the R side
dbGetQuery(conn, "SELECT ?", params = list("\u2915"))[[1]]
#> [1] "G\xf1\xf2"
dbGetQuery(conn, "SELECT ? COLLATE  Latin1_General_100_CI_AI_SC_UTF8", params = list("\u2915"))[[1]]
#> [1] "⤕"

# Specifying collation does not affect wide characters
dbGetQuery(conn, "SELECT N'\u2915'")[[1]]
#> [1] "⤕"
dbGetQuery(conn, "SELECT N'\u2915' COLLATE  Latin1_General_100_CI_AI_SC_UTF8")[[1]]
#> [1] "⤕"

# Reproduction information
sessionInfo()$locale
#> [1] "LC_COLLATE=Dutch_Netherlands.utf8;LC_CTYPE=Dutch_Netherlands.utf8;LC_MONETARY=Dutch_Netherlands.utf8;LC_NUMERIC=C;LC_TIME=Dutch_Netherlands.utf8"
dbGetQuery(conn, "SELECT CONVERT (varchar(256), SERVERPROPERTY('collation'))")
#>                               
#> 1 SQL_Latin1_General_CP1_CI_AS

dbDisconnect(conn)

If we have native support for unicode characters in ODBC, we should be able to get the correct character regardless of database collation. If we don't, we may get code working for a specific database and a specific configuration of R, but it will be very hard to get it working for all users regardless of machine language and database configuration.

@detule
Copy link
Collaborator

detule commented Sep 16, 2024

Hey @erikvona:

Thanks. I don't have a Windows machine, but will try to replicate your environment on my Linux box this week.

Before I try that, however, what version of SQL Server are you connecting to?

FWIW I would be interested in necking down what exactly the ask is. "Adding unicode support to SQL server" seems a bit overly broad.

@erikvona
Copy link
Author

Most of this testing is done on SQL Server Developer version 16.0.1000.6.

I'm not that familiar with nanodbc, but using the Windows ODBC API, this would entail:

  • Making sure all character vectors containing non-ASCII characters are converted to UTF-16 and passed to the ODBC driver using the ODBC data types SQL_WCHAR, SQL_WVARCHAR or SQL_WLONGVARCHAR if the ODBC driver supports those data types
  • Possibly changing the assumed data type from VARCHAR to NVARCHAR for character vectors when connected to SQL server

I believe that in nanodbc, this would entail storing those strings as wide_string instead of string, but that's just a quick guess after glancing at the documentation.

@detule
Copy link
Collaborator

detule commented Sep 17, 2024

Thanks @erikvona:

I am not convinced, but happy to be wrong. Is the summary below correct?

  • You are not seeing any issues when writing and reading unicode data in wide string columns - NCHAR(n), NVARCHAR.
  • When creating a table using dbWriteTable, for example, you can override the inferred data type from VARCHAR to NVARCHAR using the field.types argument.
  • You are seeing issues when round-triping unicode data in VAR/CHAR columns. Looks like limited support for unicode VAR/CHAR was added starting with SQL Server 2019 ( v15? ) - so probably should be in the version you are using. I suspect this is something we need to understand better.

Does that sound reasonable? Thanks again for iterating. I want to make sure we are on the same page in terms of what you are seeing as not working.

Cheers

@erikvona
Copy link
Author

erikvona commented Sep 17, 2024

@detule Thanks for your summary, and the time invested.

Indeed, if I'm working with strings where the data type is well-defined, then it's working, except if they have a non-unicode type.

The main problem I'm having is when working with parameters with an undefined data type, SELECT ? is the most simple example. Then, the ODBC client determines the data type when binding the parameter, as I understand it. And this is done as a non-unicode string.

In the following example, we can see using a SELECT ? INTO query, that the underlying data type is VARCHAR, and that this results in incorrect data being stored in the table.

library(DBI)
conn <- dbConnect(odbc::odbc(), driver = "ODBC Driver 17 for SQL Server", server = "localhost",  trusted_connection = "yes", AutoTranslate = "no", encoding = "utf8")
dbExecute(conn, "SELECT ? As Col1 INTO tmp", params = list("\u2915"))
#> [1] 1
dbGetQuery(conn, "SELECT COLUMN_NAME, DATA_TYPE FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = 'tmp'")
#>   COLUMN_NAME DATA_TYPE
#> 1        Col1   varchar
dbReadTable(conn, "tmp")
#> Error in eval(expr, envir, enclos): Incomplete multibyte sequence
#> Warning in dbClearResult(rs): Result already cleared
dbExecute(conn, "DROP TABLE tmp")
#> [1] 0

In other ODBC implementations, e.g. .NET/C/etc., I'm allowed to specify the data type when passing a parameter, so I can ensure the right type is used. E.g.:

OdbcCommand cmd = conn.CreateCommand();
cmd.CommandText = "SELECT ?";
cmd.Parameters.Add("@par1", OdbcType.NVarChar).Value = "\u2915"

Regarding unicode support for SQL server, as I understand it, it is supported as a storage medium, not as a code page. See the following example:

library(DBI)
conn <- dbConnect(odbc::odbc(), driver = "ODBC Driver 17 for SQL Server", server = "localhost",  trusted_connection = "yes", AutoTranslate = "no", encoding = "utf8")
dbExecute(conn, "CREATE TABLE tmp (col1 VARCHAR(50) COLLATE Latin1_General_100_CI_AI_SC_UTF8);")
#> [1] 0
dbExecute(conn, "INSERT INTO tmp SELECT ? COLLATE Latin1_General_100_CI_AI_SC_UTF8", params = list("\u2915"))
#> [1] 1
dbReadTable(conn, "tmp")
#>   col1
#> 1  ⤕

As we can see, even though both the table and the insert query use an UTF-8 collation, and I specify encoding = "utf8" when connecting, the data inserted still gets mojibaked. As I understand it, the UTF-8 sequence still gets interpreted as Latin1 by SQL server as specified in the collation, but then gets stored as UTF-8 as specified in the collation.

AFAIK, all UTF-8 collations still use code pages to interpret data passed as VARCHAR (source indicating the code page part is separate from whether or not to use UTF-8). There unfortunately are no code page 65001 collations in SQL server (yet), all UTF-8 collations specify a different code page as well, and only specify the use of UTF-8 at the end, not in place of the code page. This means I can't adjust collation on the server to get this to work for all clients.

This may be a more niche problem than I originally thought, since now that I understand it better it is easier to work around and I thought initially that more users would encounter this, but it's still a problem.

@detule
Copy link
Collaborator

detule commented Sep 19, 2024

Thanks for the write up. I think I have a better idea what the issue you are facing is.

BLUF: If you have the ability to install the development/master@github version of this package, you could try the following

dbExecute(conn, "CREATE TABLE tmp (col1 VARCHAR(50) COLLATE Latin1_General_100_CI_AI_SC_UTF8);")
res <- dbSendQuery(conn, "INSERT INTO tmp SELECT ? COLLATE Latin1_General_100_CI_AI_SC_UTF8")
df <- data.frame( "param_index" = 1, "data_type" = -9, "column_size"=10, decimal_digits=0)
odbc:::result_describe_parameters(res@ptr, df)
dbBind(res, params = list("\u2915"))
dbClearResult(res)
dbReadTable(conn, "tmp")

In a bit more detail: You are right, package:odbc does not have a public/exported method allowing the user to specify the parameter meta data ( type, size, etc ). Rather, we rely on the SQLDescribeParam ODBC API endpoint to accurately do this inference for us. I am not sure how the closed OEM driver implements this endpoint, but I have a guess and I suspect what happens on the back end is something similar to:

 > dbGetQuery(conn, "EXEC sp_describe_undeclared_parameters @tsql=N'INSERT INTO tmp SELECT @par COLLATE Latin1_General_100_CI_AI_SC_UTF8'")$suggested_system_type_name
[1] "varchar(50)"

Not surprising, I guess, as this call happens before the data is bound to the buffer, and the driver can only guess based on the target. At any rate, historically we have had some issues with SQLDescribeParam and hence the private odbc:::result_describe_parameters.

In terms of what to do about this.

  • Arguably if you want good, reliable performance when storing UNICODE characters one should store these data in N/VAR/CHAR. Yes VARCHAR, in modern versions of SQL Server can handle UNICODE encoded data, but that seems like more superficial support; if you are trying to do something more sophisticated with parametrized queries, it starts to break down a bit.
  • I think making the API public is probably not the solution as that would be exposing the innards of ODBC to the end user, and I think that's a bit against the philosophy of this package.
  • We could internalize some of the complexity and when binding parameters to prepared statements we could inspect character columns and provide parameter description overrides if we detect unicode characters. There is a performance cost to this, and also we would need to then make sure that this pattern is well behaved across all the back ends, various drivers and architectures that this package supports. I am not saying we shouldn't do this, but it's a non trivial cost to support what is arguably somewhat specific use case ( one for which some work-a-rounds do exist, however ugly ).

At any rate, if you agree, I think we can close this issue and open a more targeted feature request. Perhaps something like "Add an argument to dbBind allowing the user to override parameter metadata."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior encoding 🔤 mssql Microsoft SQL Server
Projects
None yet
Development

No branches or pull requests

6 participants
@hadley @detule @krlmlr @razz-matazz @erikvona and others