Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"right truncation" error from sqlsrv_fetch_array of long VARCHAR(max) column containing 8-bit characters #1261

Closed
bjj opened this issue May 25, 2021 · 18 comments

Comments

@bjj
Copy link

bjj commented May 25, 2021

PHP Driver version or file name

sqlsrv support => enabled
ExtensionVer => 5.9.0+14119

Directive => Local Value => Master Value
sqlsrv.WarningsReturnAsErrors => On => On
sqlsrv.LogSeverity => 0 => 0
sqlsrv.LogSubsystems => 0 => 0
sqlsrv.ClientBufferMaxKBSize => 10240 => 10240
sqlsrv.SetLocaleInfo => 2 => 2

SQL Server version

15.0.2000.5 (earlier versions also repro)

Client operating system

Ubuntu 20.04.2 LTS

PHP version

PHP 7.4.3 (cli) (built: Oct 6 2020 15:47:56) ( NTS )
Copyright (c) The PHP Group
Zend Engine v3.4.0, Copyright (c) Zend Technologies
with Zend OPcache v7.4.3, Copyright (c), by Zend Technologies
with Xdebug v2.9.2, Copyright (c) 2002-2020, by Derick Rethans

Microsoft ODBC Driver version

unixodbc-dev | 2.3.7 | https://packages.microsoft.com/ubuntu/20.04/prod focal/main amd64 Packages
unixodbc-dev | 2.3.6-0.1build1 | http://archive.ubuntu.com/ubuntu focal/main amd64 Packages

Problem description

When using sqlsrv_fetch_array fails with String data, right truncation under the following conditions:

  • The column is type VARCHAR(max) or the column type is just barely big enough to fit the exact data
  • The column is over about 2048 bytes long
  • The column contains ASCII values > 127
  • The collation of the column is NOT LIKE '%UTF8' (i.e. it is something like SQL_Latin1_General... and not Latin1_General_100_CI_AS_SC_UTF8)
  • The connection options CharacterSet is not set or set to SQLSRV_ENC_CHAR (and not set to 'UTF-8')

In this case the column's binary data happens to be UTF-8, but I don't think that matters since all byte sequences are legal Latin1 and I think it's converting from suspected-Latin1 to UTF-8.

I suspect that some code that does dynamic allocation to fetch results is starting with a 2048 byte buffer and mis-handling the case where the Latin1->UTF-8 conversion causes the string to grow. In the VARCHAR(n) cases that same code probably does exact allocation and avoids the bug whenever n is big enough to contain the encoded result.

Expected behavior and actual behavior

Expected successful read of string data (possibly, in this case, with the 8-bit characters presumed to be Latin1 and actually turned into UTF-8 codepoints).

Actually got String data, right truncation

Repro code or steps to reproduce

Note that the 8 bit characters ñ represent n-with-tilde-on-top as raw UTF-8-expressed-as-Latin1

 $connInfo = array(
    'Database' => 'Test',
    'UID' => 'test',
    'PWD' => 'test',
    'LoginTimeout' => 5,
    'CharacterSet' => SQLSRV_ENC_CHAR
);

$conn = sqlsrv_connect($host, $connInfo);

# leading string >= 2045 leading to result length > 2048
# column must be VARCHAR(MAX) or VARCHAR(2048) (starts working with bigger VARCHAR(n), e.g. 2060)
# 'CharacterSet' connInfo must not be set to SQLSRV_ENC_CHAR (works with UTF-8)
# COLLATE must not be %UTF8% (e.g. Latin1_General_100_CI_AS_SC_UTF8 works)
$sql = "DROP TABLE IF EXISTS #tmpTest;
		SET NOCOUNT ON;
        DECLARE @val VARCHAR(8000) =  REPLICATE('a', 2045) + 'ñ';
        CREATE TABLE #tmpTest (testCol VARCHAR(MAX) COLLATE SQL_Latin1_General_CP1_CI_AS);
        INSERT INTO #tmpTest (testCol) VALUES (@val);
        SELECT * from #tmpTest;";
$stmt = sqlsrv_query($conn, $sql);
$row = sqlsrv_fetch_array($stmt, SQLSRV_FETCH_ASSOC);
$errors = sqlsrv_errors(SQLSRV_ERR_ERRORS);
# String data, right truncation
var_dump($row, $errors);
@yitam
Copy link
Contributor

yitam commented May 25, 2021

Thanks @bjj for bringing this to our attention. We can reproduce it in Linux Ubuntu 18.04 as well and will get back to you after some investigation.

@yitam
Copy link
Contributor

yitam commented May 26, 2021

Hi @bjj , when connecting with CharacterSet set to SQLSRV_ENC_CHAR, the sqlsrv driver will not do any conversion for the characters. In Ubuntu (or most recent Linux distros), the default is UTF-8. Therefore, your php script is likely in a UTF-8 format. You can easily check by running file <filename>.

In your test scenario, because of the 8 bit character ñ, your php script should not be saved in UTF-8 format. In other words, make sure the output of the file command for your script looks something like this: sqlsrv_1261a.php: PHP script, ISO-8859 text.

Otherwise, if the output of file gives you PHP script, UTF-8 Unicode text, then you will get this string truncation issue. The reason is that the special A character is interpreted as a UTF-8 character that takes 4 bytes instead of 1 byte.

If you run this php script in command line, make sure your terminal uses the right character set and the correct locale has been generated and set as well (e.g. en_US.ISO-8859-1). If not, you might see a question mark instead of the special 8 bit character.

Once you make sure you have the right settings, you should not run into this truncation issue again.

@bjj
Copy link
Author

bjj commented May 27, 2021

Thanks for looking into this @yitam

when connecting with CharacterSet set to SQLSRV_ENC_CHAR, the sqlsrv driver will not do any conversion for the characters

I think that might be the root cause of the bug. It is doing conversion of the characters. You can easily observe (with a slightly modified version of that script) that there are 2 bytes of data in the table (which correspond to one desired UTF-8 codepoint) but because they get converted again to UTF-8 (as if they were something else, probably ISO-8859-1) they expand to 4 bytes (2 UTF-8 characters representing what the original binary happened to be). I have not found any settings that avoid this conversion on Linux. This same script works fine on Windows.

I created that minimum repro case just to report this problem. The encoding of that PHP script is not the cause of (or solution to) this problem. The original data that showed this issue was already in the database. It can be inserted into the database in other ways. You can verify that the correct bytes 195, 177 are at the end of the column with SELECT ASCII(RIGHT(testCol, 2)) as A, ASCII(RIGHT(testCol, 1)) as B FROM #tmpTest.

The reason is that the special A character is interpreted as a UTF-8 character that takes 4 bytes instead of 1 byte.

The special two binary bytes 195, 177 which are a valid UTF-8 string are interpreted as a valid ISO-8859-1 string and expanded to 4 bytes. That triggers some bug in sqlsrv. As I said, I think the root cause may well be that the change in encoding should not be happening (I can't find any way to turn it off on Linux, but on Windows it does not happen). I think something is assuming that there can be no change of encoding (due to SQLSRV_ENC_CHAR being selected) and allocating a fixed buffer length, but later something else erroneously changes the encoding, uses more bytes, runs out of space, and returns "right truncation"

@yitam
Copy link
Contributor

yitam commented May 27, 2021

Hi @bjj

I see your points, but because Linux uses UTF-8 as default please set your desired locale (this is not required in Windows) in your php script (must be done before you make the first connection):

$locale = "en_US.ISO-8859-1";       
setlocale(LC_ALL, $locale);

If you use both Windows and Linux, you may consider shifting to UTF-8.

@bjj
Copy link
Author

bjj commented May 27, 2021

Hi @yitam , sorry to let this comment thread get derailed by character encoding issues. While I do believe they are related, the core elements of this bug are:

  1. There is C++ calling into the ODBC API with a buffer with length chosen by the C++ code
  2. The ODBC API is returning an error to the C++ code based on that buffer length
  3. The C++ code is then letting this internal error leak out to the PHP layer

So while I think the C++ code in (1) is quite possibly choosing the wrong buffer length due to encoding assumption mismatches, that is not relevant to (3). I'm calling a PHP function whose contract is "return an arbitrary length string". There's no reason for that to return an error that effectively says "sorry, for reasons outside your control, the internal buffer allocated by the C++ implementation of this PHP function was not large enough to contain the string you wanted".

(also I did try your suggested setlocale idea with no change in observed behavior)

@yitam
Copy link
Contributor

yitam commented May 27, 2021

@bjj I see your points, but using setlocale() (must be set before making the first connection) seems to solve the issue on my side. Anyway, let me do some more investigations and get back to you.

@bjj
Copy link
Author

bjj commented May 27, 2021

using setlocale() (must be set before making the first connection) seems to solve the issue on my side

My mistake -- I did the test in a container that did not have en_US.ISO-8859-1 configured. I added it and I see now and it does make a difference.

@yitam
Copy link
Contributor

yitam commented May 28, 2021

Glad to hear that using setlocale() makes a difference for you @bjj :)

After consulting with the ODBC team, you might want to read the documentation about autotranslation when using different codepages

@bjj
Copy link
Author

bjj commented May 28, 2021

I would love to turn off AutoTranslate in this case, but that feature of ODBC is not exposed to sqlsrv_connect().

@yitam
Copy link
Contributor

yitam commented May 28, 2021

FYI, @bjj , ISO-8859-1 is not exactly the same as Windows 1252

If you're using both Windows and Linux or have no control how the data gets inserted, the clients and SQL Server might be using different ACPs. With AutoTranslate turned off, extended characters may be misinterpreted, as explained in Using Connection String Keywords.

The recommendation is to deliberately use setlocale in your php scripts (which some of our users do) or shift to UTF-8 to ensure data consistency.

@yitam
Copy link
Contributor

yitam commented Jun 3, 2021

@bjj FYI, in light of ODBC driver's autotranslation when using different codepages, we have decided to fix this issue in PHP drivers to make it more flexible in handling the possible UTF-8 characters when fetching ASCII data.

@yitam
Copy link
Contributor

yitam commented Sep 8, 2021

Closing this issue as per 5.10.0-beta1

@yitam yitam closed this as completed Sep 8, 2021
@acelaya
Copy link

acelaya commented Nov 4, 2021

I'm still getting this error when using 5.10.0-beta1 with PHP 8.1

https://github.com/shlinkio/shlink/runs/4110069594?check_suite_focus=true

@yitam
Copy link
Contributor

yitam commented Nov 4, 2021

Hi @acelaya can you provide repro steps or a script to help us investigate?

@acelaya
Copy link

acelaya commented Nov 5, 2021

Sure thing.

You can reproduce it on this repository: https://github.com/shlinkio/shlink

The workflow above is this one: https://github.com/shlinkio/shlink/blob/develop/.github/workflows/ci.yml#L179

It basically starts a SQL server instance via docker compose, and then runs some database integration tests (composer test:db:ms).

The jobs are run both with PHP 8 and pdo_sqlsrv 5.9, and with PHP 8.1 with pdo_sqlsrv 5.10 beta. In the second case they throw this error: Doctrine\DBAL\Exception\DriverException: An exception occurred while executing a query: SQLSTATE[22001]: [Microsoft][ODBC Driver 17 for SQL Server]String data, right truncation.

The test failing is this one https://github.com/shlinkio/shlink/blob/develop/module/Core/test-db/Domain/Repository/DomainRepositoryTest.php, which covers this class https://github.com/shlinkio/shlink/blob/develop/module/Core/src/Domain/Repository/DomainRepository.php

I will try to isolate the failing query this afternoon after work, if the information above is not enough.

@yitam
Copy link
Contributor

yitam commented Nov 5, 2021

@acelaya, did you mean that with 5.10beta1 this exception is thrown but not with 5.9?
Please isolate the failing query and explain how to repro this by creating a new issue and filling in the questions.

@acelaya
Copy link

acelaya commented Nov 5, 2021

@acelaya, did you mean that with 5.10beta1 this exception is thrown but not with 5.9?

Yes, exactly.

Please isolate the failing query and explain how to repro this by creating a new issue and filling in the questions.

Perfect, will do.

@acelaya
Copy link

acelaya commented Nov 14, 2021

@yitam sorry for the delay, but I took me a lot of time to reproduce the issue on a small context, and I had to debug a lot on the DBAL library I use to figure out what was going on, but I have finally created the ticket #1329

I included a supre small repo with reproducible steps. I hope it simplifies debugging it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants