"right truncation" error from sqlsrv_fetch_array of long VARCHAR(max) column containing 8-bit characters #1261

bjj · 2021-05-25T01:00:46Z

PHP Driver version or file name

sqlsrv support => enabled
ExtensionVer => 5.9.0+14119

Directive => Local Value => Master Value
sqlsrv.WarningsReturnAsErrors => On => On
sqlsrv.LogSeverity => 0 => 0
sqlsrv.LogSubsystems => 0 => 0
sqlsrv.ClientBufferMaxKBSize => 10240 => 10240
sqlsrv.SetLocaleInfo => 2 => 2

SQL Server version

15.0.2000.5 (earlier versions also repro)

Client operating system

Ubuntu 20.04.2 LTS

PHP version

PHP 7.4.3 (cli) (built: Oct 6 2020 15:47:56) ( NTS )
Copyright (c) The PHP Group
Zend Engine v3.4.0, Copyright (c) Zend Technologies
with Zend OPcache v7.4.3, Copyright (c), by Zend Technologies
with Xdebug v2.9.2, Copyright (c) 2002-2020, by Derick Rethans

Microsoft ODBC Driver version

unixodbc-dev | 2.3.7 | https://packages.microsoft.com/ubuntu/20.04/prod focal/main amd64 Packages
unixodbc-dev | 2.3.6-0.1build1 | http://archive.ubuntu.com/ubuntu focal/main amd64 Packages

Problem description

When using sqlsrv_fetch_array fails with String data, right truncation under the following conditions:

The column is type VARCHAR(max) or the column type is just barely big enough to fit the exact data
The column is over about 2048 bytes long
The column contains ASCII values > 127
The collation of the column is NOT LIKE '%UTF8' (i.e. it is something like SQL_Latin1_General... and not Latin1_General_100_CI_AS_SC_UTF8)
The connection options CharacterSet is not set or set to SQLSRV_ENC_CHAR (and not set to 'UTF-8')

In this case the column's binary data happens to be UTF-8, but I don't think that matters since all byte sequences are legal Latin1 and I think it's converting from suspected-Latin1 to UTF-8.

I suspect that some code that does dynamic allocation to fetch results is starting with a 2048 byte buffer and mis-handling the case where the Latin1->UTF-8 conversion causes the string to grow. In the VARCHAR(n) cases that same code probably does exact allocation and avoids the bug whenever n is big enough to contain the encoded result.

Expected behavior and actual behavior

Expected successful read of string data (possibly, in this case, with the 8-bit characters presumed to be Latin1 and actually turned into UTF-8 codepoints).

Actually got String data, right truncation

Repro code or steps to reproduce

Note that the 8 bit characters Ã± represent n-with-tilde-on-top as raw UTF-8-expressed-as-Latin1

 $connInfo = array(
    'Database' => 'Test',
    'UID' => 'test',
    'PWD' => 'test',
    'LoginTimeout' => 5,
    'CharacterSet' => SQLSRV_ENC_CHAR
);

$conn = sqlsrv_connect($host, $connInfo);

# leading string >= 2045 leading to result length > 2048
# column must be VARCHAR(MAX) or VARCHAR(2048) (starts working with bigger VARCHAR(n), e.g. 2060)
# 'CharacterSet' connInfo must not be set to SQLSRV_ENC_CHAR (works with UTF-8)
# COLLATE must not be %UTF8% (e.g. Latin1_General_100_CI_AS_SC_UTF8 works)
$sql = "DROP TABLE IF EXISTS #tmpTest;
		SET NOCOUNT ON;
        DECLARE @val VARCHAR(8000) =  REPLICATE('a', 2045) + 'Ã±';
        CREATE TABLE #tmpTest (testCol VARCHAR(MAX) COLLATE SQL_Latin1_General_CP1_CI_AS);
        INSERT INTO #tmpTest (testCol) VALUES (@val);
        SELECT * from #tmpTest;";
$stmt = sqlsrv_query($conn, $sql);
$row = sqlsrv_fetch_array($stmt, SQLSRV_FETCH_ASSOC);
$errors = sqlsrv_errors(SQLSRV_ERR_ERRORS);
# String data, right truncation
var_dump($row, $errors);

The text was updated successfully, but these errors were encountered:

yitam · 2021-05-25T18:25:37Z

Thanks @bjj for bringing this to our attention. We can reproduce it in Linux Ubuntu 18.04 as well and will get back to you after some investigation.

yitam · 2021-05-26T22:33:57Z

Hi @bjj , when connecting with CharacterSet set to SQLSRV_ENC_CHAR, the sqlsrv driver will not do any conversion for the characters. In Ubuntu (or most recent Linux distros), the default is UTF-8. Therefore, your php script is likely in a UTF-8 format. You can easily check by running file <filename>.

In your test scenario, because of the 8 bit character Ã±, your php script should not be saved in UTF-8 format. In other words, make sure the output of the file command for your script looks something like this: sqlsrv_1261a.php: PHP script, ISO-8859 text.

Otherwise, if the output of file gives you PHP script, UTF-8 Unicode text, then you will get this string truncation issue. The reason is that the special A character is interpreted as a UTF-8 character that takes 4 bytes instead of 1 byte.

If you run this php script in command line, make sure your terminal uses the right character set and the correct locale has been generated and set as well (e.g. en_US.ISO-8859-1). If not, you might see a question mark instead of the special 8 bit character.

Once you make sure you have the right settings, you should not run into this truncation issue again.

bjj · 2021-05-27T00:49:35Z

Thanks for looking into this @yitam

when connecting with CharacterSet set to SQLSRV_ENC_CHAR, the sqlsrv driver will not do any conversion for the characters

I think that might be the root cause of the bug. It is doing conversion of the characters. You can easily observe (with a slightly modified version of that script) that there are 2 bytes of data in the table (which correspond to one desired UTF-8 codepoint) but because they get converted again to UTF-8 (as if they were something else, probably ISO-8859-1) they expand to 4 bytes (2 UTF-8 characters representing what the original binary happened to be). I have not found any settings that avoid this conversion on Linux. This same script works fine on Windows.

I created that minimum repro case just to report this problem. The encoding of that PHP script is not the cause of (or solution to) this problem. The original data that showed this issue was already in the database. It can be inserted into the database in other ways. You can verify that the correct bytes 195, 177 are at the end of the column with SELECT ASCII(RIGHT(testCol, 2)) as A, ASCII(RIGHT(testCol, 1)) as B FROM #tmpTest.

The reason is that the special A character is interpreted as a UTF-8 character that takes 4 bytes instead of 1 byte.

The special two binary bytes 195, 177 which are a valid UTF-8 string are interpreted as a valid ISO-8859-1 string and expanded to 4 bytes. That triggers some bug in sqlsrv. As I said, I think the root cause may well be that the change in encoding should not be happening (I can't find any way to turn it off on Linux, but on Windows it does not happen). I think something is assuming that there can be no change of encoding (due to SQLSRV_ENC_CHAR being selected) and allocating a fixed buffer length, but later something else erroneously changes the encoding, uses more bytes, runs out of space, and returns "right truncation"

yitam · 2021-05-27T16:14:46Z

Hi @bjj

I see your points, but because Linux uses UTF-8 as default please set your desired locale (this is not required in Windows) in your php script (must be done before you make the first connection):

$locale = "en_US.ISO-8859-1";       
setlocale(LC_ALL, $locale);

If you use both Windows and Linux, you may consider shifting to UTF-8.

bjj · 2021-05-27T21:23:35Z

Hi @yitam , sorry to let this comment thread get derailed by character encoding issues. While I do believe they are related, the core elements of this bug are:

There is C++ calling into the ODBC API with a buffer with length chosen by the C++ code
The ODBC API is returning an error to the C++ code based on that buffer length
The C++ code is then letting this internal error leak out to the PHP layer

So while I think the C++ code in (1) is quite possibly choosing the wrong buffer length due to encoding assumption mismatches, that is not relevant to (3). I'm calling a PHP function whose contract is "return an arbitrary length string". There's no reason for that to return an error that effectively says "sorry, for reasons outside your control, the internal buffer allocated by the C++ implementation of this PHP function was not large enough to contain the string you wanted".

(also I did try your suggested setlocale idea with no change in observed behavior)

yitam · 2021-05-27T22:38:31Z

@bjj I see your points, but using setlocale() (must be set before making the first connection) seems to solve the issue on my side. Anyway, let me do some more investigations and get back to you.

bjj · 2021-05-27T22:49:17Z

using setlocale() (must be set before making the first connection) seems to solve the issue on my side

My mistake -- I did the test in a container that did not have en_US.ISO-8859-1 configured. I added it and I see now and it does make a difference.

yitam · 2021-05-28T16:46:57Z

Glad to hear that using setlocale() makes a difference for you @bjj :)

After consulting with the ODBC team, you might want to read the documentation about autotranslation when using different codepages

bjj · 2021-05-28T18:11:05Z

I would love to turn off AutoTranslate in this case, but that feature of ODBC is not exposed to sqlsrv_connect().

yitam · 2021-05-28T18:29:42Z

FYI, @bjj , ISO-8859-1 is not exactly the same as Windows 1252

If you're using both Windows and Linux or have no control how the data gets inserted, the clients and SQL Server might be using different ACPs. With AutoTranslate turned off, extended characters may be misinterpreted, as explained in Using Connection String Keywords.

The recommendation is to deliberately use setlocale in your php scripts (which some of our users do) or shift to UTF-8 to ensure data consistency.

yitam · 2021-06-03T20:34:09Z

@bjj FYI, in light of ODBC driver's autotranslation when using different codepages, we have decided to fix this issue in PHP drivers to make it more flexible in handling the possible UTF-8 characters when fetching ASCII data.

yitam · 2021-09-08T18:49:12Z

Closing this issue as per 5.10.0-beta1

acelaya · 2021-11-04T20:28:04Z

I'm still getting this error when using 5.10.0-beta1 with PHP 8.1

https://github.com/shlinkio/shlink/runs/4110069594?check_suite_focus=true

yitam · 2021-11-04T21:33:57Z

Hi @acelaya can you provide repro steps or a script to help us investigate?

acelaya · 2021-11-05T06:09:34Z

Sure thing.

You can reproduce it on this repository: https://github.com/shlinkio/shlink

The workflow above is this one: https://github.com/shlinkio/shlink/blob/develop/.github/workflows/ci.yml#L179

It basically starts a SQL server instance via docker compose, and then runs some database integration tests (composer test:db:ms).

The jobs are run both with PHP 8 and pdo_sqlsrv 5.9, and with PHP 8.1 with pdo_sqlsrv 5.10 beta. In the second case they throw this error: Doctrine\DBAL\Exception\DriverException: An exception occurred while executing a query: SQLSTATE[22001]: [Microsoft][ODBC Driver 17 for SQL Server]String data, right truncation.

The test failing is this one https://github.com/shlinkio/shlink/blob/develop/module/Core/test-db/Domain/Repository/DomainRepositoryTest.php, which covers this class https://github.com/shlinkio/shlink/blob/develop/module/Core/src/Domain/Repository/DomainRepository.php

I will try to isolate the failing query this afternoon after work, if the information above is not enough.

yitam · 2021-11-05T15:41:03Z

@acelaya, did you mean that with 5.10beta1 this exception is thrown but not with 5.9?
Please isolate the failing query and explain how to repro this by creating a new issue and filling in the questions.

acelaya · 2021-11-05T15:48:28Z

@acelaya, did you mean that with 5.10beta1 this exception is thrown but not with 5.9?

Yes, exactly.

Please isolate the failing query and explain how to repro this by creating a new issue and filling in the questions.

Perfect, will do.

acelaya · 2021-11-14T12:29:57Z

@yitam sorry for the delay, but I took me a lot of time to reproduce the issue on a small context, and I had to debug a lot on the DBAL library I use to figure out what was going on, but I have finally created the ticket #1329

I included a supre small repo with reproducible steps. I hope it simplifies debugging it.

yitam added Linux SQLSRV labels May 26, 2021

yitam mentioned this issue Jun 8, 2021

Simplified get_field_as_string and made it more robust #1265

Merged

yitam closed this as completed Sep 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"right truncation" error from sqlsrv_fetch_array of long VARCHAR(max) column containing 8-bit characters #1261

"right truncation" error from sqlsrv_fetch_array of long VARCHAR(max) column containing 8-bit characters #1261

bjj commented May 25, 2021

yitam commented May 25, 2021

yitam commented May 26, 2021

bjj commented May 27, 2021

yitam commented May 27, 2021 •

edited

Loading

bjj commented May 27, 2021

yitam commented May 27, 2021

bjj commented May 27, 2021

yitam commented May 28, 2021 •

edited

Loading

bjj commented May 28, 2021

yitam commented May 28, 2021 •

edited

Loading

yitam commented Jun 3, 2021 •

edited

Loading

yitam commented Sep 8, 2021

acelaya commented Nov 4, 2021

yitam commented Nov 4, 2021

acelaya commented Nov 5, 2021

yitam commented Nov 5, 2021

acelaya commented Nov 5, 2021

acelaya commented Nov 14, 2021

"right truncation" error from sqlsrv_fetch_array of long VARCHAR(max) column containing 8-bit characters #1261

"right truncation" error from sqlsrv_fetch_array of long VARCHAR(max) column containing 8-bit characters #1261

Comments

bjj commented May 25, 2021

PHP Driver version or file name

SQL Server version

Client operating system

PHP version

Microsoft ODBC Driver version

Problem description

Expected behavior and actual behavior

Repro code or steps to reproduce

yitam commented May 25, 2021

yitam commented May 26, 2021

bjj commented May 27, 2021

yitam commented May 27, 2021 • edited Loading

bjj commented May 27, 2021

yitam commented May 27, 2021

bjj commented May 27, 2021

yitam commented May 28, 2021 • edited Loading

bjj commented May 28, 2021

yitam commented May 28, 2021 • edited Loading

yitam commented Jun 3, 2021 • edited Loading

yitam commented Sep 8, 2021

acelaya commented Nov 4, 2021

yitam commented Nov 4, 2021

acelaya commented Nov 5, 2021

yitam commented Nov 5, 2021

acelaya commented Nov 5, 2021

acelaya commented Nov 14, 2021

yitam commented May 27, 2021 •

edited

Loading

yitam commented May 28, 2021 •

edited

Loading

yitam commented May 28, 2021 •

edited

Loading

yitam commented Jun 3, 2021 •

edited

Loading