Unicode conversion rework #278

traceon · 2020-03-13T22:49:00Z

Add ICU drop-in support (non-MSVC platforms only; system lib only)
Factored out to dedicated headers:
utils/amortized_istream_reader.h
utils/resize_without_initialization.h
utils/object_pool.h
folly switched to a repo with my custom changes that enable resize-without-initialization to work on all target platforms
Implemented:
utils/string_pool.h
utils/conversion_context.h
utils/unicode_converter.h
Separated utils/unicode_conv.h into:
driver/utils/conversion.h (common code)
driver/utils/conversion_std.h (default for MSVC)
driver/utils/conversion_icu.h (default for all non-MSVC, not implemented for MSVC)
Main conversion job is done in utils/unicode_converter.h, conversions are performed via pivot encoding, with some possible shortcuts, (hardcoded in ICU, UTF-16 placed in UChar arrays), main directions:
application narrow-char <--(ICU pivot)--> driver internal narrow char
application wide-char <--(ICU pivot)--> driver internal narrow char
data source narrow-char <--(ICU pivot)--> driver internal narrow char
This is a first step in transition to ICU-only implementation, so some code may seem redundant, or some further changes may seem like missing - most of them will take their places when the transition is done in subsequent iterations

Move ObjectPool to driver/utils/object_pool.h Move resize_without_initialization() to driver/utils/resize_without_initialization.h

Remove unused code Refactor some code to avoid duplication

Implemented (faster) StringLengthUTF8

Fix char size/converter dispatching code

Fix macOS cmake configure command line to include paths to openssl and icu brew installations

Fix target buffer growing calculation Implement implicit pivoting code that uses ucnv_convertEx() [slower?]

Enmk · 2020-04-03T15:12:01Z

well, few questins:

driver/utils/conversion_std.h (MSVC only)

Why not conversion_msvc.h ?

conversions are performed via pivot encoding

What is pivot encoding here? Do I get it right, that you usually convert via some intermediate encoding? If yes, why?

traceon · 2020-04-03T16:23:52Z

Why not conversion_msvc.h ?

Because it can be still used with any other compiler, with one single macro switch.

What is pivot encoding here? Do I get it right, that you usually convert via some intermediate encoding? If yes, why?

ICU's converter-of-X only converts from X to its pivot, and from its pivot to X. ICU uses a hardcoded pivot, which is UTF-16 in UChar. Changing it to UTF-8 may speed up things in our case, but will require custom built ICU. This is planned for the next change. Everything inside the driver is represented in UTF-8.

Enmk

Well, having a dependency on private fork of folly feels spooky and fragile.
Also, please try really hard not to abuse templates and inline functions. That leads to an overcomplicated tightly coupled system. Not every piece is performance-critical, and not every function call should be inlined.

Please try really hard to keep it as simple as possible, simplicity is a king. It makes it so much easier to review, add new features, and debug software.

Enmk · 2020-04-02T13:44:07Z

.gitmodules

@@ -16,5 +16,5 @@
 	branch = master
 [submodule "contrib/folly"]
 	path = contrib/folly
-	url = https://github.com/facebook/folly.git
-	branch = master
+	url = https://github.com/traceon/folly.git


any chance that these fixes are going to land to official repo?

Yes, hopefully will be merged soon.

Enmk · 2020-04-03T15:15:31Z

CMakeLists.txt

@@ -137,6 +139,13 @@ if (NOT CH_ODBC_PREFER_BUNDLED_FOLLY)
 #   find_package (Folly)
 endif ()

+if (CH_ODBC_USE_ICU)


For some reason, cmake wasn't able to find icu installed via homebrew on my mac. Is there any workaround/fix for that?

You need to specify the path manually, its a known issue of FindICU + homebrew.
Use this command (as per README.md):

cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo -DOPENSSL_ROOT_DIR=$(brew --prefix)/opt/openssl -DICU_ROOT=$(brew --prefix)/opt/icu4c ..

driver/CMakeLists.txt

Enmk · 2020-04-03T19:01:35Z

driver/utils/conversion_context.h

+public:
+    StringPool string_pool{10};
+
+    UnicodeConverter application_wide_char_converter;


Could you please describe why you need 4 instances of converters?

As I wrote in the description, conversions took place between the following [independently configured/hardcoded] encodings:

application narrow-char <--(ICU pivot)--> driver internal narrow char application wide-char <--(ICU pivot)--> driver internal narrow char data source narrow-char <--(ICU pivot)--> driver internal narrow char

So, as you can see, 4 total converters are involved:

application narrow-char

application wide-char

data source narrow-char

driver internal narrow char

Each of these converters convert from the specified encoding to ICU pivot, and in opposite direction.

driver/utils/conversion_icu.h

Enmk · 2020-04-03T20:13:05Z

driver/utils/unicode_converter.h

+// Return the same string with signature prefix removed, if it matches the provided signature/BOM byte array.
+// Even though the signature is specified in bytes, it will be considered matching only if it matches up to some character boundary.
+template <typename CharType>
+inline auto consumeSignature(std::basic_string_view<CharType> str, const std::string_view & signature) {


why not just removeBOM ?

You mean some known removeBOM function or you suggest naming change?
"Signature" is more correct term than "BOM." "Consume" because, you usually "consume" from the beginning of a buffer/string/sequence, whereas "remove" could mean remove from anywhere.

Enmk · 2020-04-03T20:13:51Z

driver/utils/unicode_converter.h

+inline const std::string converter_pivot_wide_char_encoding = "UTF-16";
+
+template <typename CharType>
+inline auto make_raw_str(std::initializer_list<CharType> && list) {


why is that snake_case while almost every other function is lowerCamelCase ?

Yeah, that's intentional, kind of mimicking std convention here.

Enmk · 2020-04-03T20:15:29Z

driver/utils/unicode_converter.h

+    return convertEncoding(src_converter, make_string_view(src), pivot, dest_converter, dest, ensure_src_signature, trim_dest_signature);
+}
+
+class UnicodeConverter {


I believe there should be a class-level comment, describing what does it converts and to what.

Enmk · 2020-04-03T20:17:06Z

driver/utils/unicode_converter.h

+
+class UnicodeConverter {
+public:
+    explicit inline UnicodeConverter(const std::string & encoding);


It is not clear what kind of encoding is: source, destination, transit, or anything else.

It is the converter encoding. This is a wrapper around ICU's converter with "encoding" having the same meaning as the encoding of such converter.

Enmk · 2020-04-03T20:28:41Z

driver/utils/unicode_converter.h

+    std::size_t pivot_signatures_to_trim_max_size_ = 0;
+};
+
+inline UnicodeConverter::UnicodeConverter(const std::string & encoding) {


I really do not think that inlining function this big is a good idea:

first of all, it is so less likely to be actually inlined due to the size

even if it is inlined, it will generate HUGE amount of extra code in many translation units.

if it is not inlined it will add more work to the link stage.

even if it is inlined, there will be no practical benefit performance-wise from inlining, since that would save you what? My estimate is approx 0.1-0.2% of execution time.

that also leads to implementation details like sameEncoding, make_raw_str, consumeSignature, etc. leak to the common namespace.

due to UnicodeConverter being header-only, it's implementation details are spilled all over the source base of the entire project. That produces project so coupled so highly that (as it looks like right now) any change in ICU (or attempt to replace it with something else) will ripple through the whole project.

I will move everything that can be placed in .cpp files, there. However, inlining here is more of an attempt to help the compiler to do deeper static analysis during optimization. In fact, most of the modern compilers will still inline this code, even if it is put in .cpp, thanks to LTO. So, size of the binary, and other things will be the same. Maybe slightly better compilation time, however, also slightly higher risks of slightly worse optimizations, if compiler+linker are not fresh enough.

utils/unicode_converter.cpp utils/conversion_context.cpp utils/utils.h

driver/utils/unicode_converter.h

Enmk · 2020-04-09T17:37:31Z

driver/utils/unicode_converter.h

+#include <vector>
+
+using ConverterPivotWideCharType = UChar;
+inline const std::string converter_pivot_wide_char_encoding = "UTF-16";


should be moved to cpp

I think it's better to keep this in one place, that will make it easier to read.

Enmk

Lots of wizzardy too clever to be easily understood or fixed/updated by anybody else.

And still leaks too much of ICU implementation details to the rest of the code.
IMO, Converter should have some really simple interface:

// not including ICU's headers, but forward-declaring necessary stuff
struct UConverter;

class UnicodeConverter
{
	// where encoding dictates what is located inside, what is the byte-width of the character, etc. input data is just threated as bytes of assumed encoding.
	void convert(Encoding from, std::basic_string<char> const & from, Encoding to, std::basic_string<char> & to);
	void convert(Encoding from, const char * from, size_t from_length, Encoding to, std::basic_string<char> & to);
	// or something similar...

private:
	UConverter * converter_;
	// other internal things
};
// implementation in cpp can use any trickery necessary to achieve the desired goal, but in an isolated mode, shielding rest of the application from the complexity of ICU inner workings

The way I see it there are only two encodings: one that is stored internally (presumably it matches one sent by the server) and the one client wants. There is no pivot in the domain.

And I don't get it why the notion of Pivot should be spread outside of UnicodeConverter implementation (right now it is in 5 files, while it should be in 1)? Does that help in reducing some common expensive operation, like needless conversion?

I do recall that you've said that we are going to have a custom-build-ICU soon and hence stop messing with Pivot encoding, why not abstract it away now?

Another thing: I see that you have a nifty ObjectPool class, but you use it to cache strings and vectors. How heavy is the constructor of UnicodeConverter? I imagine it is ten-to-hundred times more expensive than allocating (and filling) memory for string or vector... Would it be a good idea to reuse converters via ObjectPools?

Enmk · 2020-04-10T12:24:11Z

driver/utils/unicode_converter.h

+    const bool ensure_encoded_signature,
+    const bool trim_pivot_signature
+) {
+    pivot.clear();


Please add a simplified plan/high-level overview of what is happening here in an overly simplified pseudocode or just plain English. Hope it would make the reader's life easier.

Enmk · 2020-04-10T12:24:44Z

driver/utils/unicode_converter.h

+    const bool ensure_pivot_signature,
+    const bool trim_encoded_signature
+) {
+    encoded.clear();


Please add a simplified plan/high-level overview of what is happening here in overly simplified pseudocode or just plain English. Hope it would make the reader's life easier.

Enmk · 2020-04-10T12:26:02Z

driver/utils/unicode_converter.h

+    src_converter.convertToPivot(src, pivot, ensure_src_signature, false);
+    dest_converter.convertFromPivot(make_string_view(pivot), dest, true, trim_dest_signature);
+#else
+    dest.clear();


Please add a simplified plan/high-level overview of what is happening here in overly simplified pseudocode or just plain English.
Plus a note why you need a separate function and how it is different from convertToPivot and convertFromPivot.

Add comments for central conversion functions

Enmk

Ok

traceon mentioned this pull request Mar 19, 2020

Store and pass stateful converters in conversion contexts #280

Merged

traceon added 14 commits March 20, 2020 18:59

Move AmortizedIStreamReader to driver/utils/amortized_istream_reader.h

db9a97c

Move ObjectPool to driver/utils/object_pool.h Move resize_without_initialization() to driver/utils/resize_without_initialization.h

WIP: detect and use system ICU

31484c9

Move unicode_conv.h to unicode_conv_std.hpp

0b25b4b

Add unicode_conv_icu.cpp stub

459d9d8

Smoother ICU integration into the build system

bbf54cd

Disable CH_ODBC_PREFER_BUNDLED_ICU by default

ef8a32b

ICU Unicode conversion drop-in implementation

9e3e339

Use fixed NTSBufferLength()

c95cfb2

Remove unused stuff

3371522

Fix naming

2582b10

Use ICU

bffba43

Add proper dispatching according to the expected character sizes

10b4ed2

Compilation fix

ea861e0

Fix bug

4cdc3e1

Remove unused code Refactor some code to avoid duplication

traceon force-pushed the unicode-perf-fix branch from 3890217 to 4cdc3e1 Compare March 20, 2020 15:00

traceon added 9 commits March 21, 2020 18:03

Naming changes

b3443dc

Implemented (faster) StringLengthUTF8

Code cleanup

9562c0e

Fix char size/converter dispatching code

Choose default wide-char encoding according to platform specs

aee8fac

Change to typedefed char type

89e4495

WIP: debugging mac builds

dbc16b8

WIP: debugging mac builds

2ced750

WIP: debugging mac builds

3acf4fa

Move common code to parent header

f840b69

Always set ICU root path for macOS builds

a5f88bc

traceon closed this Mar 21, 2020

traceon reopened this Mar 21, 2020

traceon added 3 commits March 22, 2020 03:54

Fixed macOS builds

60af4c9

Delete folly

2b9cc1e

Add internal folly fork

2617708

traceon added 6 commits March 30, 2020 18:10

Add icu requirements

e32793e

Fix macOS cmake configure command line to include paths to openssl and icu brew installations

Fix signature detection

24eba04

Fix typo

b4a9f02

Fix typo

ea291fc

Fix typo

32ec45a

Fix duplicate BOM filtering

1395df5

traceon changed the title ~~WIP: Unicode conversion rework~~ Unicode conversion rework Mar 30, 2020

traceon requested a review from Enmk March 31, 2020 12:12

traceon added 4 commits March 31, 2020 19:08

Move convertEncoding to unicode_converter.h

1243651

WIP prepare for WORKAROUND_ICU_USE_EXPLICIT_PIVOTING

691ddee

Fix compilation

79df886

Fix target buffer growing calculation Implement implicit pivoting code that uses ucnv_convertEx() [slower?]

Adjust folly temporary branch

7b85aec

Enmk requested changes Apr 3, 2020

View reviewed changes

traceon added 5 commits April 6, 2020 20:06

Add comments

ed24f4b

Move code from original places to:

009ffdf

utils/unicode_converter.cpp utils/conversion_context.cpp utils/utils.h

Windows compilation fix

c80519d

Switch to official folly repo

eed69f8

Tabs to spaces

1eaf95c

Enmk reviewed Apr 9, 2020

View reviewed changes

driver/utils/unicode_converter.h Show resolved Hide resolved

Enmk reviewed Apr 9, 2020

View reviewed changes

Enmk reviewed Apr 10, 2020

View reviewed changes

Delete UnicodeConverter copy c-rot and assignment operator

bf4f2ba

Add comments for central conversion functions

Enmk approved these changes Apr 13, 2020

View reviewed changes

Enmk merged commit 1a2ac66 into ClickHouse:master Apr 13, 2020

traceon deleted the unicode-perf-fix branch April 13, 2020 12:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode conversion rework #278

Unicode conversion rework #278

traceon commented Mar 13, 2020 •

edited

Loading

Enmk commented Apr 3, 2020 •

edited

Loading

traceon commented Apr 3, 2020 •

edited

Loading

Enmk left a comment

Enmk Apr 2, 2020

traceon Apr 3, 2020

Enmk Apr 3, 2020 •

edited

Loading

traceon Apr 3, 2020

Enmk Apr 3, 2020

traceon Apr 3, 2020 •

edited

Loading

Enmk Apr 3, 2020

traceon Apr 3, 2020

Enmk Apr 3, 2020

traceon Apr 3, 2020

Enmk Apr 3, 2020

traceon Apr 6, 2020

Enmk Apr 3, 2020

traceon Apr 3, 2020

Enmk Apr 3, 2020 •

edited

Loading

traceon Apr 3, 2020

traceon Apr 6, 2020

Enmk Apr 9, 2020

traceon Apr 13, 2020

Enmk left a comment •

edited

Loading

Enmk Apr 10, 2020 •

edited

Loading

traceon Apr 13, 2020

Enmk Apr 10, 2020

traceon Apr 13, 2020

Enmk Apr 10, 2020

traceon Apr 13, 2020

Enmk left a comment

Unicode conversion rework #278

Unicode conversion rework #278

Conversation

traceon commented Mar 13, 2020 • edited Loading

Enmk commented Apr 3, 2020 • edited Loading

traceon commented Apr 3, 2020 • edited Loading

Enmk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Enmk Apr 3, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

traceon Apr 3, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Enmk Apr 3, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Enmk left a comment • edited Loading

Choose a reason for hiding this comment

Enmk Apr 10, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Enmk left a comment

Choose a reason for hiding this comment

traceon commented Mar 13, 2020 •

edited

Loading

Enmk commented Apr 3, 2020 •

edited

Loading

traceon commented Apr 3, 2020 •

edited

Loading

Enmk Apr 3, 2020 •

edited

Loading

traceon Apr 3, 2020 •

edited

Loading

Enmk Apr 3, 2020 •

edited

Loading

Enmk left a comment •

edited

Loading

Enmk Apr 10, 2020 •

edited

Loading