-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode conversion rework #278
Conversation
Move ObjectPool to driver/utils/object_pool.h Move resize_without_initialization() to driver/utils/resize_without_initialization.h
3890217
to
4cdc3e1
Compare
Implemented (faster) StringLengthUTF8
Fix char size/converter dispatching code
Fix macOS cmake configure command line to include paths to openssl and icu brew installations
Fix target buffer growing calculation Implement implicit pivoting code that uses ucnv_convertEx() [slower?]
well, few questins:
Why not
What is pivot encoding here? Do I get it right, that you usually convert via some intermediate encoding? If yes, why? |
Because it can be still used with any other compiler, with one single macro switch.
ICU's converter-of-X only converts from X to its pivot, and from its pivot to X. ICU uses a hardcoded pivot, which is UTF-16 in UChar. Changing it to UTF-8 may speed up things in our case, but will require custom built ICU. This is planned for the next change. Everything inside the driver is represented in UTF-8. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, having a dependency on private fork of folly feels spooky and fragile.
Also, please try really hard not to abuse templates and inline functions. That leads to an overcomplicated tightly coupled system. Not every piece is performance-critical, and not every function call should be inlined.
Please try really hard to keep it as simple as possible, simplicity is a king. It makes it so much easier to review, add new features, and debug software.
.gitmodules
Outdated
@@ -16,5 +16,5 @@ | |||
branch = master | |||
[submodule "contrib/folly"] | |||
path = contrib/folly | |||
url = https://github.com/facebook/folly.git | |||
branch = master | |||
url = https://github.com/traceon/folly.git |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any chance that these fixes are going to land to official repo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, hopefully will be merged soon.
@@ -137,6 +139,13 @@ if (NOT CH_ODBC_PREFER_BUNDLED_FOLLY) | |||
# find_package (Folly) | |||
endif () | |||
|
|||
if (CH_ODBC_USE_ICU) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason, cmake
wasn't able to find icu
installed via homebrew on my mac. Is there any workaround/fix for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to specify the path manually, its a known issue of FindICU
+ homebrew.
Use this command (as per README.md
):
cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo -DOPENSSL_ROOT_DIR=$(brew --prefix)/opt/openssl -DICU_ROOT=$(brew --prefix)/opt/icu4c ..
public: | ||
StringPool string_pool{10}; | ||
|
||
UnicodeConverter application_wide_char_converter; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please describe why you need 4 instances of converters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I wrote in the description, conversions took place between the following [independently configured/hardcoded] encodings:
application narrow-char <--(ICU pivot)--> driver internal narrow char
application wide-char <--(ICU pivot)--> driver internal narrow char
data source narrow-char <--(ICU pivot)--> driver internal narrow char
So, as you can see, 4 total converters are involved:
- application narrow-char
- application wide-char
- data source narrow-char
- driver internal narrow char
Each of these converters convert from the specified encoding to ICU pivot, and in opposite direction.
// Return the same string with signature prefix removed, if it matches the provided signature/BOM byte array. | ||
// Even though the signature is specified in bytes, it will be considered matching only if it matches up to some character boundary. | ||
template <typename CharType> | ||
inline auto consumeSignature(std::basic_string_view<CharType> str, const std::string_view & signature) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just removeBOM
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean some known removeBOM
function or you suggest naming change?
"Signature" is more correct term than "BOM." "Consume" because, you usually "consume" from the beginning of a buffer/string/sequence, whereas "remove" could mean remove from anywhere.
driver/utils/unicode_converter.h
Outdated
inline const std::string converter_pivot_wide_char_encoding = "UTF-16"; | ||
|
||
template <typename CharType> | ||
inline auto make_raw_str(std::initializer_list<CharType> && list) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is that snake_case
while almost every other function is lowerCamelCase
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's intentional, kind of mimicking std convention here.
return convertEncoding(src_converter, make_string_view(src), pivot, dest_converter, dest, ensure_src_signature, trim_dest_signature); | ||
} | ||
|
||
class UnicodeConverter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe there should be a class-level comment, describing what does it converts and to what.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
driver/utils/unicode_converter.h
Outdated
|
||
class UnicodeConverter { | ||
public: | ||
explicit inline UnicodeConverter(const std::string & encoding); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not clear what kind of encoding is: source, destination, transit, or anything else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is the converter encoding. This is a wrapper around ICU's converter with "encoding" having the same meaning as the encoding of such converter.
driver/utils/unicode_converter.h
Outdated
std::size_t pivot_signatures_to_trim_max_size_ = 0; | ||
}; | ||
|
||
inline UnicodeConverter::UnicodeConverter(const std::string & encoding) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really do not think that inlining function this big is a good idea:
- first of all, it is so less likely to be actually inlined due to the size
- even if it is inlined, it will generate HUGE amount of extra code in many translation units.
- if it is not inlined it will add more work to the link stage.
- even if it is inlined, there will be no practical benefit performance-wise from inlining, since that would save you what? My estimate is approx 0.1-0.2% of execution time.
- that also leads to implementation details like
sameEncoding
,make_raw_str
,consumeSignature
, etc. leak to the common namespace. - due to
UnicodeConverter
being header-only, it's implementation details are spilled all over the source base of the entire project. That produces project so coupled so highly that (as it looks like right now) any change in ICU (or attempt to replace it with something else) will ripple through the whole project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will move everything that can be placed in .cpp files, there. However, inlining here is more of an attempt to help the compiler to do deeper static analysis during optimization. In fact, most of the modern compilers will still inline this code, even if it is put in .cpp, thanks to LTO. So, size of the binary, and other things will be the same. Maybe slightly better compilation time, however, also slightly higher risks of slightly worse optimizations, if compiler+linker are not fresh enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
utils/unicode_converter.cpp utils/conversion_context.cpp utils/utils.h
#include <vector> | ||
|
||
using ConverterPivotWideCharType = UChar; | ||
inline const std::string converter_pivot_wide_char_encoding = "UTF-16"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be moved to cpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to keep this in one place, that will make it easier to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lots of wizzardy too clever to be easily understood or fixed/updated by anybody else.
And still leaks too much of ICU implementation details to the rest of the code.
IMO, Converter should have some really simple interface:
// not including ICU's headers, but forward-declaring necessary stuff
struct UConverter;
class UnicodeConverter
{
// where encoding dictates what is located inside, what is the byte-width of the character, etc. input data is just threated as bytes of assumed encoding.
void convert(Encoding from, std::basic_string<char> const & from, Encoding to, std::basic_string<char> & to);
void convert(Encoding from, const char * from, size_t from_length, Encoding to, std::basic_string<char> & to);
// or something similar...
private:
UConverter * converter_;
// other internal things
};
// implementation in cpp can use any trickery necessary to achieve the desired goal, but in an isolated mode, shielding rest of the application from the complexity of ICU inner workings
The way I see it there are only two encodings: one that is stored internally (presumably it matches one sent by the server) and the one client wants. There is no pivot in the domain.
And I don't get it why the notion of Pivot should be spread outside of UnicodeConverter implementation (right now it is in 5 files, while it should be in 1)? Does that help in reducing some common expensive operation, like needless conversion?
I do recall that you've said that we are going to have a custom-build-ICU soon and hence stop messing with Pivot encoding, why not abstract it away now?
Another thing: I see that you have a nifty ObjectPool
class, but you use it to cache strings and vectors. How heavy is the constructor of UnicodeConverter
? I imagine it is ten-to-hundred times more expensive than allocating (and filling) memory for string
or vector
... Would it be a good idea to reuse converters via ObjectPools
?
const bool ensure_encoded_signature, | ||
const bool trim_pivot_signature | ||
) { | ||
pivot.clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a simplified plan/high-level overview of what is happening here in an overly simplified pseudocode or just plain English. Hope it would make the reader's life easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
const bool ensure_pivot_signature, | ||
const bool trim_encoded_signature | ||
) { | ||
encoded.clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a simplified plan/high-level overview of what is happening here in overly simplified pseudocode or just plain English. Hope it would make the reader's life easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src_converter.convertToPivot(src, pivot, ensure_src_signature, false); | ||
dest_converter.convertFromPivot(make_string_view(pivot), dest, true, trim_dest_signature); | ||
#else | ||
dest.clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a simplified plan/high-level overview of what is happening here in overly simplified pseudocode or just plain English.
Plus a note why you need a separate function and how it is different from convertToPivot
and convertFromPivot
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Add comments for central conversion functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
utils/amortized_istream_reader.h
utils/resize_without_initialization.h
utils/object_pool.h
utils/string_pool.h
utils/conversion_context.h
utils/unicode_converter.h
utils/unicode_conv.h
into:driver/utils/conversion.h
(common code)driver/utils/conversion_std.h
(default for MSVC)driver/utils/conversion_icu.h
(default for all non-MSVC, not implemented for MSVC)utils/unicode_converter.h
, conversions are performed via pivot encoding, with some possible shortcuts, (hardcoded in ICU, UTF-16 placed in UChar arrays), main directions:application narrow-char <--(ICU pivot)--> driver internal narrow char
application wide-char <--(ICU pivot)--> driver internal narrow char
data source narrow-char <--(ICU pivot)--> driver internal narrow char