Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode conversion rework #278

Merged
merged 64 commits into from
Apr 13, 2020
Merged

Unicode conversion rework #278

merged 64 commits into from
Apr 13, 2020

Conversation

traceon
Copy link
Collaborator

@traceon traceon commented Mar 13, 2020

  • Add ICU drop-in support (non-MSVC platforms only; system lib only)
  • Factored out to dedicated headers:
    utils/amortized_istream_reader.h
    utils/resize_without_initialization.h
    utils/object_pool.h
  • folly switched to a repo with my custom changes that enable resize-without-initialization to work on all target platforms
  • Implemented:
    utils/string_pool.h
    utils/conversion_context.h
    utils/unicode_converter.h
  • Separated utils/unicode_conv.h into:
    driver/utils/conversion.h (common code)
    driver/utils/conversion_std.h (default for MSVC)
    driver/utils/conversion_icu.h (default for all non-MSVC, not implemented for MSVC)
  • Main conversion job is done in utils/unicode_converter.h, conversions are performed via pivot encoding, with some possible shortcuts, (hardcoded in ICU, UTF-16 placed in UChar arrays), main directions:
    application narrow-char <--(ICU pivot)--> driver internal narrow char
    application wide-char <--(ICU pivot)--> driver internal narrow char
    data source narrow-char <--(ICU pivot)--> driver internal narrow char
  • This is a first step in transition to ICU-only implementation, so some code may seem redundant, or some further changes may seem like missing - most of them will take their places when the transition is done in subsequent iterations

@traceon traceon closed this Mar 21, 2020
@traceon traceon reopened this Mar 21, 2020
Fix macOS cmake configure command line to include paths to openssl and icu brew installations
@traceon traceon changed the title WIP: Unicode conversion rework Unicode conversion rework Mar 30, 2020
@traceon traceon requested a review from Enmk March 31, 2020 12:12
@Enmk
Copy link
Collaborator

Enmk commented Apr 3, 2020

well, few questins:

driver/utils/conversion_std.h (MSVC only)

Why not conversion_msvc.h ?

conversions are performed via pivot encoding

What is pivot encoding here? Do I get it right, that you usually convert via some intermediate encoding? If yes, why?

@traceon
Copy link
Collaborator Author

traceon commented Apr 3, 2020

Why not conversion_msvc.h ?

Because it can be still used with any other compiler, with one single macro switch.

What is pivot encoding here? Do I get it right, that you usually convert via some intermediate encoding? If yes, why?

ICU's converter-of-X only converts from X to its pivot, and from its pivot to X. ICU uses a hardcoded pivot, which is UTF-16 in UChar. Changing it to UTF-8 may speed up things in our case, but will require custom built ICU. This is planned for the next change. Everything inside the driver is represented in UTF-8.

Copy link
Collaborator

@Enmk Enmk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, having a dependency on private fork of folly feels spooky and fragile.
Also, please try really hard not to abuse templates and inline functions. That leads to an overcomplicated tightly coupled system. Not every piece is performance-critical, and not every function call should be inlined.

Please try really hard to keep it as simple as possible, simplicity is a king. It makes it so much easier to review, add new features, and debug software.

.gitmodules Outdated
@@ -16,5 +16,5 @@
branch = master
[submodule "contrib/folly"]
path = contrib/folly
url = https://github.com/facebook/folly.git
branch = master
url = https://github.com/traceon/folly.git
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any chance that these fixes are going to land to official repo?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, hopefully will be merged soon.

@@ -137,6 +139,13 @@ if (NOT CH_ODBC_PREFER_BUNDLED_FOLLY)
# find_package (Folly)
endif ()

if (CH_ODBC_USE_ICU)
Copy link
Collaborator

@Enmk Enmk Apr 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason, cmake wasn't able to find icu installed via homebrew on my mac. Is there any workaround/fix for that?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to specify the path manually, its a known issue of FindICU + homebrew.
Use this command (as per README.md):

cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo -DOPENSSL_ROOT_DIR=$(brew --prefix)/opt/openssl -DICU_ROOT=$(brew --prefix)/opt/icu4c ..

driver/CMakeLists.txt Show resolved Hide resolved
public:
StringPool string_pool{10};

UnicodeConverter application_wide_char_converter;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please describe why you need 4 instances of converters?

Copy link
Collaborator Author

@traceon traceon Apr 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I wrote in the description, conversions took place between the following [independently configured/hardcoded] encodings:

application narrow-char <--(ICU pivot)--> driver internal narrow char
application wide-char <--(ICU pivot)--> driver internal narrow char
data source narrow-char <--(ICU pivot)--> driver internal narrow char

So, as you can see, 4 total converters are involved:

  • application narrow-char
  • application wide-char
  • data source narrow-char
  • driver internal narrow char

Each of these converters convert from the specified encoding to ICU pivot, and in opposite direction.

driver/utils/conversion_icu.h Show resolved Hide resolved
// Return the same string with signature prefix removed, if it matches the provided signature/BOM byte array.
// Even though the signature is specified in bytes, it will be considered matching only if it matches up to some character boundary.
template <typename CharType>
inline auto consumeSignature(std::basic_string_view<CharType> str, const std::string_view & signature) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just removeBOM ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean some known removeBOM function or you suggest naming change?
"Signature" is more correct term than "BOM." "Consume" because, you usually "consume" from the beginning of a buffer/string/sequence, whereas "remove" could mean remove from anywhere.

inline const std::string converter_pivot_wide_char_encoding = "UTF-16";

template <typename CharType>
inline auto make_raw_str(std::initializer_list<CharType> && list) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is that snake_case while almost every other function is lowerCamelCase ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's intentional, kind of mimicking std convention here.

return convertEncoding(src_converter, make_string_view(src), pivot, dest_converter, dest, ensure_src_signature, trim_dest_signature);
}

class UnicodeConverter {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there should be a class-level comment, describing what does it converts and to what.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


class UnicodeConverter {
public:
explicit inline UnicodeConverter(const std::string & encoding);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear what kind of encoding is: source, destination, transit, or anything else.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the converter encoding. This is a wrapper around ICU's converter with "encoding" having the same meaning as the encoding of such converter.

std::size_t pivot_signatures_to_trim_max_size_ = 0;
};

inline UnicodeConverter::UnicodeConverter(const std::string & encoding) {
Copy link
Collaborator

@Enmk Enmk Apr 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really do not think that inlining function this big is a good idea:

  • first of all, it is so less likely to be actually inlined due to the size
  • even if it is inlined, it will generate HUGE amount of extra code in many translation units.
  • if it is not inlined it will add more work to the link stage.
  • even if it is inlined, there will be no practical benefit performance-wise from inlining, since that would save you what? My estimate is approx 0.1-0.2% of execution time.
  • that also leads to implementation details like sameEncoding, make_raw_str, consumeSignature, etc. leak to the common namespace.
  • due to UnicodeConverter being header-only, it's implementation details are spilled all over the source base of the entire project. That produces project so coupled so highly that (as it looks like right now) any change in ICU (or attempt to replace it with something else) will ripple through the whole project.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will move everything that can be placed in .cpp files, there. However, inlining here is more of an attempt to help the compiler to do deeper static analysis during optimization. In fact, most of the modern compilers will still inline this code, even if it is put in .cpp, thanks to LTO. So, size of the binary, and other things will be the same. Maybe slightly better compilation time, however, also slightly higher risks of slightly worse optimizations, if compiler+linker are not fresh enough.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

#include <vector>

using ConverterPivotWideCharType = UChar;
inline const std::string converter_pivot_wide_char_encoding = "UTF-16";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be moved to cpp

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to keep this in one place, that will make it easier to read.

Copy link
Collaborator

@Enmk Enmk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of wizzardy too clever to be easily understood or fixed/updated by anybody else.

And still leaks too much of ICU implementation details to the rest of the code.
IMO, Converter should have some really simple interface:

// not including ICU's headers, but forward-declaring necessary stuff
struct UConverter;

class UnicodeConverter
{
	// where encoding dictates what is located inside, what is the byte-width of the character, etc. input data is just threated as bytes of assumed encoding.
	void convert(Encoding from, std::basic_string<char> const & from, Encoding to, std::basic_string<char> & to);
	void convert(Encoding from, const char * from, size_t from_length, Encoding to, std::basic_string<char> & to);
	// or something similar...

private:
	UConverter * converter_;
	// other internal things
};
// implementation in cpp can use any trickery necessary to achieve the desired goal, but in an isolated mode, shielding rest of the application from the complexity of ICU inner workings

The way I see it there are only two encodings: one that is stored internally (presumably it matches one sent by the server) and the one client wants. There is no pivot in the domain.

And I don't get it why the notion of Pivot should be spread outside of UnicodeConverter implementation (right now it is in 5 files, while it should be in 1)? Does that help in reducing some common expensive operation, like needless conversion?

I do recall that you've said that we are going to have a custom-build-ICU soon and hence stop messing with Pivot encoding, why not abstract it away now?

Another thing: I see that you have a nifty ObjectPool class, but you use it to cache strings and vectors. How heavy is the constructor of UnicodeConverter? I imagine it is ten-to-hundred times more expensive than allocating (and filling) memory for string or vector... Would it be a good idea to reuse converters via ObjectPools?

const bool ensure_encoded_signature,
const bool trim_pivot_signature
) {
pivot.clear();
Copy link
Collaborator

@Enmk Enmk Apr 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a simplified plan/high-level overview of what is happening here in an overly simplified pseudocode or just plain English. Hope it would make the reader's life easier.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

const bool ensure_pivot_signature,
const bool trim_encoded_signature
) {
encoded.clear();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a simplified plan/high-level overview of what is happening here in overly simplified pseudocode or just plain English. Hope it would make the reader's life easier.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

src_converter.convertToPivot(src, pivot, ensure_src_signature, false);
dest_converter.convertFromPivot(make_string_view(pivot), dest, true, trim_dest_signature);
#else
dest.clear();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a simplified plan/high-level overview of what is happening here in overly simplified pseudocode or just plain English.
Plus a note why you need a separate function and how it is different from convertToPivot and convertFromPivot.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Add comments for central conversion functions
Copy link
Collaborator

@Enmk Enmk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

@Enmk Enmk merged commit 1a2ac66 into ClickHouse:master Apr 13, 2020
@traceon traceon deleted the unicode-perf-fix branch April 13, 2020 12:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants