C++ runtime changes for high warning levels #1902

jm-mikkelsen · 2017-06-10T08:06:51Z

These changes are for compiling the C++ runtime and code using the C++ runtime headers with high warning levels and -Werror. There are no functional changes in this commit. Compiled with gcc 5.4 and clang 3.8.

constructor, put the destructor into the .cpp file so the vtable doesn't get generated everywhere.

These changes are for compiling with high warning levels and -Werror. There are no functional changes in this commit. Compiled with gcc 5.4 and clang 3.8. Summary: - Put virtual destructors into the appropriate .cpp file instead of the inline version in the header to avoid many vtables. - Change C-style casts to modern C++ casts. - Add explicit casts in some signed to/from unsigned conversions. - Remove unreached code in BufferedTokenStream.cpp and LexerATNSimulator.cpp. - Remove shadowed variables by qualifying constructor arguments with the name name as a member variable. - Add explicitly defined copy constructors and assignment operators where required by gcc's -Weff-c++. - Use std::numeric_limits<size_t>::max() instead of assigning a negative number. - Remove semi-colons after function definitions. - Remove unneccessary casts. - In preprocessor statements "#if label > value" change to "#if defined(label) && label > value" to avoid warnings about the undefined symbol being seen as zero. - Remove ANTLR4CPP_PUBLIC from "enum class" definitions. - Change the FinalAction move constructor to move instead of copy the _cleanUp std::function object. (A side-effect of explicitly initialising member variables as required by gcc's -Weff-c++. I turned this one off because most constructors needed to be touched, especially the classes implemented with InitializeInstanceFields()). - Mark hex digit conversion functions as file static in guid.cpp.

- Also remove static instance of std::wstring_convert. Access to std::wstring_convert<>::{from,to}_bytes() is not guaranteed to be thread safe.

jm-mikkelsen · 2017-06-10T08:54:39Z

Just added changes to remove the fix the utf8_to_utf32 auto return type which broke the CI builds. While I was there I removed the static utfConverter instance; access to its {from,to}_bytes methods are not guaranteed thread safe.

The Travis CI build is failing after an include of <cstddef> -- This is an attempt to work around that by including <stddef.h> instead. Problem not apparent in my FreeBSD environment.

ATN::nextTokens(ATNState* s) updates s->nextTokenWithinRule if the IntervalSet is empty, and then sets it to be read only. However, if the updated IntervalSet value was also empty, it becomes a read-only empty set, causing an exception on a second call on the same state. This was exposed a change I made to make IntervalSet::operator=() respect the _readonly flag. (Which in turn was found by compiling with a high warningly level.) The approach in this update is to perform the update if the updated value is not empty or if the current value is not read only. This preserves the previous behaviour of creating a read-only empty set and working on subsequent calls. It will throw on an attempt to update a read-only value, where previously the read-only value would be silently discarded and set to updatable.

This is a proposed fix to bug antlr#1826 which removes a race condition where multiple threads could update ATNState::nextTokenWithinRule, leading to corrupted std::vector instances in an InstanceSet.

mike-lischke

That's quite a big patch and I haven't commented on all places where a change is needed. Let me rather summarize here and consider changing all places that are affected:

Consistency: your change from {} to = default is ok, but it should be used consequently. Often you added an empty body instead. These should all be removed and replaced by default.
You added several times an empty d-tor (not even virtual), which makes no sense. It would make sense for virtual d-tors. The other d-tors are just noise.
You added several times copy c-tor and copy assignment functions only to say they should use the default impl. Why, by all means, did you even add them if you rely on the default impl anyway. That looks so totally superfluous.
The change from #if to #if defined also seems like useless cosmetic change, or is there anything that requires the latter (I haven't seen any problems or warnings resulting from the original code)?
Code formatting: in the code you added (mostly the new d-tor impls, but also others) you did not consider the coding style in the files -> opening brace at the end of the line, not an own line.

mike-lischke · 2017-06-11T08:05:56Z

runtime/Cpp/runtime/src/CommonTokenStream.cpp

-CommonTokenStream::CommonTokenStream(TokenSource *tokenSource, size_t channel)
-: BufferedTokenStream(tokenSource), channel(channel) {
+CommonTokenStream::CommonTokenStream(TokenSource *tokenSource, size_t channel_in)
+: BufferedTokenStream(tokenSource), channel(channel_in) {
 }


I had a number of similar changes already and use a simple trailing underscore. Having names like channel_in violates the camel case coding style. Can you please change that?

Will update

mike-lischke · 2017-06-11T08:06:51Z

runtime/Cpp/runtime/src/DiagnosticErrorListener.cpp

@@ -17,7 +17,7 @@ using namespace antlr4;
 DiagnosticErrorListener::DiagnosticErrorListener() : DiagnosticErrorListener(true) {
 }

-DiagnosticErrorListener::DiagnosticErrorListener(bool exactOnly) : exactOnly(exactOnly) {
+DiagnosticErrorListener::DiagnosticErrorListener(bool exactOnly_in) : exactOnly(exactOnly_in) {
 }


mike-lischke · 2017-06-11T08:11:45Z

runtime/Cpp/runtime/src/Exceptions.cpp

+UnsupportedOperationException::~UnsupportedOperationException() = default;
+EmptyStackException::~EmptyStackException() = default;
+CancellationException::~CancellationException() = default;
+ParseCancellationException::~ParseCancellationException() = default;


It would make more sense to specify the default behavior directly on the class declarations, instead of having a list like this somewhere. Also it violates the pattern used throughout the runtime:

class A { public: <variables> <c-tors> <d-tor> <other decls> protected: ... };

That means each d-tor definition has to be placed after the c-tor defintion in the cpp file or in the declaration part. Please change.

The problem with putting the "= default" in the header file is that the vtable is not generated as part of the translation unit. I kept the same order as the pattern used in the header file changes, the difference is that the destructor doesn't have an inline definition.

Would you prefer an empty destructor or "= default" in a .cpp file?

OK, I see. In such cases make them empty d-tors in the cpp file - properly ordered with the rest of the code.

mike-lischke · 2017-06-11T08:13:02Z

runtime/Cpp/runtime/src/Exceptions.h

-    IllegalStateException(const std::string &msg = "") : RuntimeException(msg) {};
+    IllegalStateException(const std::string &msg = "") : RuntimeException(msg) {}
+    IllegalStateException(IllegalStateException const&) = default;
+    ~IllegalStateException();


~IllegalStateException() = default;

Similar for all other exceptions. However, I wonder why you introduced the d-tor on all the exceptions if you use the default implementation anyway?

As in the previous comment. To make sure that the vtable for the class with virtual functions is only generated in one translation unit.

Weird, I never had problems with v-table creation so far. Is this something that changed in a newer compiler? Neither clang nor msc seem to have a problem with that either. Can you point me to an online resource, so I can read up the details?

The vtables are created OK, and will work. The problem is that there are too many of them.

That has been part of my coding standard for so long, I assumed everyone did it that way. However, finding an online reference that gives this advice has been surprisingly difficult in about half an hour of searching.

The best origin reference I found with a quick look around the office from John Lakos's "Large-Scale C++ Software Design" (from 1996), which has the "minor design rule":

"In every class that declares or is derived from a class that declares a virtual function, explicitly declare the destructor as the first virtual function in the class and define it out of line."

It goes on to describe the reasons, including an anecdote of thousands of instances of the same functions in different translation units caused by duplicated vtables. Locality of reference will probably also suffer (ie. different instances of the same function will lead to less code fitting into the CPU's instruction cache.)

(I read that in 1997 and pretty much adopted all of it, so I have probably been doing this for at least 20 years.)

Item 24 in Scott Meyers' "More Effective C++" also cautions against inline virtual function definitions with the quote "In large systems, this can lead to programs containing hundreds or thousands of copies of a class's vtbl!"

And, of course, Clang has a warning to let you know you're making your object too big:

-Wweak-vtables

Diagnostic text:

warning: A has no out-of-line virtual method definitions; its vtable will be emitted in every translation unit

Makes a lot of sense. Something new learned. Thanks a lot.

mike-lischke · 2017-06-11T08:15:01Z

runtime/Cpp/runtime/src/InputMismatchException.cpp

@@ -13,3 +13,7 @@ InputMismatchException::InputMismatchException(Parser *recognizer)
  : RecognitionException(recognizer, recognizer->getInputStream(), recognizer->getContext(),
                         recognizer->getCurrentToken()) {
 }
+
+InputMismatchException::~InputMismatchException()


Consistency: should not be defined empty here but use default in the header.

As in the previous two comments. Happy make a choice between empty destructors and "= default"

mike-lischke · 2017-06-11T09:17:59Z

runtime/Cpp/runtime/src/misc/IntervalSet.h

@@ -30,15 +31,17 @@ namespace misc {
  protected:
    /// The list of sorted, disjoint intervals.
    std::vector<Interval> _intervals;
-    bool _readonly;
+    std::atomic<bool> _readonly;


Why do we have a mutex in the ATN for protecting the read-only state change and make the _readonly flag atomic? That's not necessary - one of the two can be removed. I'd prefer atomic to stay.

I comment on this in the bug report. The atomic alone is not sufficient to remove the race condition around updates. This is a pattern we use a lot.

Hmm, now I get it. You have that check before the mutex (where atomic comes in) then the mutex lock. So this seems ok then. I'm just a bit nervous about performance impact. The interval set class is one of the lowest level classes and used everywhere in the antlr4 runtime.

The idea here is that you can check _readonly orders of magnitude more quickly than taking a lock on a mutex. That's pretty fast, and in the scheme of all the other stuff going on (eg. copying lots of IntervalSet) not too bad.

For optimising IntervalSet, adding IntervalSet(IntervalSet&&) and IntervalSet& operator=(IntervalSet&&) and using std::move is likely to be a big win. The other question is whether the copy constructor should really call addAll() at all because the source instance will have already done all the checking in add(Interval const&).

IntervalSets seem to be copied a lot in the code, this is likely to improve things.

mike-lischke · 2017-06-11T09:20:38Z

runtime/Cpp/runtime/src/support/Any.cpp

+
+antlrcpp::Any::Base::~Base()
+{
+}


I'm usually fine with having the implementation in a cpp file, so that change is ok. But when you start such a clean up it should be done fully, i.e. move all definitions here from the header.

mike-lischke · 2017-06-11T09:22:02Z

runtime/Cpp/runtime/src/support/CPPUtils.h

-      _enabled = other._enabled;
+    FinalAction(FinalAction &&other) :
+	_cleanUp(std::move(other._cleanUp)), _enabled(other._enabled)
+    {


Coding style: opening brace on the end of the line opening the block (as it was originally).

mike-lischke · 2017-06-11T09:26:31Z

runtime/Cpp/runtime/src/support/StringUtils.h

 #else
-  static std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> utfConverter;
+  using UtfConverterType = std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t>;
+  using UtfConverterWide = std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t>::wide_string;


Have you tested this on Windows? UtfConverterWide is defined as std::u32string here, while for non-Win it's a std::wstring_convert.

This part has already been improved and it's better not to include such changes in this patch.

Merged your changes.

mike-lischke · 2017-06-11T09:42:16Z

runtime/Cpp/runtime/src/tree/xpath/XPathLexer.cpp

@@ -47,7 +47,7 @@ const atn::ATN& XPathLexer::getATN() const {

 void XPathLexer::action(RuleContext *context, size_t ruleIndex, size_t actionIndex) {
  switch (ruleIndex) {
-    case 4: IDAction(dynamic_cast<antlr4::RuleContext *>(context), actionIndex); break;
+    case 4: IDAction(context, actionIndex); break;


This is generated code. You shouldn't change that manually.

mike-lischke · 2017-06-11T09:43:40Z

And, btw. this patch of course has functional changes.

For the UTF string conversions pending to be merged see my code: https://github.com/mike-lischke/antlr4/blob/master/runtime/Cpp/runtime/src/support/StringUtils.h

jm-mikkelsen · 2017-06-11T22:56:48Z

Thanks for the review Mike -- I appreciate that it is a big patch and takes an effort!

I'll respond in detail to your comments inline.

jm-mikkelsen · 2017-06-11T23:54:20Z

That's quite a big patch and I haven't commented on all places where a change is needed. Let me rather summarize here and consider changing all places that are affected:

Consistency: your change from {} to = default is ok, but it should be used consequently. Often you added an empty body instead. These should all be removed and replaced by default.

Is this how you would like things to look for destructors in .cpp files as well, or would you prefer empty bodies there?

You added several times an empty d-tor (not even virtual), which makes no sense. It would make sense for virtual d-tors. The other d-tors are just noise.

Which destructor is not virtual?

There are two types of warnings leading to the destructor changes:

A vtable for the class will be generated in each translation unit.
The class has pointer members but has compiler generated functions, eg. destructor or operator=.

For the first case, the answer is explicitly declare the destructor and put it into a .cpp file.

For the second case, the answer is to explicitly define the function, even if it is empty (for a destructor) or "= default". The intent here is to signal that you have considered what is going on, and say it's OK.

Seeing the default constructed operator=() being significantly different to the copy constructor in IntervalSet came directly from dealing with this type of warning. Which in turn led to discovering that there were multiple assignments of cached interval sets in ATN, which led directly to a resolution of the race condition in issue #1826.

You added several times copy c-tor and copy assignment functions only to say they should use the default impl. Why, by all means, did you even add them if you rely on the default impl anyway. That looks so totally superfluous.

Please see above.

The change from #if to #if defined also seems like useless cosmetic change, or is there anything that requires the latter (I haven't seen any problems or warnings resulting from the original code)?

#if defined(...) is used throughout the code, I fixed a few cases where it was not used. The warning was around a construct like #if SYMBOL > 0 rather than #if defined(SYMBOL) && SYMBOL > 0.

Code formatting: in the code you added (mostly the new d-tor impls, but also others) you did not consider the coding style in the files -> opening brace at the end of the line, not an own line.

Will bow to the local curly brace convention!

- Change qualifying suffix from "_in" to "_" to confirm with conventions.

- Explicitly cast negative values to size_t instead of using an offset from std::numeric_limts<size_t>::max().

- Change "#if SYM > x" to "#if defined(SYM) && SYM > x" as a pattern. In StringUtils.h this follows the pattern used earlier in the file.

mike-lischke

Looks pretty good. Only very few things left.

mike-lischke · 2017-06-12T06:56:39Z

runtime/Cpp/runtime/src/misc/Interval.h

-    virtual ~Interval() {};
+    Interval(Interval const&) = default;
+    virtual ~Interval();
+    Interval& operator=(Interval const&) = default;


Ah, confused Interval and IntervalSet, sorry.

mike-lischke · 2017-06-12T07:00:30Z

runtime/Cpp/runtime/src/misc/IntervalSet.cpp

+{
+  if (_readonly) {
+    throw IllegalStateException("can't alter read only IntervalSet");
+  }


Hmm, tricky. In fact there is no copy function like this in Java code, so there is no real copy operation there. When they use an assignment they replace a reference. Maybe we should just delete the assignment operator to avoid any unintended behavior and so stay closer to the Java code? If needed one can easily create a new interval set.

mike-lischke · 2017-06-12T07:04:03Z

runtime/Cpp/runtime/src/misc/IntervalSet.h

@@ -30,15 +31,17 @@ namespace misc {
  protected:
    /// The list of sorted, disjoint intervals.
    std::vector<Interval> _intervals;
-    bool _readonly;
+    std::atomic<bool> _readonly;


Hmm, now I get it. You have that check before the mutex (where atomic comes in) then the mutex lock. So this seems ok then. I'm just a bit nervous about performance impact. The interval set class is one of the lowest level classes and used everywhere in the antlr4 runtime.

mike-lischke · 2017-06-12T07:17:26Z

For your discussion about my summary, I believe that's all clarified in the code review now and we are almost there with the patch.

jm-mikkelsen · 2017-06-13T00:40:49Z

After going further into the Interval/IntervalSet/readonly and the C++ pass-by-value vs. Java pass reference by value differences, I now see many issues in the code where classes with virtual functions are passed by value and assigned to instances of the base class.

For example, ParseTreeMatcher::split() creates an instance of a TextChunk and assigns it to a Chunk(). This is a bug.

https://en.wikipedia.org/wiki/Object_slicing

I'm looking at fixing this, at least to an extent. Not sure whether to add to this PR, or just start another branch ...

mike-lischke · 2017-06-13T06:44:07Z

This is a separate issue and should be handled in an own patch. To be honest the parse tree matcher has not been tested so far and I'm not even sure there are runtime tests for it. So, everything below pattern/ is not fully complete, I'm afraid (because it's rarely used).

jm-mikkelsen · 2017-06-14T01:33:07Z

OK, thanks Mike. After picking up Antlr again on Friday afternoon (after a few years), I've gone a lot further than I expected in looking at the C++ runtime.

You've answered my internal "how does this work for anyone" question about the pattern/ stuff with "it doesn't". I'll submit my patch once this patch is accepted.

Other pieces of work I haven't committed yet:

A rework of Interval, IntervalSet, ATN and ATNState, removing the readonly status from IntervalSet, removing virtual functions from IntervalSet and Interval, moving the atomic into ATNState as a "now cached" flag, returning a const reference from ATN::nextStates(ATNState*) so the readonly status is enforced by the compiler not at runtime in the code and value semantics using std::move to reduce the number of copies performed.
Fixes in tree/xpath for object slicing problems. Is this code expected to work?
Remove virtual functions from Vocabulary, ATNDeserializationOptions and ParseInfo and make them final. They are copied by value throughout the code so virtual functions are meaningless. Also remove the unused readonly flag on ATNDeserializationOptions.
Explicitly delete copy constructor and assignment operator on various classes with virtual functions where actually using them would be an error (as with the actual usage of XPathElement and Chunk). There are more instances but I'm not going to go after them all now. I've turned down warnings a lot for the Antlr C++ runtime code now.
CMakeLists.txt update to support FreeBSD.

Unit tests pass for me with all changes.

Of these, the first one is the one that is (in my view) possibly worth adding to this pull request.

I hope the changes are worthwhile to you -- I think they help improve things, happy to update in response to feedback.

(I'm in Sydney so I'm a little out of sync with your timezone.)

mike-lischke · 2017-06-14T06:47:33Z

Ah, that sounds promising. Nobody else (other than the runtime authors) has put so much detail work into the C++ runtime yet like you. Please open PRs as you progress and we will review and merge them one by one.

mike-lischke · 2017-06-26T08:03:20Z

@parrt This patch applies to the C++ runtime only and improves several aspects in it and fixes a threading issue. The runtime tests succeed, so I think it's worth to be merged.

jm-mikkelsen added 4 commits June 10, 2017 16:47

Implement IntervalSet::operator=() using the same semantics as the copy

9845382

constructor, put the destructor into the .cpp file so the vtable doesn't get generated everywhere.

Add myself to contributors.txt.

e030163

Remove C++14 auto return type on utf8_to_utf32

2d011c8

- Also remove static instance of std::wstring_convert. Access to std::wstring_convert<>::{from,to}_bytes() is not guaranteed to be thread safe.

jm-mikkelsen added 3 commits June 10, 2017 19:56

Possible fix for max_align_t breakage in Travis CI

eb02a05

The Travis CI build is failing after an include of <cstddef> -- This is an attempt to work around that by including <stddef.h> instead. Problem not apparent in my FreeBSD environment.

ATN: Remove race condition in addState(ATNState*)

70402f8

This is a proposed fix to bug antlr#1826 which removes a race condition where multiple threads could update ATNState::nextTokenWithinRule, leading to corrupted std::vector instances in an InstanceSet.

jm-mikkelsen mentioned this pull request Jun 11, 2017

The antlr4 c++ runtime crashes in multithreaded programs #1826

Open

mike-lischke suggested changes Jun 11, 2017

View reviewed changes

jm-mikkelsen added 7 commits June 12, 2017 10:07

Merge https://github.com/antlr/antlr4

dde893d

Naming convention fix for qualifing shadowed args

6e46b16

- Change qualifying suffix from "_in" to "_" to confirm with conventions.

Change Lexer::{MORE,SKIP} def back to negative

274e3c2

- Explicitly cast negative values to size_t instead of using an offset from std::numeric_limts<size_t>::max().

Add defined() before #if SYM > val evaluations

8fd4bcf

- Change "#if SYM > x" to "#if defined(SYM) && SYM > x" as a pattern. In StringUtils.h this follows the pattern used earlier in the file.

Comply with curly brace conventions.

577b1d6

Undo remove cast to same type in generated code

8c45d71

Fix missed curly brace convention fix.

aab2c04

mike-lischke suggested changes Jun 12, 2017

View reviewed changes

jm-mikkelsen added 3 commits June 12, 2017 19:42

LexerActionType.h: Use antlr4-common.h for size_t

63fc7cb

SemanticContext::Operator: explicit virtual dtor

fad0488

Convention: Change virtual dtor to empty bodies

fdcfefa

mike-lischke approved these changes Jun 12, 2017

View reviewed changes

Merge branch 'optimizations' of https://github.com/mike-lischke/antlr4

cdfe310

Merge https://github.com/antlr/antlr4

0c4473e

parrt added the target:cpp label Jun 26, 2017

parrt added this to the 4.7.1 milestone Jun 26, 2017

parrt merged commit 990d484 into antlr:master Jun 26, 2017

jm-mikkelsen mentioned this pull request Jul 1, 2017

access violation exception in C++ runtime environment #1933

Open

C++ runtime changes for high warning levels #1902

C++ runtime changes for high warning levels #1902

Conversation

jm-mikkelsen commented Jun 10, 2017

jm-mikkelsen commented Jun 10, 2017

mike-lischke left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mike-lischke Jun 12, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mike-lischke commented Jun 11, 2017 • edited Loading

jm-mikkelsen commented Jun 11, 2017

jm-mikkelsen commented Jun 11, 2017

mike-lischke left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mike-lischke commented Jun 12, 2017

jm-mikkelsen commented Jun 13, 2017

mike-lischke commented Jun 13, 2017

jm-mikkelsen commented Jun 14, 2017

mike-lischke commented Jun 14, 2017

mike-lischke commented Jun 26, 2017

mike-lischke Jun 12, 2017 •

edited

Loading

mike-lischke commented Jun 11, 2017 •

edited

Loading