Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C++ runtime changes for high warning levels #1902

Merged
merged 19 commits into from
Jun 26, 2017
Merged

Conversation

jm-mikkelsen
Copy link
Contributor

These changes are for compiling the C++ runtime and code using the C++ runtime headers with high warning levels and -Werror. There are no functional changes in this commit. Compiled with gcc 5.4 and clang 3.8.

constructor, put the destructor into the .cpp file so the vtable doesn't
get generated everywhere.
These changes are for compiling with high warning levels and -Werror.
There are no functional changes in this commit. Compiled with gcc 5.4
and clang 3.8.

Summary:

- Put virtual destructors into the appropriate .cpp file instead
  of the inline version in the header to avoid many vtables.
- Change C-style casts to modern C++ casts.
- Add explicit casts in some signed to/from unsigned conversions.
- Remove unreached code in BufferedTokenStream.cpp and
  LexerATNSimulator.cpp.
- Remove shadowed variables by qualifying constructor arguments with
  the name name as a member variable.
- Add explicitly defined copy constructors and assignment operators
  where required by gcc's -Weff-c++.
- Use std::numeric_limits<size_t>::max() instead of assigning a negative
  number.
- Remove semi-colons after function definitions.
- Remove unneccessary casts.
- In preprocessor statements "#if label > value" change to
  "#if defined(label) && label > value" to avoid warnings about the
  undefined symbol being seen as zero.
- Remove ANTLR4CPP_PUBLIC from "enum class" definitions.
- Change the FinalAction move constructor to move instead of copy the
  _cleanUp std::function object. (A side-effect of explicitly
  initialising member variables as required by gcc's -Weff-c++. I turned
  this one off because most constructors needed to be touched,
  especially the classes implemented with InitializeInstanceFields()).
- Mark hex digit conversion functions as file static in guid.cpp.
- Also remove static instance of std::wstring_convert. Access to
  std::wstring_convert<>::{from,to}_bytes() is not guaranteed to be thread
  safe.
@jm-mikkelsen
Copy link
Contributor Author

Just added changes to remove the fix the utf8_to_utf32 auto return type which broke the CI builds. While I was there I removed the static utfConverter instance; access to its {from,to}_bytes methods are not guaranteed thread safe.

The Travis CI build is failing after an include of <cstddef> -- This is
an attempt to work around that by including <stddef.h> instead. Problem
not apparent in my FreeBSD environment.
ATN::nextTokens(ATNState* s) updates s->nextTokenWithinRule if the
IntervalSet is empty, and then sets it to be read only. However, if the
updated IntervalSet value was also empty, it becomes a read-only empty
set, causing an exception on a second call on the same state.

This was exposed a change I made to make IntervalSet::operator=()
respect the _readonly flag. (Which in turn was found by compiling with a
high warningly level.)

The approach in this update is to perform the update if the updated
value is not empty or if the current value is not read only. This
preserves the previous behaviour of creating a read-only empty set and
working on subsequent calls. It will throw on an attempt to update a
read-only value, where previously the read-only value would be silently
discarded and set to updatable.
This is a proposed fix to bug antlr#1826 which removes a race condition where
multiple threads could update ATNState::nextTokenWithinRule, leading to
corrupted std::vector instances in an InstanceSet.
Copy link
Member

@mike-lischke mike-lischke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's quite a big patch and I haven't commented on all places where a change is needed. Let me rather summarize here and consider changing all places that are affected:

  1. Consistency: your change from {} to = default is ok, but it should be used consequently. Often you added an empty body instead. These should all be removed and replaced by default.
  2. You added several times an empty d-tor (not even virtual), which makes no sense. It would make sense for virtual d-tors. The other d-tors are just noise.
  3. You added several times copy c-tor and copy assignment functions only to say they should use the default impl. Why, by all means, did you even add them if you rely on the default impl anyway. That looks so totally superfluous.
  4. The change from #if to #if defined also seems like useless cosmetic change, or is there anything that requires the latter (I haven't seen any problems or warnings resulting from the original code)?
  5. Code formatting: in the code you added (mostly the new d-tor impls, but also others) you did not consider the coding style in the files -> opening brace at the end of the line, not an own line.

CommonTokenStream::CommonTokenStream(TokenSource *tokenSource, size_t channel)
: BufferedTokenStream(tokenSource), channel(channel) {
CommonTokenStream::CommonTokenStream(TokenSource *tokenSource, size_t channel_in)
: BufferedTokenStream(tokenSource), channel(channel_in) {
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a number of similar changes already and use a simple trailing underscore. Having names like channel_in violates the camel case coding style. Can you please change that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update

@@ -17,7 +17,7 @@ using namespace antlr4;
DiagnosticErrorListener::DiagnosticErrorListener() : DiagnosticErrorListener(true) {
}

DiagnosticErrorListener::DiagnosticErrorListener(bool exactOnly) : exactOnly(exactOnly) {
DiagnosticErrorListener::DiagnosticErrorListener(bool exactOnly_in) : exactOnly(exactOnly_in) {
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

UnsupportedOperationException::~UnsupportedOperationException() = default;
EmptyStackException::~EmptyStackException() = default;
CancellationException::~CancellationException() = default;
ParseCancellationException::~ParseCancellationException() = default;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would make more sense to specify the default behavior directly on the class declarations, instead of having a list like this somewhere. Also it violates the pattern used throughout the runtime:

class A {
public:
  <variables>
  <c-tors>
  <d-tor>
  <other decls>
protected:
...
};

That means each d-tor definition has to be placed after the c-tor defintion in the cpp file or in the declaration part. Please change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with putting the "= default" in the header file is that the vtable is not generated as part of the translation unit. I kept the same order as the pattern used in the header file changes, the difference is that the destructor doesn't have an inline definition.

Would you prefer an empty destructor or "= default" in a .cpp file?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see. In such cases make them empty d-tors in the cpp file - properly ordered with the rest of the code.

IllegalStateException(const std::string &msg = "") : RuntimeException(msg) {};
IllegalStateException(const std::string &msg = "") : RuntimeException(msg) {}
IllegalStateException(IllegalStateException const&) = default;
~IllegalStateException();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

~IllegalStateException() = default;

Similar for all other exceptions. However, I wonder why you introduced the d-tor on all the exceptions if you use the default implementation anyway?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in the previous comment. To make sure that the vtable for the class with virtual functions is only generated in one translation unit.

Copy link
Member

@mike-lischke mike-lischke Jun 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird, I never had problems with v-table creation so far. Is this something that changed in a newer compiler? Neither clang nor msc seem to have a problem with that either. Can you point me to an online resource, so I can read up the details?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vtables are created OK, and will work. The problem is that there are too many of them.

That has been part of my coding standard for so long, I assumed everyone did it that way. However, finding an online reference that gives this advice has been surprisingly difficult in about half an hour of searching.

The best origin reference I found with a quick look around the office from John Lakos's "Large-Scale C++ Software Design" (from 1996), which has the "minor design rule":

"In every class that declares or is derived from a class that declares a virtual function, explicitly declare the destructor as the first virtual function in the class and define it out of line."

It goes on to describe the reasons, including an anecdote of thousands of instances of the same functions in different translation units caused by duplicated vtables. Locality of reference will probably also suffer (ie. different instances of the same function will lead to less code fitting into the CPU's instruction cache.)

(I read that in 1997 and pretty much adopted all of it, so I have probably been doing this for at least 20 years.)

Item 24 in Scott Meyers' "More Effective C++" also cautions against inline virtual function definitions with the quote "In large systems, this can lead to programs containing hundreds or thousands of copies of a class's vtbl!"

And, of course, Clang has a warning to let you know you're making your object too big:

-Wweak-vtables

Diagnostic text:

warning: A has no out-of-line virtual method definitions; its vtable will be emitted in every translation unit

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes a lot of sense. Something new learned. Thanks a lot.

@@ -13,3 +13,7 @@ InputMismatchException::InputMismatchException(Parser *recognizer)
: RecognitionException(recognizer, recognizer->getInputStream(), recognizer->getContext(),
recognizer->getCurrentToken()) {
}

InputMismatchException::~InputMismatchException()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consistency: should not be defined empty here but use default in the header.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in the previous two comments. Happy make a choice between empty destructors and "= default"

@@ -30,15 +31,17 @@ namespace misc {
protected:
/// The list of sorted, disjoint intervals.
std::vector<Interval> _intervals;
bool _readonly;
std::atomic<bool> _readonly;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have a mutex in the ATN for protecting the read-only state change and make the _readonly flag atomic? That's not necessary - one of the two can be removed. I'd prefer atomic to stay.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I comment on this in the bug report. The atomic alone is not sufficient to remove the race condition around updates. This is a pattern we use a lot.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, now I get it. You have that check before the mutex (where atomic comes in) then the mutex lock. So this seems ok then. I'm just a bit nervous about performance impact. The interval set class is one of the lowest level classes and used everywhere in the antlr4 runtime.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea here is that you can check _readonly orders of magnitude more quickly than taking a lock on a mutex. That's pretty fast, and in the scheme of all the other stuff going on (eg. copying lots of IntervalSet) not too bad.

For optimising IntervalSet, adding IntervalSet(IntervalSet&&) and IntervalSet& operator=(IntervalSet&&) and using std::move is likely to be a big win. The other question is whether the copy constructor should really call addAll() at all because the source instance will have already done all the checking in add(Interval const&).

IntervalSets seem to be copied a lot in the code, this is likely to improve things.


antlrcpp::Any::Base::~Base()
{
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm usually fine with having the implementation in a cpp file, so that change is ok. But when you start such a clean up it should be done fully, i.e. move all definitions here from the header.

_enabled = other._enabled;
FinalAction(FinalAction &&other) :
_cleanUp(std::move(other._cleanUp)), _enabled(other._enabled)
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coding style: opening brace on the end of the line opening the block (as it was originally).

#else
static std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> utfConverter;
using UtfConverterType = std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t>;
using UtfConverterWide = std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t>::wide_string;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested this on Windows? UtfConverterWide is defined as std::u32string here, while for non-Win it's a std::wstring_convert.

This part has already been improved and it's better not to include such changes in this patch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merged your changes.

@@ -47,7 +47,7 @@ const atn::ATN& XPathLexer::getATN() const {

void XPathLexer::action(RuleContext *context, size_t ruleIndex, size_t actionIndex) {
switch (ruleIndex) {
case 4: IDAction(dynamic_cast<antlr4::RuleContext *>(context), actionIndex); break;
case 4: IDAction(context, actionIndex); break;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is generated code. You shouldn't change that manually.

@mike-lischke
Copy link
Member

mike-lischke commented Jun 11, 2017

And, btw. this patch of course has functional changes.

For the UTF string conversions pending to be merged see my code: https://github.com/mike-lischke/antlr4/blob/master/runtime/Cpp/runtime/src/support/StringUtils.h

@jm-mikkelsen
Copy link
Contributor Author

Thanks for the review Mike -- I appreciate that it is a big patch and takes an effort!

I'll respond in detail to your comments inline.

@jm-mikkelsen
Copy link
Contributor Author

That's quite a big patch and I haven't commented on all places where a change is needed. Let me rather summarize here and consider changing all places that are affected:

  1. Consistency: your change from {} to = default is ok, but it should be used consequently. Often you added an empty body instead. These should all be removed and replaced by default.

Is this how you would like things to look for destructors in .cpp files as well, or would you prefer empty bodies there?

  1. You added several times an empty d-tor (not even virtual), which makes no sense. It would make sense for virtual d-tors. The other d-tors are just noise.

Which destructor is not virtual?

There are two types of warnings leading to the destructor changes:

  1. A vtable for the class will be generated in each translation unit.
  2. The class has pointer members but has compiler generated functions, eg. destructor or operator=.

For the first case, the answer is explicitly declare the destructor and put it into a .cpp file.

For the second case, the answer is to explicitly define the function, even if it is empty (for a destructor) or "= default". The intent here is to signal that you have considered what is going on, and say it's OK.

Seeing the default constructed operator=() being significantly different to the copy constructor in IntervalSet came directly from dealing with this type of warning. Which in turn led to discovering that there were multiple assignments of cached interval sets in ATN, which led directly to a resolution of the race condition in issue #1826.

  1. You added several times copy c-tor and copy assignment functions only to say they should use the default impl. Why, by all means, did you even add them if you rely on the default impl anyway. That looks so totally superfluous.

Please see above.

  1. The change from #if to #if defined also seems like useless cosmetic change, or is there anything that requires the latter (I haven't seen any problems or warnings resulting from the original code)?

#if defined(...) is used throughout the code, I fixed a few cases where it was not used. The warning was around a construct like #if SYMBOL > 0 rather than #if defined(SYMBOL) && SYMBOL > 0.

  1. Code formatting: in the code you added (mostly the new d-tor impls, but also others) you did not consider the coding style in the files -> opening brace at the end of the line, not an own line.

Will bow to the local curly brace convention!

- Change qualifying suffix from "_in" to "_" to confirm with
  conventions.
- Explicitly cast negative values to size_t instead of using an offset
  from std::numeric_limts<size_t>::max().
- Change "#if SYM > x" to "#if defined(SYM) && SYM > x" as a pattern. In
  StringUtils.h this follows the pattern used earlier in the file.
Copy link
Member

@mike-lischke mike-lischke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good. Only very few things left.

virtual ~Interval() {};
Interval(Interval const&) = default;
virtual ~Interval();
Interval& operator=(Interval const&) = default;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, confused Interval and IntervalSet, sorry.

{
if (_readonly) {
throw IllegalStateException("can't alter read only IntervalSet");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, tricky. In fact there is no copy function like this in Java code, so there is no real copy operation there. When they use an assignment they replace a reference. Maybe we should just delete the assignment operator to avoid any unintended behavior and so stay closer to the Java code? If needed one can easily create a new interval set.

@@ -30,15 +31,17 @@ namespace misc {
protected:
/// The list of sorted, disjoint intervals.
std::vector<Interval> _intervals;
bool _readonly;
std::atomic<bool> _readonly;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, now I get it. You have that check before the mutex (where atomic comes in) then the mutex lock. So this seems ok then. I'm just a bit nervous about performance impact. The interval set class is one of the lowest level classes and used everywhere in the antlr4 runtime.

@mike-lischke
Copy link
Member

For your discussion about my summary, I believe that's all clarified in the code review now and we are almost there with the patch.

@jm-mikkelsen
Copy link
Contributor Author

After going further into the Interval/IntervalSet/readonly and the C++ pass-by-value vs. Java pass reference by value differences, I now see many issues in the code where classes with virtual functions are passed by value and assigned to instances of the base class.

For example, ParseTreeMatcher::split() creates an instance of a TextChunk and assigns it to a Chunk(). This is a bug.

https://en.wikipedia.org/wiki/Object_slicing

I'm looking at fixing this, at least to an extent. Not sure whether to add to this PR, or just start another branch ...

@mike-lischke
Copy link
Member

This is a separate issue and should be handled in an own patch. To be honest the parse tree matcher has not been tested so far and I'm not even sure there are runtime tests for it. So, everything below pattern/ is not fully complete, I'm afraid (because it's rarely used).

@jm-mikkelsen
Copy link
Contributor Author

OK, thanks Mike. After picking up Antlr again on Friday afternoon (after a few years), I've gone a lot further than I expected in looking at the C++ runtime.

You've answered my internal "how does this work for anyone" question about the pattern/ stuff with "it doesn't". I'll submit my patch once this patch is accepted.

Other pieces of work I haven't committed yet:

  • A rework of Interval, IntervalSet, ATN and ATNState, removing the readonly status from IntervalSet, removing virtual functions from IntervalSet and Interval, moving the atomic into ATNState as a "now cached" flag, returning a const reference from ATN::nextStates(ATNState*) so the readonly status is enforced by the compiler not at runtime in the code and value semantics using std::move to reduce the number of copies performed.

  • Fixes in tree/xpath for object slicing problems. Is this code expected to work?

  • Remove virtual functions from Vocabulary, ATNDeserializationOptions and ParseInfo and make them final. They are copied by value throughout the code so virtual functions are meaningless. Also remove the unused readonly flag on ATNDeserializationOptions.

  • Explicitly delete copy constructor and assignment operator on various classes with virtual functions where actually using them would be an error (as with the actual usage of XPathElement and Chunk). There are more instances but I'm not going to go after them all now. I've turned down warnings a lot for the Antlr C++ runtime code now.

  • CMakeLists.txt update to support FreeBSD.

Unit tests pass for me with all changes.

Of these, the first one is the one that is (in my view) possibly worth adding to this pull request.

I hope the changes are worthwhile to you -- I think they help improve things, happy to update in response to feedback.

(I'm in Sydney so I'm a little out of sync with your timezone.)

@mike-lischke
Copy link
Member

Ah, that sounds promising. Nobody else (other than the runtime authors) has put so much detail work into the C++ runtime yet like you. Please open PRs as you progress and we will review and merge them one by one.

@mike-lischke
Copy link
Member

@parrt This patch applies to the C++ runtime only and improves several aspects in it and fixes a threading issue. The runtime tests succeed, so I think it's worth to be merged.

@parrt parrt added this to the 4.7.1 milestone Jun 26, 2017
@parrt parrt merged commit 990d484 into antlr:master Jun 26, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants