Replies: 3 comments 1 reply
-
cc @mbasmanova |
Beta Was this translation helpful? Give feedback.
-
A similar bug was fixed in MapVector::prepareForReuse in #3291 |
Beta Was this translation helpful? Give feedback.
-
@kagamiori Wei, thank you for putting together a comprehensive description of the problem. To add a bit more context, the only flags being used in Velox are SimpleVector::isAllAscii_, asciiSetRows_ and MapVector:: sortedKeys_. Other flags are not being used and not being maintained. However, Saber may be using these. Let's check with them to see if the flags can be removed. This problem applies not only to expression evaluation but to any code that modifies vector data. It would be nice to update the description to make that clear (otherwise, currently it sounds like the problem is specific to expression evaluation). As a general rule, we say that the code that modifies the vector needs to ensure that vector is mutable (e.g. singly referenced and contains singly-referenced buffers and inner vectors). This is usually done by calling either BaseVector::prepareForReuse or ensureWritable APIs. We need to make sure these APIs reset data-dependent flags. |
Beta Was this translation helpful? Give feedback.
-
Background
Recently, a bug in CoalesceExpr and SwitchExpr was discovered that these two special forms compose a result vector of values from multiple of their input vectors, but do not maintain data-dependent flags of the result vectors properly, such as MapVector::sortedKeys_ that indicates whether keys in all rows of a map vector are sorted in descending order. Not maintaining this flag properly could lead to incorrect evaluation results when the results of these two special forms are consumed by functions that use that flag, such as the
map_zip_with
function. (#4116)Looking at this problem from a broader perspective, we found that there are other data-dependent flags besides MapVector::sortedKeys_ that should be taken care of as well. We need to make sure all such flags are updated correctly when data in the vector are updated.
Data-dependent flags
Data-dependent flags of vectors include the followings:
std::optional<vector_size_t> BaseVector::nullCount_
std::optional<vector_size_t> BaseVector::distinctValueCount_
std::optional<ByteCount> BaseVector::representedByteCount_
std::optional<ByteCount> BaseVector::storageByteCount_
std::optional<bool> SimpleVector::isSorted_
bool SimpleVector::isAllAscii_
SelectivityVector SimpleVector::asciiSetRows_
SimpleVector::stats_
that containsstd::optional<T> min
andstd::optional<T> max
bool MapVector::sortedKeys_
Data-changing APIs
Functions that change the data in a vector are the followings:
ArrayVector::setElements(VectorPtr elements)(Not affecting the aforementioned flags)Brainstorm
To make the data-dependent flags consistent with values in the vector, we need to make sure all the APIs that could change the data in a vector update all the relevant flags properly. To make maintenance easier, it would be helpful to get ride of flags that are not used at all and won't be used in the near future (e.g., BaseVector::representedByteCount_ and BaseVector::storageByteCount_).
To update the data-dependent flags properly, the most naive approach is to set them to the most conservative values (e.g., std::nullopt meaning this property is unknown), but this may make some flags overly conservative and hence affect the performance of functions that rely on these flags (e.g., isAllAscii_). On the other hand, some of the data-changing APIs may be called on a per-row basis, so it may be better to update the flags after the calls on all rows in the caller (e.g., SimpleVector::stats_ and BaseVector::setNull()) Also, data changing through BaseVector::mutableNulls(), FlatVector::mutableValues(), and DictionaryVector::mutableIndices() cannot be captured inside these calls, so the update of flags have to happen in their callers, or we can only reset the flags to their conservative default values.
So I suggest the followings.
Beta Was this translation helpful? Give feedback.
All reactions