Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add function bitSlice for String and FixedString #33360

Merged
merged 6 commits into from
Jan 20, 2022

Conversation

rogeryk
Copy link
Contributor

@rogeryk rogeryk commented Jan 2, 2022

Changelog category:

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

  • Add function bitSlice

bitSlice(s, offset, length)

Returns a substring starting with the bit from the ‘offset’ index that is ‘length’ bits long. bit indexing starts from 1

This closes #28535

@robot-clickhouse robot-clickhouse added doc-alert pr-feature Pull request with new product feature labels Jan 2, 2022
@CLAassistant
Copy link

CLAassistant commented Jan 2, 2022

CLA assistant check
All committers have signed the CLA.

@alexey-milovidov alexey-milovidov added the can be tested Allows running workflows for external contributors label Jan 2, 2022
@vdimir vdimir self-assigned this Jan 10, 2022
"Illegal type " + arguments[2]->getName() + " of second argument of function " + getName(),
ErrorCodes::ILLEGAL_TYPE_OF_ARGUMENT);

return std::make_shared<DataTypeString>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why It's always String and not FixedString? Is it because length can be not a constant and each string can have different lenght?

Copy link
Contributor Author

@rogeryk rogeryk Jan 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, length may not be a constant, it will make the implementation simple

source, StringSink(*col_res, input_rows_count), static_cast<size_t>(start_value - 1));
else if (start_value < 0)
bitSliceFromRightConstantOffsetUnbounded(
source, StringSink(*col_res, input_rows_count), -static_cast<size_t>(start_value));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we negate first not to have overflow?

Suggested change
source, StringSink(*col_res, input_rows_count), -static_cast<size_t>(start_value));
source, StringSink(*col_res, input_rows_count), static_cast<size_t>(-start_value));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I test both ways are perimt, even if the negate is removed , there will be no overflow problem. I refer the code of subString,I don't know what is the best way to deal with this problem。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the negate also work on size_t, so result is right. I will change this.

Copy link
Contributor Author

@rogeryk rogeryk Jan 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find negate first is a undefined-behavior when value is INT64_MIN.
I test

#include <iostream>

int main() {
    ssize_t t = INT64_MIN;
    size_t tt = static_cast<size_t>(-t);
    std::cout << tt << std::endl;
    return 0;
}

compile and run

 $ clang++ -fsanitize=undefined   -Wall main.cpp -o main && ./main
main.cpp:5:37: runtime error: negation of -9223372036854775808 cannot be represented in type 'ssize_t' (aka 'long'); cast to an unsigned type to negate this value to itself
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior main.cpp:5:37 in 
9223372036854775808

Comment on lines 147 to 150
const ColumnConst * column_start_const,
const ColumnConst * column_length_const,
Int64 start_value,
Int64 length_value,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we pass two arguments of type optional<Int64> ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I'll change it here

@@ -0,0 +1,111 @@
SELECT 'Const Offset';
select 1, subString(bin('Hello'), 1), bin(bitSlice('Hello', 1));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good to add tests with non const and nullable first argument. Also with nulls in start and length

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK


``` text
┌─bin('Hello')─────────────────────────────┬─bin(bitSlice('Hello', 1, 8))─┐
│ 0100100001100101011011000110110001101111 │ 01001000 │
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear what happen if slice length isn't multiple of 8. We add zero pad at right, why right, not left?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also what we will do when slice out of bound.


SELECT
    number,
    bin('' AS e),
    bin(bitSlice(e, number, number + 1))
FROM numbers(10)

┌─number─┬─bin('')──────────┬─bin(bitSlice('', number, plus(number, 1)))─┐
│      0 │ 1111111111111111 │                                            │
│      1 │ 1111111111111111 │ 11000000                                   │
│      2 │ 1111111111111111 │ 11100000                                   │
│      3 │ 1111111111111111 │ 11110000                                   │
│      4 │ 1111111111111111 │ 11111000                                   │
│      5 │ 1111111111111111 │ 11111100                                   │
│      6 │ 1111111111111111 │ 11111110                                   │
│      7 │ 1111111111111111 │ 11111111                                   │
│      8 │ 1111111111111111 │ 1111111110000000                           │
│      9 │ 1111111111111111 │ 11111111                                   │
└────────┴──────────────────┴────────────────────────────────────────────┘

Looks unnatural, maybe let's make result length always is a length argument padded to 8?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The minimum unit of string is byte, so we add 0 to the right and fill it with 1byte. I don't know if there is a better way.

why right, not left?

I think most usually expectations for return are valid data at the beginning
for example

 select bin(bitSlice(unbin('1111111111111111'), 1, 9));

//right
┌─bin(bitSlice(unbin('1111111111111111'), 1, 9))─┐
│ 1111111110000000                               │
└────────────────────────────────────────────────┘


//left
┌─bin(bitSlice(unbin('1111111111111111'), 1, 9))─┐
│ 0000000111111111                               │
└────────────────────────────────────────────────┘


I think add zero pad at right will reduce its effect.

Or we specify that the length must be a multiple of 8, But it will bring some restrictions on its use.

Also what we will do when slice out of bound.

SELECT
    number,
    bin('' AS e),
    bin(bitSlice(e, number, number + 1))
FROM numbers(10)

┌─number─┬─bin('')─┬─bin(bitSlice('', number, plus(number, 1)))─┐
│      0 │         │                                            │
│      1 │         │                                            │
│      2 │         │                                            │
│      3 │         │                                            │
│      4 │         │                                            │
│      5 │         │                                            │
│      6 │         │                                            │
│      7 │         │                                            │
│      8 │         │                                            │
│      9 │         │                                            │
└────────┴─────────┴────────────────────────────────────────────┘

The test is normal here. I try to find out why

{
bool left_offset = start > 0;
size_t offset = left_offset ? start - 1 : -start;
size_t offset_byte;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we initialize with some value here (e.g. with result of first branch)? It's dificult to keep that all wariables will be initalized in any branch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I will change it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if else can clearly compare the differences between the two situations. I separate offset and length, now It looks clear.

ColumnPtr column_length;

if (number_of_arguments == 3)
column_length = arguments[2].column;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's assign column_length and column_length_const under same if if (number_of_arguments == 3) below, because this variables connected

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

void bitSliceDynamicOffsetBounded(Source && src, StringSink && sink, const IColumn & offset_column, const IColumn & length_column) const
{
const bool is_offset_null = offset_column.onlyNull();
const auto * offset_nullable = typeid_cast<const ColumnNullable *>(&offset_column);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to handle nulls? If yes in what way? According to code below we replace nulls with some defaults. But useDefaultImplementationForNulls is not set, so nulls rows with nulls in args will result to null.
Also same question for useDefaultImplementationForConstants

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bitSlice(s, offset, lenght), I think null lenght and offset values should not make result to null. The behavior here is consistent with substring and arrayslice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, There is no need to handle nulls. I will remove these code.

@vdimir
Copy link
Member

vdimir commented Jan 10, 2022

Good work!

I left some comments. Generally fine, just need to pay attention for corner cases (like when/where we add padding) and add more tests (including some corner-cases like nullable, non-constant arguments at different positions etc.)

@rogeryk
Copy link
Contributor Author

rogeryk commented Jan 13, 2022

I can't reproduce it on macos and ubuntu.

SELECT
    number,
    bin('' AS e),
    bin(bitSlice(e, number, number + 1))
FROM numbers(10)

┌─number─┬─bin('')──────────┬─bin(bitSlice('', number, plus(number, 1)))─┐
│      0 │ 1111111111111111 │                                            │
│      1 │ 1111111111111111 │ 11000000                                   │
│      2 │ 1111111111111111 │ 11100000                                   │
│      3 │ 1111111111111111 │ 11110000                                   │
│      4 │ 1111111111111111 │ 11111000                                   │
│      5 │ 1111111111111111 │ 11111100                                   │
│      6 │ 1111111111111111 │ 11111110                                   │
│      7 │ 1111111111111111 │ 11111111                                   │
│      8 │ 1111111111111111 │ 1111111110000000                           │
│      9 │ 1111111111111111 │ 11111111                                   │

please check it and add how to reproduce it.

@rogeryk rogeryk requested a review from vdimir January 13, 2022 05:09
@vdimir
Copy link
Member

vdimir commented Jan 13, 2022

I can't reproduce it on macos and ubuntu.

Oh, sorry, I copied query wrong.

I meant query select number, bin('\xff\xff' as e), bin(bitSlice(e, number, number+1)) from numbers(10).
And my point was, that maybe we need to add padding not depending on content, but just depending on lenght.
In this example for lenght = 8 we got 8 bit, for 9 we got 16 and for 10 we got again 8 bit result.
I thought about making result length always not less than lenght and multiple of 8.
What do you think?

@rogeryk
Copy link
Contributor Author

rogeryk commented Jan 13, 2022

@vdimir So you mean it's confusing to have padding and truncate at the same time. but if we don't care truncate, it will return useless data

select number, bin('\xff\xff' as e), bin(bitSlice(e, 17, 8))
┌─bin('��')──────────┬─bin(bitSlice('��', 17, 8))─┐
│ 1111111111111111 │00000000                  │
└──────────────────┴──────────────────────────┘

the function will truncate, only because I refer to substring and arraySlice implement. If you don't think need it, I can remove it.

@rogeryk
Copy link
Contributor Author

rogeryk commented Jan 17, 2022

@vdimir What do you think? should we make result length depend only on the param length?

@vdimir
Copy link
Member

vdimir commented Jan 17, 2022

What do you think? should we make result length depend only on the param length?

Actually it's the case only when we ran outside of bit range, so I suppose current solution is correct, we do not need add extra zeros from outside.

Let's try to mention @asurovenko, maybe he as author of request of feature (#8922) can give advice what is better.

Also is there examples how it works in other programming languages?

@rogeryk
Copy link
Contributor Author

rogeryk commented Jan 18, 2022

I think BitSet is very similar to the scene here, I tried some languages.

C++
start from right, padding 0.

#include <bitset>
#include <iostream>

using namespace std;

int main() {
    std::bitset<4> s;
    s.set(3);
    cout << s.to_ulong() << endl; // 8
    return 0;
}

Java
start from right, padding 0.

import  java.util.BitSet;

class T {
    public static void main(String[ ] args) {
        BitSet s = new BitSet(4);
        s.set(3);
        System.out.print(s.toByteArray()[0]);  // 8
    }
}

Rust
start from left, padding 0.

use bit_vec::BitVec;


fn main() {
let mut bv = BitVec::from_elem(4, false);


bv.set(3, true);
println!("{:?}", bv.to_bytes()[0]); // 16
}

Copy link
Member

@vdimir vdimir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your work!

@vdimir vdimir merged commit 7156e64 into ClickHouse:master Jan 20, 2022
@rogeryk rogeryk deleted the feat/bitSlice branch January 20, 2022 13:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
can be tested Allows running workflows for external contributors pr-feature Pull request with new product feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bitSlice function
5 participants