Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes constructor of SubString from SubString #22511

Merged
merged 3 commits into from
Oct 13, 2017

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Jun 24, 2017

@nalimilan
Copy link
Member

I think this should rather be an error, just like when trying to index a string with an invalid index range. See how getindex(s::String, r::UnitRange{Int}) works. Also, the check should happen in the inner constructor so that it cannot be bypassed, except by using an explicit @inbounds (this will require the check to be annotated with @boundscheck).

@kshyatt kshyatt added the strings "Strings!" label Jun 24, 2017
@davidanthoff
Copy link
Contributor

I agree with @nalimilan, I think this should throw an error.

@bkamins
Copy link
Member Author

bkamins commented Jun 25, 2017

Before fixing this I have one question. Do you want also:

julia> x=""
julia> SubString(x, 5, 10)

to throw an error except when annotated with @inbounds?

I want to make sure as I believe that in the initial design of SubString it was intentional to allow this.

@nalimilan
Copy link
Member

I would say yes, but where did you read that this was supposed to work?

@bkamins
Copy link
Member Author

bkamins commented Jun 25, 2017

  1. @nalimilan Tests imply that. Everything starting from the line:
    https://github.com/JuliaLang/julia/blob/master/test/strings/types.jl#L32
    to the line:
    https://github.com/JuliaLang/julia/blob/master/test/strings/types.jl#L45
    should throw an error if we change this.

  2. Additionally if we want to be consistent the following should be also fixed:

julia> x="∀"
"∀"

julia> x[2:3]
ERROR: UnicodeError: invalid character index
Stacktrace:
 [1] getindex(::String, ::UnitRange{Int64}) at .\strings\string.jl:239

julia> SubString(x, 2, 3)
""

In general I would recommend to have SubString work exactly like function getindex(s::String, r::UnitRange{Int}) except that it does not perform a copy. If this is what is what we agree on I can go on and implement it.

  1. The only question is about @inbounds. Because now @inbounds in getindex does not disable bounds checking - I feel that it would be safer to always check bounds in SubString, as with Unicode it is very error prone to assume that you guarantee that you give proper bounds.

  2. Finally I would add documentation to SubString.

@davidanthoff
Copy link
Contributor

I think that all makes sense. You could do a git blame and figure out who wrote these tests and ping that person, maybe there was a deeper rational for it.

@bkamins
Copy link
Member Author

bkamins commented Jun 25, 2017

I have pushed an initial proposal how SubString could be improved.
@ScottPJones could you please review the assumptions how SubString should work, as it changes the initial design of the functionality.

@nalimilan
Copy link
Member

Actually @ScottPJones didn't write these tests, he split tests into several files. These tests were added in #7145, a PR which was about introducing a more efficient implementation of length rather than about changing its behavior. So that doesn't say anything about the justification for that behavior.

@StefanKarpinski and @stevengj would probably know better, but copying the behavior of getindex sounds like a good idea to me. Regarding @inbounds, it's fine not to implement it for now, but I don't buy the argument that it shouldn't be added because Unicode is complex: if people use @inbounds, they explicitly state that they take care of using valid indices.

j > endof(s) && throw(BoundsError(s, j))

if !isvalid(s,i)
throw(ArgumentError("invalid SubString index i"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would make sense to print "$i" rather than just "i".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant symbol i as it is a name of the first argument. Invalid second argument j is allowed (like in getinex).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After rethinking I have hanget it to i=$i.

@@ -19,10 +19,10 @@ struct SubString{T<:AbstractString} <: AbstractString
function SubString{T}(s::T, i::Int, j::Int) where T<:AbstractString
i > j && return new(s, 0, 0) # allow i > j as it is consistent with getindex
i < 1 && throw(BoundsError(s, i))
j > endof(s) && throw(BoundsError(s, j))
j > s.len && throw(BoundsError(s, j))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not guaranteed that all AbstractString subtypes will have a .len field - that's even going to be removed from String soon

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tkelman Changed to sizeof (I understand it will stay).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sizeof isn't correct here. For example, when indexing "café":

julia> sizeof("café")
5

julia> endof("café")
4

@@ -42,7 +42,7 @@ SubString(s::AbstractString, r::UnitRange{<:Integer}) = SubString(s, Int(first(r
function SubString(s::SubString, i::Int, j::Int)
i > j && SubString(s.string, 1, 0) # allow i > j as it is consistent with getindex
i < 1 && throw(BoundsError(s, i))
j > endof(s) && throw(BoundsError(s, j))
j >= nextind(s, endof(s)) && throw(BoundsError(s, j))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this always return endof(s) + 1? Why is this needed?

@@ -4,32 +4,47 @@

## substrings reference original strings ##

"""
SubString(s::AbstractString, i::Integer, j::Integer)
SubString(s::AbstractString, r::UnitRange{Integer})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<:Integer

@@ -4,32 +4,47 @@

## substrings reference original strings ##

"""
SubString(s::AbstractString, i::Integer, j::Integer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

j::Integer=endof(s).

function SubString(s::SubString, i::Int, j::Int)
i > j && SubString(s.string, 1, 0) # allow i > j as it is consistent with getindex
i < 1 && throw(BoundsError(s, i))
j > sizeof(s) && throw(BoundsError(s, j))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

@@ -110,8 +125,6 @@ let s="lorem ipsum",
SubString(s,1,6)=>"lorem ",
SubString(s,1,0)=>"",
SubString(s,2,4)=>"ore",
SubString(s,2,16)=>"orem ipsum",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe keep these examples, but with valid indices that still give the expected result?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@bkamins
Copy link
Member Author

bkamins commented Jun 26, 2017

@nalimilan Thank you for the code review. I will wait with fixing it till #22548 is decided, as apart from obvious corrections your comments relate to the fact that I want to make SubString work like String with getindex.

Eg. currently:

julia> x="café"
"café"

julia> x[1:5]
"café"

goes through without error and that is why I have used sizeof.

And in general for me having consistent behavior of indexing for any AbstractString.

@bkamins
Copy link
Member Author

bkamins commented Jun 26, 2017

Now SubString is defined following #22548.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few comments. Looks like you'll have to dig into the bootstrap process to fix the CI failures. That's a relatively tedious work since it's hard to find out where the failures come from. Here it seems to be in the string interpolation code. You'll probably have to review all places using SubString and check for invalid indices.

i > j && return new(s, 0, 0) # allow i > j as it is consistent with getindex
isvalid(s, i) || throw(BoundsError(s, i))
isvalid(s, j) || throw(BoundsError(s, j))
o = i-1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

o isn't really useful in the new code.

SubString(s::AbstractString, r::UnitRange{<:Integer}) = SubString(s, first(r), last(r))

function SubString(s::SubString, i::Int, j::Int)
i > j && SubString(s.string, 1, 0) # allow i > j as it is consistent with getindex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to check this here, as the inner constructor will do the same via s.offset+i > s.offset+j.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is required as we allow any i>j (not necessarily valid) and return empty SubString then. This is required for consistency with getindex for String.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant is that you call the inner constructor below, which already checks the indices, so this is redundant (just try without it).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following test (which should pass) fails if I remove this line:
@test SubString(SubString("123", 1, 2), -10, -20) == ""
But there was an error as the line missed return before SubString so thank you for a detailed review 😄.


function SubString(s::SubString, i::Int, j::Int)
i > j && SubString(s.string, 1, 0) # allow i > j as it is consistent with getindex
isvalid(s, i) || throw(BoundsError(s, i))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isvalid is relatively expensive, and will be checked by the inner constructor too. So better drop it in favor of a less costly check that i and j are within the bounds of the sub string. Then you can delegate to the inner constructor the question of whether they are also valid for the underlying string.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I have to retain earlier i > j check, as SubString(SubString("fsfsdf", 1, 2), -10, -20) should go through and return empty substring.

ss=SubString(str,1,length(str)) #match source string
ss=SubString(str,1,endof(str)) #match source string
@test length(ss)==length(str)
ss=SubString(str,1:length(str)) # works as str is all ASCII
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just remove this test, it's not valid to index with length(str), and we don't really care that it works for ASCII strings.

@test length(ss)==length(str)

ss=SubString(str,1,0) #empty SubString
@test length(ss)==0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep blank line for consistency.

@test length(ss)==0

ss=SubString(str,10,16) #end indexed beyond source string length
@test length(ss)==3
str = "∀"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to store the string in str, it's clearer to repeat it (especially since you change the contents of str). Also, it would be useful to add a short comment explaining what is tested to new tests.

@test_throws BoundsError SubString(str, 4, idx) == str[4:idx]
end

@test_throws BoundsError SubString(str,14,20) #start indexed beyond source string length
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"start indexed" -> "start index". Same below.

@@ -72,20 +98,23 @@ b = IOBuffer()
write(b, u)
@test String(take!(b)) == ""

str = "føøbar"
u = SubString(str, 10, 10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather keep this test, just changing the indices. It's not clear what this tested, but it must have triggered a bug in a corner case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment still seems to apply.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test is retained but moved to line 95.

@test SubString(u, 2:idx) == u[2:idx]
end
@test_throws BoundsError SubString(u, 1, 10)
@test_throws BoundsError SubString(u, 1:10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also test with start index out of bounds (both positive and negative).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done below.

for i in -1:length(s)+2
if isvalid(s, i)
ss=SubString(s,1,i)
@test isvalid(ss,i)==isvalid(s,i)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

==isvalid(s,i) is redundant since that's the condition. Also check that a BoundsError is raised when !isvalid(s,i).

@bkamins
Copy link
Member Author

bkamins commented Jul 2, 2017

@nalimilan included your comments and working through bootstrap.

@test length(ss)==0

ss=SubString(str,10,16) #end indexed beyond source string length
@test length(ss)==3
# tests for SubString of as single multibyte `Char` string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of a single

for i in -1:length(s)+2
if isvalid(s, i)
ss=SubString(s,1,i)
# make sure isvalid give equivalent resutls for SubString and String
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gives equivalent results

@bkamins
Copy link
Member Author

bkamins commented Jul 2, 2017

Fixed typos. Seems that some errors in Documenter.jl have to be fixed before this one is merged.

o = i-1
new(s, o, max(0, j-o))
end
i > j && return new(s, 0, 0) # allow i > j as it is consistent with getindex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could continue to return new(s, i-1, 0) unless you've found a reason not to. I don't know what was the original justification for this behavior.

@test length(ss)==0
# tests for SubString of more than one multibyte `Char` string
# we are consistent with `getindex` for `String`
for idx in 0:1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also include idx = 4 here rather than in a separate line.


@test SubString("∀∀", 4, 4) == "∀∀"[4:4]

for idx in 5:8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to have a comment explaining what is tested here. Same below, where some lines are commented and not others (a global comment for multiple lines is fine).

@test_throws BoundsError SubString("∀∀", 4, idx)
end

@test_throws BoundsError SubString(str,14,20) #start indexing beyond source string length
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's weird to reuse str here after tests on a different string. Maybe put these tests above those on "∀"?


@test_throws BoundsError SubString(str,2:4)

str2=""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be clearer to get rid of str2 and use "" everywhere.

@bkamins
Copy link
Member Author

bkamins commented Sep 20, 2017

Merge conflicts were introduced by #22572 - I will resolve them.

@bkamins
Copy link
Member Author

bkamins commented Sep 21, 2017

I would like to ask for a merge or discard this PR before I finish implementation of #23765 as it will be using SubString.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, there are two details I hadn't seen (or were introduced in the rebase).

@StefanKarpinski @stevengj Good to merge after this?

base/regex.jl Outdated
cap = Union{Void,SubString{String}}[
ovec[2i+1] == PCRE.UNSET ? nothing : SubString(str, ovec[2i+1]+1, ovec[2i+2]) for i=1:n ]
ovec[2i+1] == PCRE.UNSET ? nothing : SubString(str, ovec[2i+1]+1,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation should be kept as this is inside an array comprehension.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

SubString(s::AbstractString, i::Integer, j::Integer=endof(s))
SubString(s::AbstractString, r::UnitRange{<:Integer})

Like [`getindex`](@ref), but returns a view into the parent AbstractString `s`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing backquotes around AbstractString, or better say "string".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@bkamins
Copy link
Member Author

bkamins commented Sep 22, 2017

@nalimilan Thank you for the review as the rebase was not very simple.

base/regex.jl Outdated
ovec[2i+1] == PCRE.UNSET ? nothing : SubString(str, ovec[2i+1]+1,
prevind(str, ovec[2i+2]+1)) for i=1:n ]
ovec[2i+1] == PCRE.UNSET ? nothing : SubString(str, ovec[2i+1]+1,
prevind(str, ovec[2i+2]+1)) for i=1:n ]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one needs to be aligned with str. Or maybe better break after : or ?.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope now it is OK (I thought I have seen the layout I have used earlier somewhere in the code base).

@bkamins
Copy link
Member Author

bkamins commented Sep 23, 2017

CI failures seem to be unrelated

@bkamins
Copy link
Member Author

bkamins commented Sep 29, 2017

I have performed another rebase of conflicts introduced in the meantime.
Is it good to merge?

@KristofferC
Copy link
Sponsor Member

Does this need a NEWS entry?

@bkamins
Copy link
Member Author

bkamins commented Oct 1, 2017

@KristofferC I have proposed a news entry.

NEWS.md Outdated
@@ -238,6 +238,9 @@ Library improvements

* The functions `strip`, `lstrip` and `rstrip` now return `SubString` ([#22496]).

* The constructor of `SubString` now checks if the requsted view range
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be listed under "breaking changes", since it can raise an error for previously working cases?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved, although if someone used an index outside a valid range it was probably a bug

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but that can still break code since no error was thrown before. Anyway, looks good to me now.

@@ -65,7 +65,7 @@ function shell_parse(str::AbstractString, interpolate::Bool=true;
j = k
end
elseif interpolate && !in_single_quotes && c == '$'
update_arg(s[i:j-1]); i = k; j = k
update_arg(s[i:prevind(s, j)]); i = k; j = k
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why these changes are being made as a part of this PR. What does this have to do with SubString?

If j is the index of c, then j-1 is actually correct for s::String because c=='$' is ASCII, by the way.

Copy link
Member Author

@bkamins bkamins Oct 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two reasons:

  1. if j is the index of c if c=='$' then j-1 does not have to be correct. Consider s="∀\$", then j=4 and j-1 is an incorrect index (s[j-1] throws UnicodeError even under stable 0.6 before the changes; the changes only make it also throw an error when range indexing is used e.g. s[i:j-1]).
  2. It is implemented in SubString PR as s is SubString because it is defined as s = lstrip(str) in line 13 and lstrip returns SubString.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stevengj Is this justification OK?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I was thinking of j+1, not j-1.

@nalimilan
Copy link
Member

@stevengj Merge now?

@nalimilan nalimilan merged commit 39f668c into JuliaLang:master Oct 13, 2017
@bkamins bkamins deleted the patch-9 branch October 13, 2017 07:50
@nalimilan
Copy link
Member

Thanks @bkamins!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
strings "Strings!"
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants