Skip to content

Commit

Permalink
fix #10958: buggy handling of embedded NUL chars
Browse files Browse the repository at this point in the history
(cherry picked from commit 1d90e97)
ref PR #10991

Conflicts:
	base/string.jl
	base/utf8.jl
	base/utf8proc.jl
	test/unicode.jl
  • Loading branch information
stevengj authored and tkelman committed Apr 26, 2015
1 parent 2cc7c9d commit b192bf0
Show file tree
Hide file tree
Showing 4 changed files with 11 additions and 6 deletions.
2 changes: 0 additions & 2 deletions base/string.jl
Original file line number Diff line number Diff line change
Expand Up @@ -538,8 +538,6 @@ beginswith(a::Array{Uint8,1}, b::Array{Uint8,1}) =

charwidth(c::Char) = max(0,int(ccall(:wcwidth, Int32, (Uint32,), c)))
strwidth(s::String) = (w=0; for c in s; w += charwidth(c); end; w)
strwidth(s::ByteString) = int(ccall(:u8_strwidth, Csize_t, (Ptr{Uint8},), s.data))
# TODO: implement and use u8_strnwidth that takes a length argument

## libc character class predicates ##

Expand Down
3 changes: 2 additions & 1 deletion base/utf8.jl
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,8 @@ function endof(s::UTF8String)
end
i
end
length(s::UTF8String) = int(ccall(:u8_strlen, Csize_t, (Ptr{Uint8},), s.data))
length(s::UTF8String) = int(ccall(:u8_charnum, Csize_t, (Ptr{Uint8}, Csize_t),
s.data, length(s.data)))

function next(s::UTF8String, i::Int)
# potentially faster version
Expand Down
7 changes: 4 additions & 3 deletions base/utf8proc.jl
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,6 @@ const UTF8PROC_CATEGORY_CS = 28
const UTF8PROC_CATEGORY_CO = 29
const UTF8PROC_CATEGORY_CN = 30

const UTF8PROC_NULLTERM = (1<<0)
const UTF8PROC_STABLE = (1<<1)
const UTF8PROC_COMPAT = (1<<2)
const UTF8PROC_COMPOSE = (1<<3)
Expand All @@ -60,10 +59,10 @@ const UTF8PROC_STRIPMARK = (1<<13)
let
const p = Array(Ptr{Uint8}, 1)
global utf8proc_map
function utf8proc_map(s::String, flags::Integer)
function utf8proc_map(s::ByteString, flags::Integer)
result = ccall(:utf8proc_map, Cssize_t,
(Ptr{Uint8}, Cssize_t, Ptr{Ptr{Uint8}}, Cint),
s, 0, p, flags | UTF8PROC_NULLTERM)
s, sizeof(s), p, flags)
result < 0 && error(bytestring(ccall(:utf8proc_errmsg, Ptr{Uint8},
(Cssize_t,), result)))
a = ccall(:jl_ptr_to_array_1d, Vector{Uint8},
Expand All @@ -73,6 +72,8 @@ let
end
end

utf8proc_map(s::String, flags::Integer) = utf8proc_map(bytestring(s), flags)

function normalize_string(s::String; stable::Bool=false, compat::Bool=false, compose::Bool=true, decompose::Bool=false, stripignore::Bool=false, rejectna::Bool=false, newline2ls::Bool=false, newline2ps::Bool=false, newline2lf::Bool=false, stripcc::Bool=false, casefold::Bool=false, lump::Bool=false, stripmark::Bool=false)
flags = 0
stable && (flags = flags | UTF8PROC_STABLE)
Expand Down
5 changes: 5 additions & 0 deletions test/unicode.jl
Original file line number Diff line number Diff line change
Expand Up @@ -99,3 +99,8 @@ let c_ll = 'β', c_cn = '\u038B'
# check codepoint with category code CN
@test Base.UTF8proc.category_code(c_cn) == Base.UTF8proc.UTF8PROC_CATEGORY_CN
end

# handling of embedded NUL chars (#10958)
@test length("\0w") == length("\0α") == 2
@test strwidth("\0w") == strwidth("\0α") == 1
@test normalize_string("\0W", casefold=true) == "\0w"

11 comments on commit b192bf0

@staticfloat
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is failing for some reason on the Launchpad buildd servers on v0.3.8; I'm investigating....

@tkelman
Copy link
Contributor

@tkelman tkelman commented on b192bf0 May 3, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buggy system wcwidth maybe?

@staticfloat
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly. I'm trying submitting a build on trusty instead of precise to see if that makes a difference. Building the deb's locally on my trusty desktop doesn't hit this issue.

@ScottPJones
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you benchmarked the change to strwidth? That looks like it might make things a lot slower, and from what I learned recently, I think you might do better to get rid of u8_strwidth entirely, and rewrite charwidth in Julia, instead of calling wcwidth.

@tkelman
Copy link
Contributor

@tkelman tkelman commented on b192bf0 May 3, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this is a commit on release-0.3, strictly backporting the bugfix. On master, charwidth no longer calls wcwidth, the functionality has been added to utf8proc. If this causes a performance regression that is fixable in a non-disruptive way we might consider PR's against release-0.3, but we'd rather err on the conservative side with the release branch.

@staticfloat
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still happening on trusty, so I'm not sure what's going on here.

@staticfloat
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stevengj I haven't been able to narrow down why this is happening. Do you have any ideas?

@stevengj
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe change the test to

@test strwidth("\0w") == charwidth('\0') + charwidth('w')
@test strwidth("\0α") == charwidth('\0') + charwidth('α')

in order to work around buggy wcwidths?

@staticfloat
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, that worked. Thanks!

@tkelman
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we be committing that to release-0.3?

@staticfloat
Copy link
Member

@staticfloat staticfloat commented on b192bf0 May 14, 2015 via email

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please sign in to comment.