Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove_corrupt_utf8() not working #41

Closed
abieler opened this issue Sep 2, 2016 · 5 comments
Closed

remove_corrupt_utf8() not working #41

abieler opened this issue Sep 2, 2016 · 5 comments

Comments

@abieler
Copy link
Contributor

abieler commented Sep 2, 2016

The function remove_corrupt_utf8() does not work under Julia v0.4.6.
The problem is the line zeros(Char, endof(s)+1) where it complains that
zero is not defined for type Char. When using UInt8 instead I could make it
run without error, but please check if this does what it is supposed to do.

function remove_corrupt_utf8(s::AbstractString)
    r = zeros(UInt8, endof(s)+1)                                                                                          
    i = 0
    for chr in s
        i += 1
        r[i] = (chr != 0xfffd) ? chr : ' '
    end
    return utf8(r)
end

Note that on the return statement I got rid of the CharString() too.

If this is ok I can make another pull request.

Cheers,
Andre

@aviks
Copy link
Member

aviks commented Sep 3, 2016

Sure, thanks. Looks OK. Note that utf8 is deprecated in 0.5, you'll need to use Compat.UTF8String. I've just fixed all the other deprecations on 0.5.

@abieler
Copy link
Contributor Author

abieler commented Oct 8, 2016

So in 0.5 I had to adapt further, due to

chr != 0xfffd being deprecated, however when doing
UInt8(chr) != 0xfffd there are InexactError() if the
character does not fit in UInt8, so I did try-catch.

Further not sure if the index stepping with i+1 was OK before,
so put in nextind(s,i)

function remove_corrupt_utf8(s::AbstractString)
    r = zeros(UInt8, endof(s)+1)
    i = 1
    for chr in s
        try
          r[i] = (UInt8(chr) != 0xfffd) ? chr : ' '
        catch
          r[i] = ' '
        end
        i = nextind(s,i)
    end
    return Compat.UTF8String(r)
end

Seems reasonable?

@aviks
Copy link
Member

aviks commented Oct 9, 2016

r[i] = (UInt8(chr) != 0xfffd) ? chr : ' '

Not all unicode characters will fit in an UInt8. This line above will loose all non-ascii characters from the string, I think.

I'd use something like this:

function remove_corrupt_utf8(s::AbstractString)
           r = IOBuffer()
           i = 1
           for chr in s
              if chr != Char(0xfffd)
                 write(r, chr)
               end
           end
           return takebuf_string(r)
       end

Are there any tests for this?

@mirestrepo
Copy link

Are there any updates/resolutions on this?

@rssdev10
Copy link
Collaborator

Should be working with Julia > 1.0 and implementation like:

function remove_corrupt_utf8(s::AbstractString)
    return map(x->isvalid(x) ? x : ' ', s)
end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants