Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position 2270: invalid continuation byte #41

Open
songyuc opened this issue Aug 26, 2020 · 7 comments

Comments

@songyuc
Copy link

songyuc commented Aug 26, 2020

Hi, guys,
I am trying using the scripts in this repo to preprocess the im2latex dataset, but I met this error as,

2020-08-26 17:16:23,199 root INFO Script being executed: scripts/preprocessing/preprocess_formulas.py
Traceback (most recent call last):
File "scripts/preprocessing/preprocess_formulas.py", line 87, in
main(sys.argv[1:])
File "scripts/preprocessing/preprocess_formulas.py", line 65, in main
for line in fin:
File "/home/songyuc/software/python/anaconda/anaconda3/lib/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position 2270: invalid continuation byte

So, how can I solve this?
Any answer or idea will be appreciated!

@da03
Copy link
Collaborator

da03 commented Aug 26, 2020

Hmm I think using python2.7 will solve this, or

try with io.open(file_path_dest,"r",encoding='ascii')?

@songyuc
Copy link
Author

songyuc commented Aug 27, 2020

@da03 , oh, it worked!
Thanks a lot!

@songyuc
Copy link
Author

songyuc commented Aug 27, 2020

Hi, @da03 , I want to confirm whether the processing in this repo is the same process in the paper, Image-to-Markup Generation with Coarse-to-Fine Attention?

@da03
Copy link
Collaborator

da03 commented Aug 27, 2020

Yes it's the same. You can also found processed data at http://lstm.seas.harvard.edu/latex/data/

@songyuc
Copy link
Author

songyuc commented Aug 27, 2020

Wow, it is great. I hope to follow your work to do some research.
And I guess, these two .gz files are the same, am I right?
2020-08-27 15-17-33屏幕截图_meitu_1

@TITC
Copy link

TITC commented Apr 23, 2022

with io.open(file_path_dest,"r",encoding='ascii')

still not work at python3.7 env

before adjust

    with open(temp_file, 'w') as fout:
        prepre = open(output_file, 'r').read().replace('\r', ' ')  # delete \r
        # replace split, align with aligned
        prepre = re.sub(r'\\begin{(split|align|alignedat|alignat|eqnarray)\*?}(.+?)\\end{\1\*?}',
                        r'\\begin{aligned}\2\\end{aligned}', prepre, flags=re.S)
        prepre = re.sub(r'\\begin{(smallmatrix)\*?}(.+?)\\end{\1\*?}',
                        r'\\begin{matrix}\2\\end{matrix}', prepre, flags=re.S)
        fout.write(prepre)

after adjust

    with open(temp_file, 'w') as fout:
        # prepre = open(output_file, 'r').read().replace('\r', ' ')  # delete \r
        prepre = io.open(output_file, 'r', encoding='ascii').read().replace(
            '\r', ' ')  # delete \r
        # replace split, align with aligned
        prepre = re.sub(r'\\begin{(split|align|alignedat|alignat|eqnarray)\*?}(.+?)\\end{\1\*?}',
                        r'\\begin{aligned}\2\\end{aligned}', prepre, flags=re.S)
        prepre = re.sub(r'\\begin{(smallmatrix)\*?}(.+?)\\end{\1\*?}',
                        r'\\begin{matrix}\2\\end{matrix}', prepre, flags=re.S)
        fout.write(prepre)

show error

2022-04-23 16:52:56,976 root  INFO     Script being executed: preprocess_formulas.py
2022-04-23 16:52:56,976 root  INFO     Script being executed: preprocess_formulas.py
Traceback (most recent call last):
  File "preprocess_formulas.py", line 103, in <module>
    main(sys.argv[1:])
  File "preprocess_formulas.py", line 66, in main
    prepre = io.open(output_file, 'r', encoding='ascii').read().replace(
  File "/home/yhtao/anaconda3/envs/latex_ocr/lib/python3.7/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position 854136: ordinal not in range(128)

@Yuxiang1995
Copy link

@TITC this work for me io.open(output_file, 'r', encoding='latin-1')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants