UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position 2270: invalid continuation byte #41

songyuc · 2020-08-26T09:20:12Z

Hi, guys,
I am trying using the scripts in this repo to preprocess the im2latex dataset, but I met this error as,

2020-08-26 17:16:23,199 root INFO Script being executed: scripts/preprocessing/preprocess_formulas.py
Traceback (most recent call last):
File "scripts/preprocessing/preprocess_formulas.py", line 87, in
main(sys.argv[1:])
File "scripts/preprocessing/preprocess_formulas.py", line 65, in main
for line in fin:
File "/home/songyuc/software/python/anaconda/anaconda3/lib/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position 2270: invalid continuation byte

So, how can I solve this?
Any answer or idea will be appreciated!

da03 · 2020-08-26T14:16:28Z

Hmm I think using python2.7 will solve this, or

try with io.open(file_path_dest,"r",encoding='ascii')?

songyuc · 2020-08-27T03:51:53Z

@da03 , oh, it worked!
Thanks a lot!

songyuc · 2020-08-27T03:57:44Z

Hi, @da03 , I want to confirm whether the processing in this repo is the same process in the paper, Image-to-Markup Generation with Coarse-to-Fine Attention?

da03 · 2020-08-27T03:59:08Z

Yes it's the same. You can also found processed data at http://lstm.seas.harvard.edu/latex/data/

songyuc · 2020-08-27T07:23:08Z

Wow, it is great. I hope to follow your work to do some research.
And I guess, these two .gz files are the same, am I right?

TITC · 2022-04-23T08:54:48Z

with io.open(file_path_dest,"r",encoding='ascii')

still not work at python3.7 env

before adjust

    with open(temp_file, 'w') as fout:
        prepre = open(output_file, 'r').read().replace('\r', ' ')  # delete \r
        # replace split, align with aligned
        prepre = re.sub(r'\\begin{(split|align|alignedat|alignat|eqnarray)\*?}(.+?)\\end{\1\*?}',
                        r'\\begin{aligned}\2\\end{aligned}', prepre, flags=re.S)
        prepre = re.sub(r'\\begin{(smallmatrix)\*?}(.+?)\\end{\1\*?}',
                        r'\\begin{matrix}\2\\end{matrix}', prepre, flags=re.S)
        fout.write(prepre)

after adjust

    with open(temp_file, 'w') as fout:
        # prepre = open(output_file, 'r').read().replace('\r', ' ')  # delete \r
        prepre = io.open(output_file, 'r', encoding='ascii').read().replace(
            '\r', ' ')  # delete \r
        # replace split, align with aligned
        prepre = re.sub(r'\\begin{(split|align|alignedat|alignat|eqnarray)\*?}(.+?)\\end{\1\*?}',
                        r'\\begin{aligned}\2\\end{aligned}', prepre, flags=re.S)
        prepre = re.sub(r'\\begin{(smallmatrix)\*?}(.+?)\\end{\1\*?}',
                        r'\\begin{matrix}\2\\end{matrix}', prepre, flags=re.S)
        fout.write(prepre)

show error

2022-04-23 16:52:56,976 root  INFO     Script being executed: preprocess_formulas.py
2022-04-23 16:52:56,976 root  INFO     Script being executed: preprocess_formulas.py
Traceback (most recent call last):
  File "preprocess_formulas.py", line 103, in <module>
    main(sys.argv[1:])
  File "preprocess_formulas.py", line 66, in main
    prepre = io.open(output_file, 'r', encoding='ascii').read().replace(
  File "/home/yhtao/anaconda3/envs/latex_ocr/lib/python3.7/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position 854136: ordinal not in range(128)

Yuxiang1995 · 2023-01-04T07:19:31Z

@TITC this work for me io.open(output_file, 'r', encoding='latin-1')

This was referenced Apr 23, 2022

Preprocessing missing a file lukas-blecher/LaTeX-OCR#134

Closed

preprocessing file missing lukas-blecher/LaTeX-OCR#135

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position 2270: invalid continuation byte #41

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position 2270: invalid continuation byte #41

songyuc commented Aug 26, 2020

da03 commented Aug 26, 2020

songyuc commented Aug 27, 2020

songyuc commented Aug 27, 2020

da03 commented Aug 27, 2020

songyuc commented Aug 27, 2020

TITC commented Apr 23, 2022

Yuxiang1995 commented Jan 4, 2023

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position 2270: invalid continuation byte #41

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position 2270: invalid continuation byte #41

Comments

songyuc commented Aug 26, 2020

da03 commented Aug 26, 2020

songyuc commented Aug 27, 2020

songyuc commented Aug 27, 2020

da03 commented Aug 27, 2020

songyuc commented Aug 27, 2020

TITC commented Apr 23, 2022

Yuxiang1995 commented Jan 4, 2023