Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

found some data label unconsistence #23

Open
Zhang-O opened this issue Oct 10, 2019 · 5 comments
Open

found some data label unconsistence #23

Zhang-O opened this issue Oct 10, 2019 · 5 comments

Comments

@Zhang-O
Copy link

Zhang-O commented Oct 10, 2019

51238 1a00a76d4e basic in im2latex_train.lst
latexs around line 51238 in im2latex_formulas.lst are not the latex content in pic 1a00a76d4e.
1a00a76d4e should point to line 51729 in im2latex_formulas.lst.
I have found some of this case, but not sure how many.
I download data from https://zenodo.org/record/56198#.XZ7yK_n_yHt.
Is anything wrong?

@Miffyli
Copy link
Collaborator

Miffyli commented Oct 10, 2019

Hey, did you open the files correctly? See this quote from the Zenodo webpage:

Newlines used in formulas_im2latex.lst are UNIX-style newlines (\n). Reading file with other type of newlines results to slightly wrong amount of lines (104563 instead of 103558), and thus breaks the structure used by this dataset. Python 3.x reads files using newlines of the running system by default, and to avoid this file must be opened with newlines="\n" (eg. open("formulas_im2latex.lst", newline="\n")).

@Zhang-O
Copy link
Author

Zhang-O commented Oct 10, 2019

sorry to waste your time.I see the web again, and chect what you said.
I found formulas_im2latex.lst with lines of 104564. I open it using notepad++ with line ending \n.
what is wrong?
thanks very much.

@Zhang-O Zhang-O closed this as completed Oct 10, 2019
@Zhang-O Zhang-O reopened this Oct 10, 2019
@Zhang-O
Copy link
Author

Zhang-O commented Oct 10, 2019

f = open("./im2latex_formulas.lst", encoding="ISO-8859-1",newline="\n")
len(f.readlines()) = 103359
when epen file with nptepad++ ,changing encoding will not change the lines of file.
almost an hour for me to check it out.
thanks again.

@Miffyli
Copy link
Collaborator

Miffyli commented Oct 10, 2019

Hmm that is peculiar: I downloaded the im2latex_formulas.lst from zenodo and ran the following (Windows 10, Python 3.6):

f = open("./im2latex_formulas.lst", newline="\n")
len(f.readlines())
Out[11]: 103559

f = open("./im2latex_formulas.lst", encoding="ISO-8859-1",newline="\n")
len(f.readlines())
Out[13]: 103559

I do not think changing the encoding helps, it is the way newlines are handled differently in different OSes.

@kim-yhow
Copy link

51238 1a00a76d4e basic in im2latex_train.lst
latexs around line 51238 in im2latex_formulas.lst are not the latex content in pic 1a00a76d4e.
1a00a76d4e should point to line 51729 in im2latex_formulas.lst.
I have found some of this case, but not sure how many.
I download data from https://zenodo.org/record/56198#.XZ7yK_n_yHt.
Is anything wrong?

Excuse me, I am also interested in this project. and are you still doing formula recognition? Have you successfully reproduced the results of EM in the paper?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants