[regarding real dataset] Please respond #39

vyaslkv · 2020-07-16T13:23:05Z

Hello,

I can understand we can't generalize unless we don't have the real different types of images and their ocr, we, can provide that dataset, to get accuracy as mathpix. I don't have the hardware to train so need your little help for that. Can you share your email id for that if possible?

da03 · 2020-07-16T16:39:27Z

Thanks for your interest in our work, my email is [email protected]. One thing to note is that we only work on public datasets (or datasets that can be released later), such that the public can benefit from our research.

Alternatively, if you want to keep the dataset private, you can also consider cloud computing services such as Amazon EC2, Google GCE, Microsoft Azure, which provides GPU instances paid by the hour.

vyaslkv · 2020-07-16T17:37:48Z

Thanks @da03 for the quick reply I am sending you an email for further discussion

vyaslkv · 2020-07-17T10:52:27Z

@da03 what is the machine configuration is required for 20k training images like RAM DISK GPU and how many hours will it take if we train on cpu and what will the difference when using GPU

vyaslkv · 2020-07-17T10:58:44Z

And how many training examples are required to get a decent result like the results which you have shown on your website

vyaslkv · 2020-07-17T14:45:10Z

or can you tell me what configuration I should at least use to train the model on nearly 20k images. I am asking so that I could try that configuration directly of AWS or else I will end up using either too less or too high and I will be wasting money (because of the hourly charge)

da03 · 2020-07-17T15:25:26Z

Regarding hardware, I think it's almost impossible to train on CPU, it would probably take forever. For GPU training would take less than a day even using 100k images. On AWS any GPU configuration is probably ok since your dataset of 20k images is small.

Regarding dataset size, I think 20k is a bit small, combining it with the im2latex-100k might give some reasonable results, but ideally you might need 100k real images to train. Besides, are your images of roughly the same font size? If not, standard image normalization techniques (such as denoising, resizing to same font size) might produce better results.

da03 · 2020-07-17T15:47:00Z

btw, if you got a GPU instance, I would recommend using this dockerfile to save you the trouble of installing luaTorch: https://github.com/OpenNMT/OpenNMT/blob/master/Dockerfile

vyaslkv · 2020-07-18T11:03:33Z

Thanks a lot @da03 for helping me out

vyaslkv · 2020-07-18T11:09:39Z

@da03 one last question I have, I don't have the latex, I have the ocr of the images will that work (like this (5+2sqrt3)/(7+4sqrt3) = a-b sqrt3). And I have 150k such images (and even more) will that work or do I need latex only

da03 · 2020-07-18T12:43:12Z

Cool that will work if you do a proper tokenization: the label shall be something like "( 5 + 2 sqrt 3 ) / ( 7 + 4 sqrt 3 ) = a - b sqrt 3" (separated by blanks). The algorithm should work for whatever output format.

vyaslkv · 2020-07-18T12:45:56Z

ok Thanks @da03 You are helping a lot

vyaslkv · 2020-07-19T18:02:02Z

Hello @da03 ,

I have one quick question how much disk space it will require for 150k training examples I took 250 GB of space but it got full during creating demo.train.1.pt like that (during
onmt_preprocess) using the default parameters given in the doc

da03 · 2020-07-19T20:28:46Z

That's surprising. What are the sizes of those images?

vyaslkv · 2020-07-20T05:09:21Z

(187, 720, 3)
(2448, 3264, 3)
(2209, 1752, 3)
(1275, 4160, 3)
(3456, 4608, 3)
(1821, 4657, 3)
(226, 1080, 3)
(388, 2458, 3)
(3264, 2448, 3)
(625, 4100, 3)
(379, 2640, 3)
(1011, 4110, 3) like this @da03

vyaslkv · 2020-07-20T05:14:20Z

How much disk space I need @da03 ? Any rough idea

vyaslkv · 2020-07-20T05:21:02Z

I am using Open-NMT python to do this should I use the main repo which is using lua

beevabeeva · 2021-02-08T18:36:17Z

@vyaslkv Have you had any progress on your data?

I agree that working towards a public model is important.

da03 · 2021-02-08T18:42:46Z

@vyaslkv Sorry for the delay. The images you are using seem to be huge: for example, an image of resolution 3264 x 2448 has ~7M pixels, and if we use a dataset containing 10k such images (we need at least thousands of training instances to learn a reasonable model), it would take 280G (7M x 10k x 4). The dataset used in this repo im2latex-450k is much smaller, since the images are much smaller (they are mostly single math formulas), and we've downsampled them to make that even smaller in the preprocessing.

I think you need to crop your images to ONLY contain the useful parts, cutting off any paddings, and downsample them as much as you can (while we humans can still identify the formulas from the reduced resolutions).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[regarding real dataset] Please respond #39

[regarding real dataset] Please respond #39

vyaslkv commented Jul 16, 2020 •

edited

Loading

da03 commented Jul 16, 2020

vyaslkv commented Jul 16, 2020

vyaslkv commented Jul 17, 2020

vyaslkv commented Jul 17, 2020

vyaslkv commented Jul 17, 2020

da03 commented Jul 17, 2020

da03 commented Jul 17, 2020

vyaslkv commented Jul 18, 2020

vyaslkv commented Jul 18, 2020 •

edited

Loading

da03 commented Jul 18, 2020

vyaslkv commented Jul 18, 2020

vyaslkv commented Jul 19, 2020

da03 commented Jul 19, 2020

vyaslkv commented Jul 20, 2020

vyaslkv commented Jul 20, 2020

vyaslkv commented Jul 20, 2020

beevabeeva commented Feb 8, 2021

da03 commented Feb 8, 2021

[regarding real dataset] Please respond #39

[regarding real dataset] Please respond #39

Comments

vyaslkv commented Jul 16, 2020 • edited Loading

da03 commented Jul 16, 2020

vyaslkv commented Jul 16, 2020

vyaslkv commented Jul 17, 2020

vyaslkv commented Jul 17, 2020

vyaslkv commented Jul 17, 2020

da03 commented Jul 17, 2020

da03 commented Jul 17, 2020

vyaslkv commented Jul 18, 2020

vyaslkv commented Jul 18, 2020 • edited Loading

da03 commented Jul 18, 2020

vyaslkv commented Jul 18, 2020

vyaslkv commented Jul 19, 2020

da03 commented Jul 19, 2020

vyaslkv commented Jul 20, 2020

vyaslkv commented Jul 20, 2020

vyaslkv commented Jul 20, 2020

beevabeeva commented Feb 8, 2021

da03 commented Feb 8, 2021

vyaslkv commented Jul 16, 2020 •

edited

Loading

vyaslkv commented Jul 18, 2020 •

edited

Loading