Additional Directions #6

SciutoAlex · 2017-02-19T04:50:18Z

Would you be willing to include some additional directions or information for using the project with other datasets? I have a large set of sentences I would like to find similarity scores for. I'm not sure if this project is appropriate or not. Seems like it should be. Thanks!

lukemunn · 2017-02-21T02:24:47Z

+1
I've got halfway but beyond that things get very vague indeed.

Install Torch (already had it)
Clone the repo from Git
Use 'sh fetch_and_preprocess.sh' to get the Stanford Glove library
??? How do I train using my own sentences? Is there a parameter to pass in to the lua script to specify which data to use? And what are the requirements for formatting this text?

jinfengr · 2017-02-21T19:46:20Z

@SciutoAlex @lukemunn You need to generate a new dataset based on your own sentences. The format should follow our sample dataset like data/msrvid/train. The folder seems too messy, I would suggest Hua to clean it up. But basically, you will only need 4 files in each train/dev/test subfolder:

a.toks: one sentence per line
b.toks: one sentence per line
id.txt: the index of the corresponding line
sim.txt: the similarity score of sentence a and b, i.e., it can be a binary score (0 if irrelevant, 1 for relevant), or any other ranges.

After doing that, the dataset is ready. Then the left thing is to generate the vocabulary using following code (change line 16 to your own dataset):
https://github.com/Jeffyrao/pairwise-neural-network/blob/master/scripts/build_vocab.py

Then you should be ready to go. :)

jinfengr · 2017-02-21T19:55:08Z

As a complement information, here is the detailed introduction of how to adapt the code to run on your own dataset:
hohoCode#1

lukemunn · 2017-02-21T19:56:29Z

Thanks for the assistance Jeffy, the library is on my other machine so will take a look tonight.

From glancing at your response (admittedly without the code in front of me), I'm still confused by step 4 (sim.txt), providing a similarity score.

Isn't providing a similarity score the whole point of the algorithm/library? Why would that be part of the initial dataset? Or is this an empty value which gets populated?

jinfengr · 2017-02-21T20:00:47Z

@lukemunn the similarity score is the ground truth of a sentence pair. It's created by the data owner, like you can label it as score 0 or 1 (the binary classification), or you want to set the range as 5-star (any score in the range [0, 5] will be valid).

The model generates a prediction score for each sentence pair, which tries to match the ground truth label as much as possible.

hohoCode closed this as completed Apr 17, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Additional Directions #6

Additional Directions #6

SciutoAlex commented Feb 19, 2017

lukemunn commented Feb 21, 2017

jinfengr commented Feb 21, 2017

jinfengr commented Feb 21, 2017

lukemunn commented Feb 21, 2017

jinfengr commented Feb 21, 2017

Additional Directions #6

Additional Directions #6

Comments

SciutoAlex commented Feb 19, 2017

lukemunn commented Feb 21, 2017

jinfengr commented Feb 21, 2017

jinfengr commented Feb 21, 2017

lukemunn commented Feb 21, 2017

jinfengr commented Feb 21, 2017