-
Notifications
You must be signed in to change notification settings - Fork 59
Additional Directions #6
Comments
+1
|
@SciutoAlex @lukemunn You need to generate a new dataset based on your own sentences. The format should follow our sample dataset like data/msrvid/train. The folder seems too messy, I would suggest Hua to clean it up. But basically, you will only need 4 files in each train/dev/test subfolder:
After doing that, the dataset is ready. Then the left thing is to generate the vocabulary using following code (change line 16 to your own dataset): Then you should be ready to go. :) |
As a complement information, here is the detailed introduction of how to adapt the code to run on your own dataset: |
Thanks for the assistance Jeffy, the library is on my other machine so will take a look tonight. From glancing at your response (admittedly without the code in front of me), I'm still confused by step 4 (sim.txt), providing a similarity score. Isn't providing a similarity score the whole point of the algorithm/library? Why would that be part of the initial dataset? Or is this an empty value which gets populated? |
@lukemunn the similarity score is the ground truth of a sentence pair. It's created by the data owner, like you can label it as score 0 or 1 (the binary classification), or you want to set the range as 5-star (any score in the range [0, 5] will be valid). The model generates a prediction score for each sentence pair, which tries to match the ground truth label as much as possible. |
Would you be willing to include some additional directions or information for using the project with other datasets? I have a large set of sentences I would like to find similarity scores for. I'm not sure if this project is appropriate or not. Seems like it should be. Thanks!
The text was updated successfully, but these errors were encountered: