Tf-idf stands for term frequency-inverse document frequency, and the tf-idf weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Variations of the tf-idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query. Reference
- For run this software is necessary a files database (use the archives paste to this).
- Add in file "forRead.txt" all files links that you want read. For this work, run the script "read.py".
- Modify the parameters to generate the links correctly.
- Open the code in a IDE Java as Maven project
- Run the file App.java in path src/main/java/bigdata/TFidF as a JavaApplication