Natural Language Processing (NLP) strategies will be used to analyze Yelp reviews data. Yelp is an app that provides a crowd-sourced review forum to business and services. The app is used publish crowd-sourced reviews about businesses. Number of 'stars' indicate the business rating given by a customer, ranging from 1 to 5. 'Cool', 'Useful' and 'Funny' indicate the number of cool votes given by other Yelp Users.
Tf–idf stands for "Term Frequency–Inverse Document Frequency" is a numerical statistic used to reflect how important a word is to a document in a collection or corpus of documents. TFIDF is used as a weighting factor during text search processes and text mining.
The intuition being the TFIDF is as follows:
- If a word appears several times in a given document, this word might be meaningful (more important) than other words that appeared fewer times in the same document.
- However, if a given word appeared several times in a given document but also appeared many times in other documents, there is a probability that this word might be common frequent word such as 'I' 'am'..etc. (not really important or meaningful!).
TF:
- Term Frequency is used to measure the frequency of term occurrence in a document:
- TF(word) = Number of times the 'word' appears in a document / Total number of terms in the document
IDF:
- Inverse Document Frequency is used to measure how important a term is:
- IDF(word) = log_e(Total number of documents / Number of documents with the term 'word' in it).
Example:
Let's assume we have a document that contains 1000 words and the term “John” appeared 20 times, the Term-Frequency for the word 'John' can be calculated as follows: TF|john = 20/1000 = 0.02
Let's calculate the IDF (inverse document frequency) of the word 'john' assuming that it appears 50,000 times in a 1,000,000 million documents (corpus). IDF|john = log (1,000,000/50,000) = 1.3
Therefore the overall weight of the word 'john' is as follows TF-IDF|john = 0.02 * 1.3 = 0.026