-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: refactor idf module, implementing tfidf & bm25 in TagExtracter by strategy pattern #183
base: master
Are you sure you want to change the base?
Conversation
Just push the full edition, this small changed not need PR and review. |
ok, PTAL. |
feat: add tfidf & bm25 in TagExtracter
update at 2023-11-17. PTAL @vcaesar Please provide Issues links to: I found the origin dict in path the dict only has tf value and position value, but no idf value to calculate tfidf and no origin corpus to calculate average document length. so I add two new dict files in path Description1. two new dict filesin in 2. the detail of tfidf's implemention2.1 loading tfidf dict filejust like idf, we implementing some interface function by strategy pattern. for example: the LoadDict function // LoadDict load dict for TFIDF seg
func (t *TFIDF) LoadDict(files ...string) error {
if len(files) <= 0 {
files = t.Seg.GetTfIdfPath(files...)
}
dictFiles := make([]*types.LoadDictFile, len(files))
for i, v := range files {
dictFiles[i] = &types.LoadDictFile{
FilePath: v,
FileType: consts.LoadDictTypeTFIDF,
}
}
return t.Seg.LoadTFIDFDict(dictFiles)
} Differenting from idf , here we defined In order to distinguish between different document dict, we define the FileType to restrict some stuff... const (
// dict file type to loading
// LoadDictTypeIDF dict of IDF to loading
LoadDictTypeIDF = iota + 1
// LoadDictTypeTFIDF dict of TFIDF to loading
LoadDictTypeTFIDF
// LoadDictTypeBM25 dict of BM25 to loading
LoadDictTypeBM25
// LoadDictTypeWithPos dict of with position to loading
LoadDictTypeWithPos
// LoadDictCorpus dict of corpus to loading
LoadDictCorpus
)
for i := 0; i < len(dictFiles); i++ {
err := seg.ReadTFIDF(dictFiles[i])
if err != nil {
return err
}
} and // ReadTFIDF read the dict file
func (seg *Segmenter) ReadTFIDF(file string) error {
if !seg.SkipLog {
log.Printf("Load the gse dictionary: \"%s\" ", file)
}
dictFile, err := os.Open(file)
if err != nil {
log.Printf("Could not load dictionaries: \"%s\", %v \n", file, err)
return err
}
defer dictFile.Close()
reader := bufio.NewReader(dictFile)
return seg.ReaderTFIDF(reader, file)
} and in freq = seg.Size(size, text, freqText)
inverseFreq = seg.Size(size, text, idfText)
if freq == 0.0 || inverseFreq == 0.0 {
continue
}
// Add participle tokens to the dictionary
words := seg.SplitTextToWords([]byte(text))
token := Token{text: words, freq: freq, inverseFreq: inverseFreq}
seg.Dict.AddToken(token) 2.2 process on tfidf calculationin the Freq function, we are defined // Freq return the TFIDF of the word
func (t *TFIDF) Freq(key string) (float64, interface{}, bool) {
return t.Seg.FindTFIDF(key)
} the result why we return interface type is we compatible with idf function. the idf is return position info which is string type... and tfidf & bm25 need idf value which is float64 type... tfidf calculation process is as follows // calculateIdf calculate the word's weight by TFIDF
func (t *TFIDF) calculateWeight(term string) float64 {
tf, idf, _ := t.Freq(term)
return tf * idf.(float64)
}
// ConstructSeg construct segment with weight
func (t *TFIDF) ConstructSeg(text string) segment.Segments {
// make segment list by total freq num
ws := make([]segment.Segment, 0)
for k := range t.FreqMap(text) {
ws = append(ws, segment.Segment{Text: k, Weight: t.calculateWeight(k)})
}
return ws
} 2.3 resultin // output:
// segments: 5 [{消费者 135.35978394451678} {汽车 132.5431762274668} {消费 99.74972568967256} {增强 96.4479152517576} {下跌 62.99878978351253}]
// results: [{消费 1} {刺激 0.5486451492724487} {下跌 0.4311204839551169} {汽车 0.4095437392771989} {购车 0.4064546007671519}] that all for TFIDF 3. the detail of BM25's implementionjust like TFIDF, I am not to say the similarities stuff.. 3.1 corpus and constantin process calculate of BM25 , we need to get the document average length, so we have to load corpus. so we are implemented // LoadCorpus for calculate the average length of corpus
func (bm25 *BM25) LoadCorpus(path ...string) (err error) {
averLength, err := bm25.Seg.LoadCorpusAverLen(path...)
if err != nil {
return
}
bm25.AverageDocLength = averLength
return
} and the detail how to calculate average length as following func (seg *Segmenter) ReadCorpus(file string) (corpusAverLen float64, err error) {
if !seg.SkipLog {
log.Printf("Load the gse dictionary: \"%s\" ", file)
}
var corpusNumber float64 = 0
var corpusLength float64 = 0
dictFile, err := os.Open(file)
if err != nil {
log.Printf("Could not load dictionaries: \"%s\", %v \n", file, err)
return
}
defer dictFile.Close()
// new the Scanner to read file content
scanner := bufio.NewScanner(dictFile)
// read file content by line
for scanner.Scan() {
corpusNumber++
line := scanner.Text()
corpusLength += float64(utf8.RuneCountInString(line))
}
corpusAverLen = corpusLength / corpusNumber
return
} what's more, we will defined const (
// BM25DefaultK1 default k1 value for calculate bm25
BM25DefaultK1 = 1.25
// BM25DefaultK1 default B value for calculate bm25
BM25DefaultB = 0.75
) 3.2 resultin output: // output:
// segments: 5 [{想象 13.489829905298084} {活力 12.86320693643856} {充满 12.480977334559475} {这里 9.56153393671824} {历史 8.738605467373437}]
// results: [{积淀 1} {活力 0.7380261680439799} {有 0.6602549059736358} {历史 0.6573229314364966} {想象 0.39804353825110805}] That's all , thanks for reading and reviewing. first time to submitting such a large pr.. |
Ok, I will review and test it. |
@vcaesar hey, any problem on this pr ? I will fix it if any problem on this pr. 🫡 |
Please provide Issues links to:
Provide test code:
I added the stopwords test code in
hmm/idf/idf_test.go
and I had run and passed all test code about stopwords . (including
hmm/idf/idf_test.go
andexamples/hmm/main.go
)Description
1. stopwords
the module of stopwords is a standalone module. it would be better if we extracted stopwords module out of the idf file path.
and then I found the stopwords in
TagExtracter
only used in cutting words to ignore stopword.2. extracker
extract the extracker module so that we we can implementing more relevance algorithm base extracker module.
before:
after:
3. relevance
refactor the idf module and extract the relevance module by strategy pattern to support more relevance algorithm, such as idf, tfidf, bm25 and so on.
before:
after:
And then, I'm implementing the Relevance by Strategy Pattern.
such as :
default Idf:
implement the interface function
all this change had ran and passed test code.