-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature: any plan on implementing the bm25 algorithm ? #181
Comments
Emm, A lot of features are planned, but my time is limited, and there any contributions welcome. |
Hey, @vcaesar there is my plan on the bm25 algorithm. 🧑🏻💻 First, in the file path type Segment struct {
text string
weight float64
} because this structure of segment should be in the state of being called at all times. if in tag_extracker, maybe there will be have some cycle import trouble. 🥲 Secondly, still in the file before: type TagExtracter struct {
seg gse.Segmenter
Idf *Idf
stopWord *StopWord
} after: type TagExtracter struct {
seg gse.Segmenter
// calculate weight by Relevance(including IDF,TF-IDF,BM25 and so on)
Relevance relevance.Relevance
stopWord *StopWord
} what's more, the field of stopWord should in the struct of Relevance. since this field is used in the word splitting, it should be used in the struct of relevance. before: type TagExtracter struct {
seg gse.Segmenter
// calculate weight by Relevance(including IDF,TF-IDF,BM25 and so on)
Relevance relevance.Relevance
stopWord *StopWord
} after: type IDF struct {
median float64
freqs []float64
Base
}
type BM25 struct {
K1 float64
N float64
Base
}
type Base struct {
// loading some stop words
StopWord *stop_word.StopWord
// loading segmenter for cut word
Seg gse.Segmenter
} And then, I want to implementing the Relevance by Strategy Pattern. such as : // Relevance easily scalable Relevance calculations (for idf, tf-idf, bm25 and so on)
type Relevance interface {
// AddToken add text, frequency, position on obj
AddToken(text string, freq float64, pos ...string) error
// LoadDict load file from incoming parameters,
// if incoming params no exist, will load file from default file path
LoadDict(files ...string) error
// LoadDictStr loading dict file by file path
LoadDictStr(pathStr string) error
// LoadStopWord loading word file by filename
LoadStopWord(fileName ...string) error
// Freq find the frequency, position, existence information of the key
Freq(key string) (float64, string, bool)
// TotalFreq the total number of tokens in the dictionary
TotalFreq() float64
// FreqMap get frequency map
// key: word, value: frequency
FreqMap(text string) map[string]float64
// ConstructSeg return the segment with weight
ConstructSeg(text string) segment.Segments
} default IDF: func NewIDF() Relevance {
idf := &IDF{
freqs: make([]float64, 0),
}
idf.StopWord = stop_word.NewStopWord()
return Relevance(idf)
} implement the interface function // AddToken add a new word with IDF into the dictionary.
func (i *IDF) AddToken(text string, freq float64, pos ...string) error {
err := i.Seg.AddToken(text, freq, pos...)
i.freqs = append(i.freqs, freq)
sort.Float64s(i.freqs)
i.median = i.freqs[len(i.freqs)/2]
return err
}
// LoadDict load the idf dictionary
func (i *IDF) LoadDict(files ...string) error {
if len(files) <= 0 {
files = i.Seg.GetIdfPath(files...)
}
return i.Seg.LoadDict(files...)
}
// Freq return the IDF of the word
func (i *IDF) Freq(key string) (float64, string, bool) {
return i.Seg.Find(key)
}
.... any problem about that? if no big problem, I'll pr bit by bit to make it possible. 🫡
|
Description
I see the bm25.go file in path
(hmm/bm25/bm25.go)
, so I wanna ask author any plan on bm25 ? 😃If author had the plan on implementing the bm25, I want to make it. 🫡
The text was updated successfully, but these errors were encountered: