-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add inverted index for mongodb #945
Conversation
Signed-off-by: mymusise <[email protected]>
Signed-off-by: mymusise <[email protected]>
Signed-off-by: mymusise <[email protected]>
Signed-off-by: mymusise <[email protected]>
Signed-off-by: mymusise <[email protected]>
It seems to me the build failed due to network issue, I have restarted the job, hope it will pass |
Some useful reference about invert index https://stackoverflow.com/a/8360932/358458 |
@corcassia Thank you for you PR, could you please clarify the different between mongodb text indexing and this PR invert indexing |
@vkosuri Ok, my pleasure. Before this PR, I found it will compare all samples when calling For example:
and the invert indexing will look like:
each key word corresponds to samples ID If we call Then, we use invert indexing to find out which samples have those key word. Finally, we compare those samples with the input statement just like before. I wish I made it clear. |
@corcassia Thank you, I am not sure with performance. To improve @gunthercox came with an idea tag filtering to speed up retrieval process #925. |
@vkosuri Yes, I think tag filtering is a good idea. |
@corcassia Thank you, I'm really impressed with the performance improvements after this change. I'm going to pull these changes down locally and test out a few things. I might have further questions. |
@corcassia Apologies for my delay, I will be reviewing this pull request as soon as possible. |
chatterbot/storage/mongodb.py
Outdated
# Just filter the statement in need. | ||
statement_ids = [] | ||
tokens = self.statement_segmentation(statement.text) | ||
word_dicts = self.inverted.find({'word': {'$in': tokens}}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be misunderstanding this, so feel free to correct me if I'm wrong. Given this change, wont only statements in which the words of statement
exist be returned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you're right. So this change will lead to no response if there's no statement exist.
@gunthercox It's all right, and I found some problem with this change in these days. |
Signed-off-by: mymusise <[email protected]>
Signed-off-by: mymusise <[email protected]>
@vkosuri I'm sorry about my wrong clarification. |
Signed-off-by: mymusise <[email protected]>
Signed-off-by: mymusise <[email protected]>
@mymusise seems the change hasn't checked in to the master? can I apply the path master brach for resolving the performance issue? |
@zxsimple Yeah, seems the master branch still not create a full-text index when setting up the bot (chatterbot/storage/mongodb.py), but you can create it manually and with a little change like this. I think with customing I think the performance with highly improve with decreasing the Solution:import re
source_corpus = [
"Hello",
"Hi there!",
"How are you doing?",
"I'm doing great.",
"That is good to hear",
"Thank you.",
"You're welcome."
]
input_statements = [
"Such a good day!",
"Here we are."
]
def words_filter(statements):
# a words filter demo
bad_words = ['the', 'a', 'an', 'is', 'are']
rule = re.compile(" | ".join(bad_words))
return [rule.sub(' ', statement) for statement in statements]
courpus = words_filter(source_corpus)
input_statements = words_filter(input_statements)
from chatterbot.trainers import ListTrainer
from chatterbot import ChatBot
chatbot = ChatBot("Ron Obvious")
chatbot.set_trainer(ListTrainer)
chatbot.train(courpus)
for statement in input_statements:
response = chatbot.get_response(statement)
print(response) |
Signed-off-by: corcassia [email protected]
Hi, I add the Inverted Index for mongodb. What new:
BestMatch
)Improving performance:
It test with 28000 rows dataset on Ubuntu 14.04 + Core i5 4670 + 12G Menery.
Hope it helps.