Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add inverted index for mongodb #945

Closed
wants to merge 10 commits into from
Closed

Conversation

mymusise
Copy link

@mymusise mymusise commented Aug 18, 2017

Signed-off-by: corcassia [email protected]
Hi, I add the Inverted Index for mongodb. What new:

  • Creating the Inverted Index when training dataset
  • It will just compare the similar statement when getting a response (Only BestMatch)

Improving performance:

Version Response Time(s)
Before 3.1
After 0.1

It test with 28000 rows dataset on Ubuntu 14.04 + Core i5 4670 + 12G Menery.

Hope it helps.

@mymusise mymusise changed the title add inverted index for mongodb Add inverted index for mongodb Aug 18, 2017
Signed-off-by: mymusise <[email protected]>
Signed-off-by: mymusise <[email protected]>
Signed-off-by: mymusise <[email protected]>
Signed-off-by: mymusise <[email protected]>
@vkosuri
Copy link
Collaborator

vkosuri commented Aug 18, 2017

It seems to me the build failed due to network issue, I have restarted the job, hope it will pass

@vkosuri
Copy link
Collaborator

vkosuri commented Aug 18, 2017

Some useful reference about invert index https://stackoverflow.com/a/8360932/358458

@vkosuri
Copy link
Collaborator

vkosuri commented Aug 18, 2017

@corcassia Thank you for you PR, could you please clarify the different between mongodb text indexing and this PR invert indexing

@mymusise
Copy link
Author

mymusise commented Aug 18, 2017

@vkosuri Ok, my pleasure.

Before this PR, I found it will compare all samples when calling ChatBot.get_response. So, if we have trained a large dataset, it will spend many time to response.
mongodb text indexing is a forward index, in this change, I hope it just select the similar sample from database by adding invert indexing, rather than select all.

For example:
There some samples in our databases:

- - 1. Hello, how are you.
-   2. I'm fine, thank you~
- - 3. Mongodb is faster than sqlite
-   4. Yes, of course.
- - 5. Hello, would you do me a favor?
-   6. Yes~
....

and the invert indexing will look like:

hello: [1, 5, ...]
Mongodb: [3, ...]
sqlite: [3, ...]
favor: [5, ...]
...

each key word corresponds to samples ID

If we call bot.get_response('The different between mongodb and sqlite'), there are some key word in this statement:
different, mongodb, text, indexing, invert

Then, we use invert indexing to find out which samples have those key word.

Finally, we compare those samples with the input statement just like before.

I wish I made it clear.

@vkosuri
Copy link
Collaborator

vkosuri commented Aug 18, 2017

@corcassia Thank you, I am not sure with performance.

To improve @gunthercox came with an idea tag filtering to speed up retrieval process #925.

@mymusise
Copy link
Author

mymusise commented Aug 18, 2017

@vkosuri Yes, I think tag filtering is a good idea.
To get the labels of input statement, will it iterate all sample or pass a classify model? @gunthercox

@gunthercox
Copy link
Owner

@corcassia Thank you, I'm really impressed with the performance improvements after this change. I'm going to pull these changes down locally and test out a few things. I might have further questions.

@gunthercox
Copy link
Owner

@corcassia Apologies for my delay, I will be reviewing this pull request as soon as possible.

# Just filter the statement in need.
statement_ids = []
tokens = self.statement_segmentation(statement.text)
word_dicts = self.inverted.find({'word': {'$in': tokens}})
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be misunderstanding this, so feel free to correct me if I'm wrong. Given this change, wont only statements in which the words of statement exist be returned?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right. So this change will lead to no response if there's no statement exist.

@mymusise
Copy link
Author

@gunthercox It's all right, and I found some problem with this change in these days.

Signed-off-by: mymusise <[email protected]>
Signed-off-by: mymusise <[email protected]>
@mymusise
Copy link
Author

mymusise commented Aug 25, 2017

@vkosuri I'm sorry about my wrong clarification. MongoDB text indexing is one of the practices of invert indexing. After a few test, I found MongoDB text indexing behave better.
I remove my invert indexing and change the _statement_query, it runs faster than before.

@zxsimple
Copy link

zxsimple commented Jan 5, 2018

@mymusise seems the change hasn't checked in to the master? can I apply the path master brach for resolving the performance issue?

@mymusise
Copy link
Author

mymusise commented Jan 5, 2018

@zxsimple Yeah, seems the master branch still not create a full-text index when setting up the bot (chatterbot/storage/mongodb.py), but you can create it manually and with a little change like this.

I think with customing Adapter can solve the problem, or, there's another way to resolving the performance.

I think the performance with highly improve with decreasing the statement_list, cause the comparison will cost a lot of CPU time.I found it may help by removing some meaningless works such as ('the', 'a', 'an', 'with', 'to') from input statement and corpus.

Solution:

import re

source_corpus = [
    "Hello",
    "Hi there!",
    "How are you doing?",
    "I'm doing great.",
    "That is good to hear",
    "Thank you.",
    "You're welcome."
]
input_statements = [
    "Such a good day!",
    "Here we are."
]

def words_filter(statements):
    # a words filter demo
    bad_words = ['the', 'a', 'an', 'is', 'are']
    rule = re.compile(" | ".join(bad_words))
    return [rule.sub(' ', statement) for statement in statements]

courpus = words_filter(source_corpus)
input_statements = words_filter(input_statements)

from chatterbot.trainers import ListTrainer
from chatterbot import ChatBot

chatbot = ChatBot("Ron Obvious")

chatbot.set_trainer(ListTrainer)
chatbot.train(courpus)


for statement in input_statements:
    response = chatbot.get_response(statement)
    print(response)

@mymusise mymusise closed this Apr 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants