Add inverted index for mongodb #945

mymusise · 2017-08-18T06:28:12Z

Signed-off-by: corcassia [email protected]
Hi, I add the Inverted Index for mongodb. What new:

Creating the Inverted Index when training dataset
It will just compare the similar statement when getting a response (Only BestMatch)

Improving performance：

Version	Response Time(s)
Before	3.1
After	0.1

It test with 28000 rows dataset on Ubuntu 14.04 + Core i5 4670 + 12G Menery.

Hope it helps.

Signed-off-by: mymusise <[email protected]>

vkosuri · 2017-08-18T08:57:49Z

It seems to me the build failed due to network issue, I have restarted the job, hope it will pass

vkosuri · 2017-08-18T09:18:53Z

Some useful reference about invert index https://stackoverflow.com/a/8360932/358458

vkosuri · 2017-08-18T09:21:29Z

@corcassia Thank you for you PR, could you please clarify the different between mongodb text indexing and this PR invert indexing

mymusise · 2017-08-18T09:43:20Z

@vkosuri Ok, my pleasure.

Before this PR, I found it will compare all samples when calling ChatBot.get_response. So, if we have trained a large dataset, it will spend many time to response.
~~mongodb text indexing is a forward index~~, in this change, I hope it just select the similar sample from database by adding invert indexing, rather than select all.

For example:
There some samples in our databases:

- - 1. Hello, how are you.
-   2. I'm fine, thank you~
- - 3. Mongodb is faster than sqlite
-   4. Yes, of course.
- - 5. Hello, would you do me a favor？
-   6. Yes~
....

and the invert indexing will look like:

hello: [1, 5, ...]
Mongodb: [3, ...]
sqlite: [3, ...]
favor: [5, ...]
...

each key word corresponds to samples ID

If we call bot.get_response('The different between mongodb and sqlite'), there are some key word in this statement:
different, mongodb, text, indexing, invert

Then, we use invert indexing to find out which samples have those key word.

Finally, we compare those samples with the input statement just like before.

I wish I made it clear.

vkosuri · 2017-08-18T09:56:42Z

@corcassia Thank you, I am not sure with performance.

To improve @gunthercox came with an idea tag filtering to speed up retrieval process #925.

mymusise · 2017-08-18T10:33:58Z

@vkosuri Yes, I think tag filtering is a good idea.
To get the labels of input statement, will it iterate all sample or pass a classify model? @gunthercox

gunthercox · 2017-08-19T18:23:30Z

@corcassia Thank you, I'm really impressed with the performance improvements after this change. I'm going to pull these changes down locally and test out a few things. I might have further questions.

gunthercox · 2017-08-24T11:15:17Z

@corcassia Apologies for my delay, I will be reviewing this pull request as soon as possible.

gunthercox · 2017-08-24T23:42:55Z

chatterbot/storage/mongodb.py

+            # Just filter the statement in need.
+            statement_ids = []
+            tokens = self.statement_segmentation(statement.text)
+            word_dicts = self.inverted.find({'word': {'$in': tokens}})


I might be misunderstanding this, so feel free to correct me if I'm wrong. Given this change, wont only statements in which the words of statement exist be returned?

Yes, you're right. So this change will lead to no response if there's no statement exist.

mymusise · 2017-08-25T03:24:44Z

@gunthercox It's all right, and I found some problem with this change in these days.

Signed-off-by: mymusise <[email protected]>

mymusise · 2017-08-25T07:00:47Z

@vkosuri I'm sorry about my wrong clarification. MongoDB text indexing is one of the practices of invert indexing. After a few test, I found MongoDB text indexing behave better.
I remove my invert indexing and change the _statement_query, it runs faster than before.

Signed-off-by: mymusise <[email protected]>

zxsimple · 2018-01-05T10:32:10Z

@mymusise seems the change hasn't checked in to the master? can I apply the path master brach for resolving the performance issue?

mymusise · 2018-01-05T13:57:36Z

@zxsimple Yeah, seems the master branch still not create a full-text index when setting up the bot (chatterbot/storage/mongodb.py), but you can create it manually and with a little change like this.

I think with customing Adapter can solve the problem, or, there's another way to resolving the performance.

I think the performance with highly improve with decreasing the statement_list, cause the comparison will cost a lot of CPU time.I found it may help by removing some meaningless works such as ('the', 'a', 'an', 'with', 'to') from input statement and corpus.

Solution:

import re

source_corpus = [
    "Hello",
    "Hi there!",
    "How are you doing?",
    "I'm doing great.",
    "That is good to hear",
    "Thank you.",
    "You're welcome."
]
input_statements = [
    "Such a good day!",
    "Here we are."
]

def words_filter(statements):
    # a words filter demo
    bad_words = ['the', 'a', 'an', 'is', 'are']
    rule = re.compile(" | ".join(bad_words))
    return [rule.sub(' ', statement) for statement in statements]

courpus = words_filter(source_corpus)
input_statements = words_filter(input_statements)

from chatterbot.trainers import ListTrainer
from chatterbot import ChatBot

chatbot = ChatBot("Ron Obvious")

chatbot.set_trainer(ListTrainer)
chatbot.train(courpus)


for statement in input_statements:
    response = chatbot.get_response(statement)
    print(response)

add inverted index for mongodb

632172a

Signed-off-by: mymusise <[email protected]>

mymusise changed the title ~~add inverted index for mongodb~~ Add inverted index for mongodb Aug 18, 2017

mymusise added 4 commits August 18, 2017 14:48

Compat

b30de11

Signed-off-by: mymusise <[email protected]>

fix words bug

05378ff

Signed-off-by: mymusise <[email protected]>

pep8

4e9ca8f

Signed-off-by: mymusise <[email protected]>

fix flake8

272c890

Signed-off-by: mymusise <[email protected]>

gunthercox reviewed Aug 24, 2017

View reviewed changes

mymusise added 2 commits August 25, 2017 11:31

remove index

6df8d8f

Signed-off-by: mymusise <[email protected]>

fix bug

ae36c49

Signed-off-by: mymusise <[email protected]>

mymusise added 2 commits August 25, 2017 17:13

create text index

a8f07d1

Signed-off-by: mymusise <[email protected]>

Merge https://github.com/gunthercox/ChatterBot

fb13458

Signed-off-by: mymusise <[email protected]>

calmzealA mentioned this pull request Nov 15, 2017

Jieba #1074

Closed

zxsimple mentioned this pull request Jan 5, 2018

Chatterbot data to 2M is very slow to reflect why? #679

Closed

Merge remote-tracking branch 'upstream/master'

f97681b

mymusise closed this Apr 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add inverted index for mongodb #945

Add inverted index for mongodb #945

mymusise commented Aug 18, 2017 •

edited

Loading

vkosuri commented Aug 18, 2017 •

edited

Loading

vkosuri commented Aug 18, 2017

vkosuri commented Aug 18, 2017

mymusise commented Aug 18, 2017 •

edited

Loading

vkosuri commented Aug 18, 2017

mymusise commented Aug 18, 2017 •

edited

Loading

gunthercox commented Aug 19, 2017

gunthercox commented Aug 24, 2017

gunthercox Aug 24, 2017

mymusise Aug 25, 2017

mymusise commented Aug 25, 2017

mymusise commented Aug 25, 2017 •

edited

Loading

zxsimple commented Jan 5, 2018

mymusise commented Jan 5, 2018 •

edited

Loading

Add inverted index for mongodb #945

Add inverted index for mongodb #945

Conversation

mymusise commented Aug 18, 2017 • edited Loading

vkosuri commented Aug 18, 2017 • edited Loading

vkosuri commented Aug 18, 2017

vkosuri commented Aug 18, 2017

mymusise commented Aug 18, 2017 • edited Loading

vkosuri commented Aug 18, 2017

mymusise commented Aug 18, 2017 • edited Loading

gunthercox commented Aug 19, 2017

gunthercox commented Aug 24, 2017

gunthercox Aug 24, 2017

Choose a reason for hiding this comment

mymusise Aug 25, 2017

Choose a reason for hiding this comment

mymusise commented Aug 25, 2017

mymusise commented Aug 25, 2017 • edited Loading

zxsimple commented Jan 5, 2018

mymusise commented Jan 5, 2018 • edited Loading

Solution:

mymusise commented Aug 18, 2017 •

edited

Loading

vkosuri commented Aug 18, 2017 •

edited

Loading

mymusise commented Aug 18, 2017 •

edited

Loading

mymusise commented Aug 18, 2017 •

edited

Loading

mymusise commented Aug 25, 2017 •

edited

Loading

mymusise commented Jan 5, 2018 •

edited

Loading