Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad performance on big amount of training data #164

Closed
sntp opened this issue May 9, 2016 · 17 comments
Closed

Bad performance on big amount of training data #164

sntp opened this issue May 9, 2016 · 17 comments
Labels

Comments

@sntp
Copy link

sntp commented May 9, 2016

I trained the bot with ~17k conversations and now it takes a lot of time for response. Are there ways to avoid it?
Training data: https://gist.github.com/sntp/221f53c48bec929ac36d0951b496fcbd

@gunthercox
Copy link
Owner

Commit e5a9869 makes one small change to start to address this by reducing the number of read and write transactions that are made to the database. I will continue to post updates on this ticket to track performance improvement changes.

@gunthercox gunthercox added the bug label May 24, 2016
@gunthercox
Copy link
Owner

Pull request #173 allows the storage adapter to override an expensive method to provide a more efficient implementation. The get_response_statements method has been overridden on the MongoDB storage adapter to provide a much more efficient version that should yield a significant improvement in performance.

@Nixellion
Copy link

Nixellion commented Aug 19, 2016

What about using SQLite? Will that speed up the process? Is there sqlite adapter?

I tried to do the same, I fed a 3.5Mb training file with converstations from social network, was curious what kind of answers i'll get from that :D

And firstly it took about 40 minutes to train, and now it is just stuck on trying to answer.
I tried using MongoDB but got an error pymongo.errors.ServerSelectionTimeoutError: localhost:27017: [WinError 10061] No connection could be made because the target machine actively refused it

I gueees, because i need to download it and run the server, huh?..

@sntp
Copy link
Author

sntp commented Aug 19, 2016

Btw, Why not to use SQL database?

@Nixellion
Copy link

Oh, okay, mongodb works fine now. Much faster. But that Bulk error is annoying.

And I still think having a standard option of sqlite would be nice, it's much faster than json, but it is also just a single file and it does not require you to install anything but python. Just a thought. No rush though.

@gunthercox
Copy link
Owner

@Nixellion I'm glad you are getting better results with the Mongo DB adapter. The JSON file adapter is really just meant for testing and development because it is limited by the fact that it has to write to the hard disk each time it needs to save.

Sill looking into the bulk insert error, and I've opened a ticket for tracking the addition of a new SQLite storage adapter #241.

@Nixellion
Copy link

Nixellion commented Aug 20, 2016

Cool, thanks!

Also found some old discussions back from 2014-2015, about making this bot smart enough to pass at least some of Turing tests\questions, building sentences from words, etc I hope you're still onto it :)

@chenjun0210
Copy link

Does It support parallel Training?

@gunthercox
Copy link
Owner

Parallel training is only supported if the database being used supports concurrent writes. The default file database that ChatterBot uses does not support concurrent writes, but if you use mongo db it will.

@chenjun0210
Copy link

my data size about
i use mongo db . but i dont know how to set the training parameters or when i use mongodb the default is parallel training ?? thanks a lot

@chenjun0210
Copy link

my data size is about 2G

@gunthercox
Copy link
Owner

You will probably need to do a bit of work to get the import process ready to bring in 2GB of data in parallel. I would recommend breaking it up, if possible, into a few files of manageable size. You will then have to use python's multiprocessing capabilities to start training processes on each subset of the data file. This functionality isn't built into ChatterBot at the moment, if you are unsure on how to accomplish this, feel free to ask any questions. Otherwise, I have opened a ticket to get support for this functionality added to ChatterBot (#354).

@Martmists-GH
Copy link
Contributor

I've noticed that #597 using ujson has sped up processing a lot, though my training data is only ~300MB in size. I recommend trying it out to see how much faster it will go.

@jxfruit
Copy link

jxfruit commented Oct 9, 2017

@Martmists hi, bro, have u solved the efficiency of bot's training and testing ?can u share some thoughts about improving efficiency ? tks

@Martmists-GH
Copy link
Contributor

One thing to note is to NOT use the default JSON storage. It's slow due to constant I/O, it's relatively unoptimized and uses the stdlib JSON module. I recommend writing your own or trying to find one online.

@jxfruit
Copy link

jxfruit commented Oct 10, 2017

@Martmists I have used mongodb as the storage adapter. However it is still very slowly for response about 7w data taking 41 seconds. I am working on finding other ways to improving efficiency. How about u?

@gunthercox
Copy link
Owner

I'm going to close this issue off, I don't believe there is any remaining actionable items here. Tickets have been created to implement changes that will help to improve response times. See #925 and its related tickets for further details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants