Fast auto completion server.
This auto-complete service suggests relevant sentences completions based on a few characters or words entered by a user. The service is fast and can be queried every time the user enters a new keystroke. The auto-completion model is trained on a corpus of conversations.
The corpus of conversation contains 16,508 sentences written by customer service representatives. This notebook quickly explores the dataset and attempts to build a small language model prototype.
auto-complete-server
is made of 3 main components:
-
A language model. Based on a corpus of sentences, the language model will assign probabilities to sequences of words. This component will ensure that the completions returned by the server are meaningful and relevant for the end-user. In the current version, a very dummy language model is implemented. It is character based and will return the most likely full sentence completion based on sentences frequencies in the corpus. It doesn't use smoothing, backoff or interpolation, or a word based n-gram model. It relies on
nltk
tokenizer to split sentences, and usespandas
to manipulate corpus data. -
The AutoCompleteModel object, which is currently implemented under the
MostPopularCompletionModel
class. This object will use the probabilities (or counts) generated by the language model and store them in an efficient data structure that will enable fast completion queries through thegenerate_completions
method. It relies on a trie data structure implemented bydatrie
library, which is fast and memory efficient. Currently,MostPopularCompletionModel
couples both the language model and the completion component. -
The Tornado web server. Tornado web framework is fast and non-blocking on network I/O. It is lightweight and therefore adapted to to our needs for the auto-complete micro service. It exposes the
generate_completions
function of the AutoCompleteModel object through a REST API. The completions are returned on an HTTP GET request on the theautocomplete
resource. The completions are returned in JSON.
- Python 3.6
- Pip
- Git
git clone https://github.com/ynouri/auto-complete-server.git
cd auto-complete-server
python -m setup install
$ ./process_corpus.py data/sample_conversations.json models/sample_conversations.trie
09-24 19:41:04 INFO Start reading corpus from data/sample_conversations.json
09-24 19:41:04 INFO Finished reading corpus.
09-24 19:41:04 INFO Start split messages into sentences...
09-24 19:42:06 INFO Finished split messages into sentences.
09-24 19:42:06 INFO Start inserting into trie...
09-24 19:42:07 INFO Finished inserting into trie.
09-24 19:42:07 INFO Saving trie to models/sample_conversations.trie
$ python -m auto_complete_server.web.app --trie-file=models/sample_conversations.trie --port=13000
[I 180924 19:45:24 mpc:28] Loading trie from models/sample_conversations.trie
[I 180924 19:45:24 app:55] Web server listening on 13000
The auto-complete service can be queried either directly on its REST API, available at /autocomplete?q=prefix
:
$ curl http://18.234.152.60:13000/autocomplete?q=How
{"completions": ["How does that sound?", "How may I help you today?", "How may I help you?"]}
Or a simplified client page can also be used by accessing the root of the host:port used (e.g. http://18.234.152.60:13000/
). The client will take as an input the REST API address, and will suggest completions for any text entered in a text box. Request timings are also displayed for information.
The Docker image can be built with the following command.
docker build -t auto-complete-server .
Docker images are automatically built and stored on every push on Docker Hub.
The Docker image corresponding to latest commit on master can be pulled and run easily. The following will download the image and start the web server on port 80.
docker pull nouri/auto-complete-server:latest
docker run -p 80:13000 nouri/auto-complete-server:latest
- Deploy an AWS EC2 instance:
- t2.medium
- Ubuntu Bionic 18.04 AMI
- Open port 13000 in security group
- Install Docker
- Pull most recent auto-complete-server image:
docker pull nouri/auto-complete-server:latest
- Run Docker container:
docker run -p 13000:13000 nouri/auto-complete-server:latest
The following load test is run a local New York based laptop and sends requests to the auto-complete server hosted on a us-east AWS instance. Docker image ID used is e89150f7061b
. Node based loadtest utility is used to send a high number of requests with different concurrency settings.
Concurrency=1
The average response time is a decent 50ms, without any errors.
$ loadtest -n 1000 -c 1 http://54.146.138.202:13000/autocomplete?q=How+can
Completed requests: 1000
Total errors: 0
Total time: 53.15666733 s
Requests per second: 19
Mean latency: 53.1 ms
Concurrency=100
A single Tornado process can handle a charge 100x bigger without any errors. The mean latency more than doubles to 120ms.
$ loadtest -n 100000 -c 100 http://54.146.138.202:13000/autocomplete?q=How+can
Completed requests: 100000
Total errors: 0
Total time: 119.900251101 s
Requests per second: 834
Mean latency: 119.7 ms
Concurrency=500
With a concurrency of 500, the error rate surges to almost 10% of requests. The mean latency increases to an un-acceptable 600ms. The number of successful requests per second still hovers around 800. This seems to be the maximum that our current implementation can handle. To increase performance further, a load balancer could be introduced to distribute requests to several Tornado processes (potentially hosted on different VMs).
$ loadtest -n 50000 -c 500 http://54.146.138.202:13000/autocomplete?q=How+can
Completed requests: 50000
Total errors: 4760
Total time: 60.870930902 s
Requests per second: 821
Mean latency: 584.3 ms
The following points could be considered for next steps / improvements.
- Add an Interpolated Kneser-Ney language model as an alternative backend to the auto-complete. KenLM, SRILM, or very recent NLTK.LM module could be considered for this. Implement serialization/deserialization
- Add a neural language model
- Compare quality of different models
- Split real corpus in a training and test set. Measure perplexity / cross-entropy scores of our language model on the test set.
- Implement pruning of low-probability scores to keep the dimensionality low.
- Decouple language model & completion function
- Language model results could be stored using an intermediary ARPA format (e.g. using Python arpa package) The trie data structure would then be built from this language model arpa file.
- Set up a Sphinx / readthedoc documentation.
- Test scaling of the auto-complete server. What load / requests per second / concurrency requires introducing load balancing?
- Add nginx as a reverse proxy and load balancer for several auto-complete services, that could use different CPU cores of the same VM and also different VMs.
- Look into scaling solutions using Kubernetes for easy cloud deployment.
- Automate performance tests. Perf tests could for example be compared by using a dedicated server (therefore physical hardware at constant load).
- Add a heavier test that builds an auto-complete model based on the full sample corpus (and not just on the “unit test” corpus). Exclude this test from pytest defaults, include it only in the CI