Add MemoryMapping example to Annoy tutorial #891

tmylk · 2016-09-27T05:47:31Z

Add example to AnnoyTutorial where 2 parallel processes load the same model from disk and mmap the same index file.

harshuljain13 · 2016-09-28T16:39:19Z

I did prepared the code to save and fetch the model from disk by 2 parallel processes. What do we mean by mmap the same index file? What is the use case for this scenario? I will add description too.

tmylk · 2016-09-29T03:30:43Z

@harshul1610 Thanks for taking this up. Could you please move annoy_index.load('index') into the thread and also output memory used? Memory should not increase much as the index stays on disk.

Also, it would be more professional to choose a less controversial word example instead of 'army'.

harshuljain13 · 2016-09-29T05:57:03Z

sure. I will do it.

added snippet for arallel processes to load the saved model and sharing the memory mapped index piskvorky#891

harshuljain13 · 2016-09-29T14:15:21Z

Is it good?

tmylk · 2016-09-29T16:15:19Z

To be more explicit Each process should have its own indexer

harshuljain13 · 2016-10-01T03:20:26Z

Is it good?

tmylk · 2016-10-02T06:02:53Z

Hi @harshul1610 The code looks good.
Could you please add some text above cell 10 to provide motivation for the code?
Best would be add a new cell where 2 separate indices are created and used without saving/loading. It should use much more memory than cell 10.
Your text will explain that memory mapping saves RAM.

tmylk · 2016-10-02T06:03:18Z

Also, no need to create new pr to update code. you can just keep pushing to existing branch

harshuljain13 · 2016-10-02T08:08:10Z

@tmylk It is the time that surely increases when we don't use memory mapped index. But, memory space used by processes is approximately the same.

when Index file not memory mapped

%%time
from gensim import models
from gensim.similarities.index import AnnoyIndexer
from multiprocessing import Process
import os
import psutil

model.save('/tmp/mymodel')

def f(process_id):
    print 'Process Id: ', os.getpid()
    process = psutil.Process(os.getpid())
    new_model = models.Word2Vec.load('/tmp/mymodel')
    vector = new_model["science"]
    annoy_index = AnnoyIndexer(new_model,100)
    approximate_neighbors = new_model.most_similar([vector], topn=5, indexer=annoy_index)
    for neighbor in approximate_neighbors:
        print neighbor
    print 'Memory used by process '+str(os.getpid())+'=', process.memory_info()

p1 = Process(target=f, args=('1',))
p1.start()
p1.join()
p2 = Process(target=f, args=('2',))
p2.start()
p2.join()

Process Id: 9681
('organisations.', 0.6213911473751068)
('klusener', 0.6172938644886017)
('version', 0.6145751774311066)
('beveridge,', 0.6114714443683624)
('silence', 0.6113320291042328)
Memory used by process 9681= pmem(rss=224518144, vms=1353203712, shared=8687616, text=3051520, lib=0, data=1042268160, dirty=0)
Process Id: 9700
('organisations.', 0.6213911473751068)
('klusener', 0.6172938644886017)
('version', 0.6145751774311066)
('beveridge,', 0.6114714443683624)
('silence', 0.6113320291042328)
Memory used by process 9700= pmem(rss=224518144, vms=1353203712, shared=8687616, text=3051520, lib=0, data=1042268160, dirty=0)
CPU times: user 168 ms, sys: 16 ms, total: 184 ms
Wall time: 6.84 s

Index file memory mapped:

%%time
from gensim import models
from gensim.similarities.index import AnnoyIndexer
from multiprocessing import Process
import os
import psutil

model.save('/tmp/mymodel')

def f(process_id):
    print 'Process Id: ', os.getpid()
    process = psutil.Process(os.getpid())
    new_model = models.Word2Vec.load('/tmp/mymodel')
    vector = new_model["science"]
    annoy_index = AnnoyIndexer()
    annoy_index.load('index')
    annoy_index.model = new_model
    approximate_neighbors = new_model.most_similar([vector], topn=5, indexer=annoy_index)
    for neighbor in approximate_neighbors:
        print neighbor
    print 'Memory used by process '+str(os.getpid())+'=', process.memory_info()

p1 = Process(target=f, args=('1',))
p1.start()
p1.join()
p2 = Process(target=f, args=('2',))
p2.start()
p2.join()

Resuts:
Process Id: 9648
('organisations.', 0.6213911473751068)
('klusener', 0.6172938644886017)
('version', 0.6145751774311066)
('beveridge,', 0.6114714443683624)
('silence', 0.6113320291042328)
Memory used by process 9648= pmem(rss=242716672, vms=1370664960, shared=26886144, text=3051520, lib=0, data=1042268160, dirty=0)
Process Id: 9663
('organisations.', 0.6213911473751068)
('klusener', 0.6172938644886017)
('version', 0.6145751774311066)
('beveridge,', 0.6114714443683624)
('silence', 0.6113320291042328)
Memory used by process 9663= pmem(rss=242716672, vms=1370664960, shared=26886144, text=3051520, lib=0, data=1042268160, dirty=0)
CPU times: user 104 ms, sys: 28 ms, total: 132 ms
Wall time: 471 ms

One thing is for sure when we are not loading index file from memory, there is an drastic increase in the shared memory that is shared by process.

tmylk · 2016-10-02T08:13:56Z

Let me understand correctly what you are saying:
"cumulative RAM used by 2 processes with memory mapping from disk" = "RAM used by 2 processes, each created its own index"

It is RAM together by 2 processes, not sepately by each one. I don't see this statistics in your log above.

That means mmaping claim by Annoy is not true: "It also creates large read-only file-based data structures that are mmapped into memory so that many processes may share the same data."

harshuljain13 · 2016-10-02T08:59:38Z

@tmylk updated the statistics

tmylk · 2016-10-02T09:13:28Z

Thanks, that is more clear now. So it is using less memory in total, as it is shared. Shared memory is included in RSS so it was confusing before.

Please add some text and this output to the notebook.

tmylk · 2016-10-18T14:35:14Z

Thanks for the pr!

tmylk added documentation Current issue related to documentation difficulty easy Easy issue: required small fix labels Sep 27, 2016

harshuljain13 added a commit to harshuljain13/gensim that referenced this issue Sep 28, 2016

solves issue piskvorky#891. added parallel processing example

34ab52c

harshuljain13 added a commit to harshuljain13/gensim that referenced this issue Sep 29, 2016

solves issue piskvorky#891. added parallel processing example

f958d46

added snippet for arallel processes to load the saved model and sharing the memory mapped index piskvorky#891

harshuljain13 mentioned this issue Sep 30, 2016

solves issue #891. added parallel processing example #899

Closed

harshuljain13 added a commit to harshuljain13/gensim that referenced this issue Oct 2, 2016

added description, memory statistics and code comparison. piskvorky#891

7bf31e1

tmylk closed this as completed Oct 18, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MemoryMapping example to Annoy tutorial #891

Add MemoryMapping example to Annoy tutorial #891

tmylk commented Sep 27, 2016 •

edited

Loading

harshuljain13 commented Sep 28, 2016

tmylk commented Sep 29, 2016

harshuljain13 commented Sep 29, 2016 •

edited

Loading

harshuljain13 commented Sep 29, 2016

tmylk commented Sep 29, 2016

harshuljain13 commented Oct 1, 2016

tmylk commented Oct 2, 2016

tmylk commented Oct 2, 2016

harshuljain13 commented Oct 2, 2016 •

edited

Loading

tmylk commented Oct 2, 2016 •

edited

Loading

harshuljain13 commented Oct 2, 2016

tmylk commented Oct 2, 2016

tmylk commented Oct 18, 2016

Add MemoryMapping example to Annoy tutorial #891

Add MemoryMapping example to Annoy tutorial #891

Comments

tmylk commented Sep 27, 2016 • edited Loading

harshuljain13 commented Sep 28, 2016

tmylk commented Sep 29, 2016

harshuljain13 commented Sep 29, 2016 • edited Loading

harshuljain13 commented Sep 29, 2016

tmylk commented Sep 29, 2016

harshuljain13 commented Oct 1, 2016

tmylk commented Oct 2, 2016

tmylk commented Oct 2, 2016

harshuljain13 commented Oct 2, 2016 • edited Loading

tmylk commented Oct 2, 2016 • edited Loading

harshuljain13 commented Oct 2, 2016

tmylk commented Oct 2, 2016

tmylk commented Oct 18, 2016

tmylk commented Sep 27, 2016 •

edited

Loading

harshuljain13 commented Sep 29, 2016 •

edited

Loading

harshuljain13 commented Oct 2, 2016 •

edited

Loading

tmylk commented Oct 2, 2016 •

edited

Loading