Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MemoryMapping example to Annoy tutorial #891

Closed
tmylk opened this issue Sep 27, 2016 · 13 comments
Closed

Add MemoryMapping example to Annoy tutorial #891

tmylk opened this issue Sep 27, 2016 · 13 comments
Labels
difficulty easy Easy issue: required small fix documentation Current issue related to documentation

Comments

@tmylk
Copy link
Contributor

tmylk commented Sep 27, 2016

Add example to AnnoyTutorial where 2 parallel processes load the same model from disk and mmap the same index file.

@tmylk tmylk added documentation Current issue related to documentation difficulty easy Easy issue: required small fix labels Sep 27, 2016
@harshuljain13
Copy link

I did prepared the code to save and fetch the model from disk by 2 parallel processes. What do we mean by mmap the same index file? What is the use case for this scenario? I will add description too.

harshuljain13 added a commit to harshuljain13/gensim that referenced this issue Sep 28, 2016
@tmylk
Copy link
Contributor Author

tmylk commented Sep 29, 2016

@harshul1610 Thanks for taking this up. Could you please move annoy_index.load('index') into the thread and also output memory used? Memory should not increase much as the index stays on disk.

Also, it would be more professional to choose a less controversial word example instead of 'army'.

@harshuljain13
Copy link

harshuljain13 commented Sep 29, 2016

sure. I will do it.

harshuljain13 added a commit to harshuljain13/gensim that referenced this issue Sep 29, 2016
added snippet for arallel processes to load the saved model and sharing the memory mapped index piskvorky#891
@harshuljain13
Copy link

Is it good?

@tmylk
Copy link
Contributor Author

tmylk commented Sep 29, 2016

To be more explicit Each process should have its own indexer

@harshuljain13
Copy link

Is it good?

@tmylk
Copy link
Contributor Author

tmylk commented Oct 2, 2016

Hi @harshul1610 The code looks good.
Could you please add some text above cell 10 to provide motivation for the code?
Best would be add a new cell where 2 separate indices are created and used without saving/loading. It should use much more memory than cell 10.
Your text will explain that memory mapping saves RAM.

@tmylk
Copy link
Contributor Author

tmylk commented Oct 2, 2016

Also, no need to create new pr to update code. you can just keep pushing to existing branch

@harshuljain13
Copy link

harshuljain13 commented Oct 2, 2016

@tmylk It is the time that surely increases when we don't use memory mapped index. But, memory space used by processes is approximately the same.

when Index file not memory mapped

%%time
from gensim import models
from gensim.similarities.index import AnnoyIndexer
from multiprocessing import Process
import os
import psutil

model.save('/tmp/mymodel')

def f(process_id):
    print 'Process Id: ', os.getpid()
    process = psutil.Process(os.getpid())
    new_model = models.Word2Vec.load('/tmp/mymodel')
    vector = new_model["science"]
    annoy_index = AnnoyIndexer(new_model,100)
    approximate_neighbors = new_model.most_similar([vector], topn=5, indexer=annoy_index)
    for neighbor in approximate_neighbors:
        print neighbor
    print 'Memory used by process '+str(os.getpid())+'=', process.memory_info()

p1 = Process(target=f, args=('1',))
p1.start()
p1.join()
p2 = Process(target=f, args=('2',))
p2.start()
p2.join()

Process Id: 9681
('organisations.', 0.6213911473751068)
('klusener', 0.6172938644886017)
('version', 0.6145751774311066)
('beveridge,', 0.6114714443683624)
('silence', 0.6113320291042328)
Memory used by process 9681= pmem(rss=224518144, vms=1353203712, shared=8687616, text=3051520, lib=0, data=1042268160, dirty=0)
Process Id: 9700
('organisations.', 0.6213911473751068)
('klusener', 0.6172938644886017)
('version', 0.6145751774311066)
('beveridge,', 0.6114714443683624)
('silence', 0.6113320291042328)
Memory used by process 9700= pmem(rss=224518144, vms=1353203712, shared=8687616, text=3051520, lib=0, data=1042268160, dirty=0)
CPU times: user 168 ms, sys: 16 ms, total: 184 ms
Wall time: 6.84 s

Index file memory mapped:

%%time
from gensim import models
from gensim.similarities.index import AnnoyIndexer
from multiprocessing import Process
import os
import psutil

model.save('/tmp/mymodel')

def f(process_id):
    print 'Process Id: ', os.getpid()
    process = psutil.Process(os.getpid())
    new_model = models.Word2Vec.load('/tmp/mymodel')
    vector = new_model["science"]
    annoy_index = AnnoyIndexer()
    annoy_index.load('index')
    annoy_index.model = new_model
    approximate_neighbors = new_model.most_similar([vector], topn=5, indexer=annoy_index)
    for neighbor in approximate_neighbors:
        print neighbor
    print 'Memory used by process '+str(os.getpid())+'=', process.memory_info()

p1 = Process(target=f, args=('1',))
p1.start()
p1.join()
p2 = Process(target=f, args=('2',))
p2.start()
p2.join()

Resuts:
Process Id: 9648
('organisations.', 0.6213911473751068)
('klusener', 0.6172938644886017)
('version', 0.6145751774311066)
('beveridge,', 0.6114714443683624)
('silence', 0.6113320291042328)
Memory used by process 9648= pmem(rss=242716672, vms=1370664960, shared=26886144, text=3051520, lib=0, data=1042268160, dirty=0)
Process Id: 9663
('organisations.', 0.6213911473751068)
('klusener', 0.6172938644886017)
('version', 0.6145751774311066)
('beveridge,', 0.6114714443683624)
('silence', 0.6113320291042328)
Memory used by process 9663= pmem(rss=242716672, vms=1370664960, shared=26886144, text=3051520, lib=0, data=1042268160, dirty=0)
CPU times: user 104 ms, sys: 28 ms, total: 132 ms
Wall time: 471 ms

One thing is for sure when we are not loading index file from memory, there is an drastic increase in the shared memory that is shared by process.

@tmylk
Copy link
Contributor Author

tmylk commented Oct 2, 2016

Let me understand correctly what you are saying:
"cumulative RAM used by 2 processes with memory mapping from disk" = "RAM used by 2 processes, each created its own index"

It is RAM together by 2 processes, not sepately by each one. I don't see this statistics in your log above.

That means mmaping claim by Annoy is not true: "It also creates large read-only file-based data structures that are mmapped into memory so that many processes may share the same data."

@harshuljain13
Copy link

@tmylk updated the statistics

@tmylk
Copy link
Contributor Author

tmylk commented Oct 2, 2016

Thanks, that is more clear now. So it is using less memory in total, as it is shared. Shared memory is included in RSS so it was confusing before.

Please add some text and this output to the notebook.

@tmylk
Copy link
Contributor Author

tmylk commented Oct 18, 2016

Thanks for the pr!

@tmylk tmylk closed this as completed Oct 18, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty easy Easy issue: required small fix documentation Current issue related to documentation
Projects
None yet
Development

No branches or pull requests

2 participants