Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Time-out issue when running native erlang views in 2.x on #1008

Closed
sklassen opened this issue Nov 18, 2017 · 12 comments
Closed

Time-out issue when running native erlang views in 2.x on #1008

sklassen opened this issue Nov 18, 2017 · 12 comments

Comments

@sklassen
Copy link
Contributor

I am seeing a time out when indexing a view written in native erlang. The erlang view works with 1.6.x for databases of any size; the view also works on 2.x, for database with fewer records or smaller documents. Ran against a database with many large(ish) documents, I see the following error:

[error] 2017-11-17T03:00:11.072015Z couchdb@localhost <0.12.1196> 19a93c5b89 rexi_server throw:{timeout,{gen_server,call,[<0.9106.1195>,{prompt,[...]}}]}]}]}} [{couch_mrview_util,get_view,4,[{file,"src/couch_mrview_util.erl"},{line,56}]},{couch_mrview,query_view,6,[{file,"src/couch_mrview.erl"},{line,244}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,139}]}]

If you latter try to call a view that failed due to time out, you get a second error:

[error] 2017-11-17T05:01:27.670419Z couchdb@localhost <0.26156.1198> d00b01bc7d rexi_server exit:timeout [{rexi,init_stream,1,[{file,"src/rexi.erl"},{line,256}]},{rexi,stream2,3,[{file,"src/rexi.erl"},{line,204}]},{fabric_rpc,view_cb,2,[{file,"src/fabric_rpc.erl"},{line,286}]},{couch_mrview,map_fold,3,[{file,"src/couch_mrview.erl"},{line,503}]},{couch_mrview_util,fold_fun,4,[{file,"src/couch_mrview_util.erl"},{line,360}]},{couch_btree,stream_kv_node2,8,[{file,"src/couch_btree.erl"},{line,783}]},{couch_btree,stream_kp_node,7,[{file,"src/couch_btree.erl"},{line,710}]},{couch_btree,fold,4,[{file,"src/couch_btree.erl"},{line,217}]}]

I suspect the gen_server timeout needs to be extended. With multiple nodes, some index tasked might be preempted and thus timing out.

I used the Ubuntu package couchdb 2.1.1-1 on xenial; I also replicated the same error on a an earlier 2.0 version running under snap.

@wohali wohali added the dbcore label Jan 16, 2018
@gregoryjgarcia0
Copy link

I have that same error message in my log. Is your CPU usage getting spiked really hard by the erlang process too? That's the problem I'm trying to solve. I'm using 2.1.1

@dc0d
Copy link

dc0d commented Mar 1, 2018

Same error (on verify installation, installed via snap on Ubuntu 16.04). Also it installs 2.0.0 instead of 2.1.1.

@janl janl added this to the 2.2.0 milestone Mar 5, 2018
@davisp
Copy link
Member

davisp commented Mar 9, 2018

Anyone have a way to duplicate this? The view engine changed between 1.6 and 2.x but the way Erlang functions are invoked shouldn't be any different so that's a bit odd. I'd also be interested in which version of Erlang is used as well.

@janl
Copy link
Member

janl commented Mar 9, 2018

@davisp see #1142 for more context

@davisp
Copy link
Member

davisp commented Mar 9, 2018

Aha, updated their but also this seems like two different errors to me.

@ghost
Copy link

ghost commented Apr 10, 2018

Same error , Erlang 6.2

[error] 2018-04-10T10:10:59.860288Z nonode@nohost <0.14711.9> bcb49f8b6b rexi_server: from: nonode@nohost(<0.32062.6>) mfa: fabric_rpc:reduce_view/4 throw:{timeout,{gen_server,call,[couch_proc_manager,{get_proc,<<"javascript">>},5000]}} [{couch_mrview_util,get_view_index_state,5,[{file,"src/couch_mrview_util.erl"},{line,101}]},{couch_mrview_util,get_view,4,[{file,"src/couch_mrview_util.erl"},{line,45}]},{couch_mrview,query_view,6,[{file,"src/couch_mrview.erl"},{line,244}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,139}]}]

@janl
Copy link
Member

janl commented Jul 8, 2018

Closing until a reproducible test case is provided.

@janl janl closed this as completed Jul 8, 2018
@sklassen
Copy link
Contributor Author

sklassen commented Nov 7, 2018

Hi @janl , @wohali , @davisp

I have created a test suite that generates the timeout mentioned above.

https://github.com/sklassen/couchapp-erlang-example.git

There is a python script that generates a large number of (fairly big) documents. There is a javascript and erlang version of each view. Try the script with 5 documents to see it function. Then try 500 and you should see a timeouts on the erlang view. The javascript view can may crash too, restarting the server.

I last ran it on Couchdb 2.2, on ubuntu 18.04 from the http://apache.bintray.com/couchdb-deb bionic package. I ran it on a NUC 7i with 4 cores and 15G of memory (n=1,q=8). I've also seen it on (n=1,q=1) and over three NUCs (n=3,q=8). The same erlang views ran on 1.6x without issue.

Perhaphs there is a configurable timeout that needs tweaking?

@wohali wohali reopened this Nov 7, 2018
@wohali wohali modified the milestones: 2.2.0, 3.0.0 Jul 11, 2019
@Caesar305
Copy link

Are there any known work arounds for this issue?

@sklassen
Copy link
Contributor Author

I can confirm I still see the same memory with the test suite above with a doc count of 500. When I run it there is no longer an error message; the process runs for some time until it quietly runs out of memory and restarts. (I am using the snap installation on ubuntu 19.04, version 2.3.1; erts-8.3.5.4; n=1; q=8 on a NUCs with 8 cores and 15GB).

I don't see the problem with larger databases with smaller documents. I suspect it also isn't only the size of the documents, but also the depth of nested structures. Memory management between erlang and the NIF is the likely culprit.

In my real-life database, as a workaround, I did a bit of everything: i) increased memory; ii) increased nodes n=5 (shared the problem around); iii) decreased the document size; iv) re-ran indexing multiple times. In my case, it now works on the second or third attempt of a full index. Incremental indexing is fine.

@Caesar305
Copy link

Our database is not big, it just has over 100 databases. Each one maybe 200Mb in size, with a few thousand documents in each. This issue occurs randomly for us, one of the nodes will simply start responding to requests very slowly (over 20 second delay). When looking at the processes, I see 2 couchdb processes pegging a few CPUs. The logs are showing similar messages to OP. Running 32GB RAM, 16 processors, 1TB SSD drives. Not sure what I can tweak to help remedy this.

@sklassen
Copy link
Contributor Author

Hi @janl , @wohali , @davisp

This problem disappeared after I rebuilt couchdb using jiffy 1.04 (see davisp/jiffy@0ba322e). Thanks @davisp for the fix.

@wohali wohali closed this as completed Mar 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants