Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ES Hang after reinstall (Have been discussed in the discuss.elastic.co) #21902

Closed
tianchao-haohan opened this issue Dec 1, 2016 · 7 comments
Closed
Labels
:Core/Infra/Core Core issues without another label feedback_needed

Comments

@tianchao-haohan
Copy link

https://discuss.elastic.co/t/elasticsearch-hang-after-reinstall/67425

Elasticsearch version: 5.0.0-alpha1

Plugins installed: []

JVM version:1.8.0_60

OS version:SUSE Enterprise 12

Description of the problem including expected versus actual behavior:
remove 5.0.0-alpha1 with rpm -ev. then install elasticsearch-2.4.1. Failed to start elasticsearch

Steps to reproduce:

  1. Install elasticsearch-5.0.0-alpha1
  2. uninstall elasticsearch-5.0.0-alpha1
  3. install elasticsearch-2.4.1
  4. start elasticsearch with sudo systemctl start elasticsearch
  5. ps -ef | grep elastic
    elastic+ 6400 1 0 03:17 pts/1 00:00:00 /usr/bin/java -Xms256m -Xmx1g -Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -XX:+DisableExplicitGC -Dfile.encoding=UTF-8 -Djna.nosys=true -Des.path.home=/usr/share/elasticsearch -cp /usr/share/elasticsearch/lib/elasticsearch-2.4.1.jar:/usr/share/elasticsearch/lib/* org.elasticsearch.bootstrap.Elasticsearch start -d -p /var/run/elasticsearch.pid --default.config=/etc/elasticsearch/elasticsearch.yml --default.path.home=/usr/share/elasticsearch --default.path.logs=/var/log/elasticsearch --default.path.data=/var/lib/elasticsearch --default.path.work=/tmp/elasticsearch --default.path.conf=/etc/elasticsearch
  6. /bin/netstat -nap | grep 9200, nothing could be found
    Failed to "curl -XGET localhost:9200", network host in elasticsearch.yml has been configured to 0.0.0.0

Provide logs (if relevant): No logs under /var/log/elasticsearch
Describe the feature:
jstack the pid:
"Signal Dispatcher" #5 daemon prio=9 os_prio=0 tid=0x00007fcfc0172000 nid=0x4dd0 runnable [0x0000000000000000]
java.lang.Thread.State: RUNNABLE

"Surrogate Locker Thread (Concurrent GC)" #4 daemon prio=9 os_prio=0 tid=0x00007fcfc0170800 nid=0x4dcf waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE

"Finalizer" #3 daemon prio=8 os_prio=0 tid=0x00007fcfc0134000 nid=0x4dce in Object.wait() [0x00007fcf734fb000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)

  • waiting on <0x00000000c00070b8> (a java.lang.ref.ReferenceQueue$Lock)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143)
  • locked <0x00000000c00070b8> (a java.lang.ref.ReferenceQueue$Lock)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:164)
    at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:209)

"Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x00007fcfc0132000 nid=0x4dcd in Object.wait() [0x00007fcf735fc000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)

  • waiting on <0x00000000c0006af8> (a java.lang.ref.Reference$Lock)
    at java.lang.Object.wait(Object.java:502)
    at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:157)
  • locked <0x00000000c0006af8> (a java.lang.ref.Reference$Lock)

"main" #1 prio=5 os_prio=0 tid=0x00007fcfc0009800 nid=0x4db9 runnable [0x00007fcfc7955000]
java.lang.Thread.State: RUNNABLE
at sun.nio.fs.UnixNativeDispatcher.stat0(Native Method)
at sun.nio.fs.UnixNativeDispatcher.stat(UnixNativeDispatcher.java:286)
at sun.nio.fs.UnixFileAttributes.get(UnixFileAttributes.java:70)
at sun.nio.fs.UnixFileStore.devFor(UnixFileStore.java:55)
at sun.nio.fs.UnixFileStore.(UnixFileStore.java:70)
at sun.nio.fs.LinuxFileStore.(LinuxFileStore.java:48)
at sun.nio.fs.LinuxFileSystem.getFileStore(LinuxFileSystem.java:112)
at sun.nio.fs.UnixFileSystem$FileStoreIterator.readNext(UnixFileSystem.java:213)
at sun.nio.fs.UnixFileSystem$FileStoreIterator.hasNext(UnixFileSystem.java:224)

  • locked <0x00000000c0b74e48> (a sun.nio.fs.UnixFileSystem$FileStoreIterator)
    at org.apache.lucene.util.IOUtils.getFileStore(IOUtils.java:515)
    at org.apache.lucene.util.IOUtils.spinsLinux(IOUtils.java:459)
    at org.apache.lucene.util.IOUtils.spins(IOUtils.java:448)
    at org.elasticsearch.env.ESFileStore.(ESFileStore.java:57)
    at org.elasticsearch.env.Environment.(Environment.java:90)
    at org.elasticsearch.node.internal.InternalSettingsPreparer.prepareEnvironment(InternalSettingsPreparer.java:81)
    at org.elasticsearch.common.cli.CliTool.(CliTool.java:107)
    at org.elasticsearch.common.cli.CliTool.(CliTool.java:100)

I'm sure that I have removed all the related elasticsearch files.
And I can't find any other java process with ps -ef.
I tried es 5.x also and it has the same hang problem.
And I also tried to start es directly with this command:
"/usr/bin/java -Xms256m -Xmx1g -Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -XX:+DisableExplicitGC -Dfile.encoding=UTF-8 -Djna.nosys=true -Des.path.home=/usr/share/elasticsearch -cp /usr/share/elasticsearch/lib/elasticsearch-2.4.2.jar:/usr/share/elasticsearch/lib/* org.elasticsearch.bootstrap.Elasticsearch start -Des.pidfile=/var/run/elasticsearch/elasticsearch.pid -Des.default.path.home=/usr/share/elasticsearch -Des.default.path.logs=/var/log/elasticsearch -Des.default.path.data=/var/lib/elasticsearch -Des.default.path.conf=/etc/elasticsearch"
The es process hang again.
There is no logs created:
c4dev@si-portal-server:> ls /var/log/elasticsearch/
c4dev@si-portal-server:
>

@jasontedor
Copy link
Member

This:

"main" #1 prio=5 os_prio=0 tid=0x00007fcfc0009800 nid=0x4db9 runnable [0x00007fcfc7955000]
java.lang.Thread.State: RUNNABLE
at sun.nio.fs.UnixNativeDispatcher.stat0(Native Method)
at sun.nio.fs.UnixNativeDispatcher.stat(UnixNativeDispatcher.java:286)
at sun.nio.fs.UnixFileAttributes.get(UnixFileAttributes.java:70)
at sun.nio.fs.UnixFileStore.devFor(UnixFileStore.java:55)
at sun.nio.fs.UnixFileStore.(UnixFileStore.java:70)
at sun.nio.fs.LinuxFileStore.(LinuxFileStore.java:48)
at sun.nio.fs.LinuxFileSystem.getFileStore(LinuxFileSystem.java:112)
at sun.nio.fs.UnixFileSystem$FileStoreIterator.readNext(UnixFileSystem.java:213)
at sun.nio.fs.UnixFileSystem$FileStoreIterator.hasNext(UnixFileSystem.java:224)
- locked <0x00000000c0b74e48> (a sun.nio.fs.UnixFileSystem$FileStoreIterator)
at org.apache.lucene.util.IOUtils.getFileStore(IOUtils.java:515)
at org.apache.lucene.util.IOUtils.spinsLinux(IOUtils.java:459)
at org.apache.lucene.util.IOUtils.spins(IOUtils.java:448)
at org.elasticsearch.env.ESFileStore.(ESFileStore.java:57)
at org.elasticsearch.env.Environment.(Environment.java:90)
at org.elasticsearch.node.internal.InternalSettingsPreparer.prepareEnvironment(InternalSettingsPreparer.java:81)
at org.elasticsearch.common.cli.CliTool.(CliTool.java:107)
at org.elasticsearch.common.cli.CliTool.(CliTool.java:100)

What filesystem and disks are you on?

@ANorwell
Copy link

I seem to be getting this as well in the java ES client with a very large ZFS installation (many disks and thousands of datasets). Preferably the ES client (nor the server) should not care about these or need to iterate over them.

@ANorwell
Copy link

ANorwell commented Dec 27, 2016

To add some more detail here:

The initialization of org.elasticsearch.env.Environment is quadratic in the number of FileStores returned by java.nio.file.FileSystems.getDefault.getFileStores. For me, this can be 30k+ filestores, because the underlying file system is ZFS.

The quadratic performance comes because:

  1. The static initialization of org.elasticsearch.env.Environment loops over all filestores, creating an ESFileStore object.
  2. The initialization of ESFileStore invokes org.apache.lucene.util.IOUtils.spins, which calls IOUtils.getFileStore, which itself loops over every filestore to find a matching filestore for the provided path. (Additionally, the initialization scans /proc/self/mountinfo for the matching entry. This is also quadratic in the number of mounted filestores. In my case, most filestores are not mounted, so mountinfo has only a few entries.)

Some possible inefficiencies I see here:

  1. The filestore is already known. Either IOUtils could expose providing this as part of its interface, or if it's hard to change lucene, then possibly the spins heuristic could be re-implemented in Elasticsearch, since it is really just a few lines of code.
  2. /proc/self/mountinfo might be read only once, and preprocessed to provide a mountpoint => device number lookup.
  3. Ideally the java elasticsearch client should not care about what disks exist. This seems specific to the server.
  4. Is it really necessary for the server to know every filestore that exists? In some environments the vast majority are unrelated to ES data storage.

@colings86 colings86 added the :Core/Infra/Core Core issues without another label label Mar 21, 2017
@colings86
Copy link
Contributor

@jasontedor is this still an issue?

@colings86
Copy link
Contributor

@ANorwell are you still seeing this problem on the released versions of 5.x?

@jasontedor
Copy link
Member

No additional feedback, closing.

@jasontedor
Copy link
Member

@ANorwell See #24402.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Core Core issues without another label feedback_needed
Projects
None yet
Development

No branches or pull requests

4 participants