Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chinese characters / words can't be searched #713

Open
sunnipaul opened this issue Sep 6, 2015 · 15 comments
Open

Chinese characters / words can't be searched #713

sunnipaul opened this issue Sep 6, 2015 · 15 comments

Comments

@sunnipaul
Copy link
Contributor

For example, searching for "汉字" gives no result even I've sent an message in a chat room containing it.

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

@sunnipaul
Copy link
Contributor Author

Similarly, I can't create a room with Chinese name.

@marceloschmidt
Copy link
Member

Related to #640

@robhawkins
Copy link

TL;DR: Elastic Search or Lucene might be able to help here. This blog post gives a good explanation of the problems with searching Chinese and how Elastic can help.

I see two issues with searching non-latin text in RChat. One is that the current RChat can't search for Chinese characters (or anything much outside of [a-z0-9]). The other is that Chinese has no spaces, so the words require some segmentation for them to become searchable in an indexed full text search engine. For the second issue, I investigated a little bit to see if Mongo has implemented a segmenter for various languages yet or not. It turns out Chinese text search is in the Mongo Enterprise Edition but not yet available in free versions. It might be possible to add your own tokenizer. For instance, googling for "mongo full text custom tokenizer" turned up this blog post from 2010-11-14: Full text search with MongoDB and Lucene analyzers

Elastic Search, on the other hand, has a tokenizer that claims to work with several languages. This blog post from 2014-12-18 gives a run down on performance with Chinese: Efficient Chinese Search with Elastic Search . They also describe how to get the a better tokenizer, Paoding, to work with Elastic.

@rodrigok
Copy link
Member

Yes, we need to implement a better search engine as an option, we implemented the search using internal mongodb's search engine to keep easy to install RC and allow users to search their messages, so we need to implement an way to allow users to add other search engines to solve their problems better.

We need help with this issue because this isn't our main focus now and we don't know much about other search engines.

@steedos
Copy link

steedos commented Sep 23, 2015

Check this: Real-Time Search With MongoDB and Solr

http://geniuscarrier.com/real-time-search-with-mongodb-and-solr/

@robhawkins
Copy link

@rodrigok - I'd love to help. Implementing searchable text for other languages would be really fun and interesting. Do you all have other full time jobs? Or are you just focused on this?

@steedos - Interesting. I wonder what solution would be the simplest to implement and maintain. Would the Mongo+Solr option require rewriting the application to work with Solr? If so then I don't know if using Mongo+Solr or just Elastic would be simpler.

@rodrigok
Copy link
Member

Hi @robhawkins, we are moving our main focus to Rocket.Chat

@steedos
Copy link

steedos commented Oct 10, 2015

I just tried Elastic Search, It's good at Chinese document indexing and search. So we just need an admin settings to enable config Elastic Search and set server url.

And use Elastic Search, we can also index office documents and pdf files. I think it's important to attach office documents in chat rooms.

@sunnipaul
Copy link
Contributor Author

@steedos It's would be great. Anyone want to do it?

@FerminYang
Copy link

@steedos It's would be great. Anyone want to do it?

+1

@TwizzyDizzy
Copy link

@rocket-cat close

Closing, since cannot be reproduced (trying what has been described in the first post) on 0.61.1 anymore.

Cheers
Thomas

@rocket-cat rocket-cat bot closed this as completed Feb 16, 2018
@robhawkins
Copy link

@TwizzyDizzy are Chinese characters searchable now? It's been awhile since I looked and would be interested to know. An example test would be to type in a sentence like 我的爸爸是最餓的 and then searching for any of those characters individually, or a word such as 爸爸.

Thanks!

@TwizzyDizzy
Copy link

TwizzyDizzy commented Feb 16, 2018

The thing you described is not possible. At least not by simply putting that into the searchbox without any regular expression applied. But this goes for ASCII words as well: for example, send a message "autocarauto" and then search for "car". doesn't work either without regex.

On the other hand: typing 我的 爸爸 是最餓的 (including spaces) and searching for 爸爸 works.

Cheers
Thomas

@robhawkins
Copy link

robhawkins commented Feb 16, 2018

Chinese speakers do not use spaces in their writing (see zh wikipedia)

To make Rocket Chat friendly for Chinese speakers, I see two options,

  1. If regex is fast enough across a database full of messages, maybe just wrap Chinese character searches with * on either end
  2. If regex does not scale, perhaps this issue should remain open to track extending Rocket Chat for use by Chinese speakers. Without a fast (presumably indexed) history search, Rocket Chat it isn't as useful.

@TwizzyDizzy
Copy link

@rocket-cat open

Chinese speakers do not use spaces in their writing (see zh wikipedia)

I see! That makes things difficult indeed. I'll reopen then. Thanks for your feedback!

Cheers
Thomas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants