-
Notifications
You must be signed in to change notification settings - Fork 10.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chinese characters / words can't be searched #713
Comments
Similarly, I can't create a room with Chinese name. |
Related to #640 |
TL;DR: Elastic Search or Lucene might be able to help here. This blog post gives a good explanation of the problems with searching Chinese and how Elastic can help. I see two issues with searching non-latin text in RChat. One is that the current RChat can't search for Chinese characters (or anything much outside of [a-z0-9]). The other is that Chinese has no spaces, so the words require some segmentation for them to become searchable in an indexed full text search engine. For the second issue, I investigated a little bit to see if Mongo has implemented a segmenter for various languages yet or not. It turns out Chinese text search is in the Mongo Enterprise Edition but not yet available in free versions. It might be possible to add your own tokenizer. For instance, googling for "mongo full text custom tokenizer" turned up this blog post from 2010-11-14: Full text search with MongoDB and Lucene analyzers Elastic Search, on the other hand, has a tokenizer that claims to work with several languages. This blog post from 2014-12-18 gives a run down on performance with Chinese: Efficient Chinese Search with Elastic Search . They also describe how to get the a better tokenizer, Paoding, to work with Elastic. |
Yes, we need to implement a better search engine as an option, we implemented the search using internal mongodb's search engine to keep easy to install RC and allow users to search their messages, so we need to implement an way to allow users to add other search engines to solve their problems better. We need help with this issue because this isn't our main focus now and we don't know much about other search engines. |
Check this: Real-Time Search With MongoDB and Solr http://geniuscarrier.com/real-time-search-with-mongodb-and-solr/ |
@rodrigok - I'd love to help. Implementing searchable text for other languages would be really fun and interesting. Do you all have other full time jobs? Or are you just focused on this? @steedos - Interesting. I wonder what solution would be the simplest to implement and maintain. Would the Mongo+Solr option require rewriting the application to work with Solr? If so then I don't know if using Mongo+Solr or just Elastic would be simpler. |
Hi @robhawkins, we are moving our main focus to Rocket.Chat |
I just tried Elastic Search, It's good at Chinese document indexing and search. So we just need an admin settings to enable config Elastic Search and set server url. And use Elastic Search, we can also index office documents and pdf files. I think it's important to attach office documents in chat rooms. |
@steedos It's would be great. Anyone want to do it? |
+1 |
@rocket-cat close Closing, since cannot be reproduced (trying what has been described in the first post) on 0.61.1 anymore. Cheers |
@TwizzyDizzy are Chinese characters searchable now? It's been awhile since I looked and would be interested to know. An example test would be to type in a sentence like 我的爸爸是最餓的 and then searching for any of those characters individually, or a word such as 爸爸. Thanks! |
The thing you described is not possible. At least not by simply putting that into the searchbox without any regular expression applied. But this goes for ASCII words as well: for example, send a message "autocarauto" and then search for "car". doesn't work either without regex. On the other hand: typing Cheers |
Chinese speakers do not use spaces in their writing (see zh wikipedia) To make Rocket Chat friendly for Chinese speakers, I see two options,
|
@rocket-cat open
I see! That makes things difficult indeed. I'll reopen then. Thanks for your feedback! Cheers |
For example, searching for "汉字" gives no result even I've sent an message in a chat room containing it.
Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.
The text was updated successfully, but these errors were encountered: