Skip to content
/ crk Public

CRK is an open source library which used to extract region tags from chinese nature language

License

Notifications You must be signed in to change notification settings

gfxcc/crk

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CRK - Chinese Region Kit

crk is an open source library which used to extract region tags from chinese nature language

Motivation

To create a library generate region tags from chinese nature language
[arbitrary chinese words] -> [ processed by CRK] -> [organized region tags]

"魅族集团位于广东省珠海市,我们的员工有来自河北,上海,四川成都的。"

              |
	     [CRK]
	      |
{
1."xxxxxx":"广东省", "xxxxxx":"珠海市"
2."xxxxxx":"四川省", "xxxxxx":"成都市"
3."xxxxxx":"上海市"
4."xxxxxx":"河北省"
}

crk relies on LTP and 中国行政区划信息

Build

crk relies cmake

git clone https://github.com/gfxcc/crk
cd crk
cmake .
make

Models

To use LTP, we need models.

  1. google drive
  2. baidu

Usage

  • Add this into your code
#include "crk.h"
using namespace crk;
  • download model files and fill options with absolutely path
  • add library lcrk into your makefile or cmakelists.txt
target_link_libraries(sample  
	crk
	)

Code Example

...
Engine* engine;
Options options;
options.model_segmentor = "/home/.../thirdparty/ltp/models/cws.model";
options.model_postagger = "/home/.../thirdparty/ltp/models/pos.model";
options.region_data = "../region_data/";

Status s = Engine::CreateEngine(options, &engine);

vector<vector<pair<string, string>>> regions;
engine->MatchRegion(line, regions);
...

Please check details in example/sample.cc

API Reference

  // regions return matched region code pair <region_code, region_name>
  // multiple region might be matched, so the regions was organizaed
  // by two dimensional vector
  int MatchRegion(const std::string& input,
      std::vector<std::vector<std::pair<std::string, std::string>>>& regions);

Tests

"善领汽电集团旗下专门负责4S店集团及特殊渠道运营的销售服务公司,代理众易畅、任我通、艾酷等多个知名品牌,经营项目:DVD导航、大屏机、原车屏升级、360全景、智能云镜、记录仪、脚踏板、行李架、前后大包围、电动踏板、电动尾门;美容镀晶等",
"440300": "深圳市"

"新疆特产:葡萄干、若羌大枣、和田大枣、巴旦木、纸皮核桃、无花果、精河枸杞、木垒鹰嘴豆、昆仑山胎菊、大漠肉苁蓉、伊犁薰衣草。绿色、天然、健康。"
"652722": "精河县",
"652824": "若羌县",
"653201": "和田市",
"653221": "和田县"


"学校发布信息,教育咨询交流 无锡市长安中心小学"
"130102": "长安区",
"320200": "无锡市",
"610116": "长安区"

Projects built on CRK

TODO

Analyzing multiple regions code and generating organized tags

魅族集团位于广东省珠海市,我们的员工有来自河北,上海,四川成都的。
{
1."xxxxxx":"广东省", "xxxxxx":"珠海市"
2."xxxxxx":"四川省", "xxxxxx":"成都市"
3."xxxxxx":"上海市"
4."xxxxxx":"河北省"
}

License

limited by license from LTP

About

CRK is an open source library which used to extract region tags from chinese nature language

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published