-
Notifications
You must be signed in to change notification settings - Fork 8
/
README
125 lines (96 loc) · 4.12 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
================================================================
Bayon : a simple and fast clustering tool
Copyright(C) Mizuki Fujisawa <[email protected]>
================================================================
Overview:
Bayon is a simple and fast hard-clustering tool.
Bayon supports repeated-bisection clustering and k-means clustering.
Install:
% ./configure
% make
% make check
% sudo make install
Usage:
* Clustering input data
% bayon -n num [options] file
% bayon -l limit [options] file
-n, --number=num the number of clusters
-l, --limit=lim limit value of cluster bisection
-p, --point output similarity points
-c, --clvector=file save vectors of cluster centroids
--clvector-size=num max size of output vectors of
cluster centroids (default: 50)
--method=method clustering method(rb, kmeans), default:rb
--seed=seed set a seed for random number generator
* Get the similar clusters for each input documents
% bayon -C file [options] file
-C, --classify=file target vectors
--inv-keys=num max size of the keys of each vector to be
looked up in inverted index (default: 20)
--inv-size=num max size of the inverted index of each key
(default: 100)
--classify-size=num max size of output similar groups
(default: 20)
* Common options
--idf apply idf to input vectors
-h, --help show help messages
-v, --version show the version and exit
Example:
* clustering (number_of_output_clusters = 100)
% bayon -n 100 input.tsv > cluster.tsv
* clustering (save the vectors of cluster centroids)
% bayon -n 100 -c centroid.tsv input.tsv > cluster.tsv
* classification (get similar clusters using centroid vectors)
% bayon -C centroid.tsv input.tsv > classify.tsv
Format of Input Data:
* list of the vectors of input documents for clustering and classification
document_id1 \t key1-1 \t value1-1 \t key1-2 \t value1-2 \t ...\n
document_id2 \t key2-1 \t value2-1 \t key2-2 \t value2-2 \t ...\n
...
* document_id : string
* key : string
* value : double
* list of the vectors of cluster centroids
cluster_id1 \t key1-1 \t value1-1 \t key1-2 \t value1-2 \t ...\n
cluster_id2 \t key2-1 \t value2-1 \t key2-2 \t value2-2 \t ...\n
...
* cluster_id : string
* key : string
* value : double
Format of Output Data:
* list of clusters (output of clustering)
cluster_id1 \t document_id1 \t document_id2 \t document_id3 \t ...\n
cluster_id2 \t document_id4 \t document_id5 \t document_id6 \t ...\n
...
* cluster_id : integer (>= 1)
* document_id : string
* list of clusters with similarity values between documents and clusters
(if did clustering with [-p] option)
cluster_id1 \t document_id1 \t point1 \t document_id2 \t point2 \t ...\n
cluster_id2 \t document_id3 \t point3 \t document_id4 \t point4 \t ...\n
...
* cluster_id : integer (>= 1)
* document_id : string
* point : double
* list of the vectors of cluster centroids
(if did clustering with [-c file] option)
cluster_id1 \t key1-1 \t value1-1 \t key1-2 \t value1-2 \t ...\n
cluster_id2 \t key2-1 \t value2-1 \t key2-2 \t value2-2 \t ...\n
...
* cluster_id : integer (>= 1)
* key : string
* value : double
* list of similar clusters for each input documents
document_id1 \t cluster_id1 \t point1 \t cluster_id2 \t point2 \t ...\n
document_id2 \t cluster_id3 \t point3 \t cluster_id4 \t point4 \t ...\n
...
* document_id : string
* cluster_id : string
* point : double
Requirement:
* C++ compiler with STL (Standard Template Library)
* google-sparsehash <http://code.google.com/p/google-sparsehash/>
(If google-sparsehash not installed,
the clustering tool uses __gnu_cxx::hash_map or std::map)
License:
GPL2 (Gnu General Public License Version 2)