TitleA comparative study on representing units in chinese text clustering
AuthorsWang, Hongjun
Yu, Shiwen
Lv, Xueqiang
Shi, Shuicai
Xiao, Shibin
AffiliationInstitute of Computing Linguistics, Peking University, Beijing 100080
Chinese Information Processing Center, Beijing Information Technology Institute, Beijing 100101
Issue Date2006
Citation1st International Conference on Knowledge Science, Engineering and Management, KSEM 2006.Guilin, China,4092 LNAI(466-476).
AbstractWords and n-grams are commonly used Chinese text representing units and are proved to be good features for Chinese Text Categorization and Information Retrieval. But the effectiveness of applying these representing units for Chinese Text Clustering is still uncovered. This paper is a comparative study of representing units in Chinese Text Clustering. With K-means algorithm, several representing units were evaluated including Chinese character N-gram features, word features and their combinations. We found Chinese word features, Chinese character unigram features and bi-gram features most effective in our experiments. The combination of features didn't improve the results. Detailed experimental results on several public Chinese Text Categorization datasets are provided in the paper. ? Springer-Verlag Berlin Heidelberg 2006.
Appears in Collections:待认领

Files in This Work
There are no files associated with this item.

Web of Science®

Checked on Last Week


Checked on Current Time

License: See PKU IR operational policies.