Title | Chinese-English similar document retrieval |
Authors | Wang, Hongjun Shi, Shuicai Yu, Shiwen Xiao, Shibin Lu, Xueqiang |
Affiliation | Institute of Computing Linguistics, Peking University, Beijing 100080, China Chinese Information Processing Center, Beijing Information Technology Institute, Beijing 100101, China |
Issue Date | 2006 |
Publisher | journal of computational information systems |
Citation | Journal of Computational Information Systems.2006,2,(3),1153-1160. |
Abstract | To retrieve documents written in different languages is necessary to construct parallel documents. Chinese-English documents pairs share less translation pairs than European languages documents pairs due to the difficulty in Chinese segmentation and retrieval of similar Chinese-English documents is more difficult. This paper describes an improved algorithm to retrieve similar Chinese-English Document pairs, which uses statistical translation model to match bilingual words-pairs. It introduces TFIDF to weight word-pairs and uses a new Dice-Method-based method to compute Cross-Language document similarity. The algorithm is evaluated by measuring the number of documents whose translation equivalences in the top N similar documents. Although two 'noise' datasets are used in the experiment, nearly 90% translations are identified in the top 5 similar documents. Result shows that the algorithm can effectively find translation equivalent of a document. |
URI | http://hdl.handle.net/20.500.11897/410048 |
ISSN | 15539105 |
Indexed | EI |
Appears in Collections: | 待认领 |