TitleChinese-English similar document retrieval
AuthorsWang, Hongjun
Shi, Shuicai
Yu, Shiwen
Xiao, Shibin
Lu, Xueqiang
AffiliationInstitute of Computing Linguistics, Peking University, Beijing 100080, China
Chinese Information Processing Center, Beijing Information Technology Institute, Beijing 100101, China
Issue Date2006
Publisherjournal of computational information systems
CitationJournal of Computational Information Systems.2006,2,(3),1153-1160.
AbstractTo retrieve documents written in different languages is necessary to construct parallel documents. Chinese-English documents pairs share less translation pairs than European languages documents pairs due to the difficulty in Chinese segmentation and retrieval of similar Chinese-English documents is more difficult. This paper describes an improved algorithm to retrieve similar Chinese-English Document pairs, which uses statistical translation model to match bilingual words-pairs. It introduces TFIDF to weight word-pairs and uses a new Dice-Method-based method to compute Cross-Language document similarity. The algorithm is evaluated by measuring the number of documents whose translation equivalences in the top N similar documents. Although two 'noise' datasets are used in the experiment, nearly 90% translations are identified in the top 5 similar documents. Result shows that the algorithm can effectively find translation equivalent of a document.
