CÁC BÀI BÁO KHOA HỌC 03:09:57 Ngày 20/04/2024 GMT+7
Near-duplicates detection for Vietnamese documents in large database

Near-duplicate documents exacerbate the problem of information overload. Research in detecting near-duplicates has attracted a lot of attention from both industry and academia. In this paper, we focus on addressing this problem for Vietnamese documents which, to the best of our knowledge, has not been done before. Most of the current algorithms have been designed for English which are not directly applicable to Vietnamese - a monosyllabic language. We propose to combine Charikar's algorithm [2] with a "weighting scheme" and Vietnamese specific features to address the language intricacy. Experimental results indicate that our scheme is effective for detecting near-duplicates in a corpus of Vietnamese documents. © 2008 IEEE.


 Cong T.T., The D.B., Bao S.P.
   453.pdf    Gửi cho bạn bè
  Từ khóa : Information technology; Technology; Charikar; Hash scheme; Information overloading; International conferences; Language processing; Large databases; LSH; Near-duplicate Vietnamese detection; Web information; Weighting scheme; Weighting schemes; Linguistics