ĐẠI HỌC QUỐC GIA HÀ NỘI

03:09:57 Ngày 20/04/2024 GMT+7

Near-duplicates detection for Vietnamese documents in large database

Near-duplicate documents exacerbate the problem of information overload. Research in detecting near-duplicates has attracted a lot of attention from both industry and academia. In this paper, we focus on addressing this problem for Vietnamese documents which, to the best of our knowledge, has not been done before. Most of the current algorithms have been designed for English which are not directly applicable to Vietnamese - a monosyllabic language. We propose to combine Charikar's algorithm [2] with a "weighting scheme" and Vietnamese specific features to address the language intricacy. Experimental results indicate that our scheme is effective for detecting near-duplicates in a corpus of Vietnamese documents. © 2008 IEEE.

Cong T.T., The D.B., Bao S.P.

453.pdf

Gửi cho bạn bè

Từ khóa : Information technology; Technology; Charikar; Hash scheme; Information overloading; International conferences; Language processing; Large databases; LSH; Near-duplicate Vietnamese detection; Web information; Weighting scheme; Weighting schemes; Linguistics

Các bài khác

CÁC ĐƠN VỊ THÀNH VIÊN VÀ TRỰC THUỘC

TRƯỜNG ĐẠI HỌC THÀNH VIÊN

TRƯỜNG, KHOA TRỰC THUỘC

VIỆN NGHIÊN CỨU

TRUNG TÂM ĐÀO TẠO CÁC MÔN CHUNG

ĐƠN VỊ PHỤC VỤ, DỊCH VỤ

CÁC ĐƠN VỊ KHÁC

ĐẢM BẢO CHẤT LƯỢNG

THÔNG BÁO MỚI

SINH VIÊN
	40 sinh viên xuất sắc của ĐHQGHN nhận học bổng K-T năm học 2023-2024
	Thủ khoa tốt nghiệp Đinh Thế Nam: VNU-HUS chắp cánh ước mơ khoa học

Bản quyền thuộc về Đại học Quốc gia Hà Nội
Khu đô thị ĐHQGHN tại Hoà Lạc, Thạch Thất, Hà Nội
Giấy phép số 993/GP-TTĐT ngày 20/3/2020 của Sở Thông tin và Truyền thông Hà Nội.
Webmaster: media@vnu.edu.vn
kiemtra_spam@vnu.edu.vn

Trang chủ | Tìm kiếm | Sơ đồ Website | Văn bản