ĐẠI HỌC QUỐC GIA HÀ NỘI

15:06:08 Ngày 25/04/2024 GMT+7

A hybrid approach to word segmentation of Vietnamese texts

We present in this article a hybrid approach to automatically tokenize Vietnamese text. The approach combines both finite-state automata technique, regular expression parsing and the maximal-matching strategy which is augmented by statistical methods to resolve ambiguities of segmentation. The Vietnamese lexicon in use is compactly represented by a minimal finite-state automaton. A text to be tokenized is first parsed into lexical phrases and other patterns using pre-defined regular expressions. The automaton is then deployed to build linear graphs corresponding to the phrases to be segmented. The application of a maximal- matching strategy on a graph results in all candidate segmentations of a phrase. It is the responsibility of an ambiguity resolver, which uses a smoothed bigram language model, to choose the most probable segmentation of the phrase. The hybrid approach is implemented to create vnTokenizer, a highly accurate tokenizer for Vietnamese texts. © 2008 Springer-Verlag Berlin Heidelberg.

Phuong L.H., Huyen N.T.M., Roussanaly A., Vinh H.T.

428.pdf

Gửi cho bạn bè

Từ khóa : Applications; Computational linguistics; Finite automata; Graph theory; Linguistics; Natural language processing systems; Robots; Semantics; Statistical methods; Translation (languages); Bigram language models; Hybrid approaches; Linear graphs; Regular expressions; Tokenizer; Word segmentations; Automata theory

Các bài khác

CÁC ĐƠN VỊ THÀNH VIÊN VÀ TRỰC THUỘC

TRƯỜNG ĐẠI HỌC THÀNH VIÊN

TRƯỜNG, KHOA TRỰC THUỘC

VIỆN NGHIÊN CỨU

TRUNG TÂM ĐÀO TẠO CÁC MÔN CHUNG

ĐƠN VỊ PHỤC VỤ, DỊCH VỤ

CÁC ĐƠN VỊ KHÁC

ĐẢM BẢO CHẤT LƯỢNG

THÔNG BÁO MỚI

SINH VIÊN
	Hơn 350 đơn vị máu đã được tiếp nhận tại Ngày hội hiến máu “Sắc hồng hy vọng”
	40 sinh viên xuất sắc của ĐHQGHN nhận học bổng K-T năm học 2023-2024

Bản quyền thuộc về Đại học Quốc gia Hà Nội
Khu đô thị ĐHQGHN tại Hoà Lạc, Thạch Thất, Hà Nội
Giấy phép số 993/GP-TTĐT ngày 20/3/2020 của Sở Thông tin và Truyền thông Hà Nội.
Webmaster: media@vnu.edu.vn
kiemtra_spam@vnu.edu.vn

Trang chủ | Tìm kiếm | Sơ đồ Website | Văn bản