CÁC BÀI BÁO KHOA HỌC 15:06:08 Ngày 25/04/2024 GMT+7
A hybrid approach to word segmentation of Vietnamese texts

We present in this article a hybrid approach to automatically tokenize Vietnamese text. The approach combines both finite-state automata technique, regular expression parsing and the maximal-matching strategy which is augmented by statistical methods to resolve ambiguities of segmentation. The Vietnamese lexicon in use is compactly represented by a minimal finite-state automaton. A text to be tokenized is first parsed into lexical phrases and other patterns using pre-defined regular expressions. The automaton is then deployed to build linear graphs corresponding to the phrases to be segmented. The application of a maximal- matching strategy on a graph results in all candidate segmentations of a phrase. It is the responsibility of an ambiguity resolver, which uses a smoothed bigram language model, to choose the most probable segmentation of the phrase. The hybrid approach is implemented to create vnTokenizer, a highly accurate tokenizer for Vietnamese texts. © 2008 Springer-Verlag Berlin Heidelberg.


 Phuong L.H., Huyen N.T.M., Roussanaly A., Vinh H.T.
   428.pdf    Gửi cho bạn bè
  Từ khóa : Applications; Computational linguistics; Finite automata; Graph theory; Linguistics; Natural language processing systems; Robots; Semantics; Statistical methods; Translation (languages); Bigram language models; Hybrid approaches; Linear graphs; Regular expressions; Tokenizer; Word segmentations; Automata theory