Vietnamese spam detection based on language classification

Language classification is the process of identifying the disposition of a presented text, such as classifying an email or a text document into a particular category. Classifying text can involve determining the genre of a book, categorizing a document, or in our case deciding whether an email is spam. The idea behind language classification is to teach the computer to be a filing clerk. Spam filters using a Bayesian combination of the spam probabilities of individual words that employ language classification read and filter your email by learning your personal email behavior (what you think is and isn't spam). There are many spam filters written based on this technology and applied effectively for English and other languages. But they got a low effect when applied directly at Vietnamese spam. Because the token segmentation of the Bayesian filters is not suitable for Vietnamese specific characteristics. We, therefore, propose a Vietnamese segmentation for using token selection for building a Vietnamese spam filter based on language classification and Bayesian combination to sufficiently support Vietnamese. The result is very satisfactory. Thanks to this technique, our filter for Vietnamese spam is 9% more accurate when compared to other filters which use other segmentation technical. ©2008 IEEE.

 Nguyen T.A., Tran Q.A., Nguyen N.B.
