TY - JOUR
T1 - T-Test feature selection approach based on term frequency for text categorization
AU - Wang, Deqing
AU - Zhang, Hui
AU - Liu, Rui
AU - Lv, Weifeng
AU - Wang, Datao
PY - 2014/8/1
Y1 - 2014/8/1
N2 - Feature selection techniques play an important role in text categorization (TC), especially for the large-scale TC tasks. Many new and improved methods have been proposed, and most of them are based on document frequency, such as the famous Chi-square statistic and information gain etc. These methods based on document frequency, however, have two shortcomings: (1) they are not reliable for low-frequency terms, that is, low-frequency terms will be filtered because of their smaller weights; and (2) they only count whether one term occurs within a document and ignore term frequency. Actually, high-frequency term (except stop words) occurred in few documents is often regards as a discriminators in the real-life corpus. Aimed at solving the above drawbacks, the paper focuses on how to construct a feature selection function based on term frequency, and proposes a new approach using student t-test. The t-test function is used to measure the diversity of the distributions of a term frequency between the specific category and the entire corpus. Extensive comparative experiments on two text corpora using three classifiers show that the proposed approach is comparable to the state-of-the-art feature selection methods in terms of macro-F1 and micro-F1. Especially on micro-F1, our method achieves slightly better performance on Reuters with kNN and SVMs classifiers, compared to χ2, and IG.
AB - Feature selection techniques play an important role in text categorization (TC), especially for the large-scale TC tasks. Many new and improved methods have been proposed, and most of them are based on document frequency, such as the famous Chi-square statistic and information gain etc. These methods based on document frequency, however, have two shortcomings: (1) they are not reliable for low-frequency terms, that is, low-frequency terms will be filtered because of their smaller weights; and (2) they only count whether one term occurs within a document and ignore term frequency. Actually, high-frequency term (except stop words) occurred in few documents is often regards as a discriminators in the real-life corpus. Aimed at solving the above drawbacks, the paper focuses on how to construct a feature selection function based on term frequency, and proposes a new approach using student t-test. The t-test function is used to measure the diversity of the distributions of a term frequency between the specific category and the entire corpus. Extensive comparative experiments on two text corpora using three classifiers show that the proposed approach is comparable to the state-of-the-art feature selection methods in terms of macro-F1 and micro-F1. Especially on micro-F1, our method achieves slightly better performance on Reuters with kNN and SVMs classifiers, compared to χ2, and IG.
KW - Feature selection
KW - Student t-test
KW - Term frequency
KW - Text classification
UR - https://www.scopus.com/pages/publications/84896987950
U2 - 10.1016/j.patrec.2014.02.013
DO - 10.1016/j.patrec.2014.02.013
M3 - 文章
AN - SCOPUS:84896987950
SN - 0167-8655
VL - 45
SP - 1
EP - 10
JO - Pattern Recognition Letters
JF - Pattern Recognition Letters
IS - 1
ER -