TY - JOUR
T1 - Term frequency - function of document frequency
T2 - A new term weighting scheme for enterprise information retrieval
AU - Zhang, Hui
AU - Wang, Deqing
AU - Wu, Wenjun
AU - Hu, Hongping
PY - 2011/12
Y1 - 2011/12
N2 - In today's business environment, enterprises are increasingly under pressure to process the vast amount of data produced everyday within enterprises. One method is to focus on the business intelligence (BI) applications and increasing the commercial added-value through such business analytics activities. Term weighting scheme, which has been used to convert the documents as vectors in the term space, is a vital task in enterprise Information Retrieval (IR), text categorisation, text analytics, etc. When determining term weight in a document, the traditional TF-IDF scheme sets weight value for the term considering only its occurrence frequency within the document and in the entire set of documents, which leads to some meaningful terms that cannot get the appropriate weight. In this article, we propose a new term weighting scheme called Term Frequency - Function of Document Frequency (TF-FDF) to address this issue. Instead of using monotonically decreasing function such as Inverse Document Frequency, FDF presents a convex function that dynamically adjusts weights according to the significance of the words in a document set. This function can be manually tuned based on the distribution of the most meaningful words which semantically represent the document set. Our experiments show that the TF-FDF can achieve higher value of Normalised Discounted Cumulative Gain in IR than that of TF-IDF and its variants, and improving the accuracy of relevance ranking of the IR results.
AB - In today's business environment, enterprises are increasingly under pressure to process the vast amount of data produced everyday within enterprises. One method is to focus on the business intelligence (BI) applications and increasing the commercial added-value through such business analytics activities. Term weighting scheme, which has been used to convert the documents as vectors in the term space, is a vital task in enterprise Information Retrieval (IR), text categorisation, text analytics, etc. When determining term weight in a document, the traditional TF-IDF scheme sets weight value for the term considering only its occurrence frequency within the document and in the entire set of documents, which leads to some meaningful terms that cannot get the appropriate weight. In this article, we propose a new term weighting scheme called Term Frequency - Function of Document Frequency (TF-FDF) to address this issue. Instead of using monotonically decreasing function such as Inverse Document Frequency, FDF presents a convex function that dynamically adjusts weights according to the significance of the words in a document set. This function can be manually tuned based on the distribution of the most meaningful words which semantically represent the document set. Our experiments show that the TF-FDF can achieve higher value of Normalised Discounted Cumulative Gain in IR than that of TF-IDF and its variants, and improving the accuracy of relevance ranking of the IR results.
KW - enterprise information retrieval
KW - function of document frequency
KW - relevance ranking
KW - term frequency
KW - term weighting scheme
UR - https://www.scopus.com/pages/publications/84864690379
U2 - 10.1080/17517575.2012.665945
DO - 10.1080/17517575.2012.665945
M3 - 文章
AN - SCOPUS:84864690379
SN - 1751-7575
VL - 6
SP - 433
EP - 444
JO - Enterprise Information Systems
JF - Enterprise Information Systems
IS - 4
ER -