跳到主要导航 跳到搜索 跳到主要内容

Topic Detection from Short Text: A Term-based Consensus Clustering method

  • National Computer Network Emergency Response Technical Team
  • Beihang University

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

The process of Topic Detection from Short Text Systems (SMS) is to extract distinct topics hidden inside short text collections, such as Twitter, Weibo, and instant messages. With the recent emergence of large volume user generated content collections enabled by online social media, topic detection from SMS becomes a challenging yet promising means for online public opinion analysis. In available literature, many forms and methods of topic detection have been proposed, but obtaining meaningful and coherent data is still difficult to reliably obtain for the extreme sparsity brought by SMS. To this end, we developed a Term-based Consensus Clustering topic detection (TCC) framework to provide an unsupervised methodology for finding distinct topics from within SMS collections. Specifically, we adopt a consensus clustering technique called K-means-based Consensus Clustering to handle SMS clustering, due to its low computational complexity and robust clustering performance. To further enrich the features of the information of the sparse SMS data, we conduct term clustering in the highly dense term space instead of the conventionally targeted sparse document space. To be more specific, we first use a feature space transfer technique to represent short text collections as a pseudo-document matrix, where rows, namely instances, correspond to terms and columns, namely features, correspond to adjacent terms. Basic partitions are generated from the pseudo-document matrix for term clustering and consensus clustering is followed to obtain the final term clustering result. Finally, a document classification process is adopted and a document is assigned to a cluster, where most terms occurred. Extensive experiments on real-world data sets demonstrate that TCC is comparable to several widely used methods in terms of topic detection quality. Particularly, we demonstrate that TCC obtains best clustering performance when observing a large number of the predefined topics across short text collections.

源语言英语
主期刊名2016 13th International Conference on Service Systems and Service Management, ICSSSM 2016
编辑Jian Chen, Xiaoqiang Cai, Changchun Zhou, Kaida Qin, Baojian Yang
出版商Institute of Electrical and Electronics Engineers Inc.
ISBN(电子版)9781509028429
DOI
出版状态已出版 - 9 8月 2016
活动13th International Conference on Service Systems and Service Management, ICSSSM 2016 - Kunming, 中国
期限: 24 6月 201626 6月 2016

出版系列

姓名2016 13th International Conference on Service Systems and Service Management, ICSSSM 2016

会议

会议13th International Conference on Service Systems and Service Management, ICSSSM 2016
国家/地区中国
Kunming
时期24/06/1626/06/16

指纹

探究 'Topic Detection from Short Text: A Term-based Consensus Clustering method' 的科研主题。它们共同构成独一无二的指纹。

引用此