Online Detection of Domain-Specific New Words in Text Streams

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

With the tremendous development of Internet, many domain-specific new words appear in various media text streams such as forums, Sina Weibo, Wechat, etc. These new words are always a group of important words in specific domains and are significant for NLP tasks. Most existing models have time-consuming processing or cannot handle out of vocabulary (OOV) words on streaming and online scenes. In this paper, we propose an unsupervised method, D-TopWords with Gaussian LDA, to perform online detection of domain-specific new words effectively. Different from traditional new words detection models, our method is a joint statistical model based on a finite word dictionary without any handcraft features. By further introducing Gaussian LDA into our model, we solve properly the problem of OOV words from new text streams. Experimental results show that our work can successfully extract domain-specific new words and it has a better performance in online detection task than some state-of-the-art methods.

Original languageEnglish
Title of host publication2018 15th International Conference on Service Systems and Service Management, ICSSSM 2018
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Print)9781538651780
DOIs
StatePublished - 13 Sep 2018
Event15th International Conference on Service Systems and Service Management, ICSSSM 2018 - Hangzhou, China
Duration: 21 Jul 201822 Jul 2018

Publication series

Name2018 15th International Conference on Service Systems and Service Management, ICSSSM 2018

Conference

Conference15th International Conference on Service Systems and Service Management, ICSSSM 2018
Country/TerritoryChina
CityHangzhou
Period21/07/1822/07/18

Keywords

  • Gaussian LDA
  • new words detection
  • text streams
  • word dictionary model

Fingerprint

Dive into the research topics of 'Online Detection of Domain-Specific New Words in Text Streams'. Together they form a unique fingerprint.

Cite this