ACL 2016 Tutorial:

Understanding Short Texts

Introduction

Billions of short texts are produced every day, in the form of search queries, ad keywords, tags, tweets, messenger conversations, social network posts, etc. Unlike documents, short texts have some unique characteristics which make them difficult to handle.

First, short texts, especially search queries, do not always observe the syntax of a written language. This means traditional NLP techniques, such as syntactic parsing, do not always apply to short texts.

Second, short texts contain limited context. An analysis based on Bing's search logs shows that more than 97% of queries contain 1 to 8 words, and over 63% of queries only contain 1 or 2 words.

Because of the above reasons, short texts give rise to a significant amount of ambiguity, which makes them extremely difficult to handle.On the other hand, many applications, including search engines, ads, automatic question answering, online advertising, recommendation systems, etc., rely on short text understanding. In all these applications, the necessary first step is to transform an input text into a machine-interpretable representation, namely to "understand" the short text. A growing number of approaches leverage external knowledge to address the issue of inadequate contextual information that accompanies the short texts. These approaches can be classified into two categories: Explicit Representation Model (ERM) and Implicit Representation Model (IRM). In this tutorial, we will present a comprehensive overview of short text understanding based on explicit semantics (knowledge graph representation, acquisition, and reasoning) and implicit semantics (embedding and deep learning). Specifically, we will go over various techniques in knowledge acquisition, representation, and inferencing has been proposed for text understanding, and we will describe massive structured and semi-structured data that have been made available in the recent decade that directly or indirectly encode human knowledge, turning the knowledge representation problems into a computational grand challenge with feasible solutions insight.

Tutorial Overview

This tutorial aims at presenting a comprehensive overview of short text understanding based on explicitsemantics (knowledge graph representation, acquisition, and reasoning) and implicit semantics (embedding and deep learning). We note that notutorial on the topic yet exist across NLP, web, IR,or databases conferences, and we believe that this tutorial is timely for both surveying the field, and educating both application developers and aspiring researchers.

Central theme

The tutorial is going to survey many applications,including search engines, ads, automatic questionanswering,online advertising, recommendation systems, etc., that may benefit from short text understanding.
The central theme of the tutorial is representation, as in all these applications, the necessary first step is to transform an input text into a machine-interpretable representation, namely to “understand” the short text. We will go over various techniques in knowledge acquisition, representation, and inferencing has been proposed for text understanding, and we will describe massive structured and semi-structured data that have been made available in the recent decade that directly or indirectly encode human knowledge, turning the knowledge representation problems into a computational grand challenge with feasible solutions in sight.

Tutorial outline

Following is the outline of the tutorial. The total length is about 3 hours.

Part I. Introduction (20 min) We will introduce the challenge of short text understanding, and its various applications, in order to motivate and inspire the audience of this problem area. This section will also provide a quick overview for the rest of the tutorial.

Part II. Explicit short text understanding (80 min) We will introduce current popular knowledge base systems which are used for building explicit models. Then we will introduce the explicit representation such as conceptualization for segmentation, labeling, syntax structure analysis, and applications.

Part III. Implicit short text understanding (60 min) We will introduce the major approaches used for building word embedding, phrase embedding, and sentence embedding. Then we will introduce how deep neural networks are built on top of these embedding for short text related applications.

Part IV. Conclusion (10 min) We will summarize the tutorial.


Slides download

Please feel free to download our slides for more details (and cite the following paper in your work):


Download the slides of Part I ( Introduction )


Download the slides of Part II ( Explicit short text understanding )


Download the slides of Part III ( Implicit short text understanding )


Download the slides of Part IV ( Conclusion )

Paper

Please cite the following paper in your work:

Presenters


Zhongyuan Wang is a Lead Researcher at Microsoft Research. He leads two projects at MSR: Enterprise Dictionary (knowledge mining from Enterprise) and Probase (knowledge mining from Web). He got his Ph.D. degree in computer science from Renmin University of China, and his PhD thesis is “Short Text Understanding”. Zhongyuan Wang has published 20+ papers (including ICDE 2015 Best Paper Award on short text understanding) in the leading international conferences, such as VLDB, ICDE, IJCAI, CIKM, etc. He is also the co-author of the book "Web Data Management: Concepts and Techniques" (published in 2014), and the author of book "Short Text Understanding" (Will published in Sept. 2016). His research interests include knowledge base, natural language processing, semantic network, machine learning, and web data mining.

zhongyuan wang

Haixun Wang

Facebook Inc.

http://haixun.olidu.com/

Haixun Wang is a research scientist / Engineering manager at Facebook. Before Facebook, he is with Google Research, working on natural language processing. He led research in semantic search, graph data processing systems, and distributed query processing at Microsoft Research Asia. He had been a research staff member at IBM T. J. Watson Research Center from 2000 - 2009. He was Technical Assistant to Stuart Feldman (Vice President of Computer Science of IBM Research) from 2006 to 2007, and Technical Assistant to Mark Wegman (Head of Computer Science of IBM Research) from 2007 to 2009. He received the Ph.D. degree in computer science from the University of California, Los Angeles in 2000. He has published more than 150 research papers in referred international journals and conference proceedings. He served PC Chair of conferences such as CIKM12 and he is on the editorial board of IEEE Transactions of Knowledge and Data Engineering (TKDE), and Journal of Computer Science and Technology (JCST). He won the best paper award in ICDE 2015, 10 year best paper award in ICDM 2013, and best paper award of ER 2009.

Related Websites

More detailed information can be queried on our websites:

Data Mining and Enterprise Intelligence,Microsoft Research :
https://www.microsoft.com/en-us/research/group/data-mining-enterprise-intelligence/

References

See all the References