IEEE Computer Society
ICSC 2008

ICSC 2008

Second IEEE International Conference on Semantic Computing
Santa Clara, CA, USA - August 4-7, 2008
Text Analytics for Semantic Computing - the good, the bad and the ugly

Instructors: Meenakshi Nagarajan, Cartic Ramakrishnan and Amit Sheth

The Web of today contains a broad variety of text that varies from highly edited news articles, to well-structured scientific literature, to community authored content and casual text from social software. Semantic computing applications have an opportunity to mine all these kinds of text. The goals of the tutorial are to provide an overview of the text analytics area as pertinent to the semantic computing applications that use such textual data. We contrast and compare text in Wikipedia (the good), bio-medical literature (the bad) and blogs and posts on social networking sites (the ugly) in terms of their characteristics, techniques for mining them and applications that use these types of text. The intention is to familiarize the audience with challenges involved in effectively gleaning semantics from these three types of text; present a survey of enabling tasks such as entity recognition, disambiguation, attribute or relation extraction. In doing so, we hope to present variations for some of these enabling techniques that make explicit use of domain knowledge. Going beyond the extraction of semantics, we also present a survey of applications that utilize the extracted semantics to support knowledge-discovery and search like operations in a variety of domains and settings.

Underlying the use of aforementioned corpora for semantic computing is the fundamental problem of gleaning semantics from text. We believe that the challenges involved in this task are different for different types of text. Some kinds of text are rather well-formed and simple for machines to process; eg. Wikipedia text. This can be attributed to the fact that Wikipedia is meant to be an encyclopedia of factual knowledge meant for use by any lay person. Consequently factual information written in very simple sentential forms make up the majority of Wikipedia text. This coupled with the loose hierarchical organization of Wikipedia and availability of resources like DBPedia makes extraction of information some what more manageable.

Biomedical literature, however, contains text that describes complex scientific investigations which do not always contain explicit factual assertions. Instead, there is often a series of arguments, opinions and experiments supported by evidence that collectively corroborate or refute a hypothesis that may not be explicitly stated in a simple sentence. Sentences tend to be rather long and convoluted. Furthermore domain specific terms, abbreviations, number ranges and symbols often make sentences hard for the human reader to parse, further complicating automated information extraction. These factors make the task of mining biomedical text substantially more complex than Wikipedia like text.

On the other hand, text found in blogs and social networks are casual and often times written in broken English. While bio-medical literature adheres to an agreed vocabulary, casual text is often beset with constantly evolving demographic and domain specific slang that also depends on a rich background of shared knowledge. Texts in blogs are also heavily influenced by social and cultural influences prevalent in micro-communities within the blogosphere. Owing to the constant productivity of language, exhaustive enumeration of such vocabularies is impossible. These factors further complicate attempts at extraction of semantics from this text.

In each of the three cases, a different combination of factors pertaining to intended use of the text, the intended consumer of the information therein, linguistic variations and the availability (or lack thereof) of domain knowledge, conspire to make automated extraction of semantics a problem of varied difficulty. Consequently, no one solution fits analysis of all types of text and techniques that enable semantic computing have to be approached bearing in mind the fundamental characteristics of the text.


===============================================

Short Bios:

Meenakshi Nagarajan's research interests are in the statistical and natural language processing of data originating from social software such as blogs, social networks, chats etc. She has collaborated with researchers at HP and IBM Research at Almaden. She has 10 publications and has served on or is serving on 5 program committees.

Cartic Ramakrishanan's expertise is in complex entity and relationship extraction from text (especially biomedical literature) and knowledge discovery techniques utilizing such extracted knowledge. He has collaborated with researchers at the National Library of Medicine, at the National Institutes of Health and IBM Research at Almaden. He has 15 publications and has served on or is serving on 7 program committees.

Amit Sheth (http://knoesis.wright.edu/amit/) is an Educator, Researcher and Entrepreneur. He is the LexisNexis Ohio Eminent Scholar, an IEEE Fellow and the director of the Kno.e.sis Center in the Computer Science and Engineering Department of Wright State University. Earlier he was at the University of Georgia, where he started the LSDIS lab in 1994, and he served in R&D groups at Bellcore, Unisys, and Honeywell.

His research has led to two commercial companies, which he founded and led; several Enterprise and Web-based products; and many deployed applications in industry, health care, and scientific research. He is one of the most-cited authors in computer science (22 publications with 100+ citations each, h-index of 50, 11,000+ total citations), has given 200 invited talks and colloquia including 30 keynotes, (co)-organized/chaired 40 conferences/workshops, and served on over 125 program committees. He is on several journal editorial boards, is the EIC of the International Journal on Semantic Web and Information Systems (IJSWIS) and joint EIC of Distributed and Parallel Databases (DAPD), and an editor of two Springer book series. Prof. Sheth has offered well over 20 tutorials at most major international conferences in his areas of his work including WWW, SIGMOD, VLDB, ICDE, ICWS/SCC, CAiSE, etc. He has also offered professional courses and short courses at various international events and institutions, and he is one of the first tow introduce courses on Enterprise Integration (since 1995), Web-based Information Systems (since 1996), Semantic Web (since 2002), and Semantic Web Services (since 2003).

back to tutorials

© IEEE-ICSC 2008