Large Scale Topic Modeling Using Search Queries: An Information-Theoric Approach

Hdl Handle:
http://hdl.handle.net/11285/572556
Title:
Large Scale Topic Modeling Using Search Queries: An Information-Theoric Approach
Authors:
Ramírez Rangel, Eduardo H.
Issue Date:
01/12/2010
Abstract:
Creating topic models of text collections is an important step towards more adaptive information access and retrieval applications. Such models encode knowledge of the topics discussed on a collection, the documents that belong to each topic and the semantic similarity of a given pair of topics. Among other things, they can be used to focus or disambiguate search queries and construct visualizations to navigate across the collection. So far, the dominant paradigm to topic modeling has been the Probabilistic Topic Modeling approach in which topics are represented as probability distributions of terms, and documents are assumed to be generated from a mixture of random topics. Although such models are theoretically sound, their high computational complexity makes them difficult to use in very large scale collections. In this work we propose an alternative topic modeling paradigm based on a simpler representation of topics as freely overlapping clusters of semantically similar documents, that is able to take advantage of highly-scalable clustering algorithms. Then, we propose the Querybased Topic Modeling framework (QTM), an information-theoretic method that assumes the existence of a "golden" set of queries that can capture most of the semantic information of the collection and produce models with máximum semantic coherence. The QTM method uses information-theoretic heuristics to find a set of "topical-queries" which are then co-clustered along with the documents of the collection and transformed to produce overlapping document clusters. The QTM framework was designed with scalability in mind and is able to be executed in parallel over commodity-class machines using the Map-Reduce approach. Then, in order to compare the QTM results with models generated by other methods we have developed metrics that formalize the notion of semantic coherence using probabilistic concepts and the familiar notions of recall and precisión. In contrast to traditional clustering metrics, the proposed metrics have been generalized to validate overlapping and potentially incomplete clustering solutions using multi-labeled corpora. We use them to experimentally validate our query-based approach, showing that models produced using selected queries outperform the ones produced using the collection vocabulary. Also, we explore the heuristics and settings that determine the performance of QTM and show that the proposed method can produce models of comparable, or even superior quality, than those produced with state of the art probabilistic methods.
Keywords:
Queries; Qtm
Degree Program:
Graduate Programs in Mechatronics and Information Technologies
Advisors:
Dr. Ramón F. Brena
Committee Member / Sinodal:
Dra. Alma Delia Cuevas; Dr. José Ignacio Leaza; Dr. Leonardo Garrido; Dr. Randy Goebel
Degree Level:
Doctor of Philosophy
School:
School Of Engineering
Campus Program:
Campus Monterrey
Discipline:
Ingeniería y Ciencias Aplicadas / Engineering & Applied Sciences
Appears in Collections:
Ciencias Exactas

Full metadata record

DC FieldValue Language
dc.contributor.advisorDr. Ramón F. Brenaes
dc.contributor.authorRamírez Rangel, Eduardo H.es
dc.date.accessioned2015-08-17T11:35:09Zen
dc.date.available2015-08-17T11:35:09Zen
dc.date.issued01/12/2010-
dc.identifier.urihttp://hdl.handle.net/11285/572556en
dc.description.abstractCreating topic models of text collections is an important step towards more adaptive information access and retrieval applications. Such models encode knowledge of the topics discussed on a collection, the documents that belong to each topic and the semantic similarity of a given pair of topics. Among other things, they can be used to focus or disambiguate search queries and construct visualizations to navigate across the collection. So far, the dominant paradigm to topic modeling has been the Probabilistic Topic Modeling approach in which topics are represented as probability distributions of terms, and documents are assumed to be generated from a mixture of random topics. Although such models are theoretically sound, their high computational complexity makes them difficult to use in very large scale collections. In this work we propose an alternative topic modeling paradigm based on a simpler representation of topics as freely overlapping clusters of semantically similar documents, that is able to take advantage of highly-scalable clustering algorithms. Then, we propose the Querybased Topic Modeling framework (QTM), an information-theoretic method that assumes the existence of a "golden" set of queries that can capture most of the semantic information of the collection and produce models with máximum semantic coherence. The QTM method uses information-theoretic heuristics to find a set of "topical-queries" which are then co-clustered along with the documents of the collection and transformed to produce overlapping document clusters. The QTM framework was designed with scalability in mind and is able to be executed in parallel over commodity-class machines using the Map-Reduce approach. Then, in order to compare the QTM results with models generated by other methods we have developed metrics that formalize the notion of semantic coherence using probabilistic concepts and the familiar notions of recall and precisión. In contrast to traditional clustering metrics, the proposed metrics have been generalized to validate overlapping and potentially incomplete clustering solutions using multi-labeled corpora. We use them to experimentally validate our query-based approach, showing that models produced using selected queries outperform the ones produced using the collection vocabulary. Also, we explore the heuristics and settings that determine the performance of QTM and show that the proposed method can produce models of comparable, or even superior quality, than those produced with state of the art probabilistic methods.en
dc.language.isoenen
dc.rightsOpen Accessen
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/4.0/*
dc.titleLarge Scale Topic Modeling Using Search Queries: An Information-Theoric Approachen
dc.typeTesis de Doctoradoes
thesis.degree.grantorInstituto Tecnológico y de Estudios Superiores de Monterreyes
thesis.degree.levelDoctor of Philosophyen
dc.contributor.committeememberDra. Alma Delia Cuevases
dc.contributor.committeememberDr. José Ignacio Leazaes
dc.contributor.committeememberDr. Leonardo Garridoes
dc.contributor.committeememberDr. Randy Goebeles
thesis.degree.disciplineSchool Of Engineeringen
thesis.degree.nameGraduate Programs in Mechatronics and Information Technologiesen
dc.subject.keywordQueriesen
dc.subject.keywordQtmen
thesis.degree.programCampus Monterreyes
dc.subject.disciplineIngeniería y Ciencias Aplicadas / Engineering & Applied Scienceses
All Items in REPOSITORIO DEL TECNOLOGICO DE MONTERREY are protected by copyright, with all rights reserved, unless otherwise indicated.