Archive for the ‘Paper Abstract’ category

Copulas for Information Retrieval

April 4th, 2013

In many domains of information retrieval, system estimates of document relevance are based on multidimensional quality criteria that have to be accommodated in a unidimensional result ranking. Current solutions to this challenge are often inconsistent with the formal probabilistic framework in which constituent scores were estimated, or use sophisticated learning methods that make it difficult for humans to understand the origin of the final ranking. To address these issues, we introduce the use of copulas, a powerful statistical framework for modeling complex multi-dimensional dependencies, to information retrieval tasks. We provide a formal background to copulas and demonstrate their effectiveness on standard IR tasks such as combining multidimensional relevance estimates and fusion of results from multiple search engines. We introduce copula-based versions of standard relevance estimators and fusion methods and show that these lead to significant performance improvements on several tasks, as evaluated on large-scale standard corpora, compared to their non-copula counterparts. We also investigate criteria for understanding the likely effect of using copula models in a given retrieval scenario.

This work together with Arjen P. de Vries and Kevyn Collins-Thompson has been accepted for full oral presentation at the 36th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR) in Dublin, Ireland.

Exploiting User Comments for Audio-visual Content Indexing and Retrieval

December 4th, 2012

State-of-the-art content sharing platforms often require users to assign tags to pieces of media in order to make them easily retrievable. Since this task is sometimes perceived as tedious or boring, annotations can be sparse. Commenting on the other hand is a frequently used means of expressing user opinion towards shared media items. We propose the use of time series analyses in order to infer potential tags and indexing terms for audio-visual content from user comments. In this way, we mitigate the vocabulary gap between queries and document descriptors. Additionally, we show how large-scale encyclopedias such as Wikipedia can aid the task of tag prediction by serving as surrogates for high-coverage natural language vocabulary lists. Our evaluation is conducted on a corpus of several million real-world user comments from the popular video sharing platform YouTube, and demonstrates significant improvements in retrieval performance.

This work together with Wen Li and Arjen P. de Vries has been accepted for full oral presentation at the 35th European Conference on Information Retrieval (ECIR) in Moscow, Russia.

Designing Human-Readable User Profiles for Search Evaluation

December 4th, 2012

Forming an accurate mental model of a user is crucial for the qualitative design and evaluation steps of many information-centric applications such as web search, content recommendation, or advertising. This process can often be time-consuming as search and interaction histories become verbose. We present and analyze the usefulness of concise human-readable user profiles in order to enhance system tuning and evaluation by means of user studies.

This work together with Kevyn Collins-Thompson, Paul Bennett and Susan Dumais has been accepted for poster presentation at the 35th European Conference on Information Retrieval (ECIR) in Moscow, Russia.

Personalizing Atypical Web Search Sessions

November 12th, 2012

State-of-the-art web search personalization treats users as static or slowly evolving entities with a given set of preferences defined by their past behavior. However, recent publications as well as empirical evidence suggest that there is a significant number of search sessions in which users diverge from their regular search profiles in order to satisfy atypical, non-recurring information needs. In this work, we conduct a large-scale inspection of real life search sessions to further the understanding of this problem. Subsequently, we design an automatic means of detecting and supporting such atypical sessions. We demonstrate significant improvements over state-of-the-art web search personalization techniques by accounting for the typicality of search sessions. The merit of the proposed method is evaluated based on web-scale search session data spanning several months of user activity.

This work together with Kevyn Collins-Thompson, Paul Bennett and Susan Dumais has been accepted for full oral presentation at the ACM International Conference on Web Search and Data Mining (WSDM) in Rome, Italy.

The Downside of Markup: Examining the Harmful Effects of CSS and Javascript on Indexing Today’s Web

July 16th, 2012

The continued development and maturation of advanced HTML features such as Cascading style sheets (css), js, and AJAX, as well as their widespread adoption by browsers, has enabled web pages to flourish with sophistication and interactivity. Unfortunately, this presents challenges to the web search community, as a web page’s representation in the browser (i.e., what users see) can diverge dramatically from its raw HTML content (i.e., what search engines index and retrieve). For example, interactive pages may contain content in regions that are not visible before a user action, such as focusing a tab, but which are nonetheless still contained within the raw HTML. We study this divergence by comparing raw HTML to its fully rendered form across a number of metrics spanning presentation, geometry, and content, using a large, representative sample of popular web pages. We find that a large divergence currently exists, and we show via a historical analysis that this divergence has grown more pronounced over the last decade. Finally, we conduct a retrieval experiment which shows that this divergence is already influencing web retrieval in a negative manner, and that we can improve performance by making use of properties that are only available via pages’ rendered forms. The general finding of our study is that continuing to index the web via simple HTML parsing will diminish the effectiveness of retrieval on the modern web.

This paper has been accepted for publication at CIKM’12, Maui, USA.

Supporting Children’s Web Search in School Environments

May 29th, 2012

Nowadays, the Internet represents a ubiquitous source of information and communication. Its central role in everyday life is reflected in the curricula of modern schools. Already in early grades, children are encouraged to search for information on-line. However, the way in which they interact with state-of-the-art search interfaces and how they explore and interpret the presented information, differs greatly from adult user behaviour. Our work describes a qualitative user study in which the Web search behaviour of Dutch elementary school children was observed and classified into roles motivated by prior research in cognitive science. Building on the findings of this survey, we propose an automatic method of identifying struggling searchers in order to enable teaching personnel to provide appropriate and targeted guidance where needed.

This article was accepted for publication at IIiX 2012.

EmSe: Initial Evaluation of a Child-friendly Medical Search System

May 28th, 2012

When undergoing medical treatment in combination with extended stays in hospitals, children have been frequently found to develop an interest in their condition and the course of treatment. A natural means of searching for related information would be to use a web search engine. The medical domain, however, imposes several key challenges on young and inexperienced searchers, such as difficult terminology, potentially frightening topics or non-objective information offered by lobbyists or pharmaceutical companies. To address these problems, we present the design and usability study of EmSe, a search service for children in a hospital environment.

This article was accepted for presentation as a poster at IIiX 2012.

Quality through Flow and Immersion: Gamifying Crowdsourced Relevance Assessments

May 1st, 2012

Crowdsourcing is a market of steadily-growing importance upon which both academia and industry increasingly rely. However, this market appears to be inherently infested with a significant share of malicious workers who try to maximise their profits through cheating or sloppiness. This serves to undermine the very merits crowdsourcing has come to represent. Based on previous experience as well as psychological insights, we propose the use of a game in order to attract and retain a larger share of reliable workers to frequently-requested crowdsourcing tasks such as relevance assessments and clustering. In a large-scale comparative study conducted using recent TREC data, we investigate the performance of traditional HIT designs and a game-based alternative that is able to achieve high quality at significantly lower pay rates, facing fewer malicious submissions.

This article was accepted for publication in SIGIR 2012.

The BladeMistress Corpus: From Talk to Action in Virtual Worlds

February 2nd, 2012

Virtual worlds (VWs) are quickly emerging as a new channel for social interaction. They are at once very similar to, and very different from the real world. These worlds are populated by the same people we interact with at work, and offer many of the activities we are used to — shopping, entertainment, socializing. The inhabitants take on the familiar roles of leaders, educators, craftsmen and salesmen. In addition, virtual worlds offer many activities that the participants cannot regularly experience in real-life, such as taking part in a military raid or coordinating the economy of a city-state. Virtual worlds also offer something unique and very attractive to many of us — a clean slate, a chance to dramatically change anything and everything about ourselves — our appearance, our social class, our circle of friends and foes. The opportunity for self-expression in virtual worlds is much greater than in the real world, where we are constrained by finances, social commitments, health conditions and physical forces like gravity.
We believe virtual worlds present a unique environment for studying the relation between human communications and actions in a natural, task-oriented environment. Observations from a virtual world present a nearly-complete picture of behavior of large crowds: we can observe the exact location of every individual, who they are talking to, what they are saying, but also what they are doing at any particular moment. It is this last factor that makes virtual world observations particularly useful: it allows us to explore the connections between words spoken by one individual and actions performed by another. We can study how words can influence crowds, how a swarm can self-organize into an efficient structure, how individuals negotiate the role they play in a particular task and how they react to outcomes of virtual-world events.

EmSe: Supporting Children’s Information Finding Needs within a Hospital Environment

January 25th, 2012

For children, illness and other medical conditions can be very confusing and frightening. Children faced with these will often express an interest in learning about their medical conditions, what is happening, and what to expect. However, finding information related to medical conditions is often a difficult and sensitive task, so designing and developing search services for children presents a number of challenges, including: children’s problems expressing information needs, finding and crucially identifying relevant information, and ensuring that information is understandable, appropriate, and sensitive to the child’s physical and emotional state. To address these, we developed the Emma Search (EmSe) engine for Emma Kinderziekenhuis (EKZ) at the Amsterdam Medical Centre (AMC).

The full article describing our demonstrator will appear in the proceedings of ECIR 2012.