Reinforcement Learning (RL) enables agents to learn optimal decision making through interaction within a dynamic environment. Recent advances in deep learning and RL have allowed intelligent agents to exhibit unprecedented success in various domains and achieve super-human performance in many tasks. RL and deep learning are impacting almost all areas of academia and industry today, and there is a growing interest in applying them to Information Retrieval (IR). Companies like Google and Alibaba have already started to use RL-based search and recommendation engines to personalize their services and enhance user experiences on their ecosystem.
Current online resources for learning RL either focus on theory at the expense of hands-on practice or are limited to implementations without sufficient intuition and theoretical background. This full-day tutorial has been carefully tailored for IR researchers and practitioners to gain both theoretical knowledge and hands-on experience on the most popular RL methods using PyTorch and Python Jupyter notebooks on Google Colab. We aim to equip the participants with a working knowledge of RL which will help them better understand the latest IR publications involving RL and enable them to tackle their own IR problems using RL.
Our tutorial does not require any previous knowledge on the topic and starts with the fundamental concepts and algorithms such as Markov Decision Process, Exploration vs Exploitation, Q-Learning, Policy Gradient and Actor-Critic algorithms. We focus particularly on the combination of Reinforcement Learning and Deep Learning, with algorithms such as Deep Q-Network (DQN). Lastly, we describe how these techniques can be utilized to address representative IR problems like “learning to rank” and discuss recent developments as well as an outlook for future research.
IR From Bag-of-words to BERT and Beyond through Practical Experiments – An ECIR 2021 tutorial with PyTerrier and OpenNIR
Advances from the natural language processing community have recently sparked a renaissance in the task of adhoc search. Particularly, large contextualized language modeling techniques, such as BERT, have equipped ranking models with a far deeper understanding of language than the capabilities of previous bag-of-words (BoW) models. Applying these techniques to a new task is tricky, requiring knowledge of deep learning frameworks, and significant scripting and data munging. In this full-day tutorial, we build up from foundational retrieval principles to the latest neural ranking techniques. We provide background on classical (e.g., BoW), modern (e.g., Learning to Rank) and contemporary (e.g., BERT, doc2query) search ranking and re-ranking techniques. Going further, we detail and demonstrate how these can be easily experimentally applied to new search tasks in a new declarative style of conducting experiments exemplified by the PyTerrier and OpenNIR search toolkits.
This tutorial is interactive in nature for participants; it is broken into sessions, each of which mixes explanatory presentation with hands-on activities using prepared Jupyter notebooks running on the Google Colab platform.
At the end of the tutorial, participants will be comfortable accessing classical inverted index data structures, building declarative retrieval pipelines, and conducting experiments using state-of-the-art neural ranking models.
Fake News, Disinformation, Propaganda, Media Bias, and Flattening the Curve of the COVID-19 Infodemic
The rise of social media has democratized content creation and has made it easy for anybody to share and to spread information online. On the positive side, this has given rise to citizen journalism, thus enabling much faster dissemination of information compared to what was possible with newspapers, radio, and TV. On the negative side, stripping traditional media of their gate-keeping role has left the public unprotected against the spread of disinformation, which could now travel at breaking-news speed over the same democratic channel. This situation gave rise to the proliferation of false information specifically created to affect individual people’s beliefs, and ultimately to influence major events such as political elections; it also set the dawn of the Post-Truth Era, where appeal to emotions has become more important than the truth. More recently, with the emergence of the COVID-19 pandemic, a new blending of medical and political misinformation and disinformation has given rise to the first global infodemic. Limiting the impact of these negative developments has become a major focus for journalists, social media companies, and regulatory authorities. The tutorial offers an overview of the emerging and inter-connected research areas of fact-checking, misinformation, disinformation, “fake news”, propaganda, and media bias detection, with focus on text and on computational approaches. It further explores the general fact-checking pipeline and important elements thereof such as check-worthiness estimation, detecting previously fact-checked claims, stance detection, source reliability estimation, and detecting malicious users in social media. Finally, it covers some recent developments such as the emergence of large-scale pre-trained language models, and the challenges and opportunities they offer.
Operationalizing Treatments against Bias: Challenges and Solutions
Over the technologies getting attention in recent years, ranking and recommender systems are playing a key role in today’s online platforms, influencing how and what information individuals access. However, the adoption of machine learning in information retrieval has shown biased and even discriminatory impacts in various domains. Given that bias is becoming a threat to information seeking, uncovering, characterizing, and counteracting biases, while preserving the effectiveness of the system, is proving to be essential. The core of the problem deals with the study of interdisciplinary concepts, the design of bias-aware algorithmic pipelines, and the materialization and mitigation of biased effects.
This tutorial provides a timely perspective to consider while inspecting information retrieval outputs, leaving attendees with a solid understanding on how to integrate bias-related countermeasures in their research pipeline.
The first part introduces real-world examples of how a bias can impact on our society, the conceptual foundations underlying the study of bias and fairness in algorithmic decision-making, and the strategies to plan, uncover, assess, reduce, and evaluate a bias in an information retrieval system.
The second part provides practical case studies to attendees, where they are engaged in uncovering sources of bias and in designing countermeasures for rankings of results. By means of use cases on personalized rankings, the presented algorithmic approaches would help academic researchers and industrial practitioners to better develop systems that tackle bias constraints.
Finally, this tutorial identifies the current challenges in bias-aware research and new directions in the context of information retrieval.
Large-Scale Information Extraction under Privacy-Aware Constraints
People spend a significant portion of their lives online and this has led to an explosion of personal data from users and their activities. Typically, this data is private and nobody else, except the user, can look at it. This poses interesting and complex challenges from scalable information extraction point of view: extracting information under privacy-aware constraints where there is little data to learn from but need highly accurate models to run on a large amount of data across different users.
We present techniques which involve building IE models on a small amount of eyes-on data and a large amount of eyes-off data. In this tutorial, we use emails as representative private data to explain the concepts of scalable IE under privacy-aware constraints. More than 60% of the emails are business to consumer (B2C) emails. At Microsoft, we have developed information extraction systems to extract relevant information from these emails for a large number of scenarios (e.g., flights, hotels, appointments, etc.), for thousands of sender domains (e.g., Hilton, British Airways, etc.) and templates (HTML DOM structures)—to power a number of AI applications.
How are the IE techniques for private eyes-off data different compared to that for eyes-on HTML data? How to get labelled data in a privacy-preserving manner? How to build scalable extraction models across several sender domains using different ways to represent the information? In this tutorial, we address all these questions from various research to production perspectives. As part of the hands-on exercise, we use publicly available hotel confirmation email data sets to extract various fields from those (e.g., check-in date, hotel address, etc.) under simulated privacy constraints. We use Python Jupiter notebooks and various machine learning algorithms for the same.
Information retrieval (IR) systems provide access to large amounts of content, some of which may be personal, confidential, or otherwise sensitive. While some information protection is legislated (e.g., through the European Union’s General Data Protection Regulation (GDPR) or disclosure exemptions in Freedom of Information Acts; other cases are regulated only by social expectations. Information retrieval research has traditionally focused on finding indexed content. However, the increased intermixing of sensitive content with content that can properly be disclosed now motivates research on systems that can balance multiple interests: serving the searcher’s interest in finding content while serving other stakeholders’ interests inappropriately protecting their sensitive information.
If the content requiring protection were marked, protecting it would be straightforward. There are, however, many cases in which sensitive content must be discovered before it can be protected. Discovering such sensitivities ranges in complexity from the detection of personally identifiable information (PII) to automated text classification for sensitive content, to human-in-the-loop techniques for identifying sensitivities that result from the context in which the information was produced.
Once discovered, IR systems can use the results of sensitivity decisions to filter search results, or such systems can be designed to balance the risks of missing relevant content with the risks of disclosing sensitive content. Optimising such systems’ performance depends on how well the sensitivity classification works and requires the development of new evaluation measures. Moreover, the evaluation of such sensitivity-aware IR systems requires new test collections that contain (actual or simulated) sensitive content, and when actual sensitive content is used secure ways of evaluating retrieval algorithms (e.g., algorithm deposit or trusted online evaluation) are needed. Where untrusted data centres provide services, encrypted search may also be needed. Many of these components rely on algorithmic privacy guarantees, such as those provided by k-anonymity, L-diversity, t-closeness or differential privacy.
This tutorial will introduce these challenges, review work to date on each aspect of the problem, describe current best practices, and identify open research questions.
Modern recommendation systems (RS) use various machine learning (ML) models to provide users with relevant and personalized recommendations about products in an extensive catalog. Despite the great success of ML models in making recommendations, they are often not robust to opposing actors, such as competitors, who may take action to alter the recommendations to a malicious outcome. While the injection of hand-engineered fake profiles, aka shilling attacks, was the focus of investigations between the years 2000 and 2015, the last few years have been characterized by the rise of Adversarial Machine Learning (AML) techniques, namely ML-based approaches to attack and defend RS.
In this tutorial, we present an overview of AML applications in RS and, in particular, we introduce a dual categorization of AML uses in RS: the one based on the study of adversarial attacks and defences, against model parameters, content data, or user-element interactions; the other concerned the use of the Generative Adversarial Network (GAN) to propose new recommendation models.
Exploring the vast amount of rapidly growing biomedical text is of utmost importance, but is also particularly challenging due to the highly specialized domain knowledge and inconsistency of the nomenclature. This introductory tutorial will be a hands-on session to explore the semantics encoded in biomedical ontologies to process text using shell scripting with minimal software dependencies. Participants will learn how to process OWL, retrieve synonyms and ancestors, perform entity linking, and construct large lexicons.
This tutorial will present how we can select an ontology that models a given domain and identify the official names and synonyms of biomedical entities. This tutorial will use two ontologies, one about human diseases and the other about chemical entities of biological interest. The semantics encoded in those ontologies will be explored to find the ancestors and related classes of a given entity. Participants will learn how to apply semantic similarity to address ambiguity in the entity linking process. After constructing large lexicons that include all the entities of a given domain, participants will learn how to recognize them in biomedical text.
The tutorial will be a half day session (3 hours), according to the following outline:
Synonyms and Ancestors
This is an introductory tutorial, thus no expected prerequisite knowledge and experience in bioinformatics, text mining and ontologies is required. The participants should however have basic experience in shell scripting and pattern matching.
Participants need a computer (any operating system) with access to internet and a terminal with a UNIX shell. Before the tutorial participants should follow check if everything is available on their computer using the Test Script. This YouTube Playlist includes videos explaining how to use the Test Script using different operating systems.