Learning trustworthy models from positive and unlabelled data

Pawel Teisseyre

Assistant Professor at the Polish Academy of Sciences

The goal of the research stay is to explore learning classification models using positive-unlabelled (PU) data. In PU learning, it is assumed that only some observations in training data are assigned label, which is positive, whereas the remaining observations are unlabelled and can be either positive or negative. PU datasets appear naturally in many domains, for example, in medical databases, some patients have been diagnosed with a disease (positive observations), whereas the remaining patients have not been diagnosed. However the absence of a diagnosis does not mean that the patient does not have the disease in question. Further examples include detecting illegal or detrimental content in social networks, under-reporting, image and text classification among many others. The goal is to build binary classification model using such incompletely labelled data. There are risks associated with modelling this type of data. Treating unlabelled examples as negative ones (biased learning) or ignoring labelling bias may lead to poor accuracy of the corresponding model and misleading conclusions. To learn trustworthy classification models from PU data it is necessary to consider the problem of propensity score estimation, where the propensity score of a positive example is its probability to be labelled. This probability may depend on its feature vector, which increases the difficulty of the problem. We plan to analyse PU setting for multi-label datasets as PU multi-label datasets appear even more often than single-label PU datasets. This is due to the fact that in many situations it is problematic to assign all labels to the given object and naturally some labels remain unassigned. Accurate estimation of the propensity score will allow to estimate the multivariate joint posterior probability which is a crucial task in multi-label learning.

The objective of the research visit is to develop methods of learning trustworthy classification models using positive-unlabelled datasets. In particular, we will focus on the problem of the propensity score estimation, which has been identified as an important in machine learning community. Accurately estimating the propensity score allows learning trustworthy models in situations when labelling bias is present. This is particularly important in medical applications, where the probability of diagnosing the disease in a person who really has the disease may strongly depend on several factors such as age, availability of health care facilities, social status, level of education and others. The labelling bias naturally appears in multi-label case, for example the probability of diagnosing the disease in a person who really has the disease may additionally depend on the presence of other diseases. Using the propensity score estimation techniques would allow to improve the accuracy of multi-label models. This would be particularly important for scientific community as there is lack of methods of propensity score estimation developed for the multi-label case. Several application domains may potentially benefit from our research advances. The proposed PU learning methods can be used to address important problems occurring in medical applications, such as handling databases containing under-reported diseases. More specifically, the proposed methods allow to predict the probability of the disease (or multiple diseases) in a given patient in a situation when the model is trained on incomplete dataset containing under-reported diseases. The PU methods may be valuable to detect ‘false negative’ patients, i.e. those who have the diseases but the disease is undiagnosed. Further examples include application of the proposed methods in image and text classification based on incompletely labelled datasets, dealing with under-reporting in surveys, detecting illegal content in social network analysis, among others.

Keywords: positive-unlabeled data, propensity score estimation, multi-label learning
Scientific area: machine learning