Module `pulearn.guides.pu_fundamentals`

PU Learning Fundamentals.

PU Learning Fundamentals

This guide introduces the core ideas behind Positive-Unlabeled (PU) learning, explains the assumptions pulearn makes, and shows how those assumptions map to the library's API.

What Is PU Learning?

In standard supervised classification you have labeled examples for every class. PU learning is a semi-supervised setting where:

Labeled positives (P) — a set of confirmed positive examples.
Unlabeled (U) — a (usually much larger) set whose true class is unknown; it is a mixture of hidden positives and hidden negatives.

No confirmed negative examples exist in the training set. This arises naturally in domains such as:

Domain	Positive	Unlabeled
Medical screening	Diagnosed patients	Unscreened population
Fraud detection	Known fraudulent transactions	Unreviewed transactions
Web mining	Pages matching a topic	Uncurated web pages
Drug discovery	Known active compounds	Unannotated compound library

The fundamental challenge: you cannot directly compute standard metrics (precision, recall, F1, AUC) because the unlabeled set contains an unknown fraction of hidden positives.

Core Assumptions

SCAR — Selected Completely At Random

The SCAR assumption states that the labeled positives are a random sample drawn uniformly from all true positives:

P(s = 1 | x, y = 1) = P(s = 1 | y = 1) = c

where s = 1 means "labeled" and c is the constant labeling propensity. Under SCAR:

The unlabeled set is representative of the full positive class (no selection bias among which positives get labeled).
The label rate c can be estimated from the data; all standard corrected metrics are valid.
Stratified cross-validation is valid using the binary PU label.

Most pulearn estimators and metrics assume SCAR.

SAR — Selected At Random

The SAR assumption relaxes SCAR by allowing the propensity to depend on observed features:

P(s = 1 | x, y = 1) = e(x)

SAR is more realistic in many real-world settings (e.g., a doctor is more likely to screen high-risk patients) but harder to work with because e(x) must be modeled separately. pulearn provides experimental SAR hooks (ExperimentalSarHook, predict_sar_propensity, compute_inverse_propensity_weights) that are still under active development.

When SCAR Is Likely Violated

Watch for these warning signs:

Systematic selection bias: labeled positives differ from unlabeled samples on observable features (age, income, region, etc.).
SCAR sanity check warnings: scar_sanity_check reports group_separable, high_mean_shift, or max_feature_shift.
Propensity bootstrap flags: high resample variance or unstable diagnostics from diagnose_prior_estimator.

If SCAR is violated, standard corrected metrics will be biased. Consider covariate-adjusted propensity models or move to the experimental SAR hooks.

Key Quantities

Class Prior `pi = P(y = 1)`

The true fraction of positives in the population. This is almost never the observed labeled-positive fraction because only a subset of positives are labeled.

pi is required by most corrected metrics (pu_f1_score, pu_roc_auc_score, etc.) and by NNPUClassifier. Typical workflow:

Use LabelFrequencyPriorEstimator for a lower-bound baseline.
Use HistogramMatchPriorEstimator or ScarEMPriorEstimator for a data-driven estimate.
Optionally bootstrap confidence intervals via .bootstrap(…).
If uncertain, sweep corrected metrics over a plausible range with analyze_prior_sensitivity.

from pulearn import (
    LabelFrequencyPriorEstimator,
    HistogramMatchPriorEstimator,
    ScarEMPriorEstimator,
)

baseline = LabelFrequencyPriorEstimator().estimate(X_train, y_pu)
histogram = HistogramMatchPriorEstimator().estimate(X_train, y_pu)
scar_em = ScarEMPriorEstimator().estimate(X_train, y_pu)

print(f"Lower bound: {baseline.pi:.3f}")
print(f"Histogram match: {histogram.pi:.3f}")
print(f"EM estimate: {scar_em.pi:.3f}")

Labeling Propensity `c = P(s = 1 | y = 1)`

The probability that a true positive is labeled. Under SCAR, c is a constant. It relates to the prior as:

c = P(s = 1) / P(y = 1) = label_frequency / pi

c is used for probability calibration (calibrate_posterior_p_y1) and by the Elkanoto classifiers internally. Typical workflow:

from pulearn import MeanPositivePropensityEstimator

c_hat = MeanPositivePropensityEstimator().estimate(y_pu, s_proba=y_scores).c

See the Evaluation Guide for how pi and c interact with corrected metrics.

Label Conventions

pulearn normalizes labels to a canonical internal form on every API boundary:

External label	Meaning	Internal canonical
`1` or `True`	Labeled positive	`1`
`0` or `False`	Unlabeled	`0`
`-1`	Unlabeled	`0`

Use pulearn.normalize_pu_labels(y) to convert labels at any boundary:

import numpy as np
from pulearn import normalize_pu_labels

y_raw = np.array([1, -1, -1, 1, -1])
y_pu = normalize_pu_labels(y_raw)  # array([1, 0, 0, 1, 0])

pulearn.pu_label_masks(y) returns (positive_mask, unlabeled_mask) for arrays already in canonical form.

Common Pitfalls

The table below summarizes the most frequent mistakes new PU practitioners make, along with quick mitigations. Full details and warning-message explanations are in the Failure-Mode Playbook.

Pitfall	Symptom	Mitigation
Using standard sklearn metrics on PU data	Inflated F1 / misleading AUC	Switch to `pulearn.metrics` corrected versions (supply `pi`)
Treating unlabeled as confirmed negatives	Model learns trivial boundary; recall collapses	Use a PU classifier from `pulearn`, not a plain sklearn estimator
Guessing `pi` without estimation	Corrected metrics silently biased	Estimate `pi` with `LabelFrequencyPriorEstimator` + a second method
Assuming SCAR without checking	Metrics and propensity estimates biased	Run `scar_sanity_check(y_pu, s_proba=s_proba, X=X)` before committing to SCAR
Mixing label conventions (`-1` vs `0`)	`ValueError` or silent mislabeling	Call `normalize_pu_labels(y)` at every data boundary
Using standard `StratifiedKFold` in CV	Folds with zero labeled positives; unstable results	Replace with `PUStratifiedKFold` or `PUCrossValidator`
Interpreting raw output probabilities as `P(y=1 \mid x)`	Scores are shifted by `c`, not calibrated	Use `calibrate_posterior_p_y1` or `clf.fit_calibrator(…)` before thresholding
Ignoring `pi` sensitivity	Conclusions depend on an uncertain single estimate	Sweep with `analyze_prior_sensitivity` and report the range

Quick Rule of Thumb

If you are unsure which assumption your data satisfies, always run scar_sanity_check first, estimate pi with at least two methods, and evaluate with ranking metrics (pu_roc_auc_score) before threshold-based metrics (pu_f1_score).

The PU Learning Pipeline

A typical pulearn workflow follows these steps:

Raw data
    │
    ▼
1. Label normalization          normalize_pu_labels(y)
    │
    ▼
2. Prior / propensity estimation   LabelFrequencyPriorEstimator, etc.
    │
    ▼
3. Optional SCAR check          scar_sanity_check(y_pu, s_proba=s_proba, X=X)
    │
    ▼
4. Train PU classifier          e.g. ElkanotoPuClassifier.fit(X, y_pu)
    │
    ▼
5. Evaluate with corrected metrics   pu_f1_score(y_pu, y_pred, pi=pi)
    │
    ▼
6. Optional calibration         calibrate_pu_classifier(clf, X_cal, y_cal)

See the companion guides for details on each step:

Learner Selection Guide — choosing the right classifier
Evaluation Guide — corrected metrics and prior needs
Failure-Mode Playbook — warnings and mitigations

References

Elkan & Noto (2008). Learning classifiers from only positive and unlabeled data. KDD 2008. https://cseweb.ucsd.edu/~elkan/posonly.pdf
Mordelet & Vert (2013). A bagging SVM to learn from positive and unlabeled examples. Pattern Recognition Letters.
Kiryo et al. (2017). Positive-unlabeled learning with non-negative risk estimator. NeurIPS 2017. https://arxiv.org/abs/1703.00593
du Plessis, Niu & Sugiyama (2014). Analysis of learning from positive and unlabeled data. NeurIPS 2014.