Module pulearn.risk
Risk-objective PU learning wrapper for sklearn estimators.
Implements uPU and nnPU risk-based training for any sklearn estimator
that supports predict_proba, with optional sample-weight passthrough
and sparse-matrix compatibility.
References
du Plessis, M. C., Niu, G., and Sugiyama, M. (2015). Convex formulation for learning from positive and unlabeled data. ICML 2015.
Kiryo, R., Niu, G., du Plessis, M. C., and Sugiyama, M. (2017). Positive-Unlabeled Learning with Non-Negative Risk Estimator. NeurIPS 2017.
Classes
class PURiskClassifier (estimator, prior, objective='nnpu', n_iter=10, beta=0.0)-
Expand source code Browse git
class PURiskClassifier(BasePUClassifier): """Generalized uPU / nnPU risk-objective wrapper for sklearn estimators. Wraps any sklearn-compatible probabilistic classifier and trains it using either the unbiased PU (uPU) or non-negative PU (nnPU) risk objective from the PU learning literature. Training proceeds in an iterative expectation–maximisation style: 1. An initial model is fitted with class-prior-based sample weights (labeled positives weighted by ``prior``; unlabeled samples weighted by ``1``). 2. At each subsequent iteration the model's predicted probabilities are used to recompute per-sample weights according to the chosen risk objective, and the model is re-fitted. This design is compatible with any sklearn estimator that implements ``predict_proba`` and ``fit``. When the base estimator's ``fit`` method accepts ``sample_weight``, weights are passed directly; otherwise a ``UserWarning`` is raised and training falls back to unweighted fitting (only the initial model is trained). Sparse feature matrices are passed through unchanged to the base estimator, so sparse-aware estimators (e.g. ``LogisticRegression``) benefit from sparse storage throughout. Parameters ---------- estimator : sklearn estimator Base probabilistic classifier implementing ``fit(X, y)`` and ``predict_proba(X)``. If ``fit`` also accepts ``sample_weight``, the uPU/nnPU weights are forwarded on every training iteration. prior : float Prior probability of the positive class P(y=1). Must be strictly in ``(0, 1)``. objective : {'nnpu', 'upu'}, default 'nnpu' Risk objective to optimise. * ``'nnpu'`` — non-negative PU risk (Kiryo et al., 2017): applies an adaptive per-sample reweighting at every iteration using the current model's predictions. The nnPU correction prevents the estimated negative risk from going below ``-beta``. * ``'upu'`` — unbiased PU risk (du Plessis et al., 2015): trains the model once with fixed class-prior weights (labeled positives at ``prior``, unlabeled samples at ``1``). ``n_iter`` is ignored for ``'upu'``. n_iter : int, default 10 Number of reweighting-and-refit iterations for ``'nnpu'``. Ignored when ``objective='upu'``. beta : float, default 0.0 nnPU correction threshold. The nnPU correction branch is triggered when the estimated negative risk falls below ``-beta``. Larger values make the correction less frequent. Only used when ``objective='nnpu'``. Attributes ---------- estimator_ : sklearn estimator The fitted base estimator (a clone of ``estimator``). classes_ : ndarray of shape (2,) Class labels ``[0, 1]``. n_iter_ : int Number of EM iterations actually performed. supports_sample_weight_ : bool ``True`` if the base estimator's ``fit`` method accepts ``sample_weight``. Examples -------- >>> import numpy as np >>> from sklearn.linear_model import LogisticRegression >>> from pulearn import PURiskClassifier >>> rng = np.random.RandomState(0) >>> X = rng.randn(200, 4) >>> y = np.where(X[:, 0] > 0, 1, 0) >>> clf = PURiskClassifier( ... LogisticRegression(random_state=0), ... prior=0.5, ... n_iter=3, ... ) >>> clf.fit(X, y) # doctest: +ELLIPSIS PURiskClassifier(...) >>> clf.predict(X[:3]) # doctest: +ELLIPSIS array(...) """ def __init__( self, estimator, prior, objective="nnpu", n_iter=10, beta=0.0, ): """Initialize the PURiskClassifier.""" self.estimator = estimator self.prior = prior self.objective = objective self.n_iter = n_iter self.beta = beta # ------------------------------------------------------------------ # Private helpers # ------------------------------------------------------------------ def _fit_with_weights(self, estimator, X, y, weights): """Fit *estimator* on (X, y), forwarding weights if supported.""" if self.supports_sample_weight_: estimator.fit(X, y, sample_weight=weights) else: estimator.fit(X, y) # ------------------------------------------------------------------ # Public API # ------------------------------------------------------------------ def fit(self, X, y, sample_weight=None): """Fit the PU risk classifier. Parameters ---------- X : array-like or sparse matrix of shape (n_samples, n_features) Training data. Sparse matrices are supported when the base estimator supports them. y : array-like of shape (n_samples,) PU labels. Labeled positive examples must carry label ``1``. Unlabeled examples may be labeled ``0``, ``-1``, or ``False`` and are normalized to ``0`` internally. sample_weight : array-like of shape (n_samples,) or None, \ default None Optional external per-sample importance weights. When the base estimator supports ``sample_weight`` these are multiplied element-wise with the internal uPU/nnPU weights. Ignored when the base estimator does not support ``sample_weight``. Returns ------- self : PURiskClassifier Fitted estimator. """ # ---- validate objective ---------------------------------------- if self.objective not in _OBJECTIVES: raise ValueError( "objective must be one of {}; got {!r}.".format( _OBJECTIVES, self.objective ) ) if not (0.0 < self.prior < 1.0): raise ValueError( "prior must be in (0, 1); got {}.".format(self.prior) ) if self.n_iter < 1: raise ValueError( "n_iter must be >= 1; got {}.".format(self.n_iter) ) # ---- validate inputs ------------------------------------------- y = validate_pu_fit_inputs(X, y, context="fit PURiskClassifier") y = self._normalize_pu_y( y, require_positive=True, require_unlabeled=True ) # Validate external sample_weight if provided ext_w = None if sample_weight is not None: ext_w = np.asarray(sample_weight, dtype=float) if ext_w.shape != (len(y),): raise ValueError( "sample_weight must have shape (n_samples,); " "got {}.".format(ext_w.shape) ) # ---- check base estimator capabilities ------------------------- self.supports_sample_weight_ = has_fit_parameter( self.estimator, "sample_weight" ) if not self.supports_sample_weight_: warnings.warn( "Base estimator {!r} does not accept sample_weight in " "fit(). PU risk weights will be ignored and training " "will use a single unweighted fit.".format( type(self.estimator).__name__ ), UserWarning, stacklevel=2, ) # ---- clone the base estimator ---------------------------------- self.estimator_ = clone(self.estimator) # ---- compute initial weights ----------------------------------- init_weights = _compute_pu_risk_weights( y, self.prior, np.full(len(y), 0.5), # neutral initial predictions objective=self.objective, beta=self.beta, ) if ext_w is not None and self.supports_sample_weight_: init_weights = init_weights * ext_w # ---- initial fit ----------------------------------------------- self._fit_with_weights(self.estimator_, X, y, init_weights) iters_done = 1 # ---- iterative refinement (nnPU only, requires sample_weight) -- # When the base estimator cannot accept sample_weight, the risk # weights cannot be applied, so only the initial fit is performed. if self.objective == "nnpu" and self.supports_sample_weight_: for _ in range(1, self.n_iter): proba = self._validate_predict_proba_output( np.asarray(self.estimator_.predict_proba(X)) ) p_hat = proba[:, 1] weights = _compute_pu_risk_weights( y, self.prior, p_hat, objective=self.objective, beta=self.beta, ) if ext_w is not None: weights = weights * ext_w self._fit_with_weights(self.estimator_, X, y, weights) iters_done += 1 self.n_iter_ = iters_done self.classes_ = np.array([0, 1]) return self def predict_proba(self, X): """Predict class probabilities. Parameters ---------- X : array-like or sparse matrix of shape (n_samples, n_features) Input samples. Returns ------- proba : ndarray of shape (n_samples, 2) Estimated class probabilities. Column 0 corresponds to the unlabeled/negative class; column 1 to the positive class. """ check_is_fitted(self, "estimator_") proba = self.estimator_.predict_proba(X) return self._validate_predict_proba_output(np.asarray(proba)) def predict(self, X, threshold=0.5): """Predict class labels. Parameters ---------- X : array-like or sparse matrix of shape (n_samples, n_features) Input samples. threshold : float, default 0.5 Decision threshold on the positive-class probability. Returns ------- y : ndarray of shape (n_samples,) Predicted labels: ``1`` for positive, ``0`` for unlabeled/negative. """ check_is_fitted(self, "estimator_") proba = self.predict_proba(X) return np.where(proba[:, 1] >= threshold, 1, 0)Generalized uPU / nnPU risk-objective wrapper for sklearn estimators.
Wraps any sklearn-compatible probabilistic classifier and trains it using either the unbiased PU (uPU) or non-negative PU (nnPU) risk objective from the PU learning literature.
Training proceeds in an iterative expectation–maximisation style:
- An initial model is fitted with class-prior-based sample weights
(labeled positives weighted by
prior; unlabeled samples weighted by1). - At each subsequent iteration the model's predicted probabilities are used to recompute per-sample weights according to the chosen risk objective, and the model is re-fitted.
This design is compatible with any sklearn estimator that implements
predict_probaandfit. When the base estimator'sfitmethod acceptssample_weight, weights are passed directly; otherwise aUserWarningis raised and training falls back to unweighted fitting (only the initial model is trained).Sparse feature matrices are passed through unchanged to the base estimator, so sparse-aware estimators (e.g.
LogisticRegression) benefit from sparse storage throughout.Parameters
estimator:sklearn estimator- Base probabilistic classifier implementing
fit(X, y)andpredict_proba(X). Iffitalso acceptssample_weight, the uPU/nnPU weights are forwarded on every training iteration. prior:float- Prior probability of the positive class P(y=1).
Must be strictly in
(0, 1). objective:{'nnpu', 'upu'}, default'nnpu'-
Risk objective to optimise.
'nnpu'— non-negative PU risk (Kiryo et al., 2017): applies an adaptive per-sample reweighting at every iteration using the current model's predictions. The nnPU correction prevents the estimated negative risk from going below-beta.'upu'— unbiased PU risk (du Plessis et al., 2015): trains the model once with fixed class-prior weights (labeled positives atprior, unlabeled samples at1).n_iteris ignored for'upu'.
n_iter:int, default10- Number of reweighting-and-refit iterations for
'nnpu'. Ignored whenobjective='upu'. beta:float, default0.0- nnPU correction threshold.
The nnPU correction branch is
triggered when the estimated negative risk falls below
-beta. Larger values make the correction less frequent. Only used whenobjective='nnpu'.
Attributes
estimator_:sklearn estimator- The fitted base estimator (a clone of
estimator). classes_:ndarrayofshape (2,)- Class labels
[0, 1]. n_iter_:int- Number of EM iterations actually performed.
supports_sample_weight_:boolTrueif the base estimator'sfitmethod acceptssample_weight.
Examples
>>> import numpy as np >>> from sklearn.linear_model import LogisticRegression >>> from pulearn import PURiskClassifier >>> rng = np.random.RandomState(0) >>> X = rng.randn(200, 4) >>> y = np.where(X[:, 0] > 0, 1, 0) >>> clf = PURiskClassifier( ... LogisticRegression(random_state=0), ... prior=0.5, ... n_iter=3, ... ) >>> clf.fit(X, y) # doctest: +ELLIPSIS PURiskClassifier(...) >>> clf.predict(X[:3]) # doctest: +ELLIPSIS array(...)Initialize the PURiskClassifier.
Ancestors
- BasePUClassifier
- sklearn.base.ClassifierMixin
- sklearn.base.BaseEstimator
- sklearn.utils._repr_html.base.ReprHTMLMixin
- sklearn.utils._repr_html.base._HTMLDocumentationLinkMixin
- sklearn.utils._metadata_requests._MetadataRequester
Methods
def fit(self, X, y, sample_weight=None)-
Expand source code Browse git
def fit(self, X, y, sample_weight=None): """Fit the PU risk classifier. Parameters ---------- X : array-like or sparse matrix of shape (n_samples, n_features) Training data. Sparse matrices are supported when the base estimator supports them. y : array-like of shape (n_samples,) PU labels. Labeled positive examples must carry label ``1``. Unlabeled examples may be labeled ``0``, ``-1``, or ``False`` and are normalized to ``0`` internally. sample_weight : array-like of shape (n_samples,) or None, \ default None Optional external per-sample importance weights. When the base estimator supports ``sample_weight`` these are multiplied element-wise with the internal uPU/nnPU weights. Ignored when the base estimator does not support ``sample_weight``. Returns ------- self : PURiskClassifier Fitted estimator. """ # ---- validate objective ---------------------------------------- if self.objective not in _OBJECTIVES: raise ValueError( "objective must be one of {}; got {!r}.".format( _OBJECTIVES, self.objective ) ) if not (0.0 < self.prior < 1.0): raise ValueError( "prior must be in (0, 1); got {}.".format(self.prior) ) if self.n_iter < 1: raise ValueError( "n_iter must be >= 1; got {}.".format(self.n_iter) ) # ---- validate inputs ------------------------------------------- y = validate_pu_fit_inputs(X, y, context="fit PURiskClassifier") y = self._normalize_pu_y( y, require_positive=True, require_unlabeled=True ) # Validate external sample_weight if provided ext_w = None if sample_weight is not None: ext_w = np.asarray(sample_weight, dtype=float) if ext_w.shape != (len(y),): raise ValueError( "sample_weight must have shape (n_samples,); " "got {}.".format(ext_w.shape) ) # ---- check base estimator capabilities ------------------------- self.supports_sample_weight_ = has_fit_parameter( self.estimator, "sample_weight" ) if not self.supports_sample_weight_: warnings.warn( "Base estimator {!r} does not accept sample_weight in " "fit(). PU risk weights will be ignored and training " "will use a single unweighted fit.".format( type(self.estimator).__name__ ), UserWarning, stacklevel=2, ) # ---- clone the base estimator ---------------------------------- self.estimator_ = clone(self.estimator) # ---- compute initial weights ----------------------------------- init_weights = _compute_pu_risk_weights( y, self.prior, np.full(len(y), 0.5), # neutral initial predictions objective=self.objective, beta=self.beta, ) if ext_w is not None and self.supports_sample_weight_: init_weights = init_weights * ext_w # ---- initial fit ----------------------------------------------- self._fit_with_weights(self.estimator_, X, y, init_weights) iters_done = 1 # ---- iterative refinement (nnPU only, requires sample_weight) -- # When the base estimator cannot accept sample_weight, the risk # weights cannot be applied, so only the initial fit is performed. if self.objective == "nnpu" and self.supports_sample_weight_: for _ in range(1, self.n_iter): proba = self._validate_predict_proba_output( np.asarray(self.estimator_.predict_proba(X)) ) p_hat = proba[:, 1] weights = _compute_pu_risk_weights( y, self.prior, p_hat, objective=self.objective, beta=self.beta, ) if ext_w is not None: weights = weights * ext_w self._fit_with_weights(self.estimator_, X, y, weights) iters_done += 1 self.n_iter_ = iters_done self.classes_ = np.array([0, 1]) return selfFit the PU risk classifier.
Parameters
X:array-likeorsparse matrixofshape (n_samples, n_features)- Training data. Sparse matrices are supported when the base estimator supports them.
y:array-likeofshape (n_samples,)- PU labels.
Labeled positive examples must carry label
1. Unlabeled examples may be labeled0,-1, orFalseand are normalized to0internally. sample_weight:array-likeofshape (n_samples,)orNone, defaultNone- Optional external per-sample importance weights.
When the
base estimator supports
sample_weightthese are multiplied element-wise with the internal uPU/nnPU weights. Ignored when the base estimator does not supportsample_weight.
Returns
self:PURiskClassifier- Fitted estimator.
def predict(self, X, threshold=0.5)-
Expand source code Browse git
def predict(self, X, threshold=0.5): """Predict class labels. Parameters ---------- X : array-like or sparse matrix of shape (n_samples, n_features) Input samples. threshold : float, default 0.5 Decision threshold on the positive-class probability. Returns ------- y : ndarray of shape (n_samples,) Predicted labels: ``1`` for positive, ``0`` for unlabeled/negative. """ check_is_fitted(self, "estimator_") proba = self.predict_proba(X) return np.where(proba[:, 1] >= threshold, 1, 0)Predict class labels.
Parameters
X:array-likeorsparse matrixofshape (n_samples, n_features)- Input samples.
threshold:float, default0.5- Decision threshold on the positive-class probability.
Returns
y:ndarrayofshape (n_samples,)- Predicted labels:
1for positive,0for unlabeled/negative.
def predict_proba(self, X)-
Expand source code Browse git
def predict_proba(self, X): """Predict class probabilities. Parameters ---------- X : array-like or sparse matrix of shape (n_samples, n_features) Input samples. Returns ------- proba : ndarray of shape (n_samples, 2) Estimated class probabilities. Column 0 corresponds to the unlabeled/negative class; column 1 to the positive class. """ check_is_fitted(self, "estimator_") proba = self.estimator_.predict_proba(X) return self._validate_predict_proba_output(np.asarray(proba))Predict class probabilities.
Parameters
X:array-likeorsparse matrixofshape (n_samples, n_features)- Input samples.
Returns
proba:ndarrayofshape (n_samples, 2)- Estimated class probabilities. Column 0 corresponds to the unlabeled/negative class; column 1 to the positive class.
def set_fit_request(self: PURiskClassifier,
*,
sample_weight: bool | str | None = '$UNCHANGED$') ‑> PURiskClassifier-
Expand source code Browse git
def func(*args, **kw): """Updates the `_metadata_request` attribute of the consumer (`instance`) for the parameters provided as `**kw`. This docstring is overwritten below. See REQUESTER_DOC for expected functionality. """ if not _routing_enabled(): raise RuntimeError( "This method is only available when metadata routing is enabled." " You can enable it using" " sklearn.set_config(enable_metadata_routing=True)." ) if self.validate_keys and (set(kw) - set(self.keys)): raise TypeError( f"Unexpected args: {set(kw) - set(self.keys)} in {self.name}. " f"Accepted arguments are: {set(self.keys)}" ) # This makes it possible to use the decorated method as an unbound method, # for instance when monkeypatching. # https://github.com/scikit-learn/scikit-learn/issues/28632 if instance is None: _instance = args[0] args = args[1:] else: _instance = instance # Replicating python's behavior when positional args are given other than # `self`, and `self` is only allowed if this method is unbound. if args: raise TypeError( f"set_{self.name}_request() takes 0 positional argument but" f" {len(args)} were given" ) requests = _instance._get_metadata_request() method_metadata_request = getattr(requests, self.name) for prop, alias in kw.items(): if alias is not UNCHANGED: method_metadata_request.add_request(param=prop, alias=alias) _instance._metadata_request = requests return _instanceConfigure whether metadata should be requested to be passed to the
fitmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a :term:
meta-estimatorand metadata routing is enabled withenable_metadata_routing=True(see :func:sklearn.set_config). Please check the :ref:User Guide <metadata_routing>on how the routing mechanism works.The options for each parameter are:
-
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided. -
False: metadata is not requested and the meta-estimator will not pass it tofit. -
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it. -
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version: 1.3
Parameters
sample_weight:str, True, False,orNone, default=sklearn.utils.metadata_routing.UNCHANGED- Metadata routing for
sample_weightparameter infit.
Returns
self:object- The updated object.
-
def set_predict_request(self: PURiskClassifier,
*,
threshold: bool | str | None = '$UNCHANGED$') ‑> PURiskClassifier-
Expand source code Browse git
def func(*args, **kw): """Updates the `_metadata_request` attribute of the consumer (`instance`) for the parameters provided as `**kw`. This docstring is overwritten below. See REQUESTER_DOC for expected functionality. """ if not _routing_enabled(): raise RuntimeError( "This method is only available when metadata routing is enabled." " You can enable it using" " sklearn.set_config(enable_metadata_routing=True)." ) if self.validate_keys and (set(kw) - set(self.keys)): raise TypeError( f"Unexpected args: {set(kw) - set(self.keys)} in {self.name}. " f"Accepted arguments are: {set(self.keys)}" ) # This makes it possible to use the decorated method as an unbound method, # for instance when monkeypatching. # https://github.com/scikit-learn/scikit-learn/issues/28632 if instance is None: _instance = args[0] args = args[1:] else: _instance = instance # Replicating python's behavior when positional args are given other than # `self`, and `self` is only allowed if this method is unbound. if args: raise TypeError( f"set_{self.name}_request() takes 0 positional argument but" f" {len(args)} were given" ) requests = _instance._get_metadata_request() method_metadata_request = getattr(requests, self.name) for prop, alias in kw.items(): if alias is not UNCHANGED: method_metadata_request.add_request(param=prop, alias=alias) _instance._metadata_request = requests return _instanceConfigure whether metadata should be requested to be passed to the
predictmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a :term:
meta-estimatorand metadata routing is enabled withenable_metadata_routing=True(see :func:sklearn.set_config). Please check the :ref:User Guide <metadata_routing>on how the routing mechanism works.The options for each parameter are:
-
True: metadata is requested, and passed topredictif provided. The request is ignored if metadata is not provided. -
False: metadata is not requested and the meta-estimator will not pass it topredict. -
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it. -
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version: 1.3
Parameters
threshold:str, True, False,orNone, default=sklearn.utils.metadata_routing.UNCHANGED- Metadata routing for
thresholdparameter inpredict.
Returns
self:object- The updated object.
-
Inherited members
- An initial model is fitted with class-prior-based sample weights
(labeled positives weighted by