hmm.labeling.models module

class hmm.labeling.models.Labeler(lfs=[], model=LabelModel())[source]

Bases: object

Wrapper for Snorkel label model.

  • addition/change of labeling functions

  • label aggregation

  • model fitting

  • model evaluation: scoring and bucket analysis

  • filtering NAs

Parameters
  • lfs – list of labeling functions (heuristic functions)

  • model – the model to use - by default, Snorkel’s generative label model

add_lfs(lfs)[source]

Add labeling functions to the model.

Parameters

lfs – list of labeling functions to add

filter_probs(X, L)[source]

Filter unlabeled rows (where all the labeling functions abstain) from the dataset.

Parameters
  • X – the dataset

  • L – an n x l matrix of candidate labels, where n is the size of the dataset and l is the number of labeling functions

Returns

the dataset with any unlabeled tuples removed

fit(L_train, Y_dev=None, fit_params={})[source]

Fit the generative label model on a set of labels. No ground-truth labels are required for fitting, but can be included to help make the automatically generated label distribution match the ground-truth label distribution. Fitting involves only the candidate label distributions in the training set L_train.

Parameters
  • L_train – an n x l matrix of candidate labels, where n is the size of the training dataset and l is the number of labeling functions

  • Y_dev – a held-out set of ground-truth labels

  • fit_params – optional set of parameters for fitting - see Snorkel docs for all options

Returns

the fitted label model

get_confusion_matrix(L_dev, y_dev)[source]

Compute the confusion matrix for the final labels for a held out development set.

Parameters
  • L_dev – an n x l matrix of candidate labels, where n is the size of the dev dataset and l is the number of labeling functions

  • y_dev – ground truth labels for the dev set

Returns

the confusion matrix as a pandas crosstab

get_label_buckets(L_dev, y_dev)[source]

Fetch a bucket of labels (i.a. false positives, false negatives)

Parameters
  • L_dev – an n x l matrix of candidate labels, where n is the size of the dev dataset and l is the number of labeling functions

  • y_dev – ground truth labels for the dev set

Returns

a set of bucket labels - see the Moral Machine example for some analyses with label buckets

get_preds(L, threshold=0.5)[source]

Produce rounded labels from a set of candidate labels produced for some dataset.

Parameters
  • L – an n x l matrix of candidate labels, where n is the size of the dataset and l is the number of labeling functions

  • threshold – threshold for rounding posterior probabilities to discrete labels

Returns

the rounded labels

label(data, verbose=True)[source]

Aggregate candidate labels into a single label for each tuple in the dataframes in data.

Parameters
  • data – a set of dataframes, each containing a set of tuples to label

  • verbose – whether or not to periodically print label status

Returns

a set of labels for each dataframe in data

static probs_to_preds(probs)[source]
static score(model, L_valid, y_val, verbose=True)[source]

Validate the label model on a held out test set.

Parameters
  • model – a label aggregation model

  • L_valid – an n x l matrix of candidate labels, where n is the size of the held-out validation set and l is the number of labeling functions

  • y_val – ground-truth labels for the held-out validation set

  • verbose – whether or not to periodically print label status

Returns

set_lfs(lfs)[source]

Set the list of labeling functions.

Parameters

lfs – labeling functions for the model

update_applier()[source]

Update the labeling function applier with the labeling functions. The applier is responsible for generating candidate labels for each tuple in the dataset, one set of candidate labels for each labeling function.