hmm.labeling.models module¶

class hmm.labeling.models.Labeler(lfs=[], model=LabelModel())[source]¶

Bases: object

Wrapper for Snorkel label model.

addition/change of labeling functions
label aggregation
model fitting
model evaluation: scoring and bucket analysis
filtering NAs

Parameters

lfs – list of labeling functions (heuristic functions)
model – the model to use - by default, Snorkel’s generative label model

add_lfs(lfs)[source]¶

Add labeling functions to the model.

Parameters: lfs – list of labeling functions to add

filter_probs(X, L)[source]¶

Filter unlabeled rows (where all the labeling functions abstain) from the dataset.

Parameters

X – the dataset
L – an n x l matrix of candidate labels, where n is the size of the dataset and l is the number of labeling functions

Returns

the dataset with any unlabeled tuples removed

fit(L_train, Y_dev=None, fit_params={})[source]¶

Fit the generative label model on a set of labels. No ground-truth labels are required for fitting, but can be included to help make the automatically generated label distribution match the ground-truth label distribution. Fitting involves only the candidate label distributions in the training set L_train.

Parameters

L_train – an n x l matrix of candidate labels, where n is the size of the training dataset and l is the number of labeling functions
Y_dev – a held-out set of ground-truth labels
fit_params – optional set of parameters for fitting - see Snorkel docs for all options

Returns

the fitted label model

get_confusion_matrix(L_dev, y_dev)[source]¶

Compute the confusion matrix for the final labels for a held out development set.

Parameters

L_dev – an n x l matrix of candidate labels, where n is the size of the dev dataset and l is the number of labeling functions
y_dev – ground truth labels for the dev set

Returns

the confusion matrix as a pandas crosstab

get_label_buckets(L_dev, y_dev)[source]¶

Fetch a bucket of labels (i.a. false positives, false negatives)

Parameters

L_dev – an n x l matrix of candidate labels, where n is the size of the dev dataset and l is the number of labeling functions
y_dev – ground truth labels for the dev set

Returns

a set of bucket labels - see the Moral Machine example for some analyses with label buckets

get_preds(L, threshold=0.5)[source]¶

Produce rounded labels from a set of candidate labels produced for some dataset.

Parameters

L – an n x l matrix of candidate labels, where n is the size of the dataset and l is the number of labeling functions
threshold – threshold for rounding posterior probabilities to discrete labels

Returns

the rounded labels

label(data, verbose=True)[source]¶

Aggregate candidate labels into a single label for each tuple in the dataframes in data.

Parameters

data – a set of dataframes, each containing a set of tuples to label
verbose – whether or not to periodically print label status

Returns

a set of labels for each dataframe in data

static probs_to_preds(probs)[source]¶

static score(model, L_valid, y_val, verbose=True)[source]¶

Validate the label model on a held out test set.

Parameters

model – a label aggregation model
L_valid – an n x l matrix of candidate labels, where n is the size of the held-out validation set and l is the number of labeling functions
y_val – ground-truth labels for the held-out validation set
verbose – whether or not to periodically print label status

Returns

set_lfs(lfs)[source]¶

Set the list of labeling functions.

Parameters: lfs – labeling functions for the model

update_applier()[source]¶: Update the labeling function applier with the labeling functions. The applier is responsible for generating candidate labels for each tuple in the dataset, one set of candidate labels for each labeling function.