hmm.classification module¶

class hmm.classification.Classifier(num_features, cat_features, clf=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False))[source]¶

Bases: object

A simple classification pipeline wrapping the sklearn library.

transforms (imputes, encodes/scales) categorical and numerical features
fits a classifier
computes accuracy scores for the classifier

Parameters

num_features – a list of df keys for the numerical features
cat_features – a list of df keys for the categorical features
clf – a classification (discriminative) model

cross_val(X, y, cv=5, verbose=True)[source]¶

Cross validate the pipeline.

Parameters

X – a dataset
y – the ground-truth labels
cv – number of folds in the cross validation
verbose – whether or not to print test accuracy

Returns

the cross validation score object (sklearn)

fit(X, y)[source]¶

Fit the pipeline on a labeled dataset.

Parameters

X – the data
y – the ground-truth labels

Returns

the fitted pipeline

get_clf(model)[source]¶

Construct the pipeline - a feature preprocessor and a classification model.

Parameters: model – a classification (discriminative) model
Returns: a sklearn pipeline

score(X, y, verbose=True)[source]¶

Score the pipeline for accuracy on a test set.

Parameters

X – the test data
y – ground-truth labels for the test data
verbose – whether or not to print test accuracy

Returns

accuracy on the test set

hmm.classification.train_test_val_dev_split(X, y)[source]¶

Split the dataset into four partitions: training (64%), testing (16%), validation (16%), and development (4%). - Training is for fitting the model. - Testing is for testing the fitted model and parameter tuning. - Validation is for final testing after tuning parameters. - Development is for examining individual rows and performing unit tests.

Parameters

X – the dataset
y – the ground-truth labels

Returns

four partitions of the dataset