hmm.classification module

class hmm.classification.Classifier(num_features, cat_features, clf=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False))[source]

Bases: object

A simple classification pipeline wrapping the sklearn library.

  • transforms (imputes, encodes/scales) categorical and numerical features

  • fits a classifier

  • computes accuracy scores for the classifier

Parameters
  • num_features – a list of df keys for the numerical features

  • cat_features – a list of df keys for the categorical features

  • clf – a classification (discriminative) model

cross_val(X, y, cv=5, verbose=True)[source]

Cross validate the pipeline.

Parameters
  • X – a dataset

  • y – the ground-truth labels

  • cv – number of folds in the cross validation

  • verbose – whether or not to print test accuracy

Returns

the cross validation score object (sklearn)

fit(X, y)[source]

Fit the pipeline on a labeled dataset.

Parameters
  • X – the data

  • y – the ground-truth labels

Returns

the fitted pipeline

get_clf(model)[source]

Construct the pipeline - a feature preprocessor and a classification model.

Parameters

model – a classification (discriminative) model

Returns

a sklearn pipeline

score(X, y, verbose=True)[source]

Score the pipeline for accuracy on a test set.

Parameters
  • X – the test data

  • y – ground-truth labels for the test data

  • verbose – whether or not to print test accuracy

Returns

accuracy on the test set

hmm.classification.train_test_val_dev_split(X, y)[source]

Split the dataset into four partitions: training (64%), testing (16%), validation (16%), and development (4%). - Training is for fitting the model. - Testing is for testing the fitted model and parameter tuning. - Validation is for final testing after tuning parameters. - Development is for examining individual rows and performing unit tests.

Parameters
  • X – the dataset

  • y – the ground-truth labels

Returns

four partitions of the dataset