TF-Hub is a platform to share machine learning expertise packaged in reusable resources, notably pre-trained modules. This tutorial is organized into two main parts.
Introduction: Training a text classifier with TF-Hub
We will use a TF-Hub text embedding module to train a simple sentiment classifier with a reasonable baseline accuracy. We will then analyze the predictions to make sure our model is reasonable and propose improvements to increase the accuracy.
Advanced: Transfer learning analysis
In this section, we will use various TF-Hub modules to compare their effect on the accuracy of the estimator and demonstrate advantages and pitfalls of transfer learning.
# Install the latest Tensorflow version. !pip install --quiet "tensorflow>=1.7" # Install TF-Hub. !pip install -q tensorflow-hub
More detailed information about installing Tensorflow can be found at
import tensorflow as tf import tensorflow_hub as hub import matplotlib.pyplot as plt import numpy as np import os import pandas as pd import re import seaborn as sns
We will try to solve the Large Movie Review Dataset v1.0 task from Mass et al. The dataset consists of IMDB movie reviews labeled by positivity from 1 to 10. The task is to label the reviews as negative or positive.
# Load all files from a directory in a DataFrame. def load_directory_data(directory): data = {} data["sentence"] = [] data["sentiment"] = [] for file_path in os.listdir(directory): with tf.gfile.GFile(os.path.join(directory, file_path), "r") as f: data["sentence"].append( data["sentiment"].append(re.match("\d+_(\d+)\.txt", file_path).group(1)) return pd.DataFrame.from_dict(data) # Merge positive and negative examples, add a polarity column and shuffle. def load_dataset(directory): pos_df = load_directory_data(os.path.join(directory, "pos")) neg_df = load_directory_data(os.path.join(directory, "neg")) pos_df["polarity"] = 1 neg_df["polarity"] = 0 return pd.concat([pos_df, neg_df]).sample(frac=1).reset_index(drop=True) # Download and process the dataset files. def download_and_load_datasets(force_download=False): dataset = tf.keras.utils.get_file( fname="aclImdb.tar.gz", origin="", extract=True) train_df = load_dataset(os.path.join(os.path.dirname(dataset), "aclImdb", "train")) test_df = load_dataset(os.path.join(os.path.dirname(dataset), "aclImdb", "test")) return train_df, test_df # Reduce logging output. tf.logging.set_verbosity(tf.logging.ERROR) train_df, test_df = download_and_load_datasets() train_df.head()
Downloading data from 84131840/84125825 [==============================] - 1s 0us/step 84140032/84125825 [==============================] - 1s 0us/step
sentence | sentiment | polarity | |
0 | Next to "Star Wars" and "The Wizard of Oz," th... | 10 | 1 |
1 | I can't help but laugh at the people who prais... | 1 | 0 |
2 | Based on a true story, this series is a gem wi... | 10 | 1 |
3 | Van Dien must cringe with embarrassment at the... | 1 | 0 |
4 | This film had such promise!! What a great idea... | 4 | 0 |
Estimator framework provides input functions that wrap Pandas dataframes.
# Training input on the whole training set with no limit on training epochs. train_input_fn = tf.estimator.inputs.pandas_input_fn( train_df, train_df["polarity"], num_epochs=None, shuffle=True) # Prediction on the whole training set. predict_train_input_fn = tf.estimator.inputs.pandas_input_fn( train_df, train_df["polarity"], shuffle=False) # Prediction on the test set. predict_test_input_fn = tf.estimator.inputs.pandas_input_fn( test_df, test_df["polarity"], shuffle=False)
TF-Hub provides a feature column that applies a module on the given text feature and passes further the outputs of the module. In this tutorial we will be using the nnlm-en-dim128 module. For the purpose of this tutorial, the most important facts are:
embedded_text_feature_column = hub.text_embedding_column( key="sentence", module_spec="")
For classification we can use a DNN Classifier (note further remarks about different modelling of the label function at the end of the tutorial).
estimator = tf.estimator.DNNClassifier( hidden_units=[500, 100], feature_columns=[embedded_text_feature_column], n_classes=2, optimizer=tf.train.AdagradOptimizer(learning_rate=0.003))
Train the estimator for a reasonable amount of steps.
# Training for 1,000 steps means 128,000 training examples with the default # batch size. This is roughly equivalent to 5 epochs since the training dataset # contains 25,000 examples. estimator.train(input_fn=train_input_fn, steps=1000);
Run predictions for both training and test set.
train_eval_result = estimator.evaluate(input_fn=predict_train_input_fn) test_eval_result = estimator.evaluate(input_fn=predict_test_input_fn) print "Training set accuracy: {accuracy}".format(**train_eval_result) print "Test set accuracy: {accuracy}".format(**test_eval_result)
Training set accuracy: 0.801320016384 Test set accuracy: 0.793600022793
We can visually check the confusion matrix to undestand the distribution of misclassifications.
def get_predictions(estimator, input_fn): return [x["class_ids"][0] for x in estimator.predict(input_fn=input_fn)] LABELS = [ "negative", "positive" ] # Create a confusion matrix on training data. with tf.Graph().as_default(): cm = tf.confusion_matrix(train_df["polarity"], get_predictions(estimator, predict_train_input_fn)) with tf.Session() as session: cm_out = # Normalize the confusion matrix so that each row sums to 1. cm_out = cm_out.astype(float) / cm_out.sum(axis=1)[:, np.newaxis] sns.heatmap(cm_out, annot=True, xticklabels=LABELS, yticklabels=LABELS); plt.xlabel("Predicted"); plt.ylabel("True");
Transfer learning makes it possible to save training resources and to achieve good model generalization even when training on a small dataset. In this part, we will demonstrate this by training with two different TF-Hub modules:
And by training in two modes:
Let's run a couple of trainings and evaluations to see how using a various modules can affect the accuracy.
def train_and_evaluate_with_module(hub_module, train_module=False): embedded_text_feature_column = hub.text_embedding_column( key="sentence", module_spec=hub_module, trainable=train_module) estimator = tf.estimator.DNNClassifier( hidden_units=[500, 100], feature_columns=[embedded_text_feature_column], n_classes=2, optimizer=tf.train.AdagradOptimizer(learning_rate=0.003)) estimator.train(input_fn=train_input_fn, steps=1000) train_eval_result = estimator.evaluate(input_fn=predict_train_input_fn) test_eval_result = estimator.evaluate(input_fn=predict_test_input_fn) training_set_accuracy = train_eval_result["accuracy"] test_set_accuracy = test_eval_result["accuracy"] return { "Training accuracy": training_set_accuracy, "Test accuracy": test_set_accuracy } results = {} results["nnlm-en-dim128"] = train_and_evaluate_with_module( "") results["nnlm-en-dim128-with-module-training"] = train_and_evaluate_with_module( "", True) results["random-nnlm-en-dim128"] = train_and_evaluate_with_module( "") results["random-nnlm-en-dim128-with-module-training"] = train_and_evaluate_with_module( "", True)
Let's look at the results.
pd.DataFrame.from_dict(results, orient="index")
Training accuracy | Test accuracy | |
nnlm-en-dim128 | 0.80148 | 0.79384 |
nnlm-en-dim128-with-module-training | 0.94212 | 0.86792 |
random-nnlm-en-dim128 | 0.72072 | 0.67664 |
random-nnlm-en-dim128-with-module-training | 0.76124 | 0.71944 |
We can already see some patterns, but first we should establish the baseline accuracy of the test set - the lower bound that can be achieved by outputting only the label of the most represented class:
Assigning the most represented class will give us accuracy of 50%. There are a couple of things to notice here:
