Note: You can run this notebook live in Colab with zero setup.
TF-Hub is a platform to share machine learning expertise packaged in reusable resources, notably pre-trained modules. This tutorial is organized into two main parts.
Introduction: Training a text classifier with TF-Hub
We will use a TF-Hub text embedding module to train a simple sentiment classifier with a reasonable baseline accuracy. We will then analyze the predictions to make sure our model is reasonable and propose improvements to increase the accuracy.
Advanced: Transfer learning analysis
In this section, we will use various TF-Hub modules to compare their effect on the accuracy of the estimator and demonstrate advantages and pitfalls of transfer learning.
# Install the latest Tensorflow version. !pip install --quiet "tensorflow>=1.7" # Install TF-Hub. !pip install -q tensorflow-hub
More detailed information about installing Tensorflow can be found at https://www.tensorflow.org/install/.
import tensorflow as tf import tensorflow_hub as hub import matplotlib.pyplot as plt import numpy as np import os import pandas as pd import re import seaborn as sns
/usr/local/lib/python2.7/dist-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`. from ._conv import register_converters as _register_converters
We will try to solve the Large Movie Review Dataset v1.0 task from Mass et al. The dataset consists of IMDB movie reviews labeled by positivity from 1 to 10. The task is to label the reviews as negative or positive.
# Load all files from a directory in a DataFrame. def load_directory_data(directory): data = {} data["sentence"] = [] data["sentiment"] = [] for file_path in os.listdir(directory): with tf.gfile.GFile(os.path.join(directory, file_path), "r") as f: data["sentence"].append(f.read()) data["sentiment"].append(re.match("\d+_(\d+)\.txt", file_path).group(1)) return pd.DataFrame.from_dict(data) # Merge positive and negative examples, add a polarity column and shuffle. def load_dataset(directory): pos_df = load_directory_data(os.path.join(directory, "pos")) neg_df = load_directory_data(os.path.join(directory, "neg")) pos_df["polarity"] = 1 neg_df["polarity"] = 0 return pd.concat([pos_df, neg_df]).sample(frac=1).reset_index(drop=True) # Download and process the dataset files. def download_and_load_datasets(force_download=False): dataset = tf.keras.utils.get_file( fname="aclImdb.tar.gz", origin="http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz", extract=True) train_df = load_dataset(os.path.join(os.path.dirname(dataset), "aclImdb", "train")) test_df = load_dataset(os.path.join(os.path.dirname(dataset), "aclImdb", "test")) return train_df, test_df # Reduce logging output. tf.logging.set_verbosity(tf.logging.ERROR) train_df, test_df = download_and_load_datasets() train_df.head()
Downloading data from http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz 84131840/84125825 [==============================] - 1s 0us/step 84140032/84125825 [==============================] - 1s 0us/step
sentence | sentiment | polarity | |
---|---|---|---|
0 | Next to "Star Wars" and "The Wizard of Oz," th... | 10 | 1 |
1 | I can't help but laugh at the people who prais... | 1 | 0 |
2 | Based on a true story, this series is a gem wi... | 10 | 1 |
3 | Van Dien must cringe with embarrassment at the... | 1 | 0 |
4 | This film had such promise!! What a great idea... | 4 | 0 |
Estimator framework provides input functions that wrap Pandas dataframes.
# Training input on the whole training set with no limit on training epochs. train_input_fn = tf.estimator.inputs.pandas_input_fn( train_df, train_df["polarity"], num_epochs=None, shuffle=True) # Prediction on the whole training set. predict_train_input_fn = tf.estimator.inputs.pandas_input_fn( train_df, train_df["polarity"], shuffle=False) # Prediction on the test set. predict_test_input_fn = tf.estimator.inputs.pandas_input_fn( test_df, test_df["polarity"], shuffle=False)
TF-Hub provides a feature column that applies a module on the given text feature and passes further the outputs of the module. In this tutorial we will be using the nnlm-en-dim128 module. For the purpose of this tutorial, the most important facts are:
embedded_text_feature_column = hub.text_embedding_column( key="sentence", module_spec="https://tfhub.dev/google/nnlm-en-dim128/1")
For classification we can use a DNN Classifier (note further remarks about different modelling of the label function at the end of the tutorial).
estimator = tf.estimator.DNNClassifier( hidden_units=[500, 100], feature_columns=[embedded_text_feature_column], n_classes=2, optimizer=tf.train.AdagradOptimizer(learning_rate=0.003))
Train the estimator for a reasonable amount of steps.
# Training for 1,000 steps means 128,000 training examples with the default # batch size. This is roughly equivalent to 5 epochs since the training dataset # contains 25,000 examples. estimator.train(input_fn=train_input_fn, steps=1000);
Run predictions for both training and test set.
train_eval_result = estimator.evaluate(input_fn=predict_train_input_fn) test_eval_result = estimator.evaluate(input_fn=predict_test_input_fn) print "Training set accuracy: {accuracy}".format(**train_eval_result) print "Test set accuracy: {accuracy}".format(**test_eval_result)
Training set accuracy: 0.801320016384 Test set accuracy: 0.793600022793
We can visually check the confusion matrix to undestand the distribution of misclassifications.
def get_predictions(estimator, input_fn): return [x["class_ids"][0] for x in estimator.predict(input_fn=input_fn)] LABELS = [ "negative", "positive" ] # Create a confusion matrix on training data. with tf.Graph().as_default(): cm = tf.confusion_matrix(train_df["polarity"], get_predictions(estimator, predict_train_input_fn)) with tf.Session() as session: cm_out = session.run(cm) # Normalize the confusion matrix so that each row sums to 1. cm_out = cm_out.astype(float) / cm_out.sum(axis=1)[:, np.newaxis] sns.heatmap(cm_out, annot=True, xticklabels=LABELS, yticklabels=LABELS); plt.xlabel("Predicted"); plt.ylabel("True");
Transfer learning makes it possible to save training resources and to achieve good model generalization even when training on a small dataset. In this part, we will demonstrate this by training with two different TF-Hub modules:
And by training in two modes:
Let's run a couple of trainings and evaluations to see how using a various modules can affect the accuracy.
def train_and_evaluate_with_module(hub_module, train_module=False): embedded_text_feature_column = hub.text_embedding_column( key="sentence", module_spec=hub_module, trainable=train_module) estimator = tf.estimator.DNNClassifier( hidden_units=[500, 100], feature_columns=[embedded_text_feature_column], n_classes=2, optimizer=tf.train.AdagradOptimizer(learning_rate=0.003)) estimator.train(input_fn=train_input_fn, steps=1000) train_eval_result = estimator.evaluate(input_fn=predict_train_input_fn) test_eval_result = estimator.evaluate(input_fn=predict_test_input_fn) training_set_accuracy = train_eval_result["accuracy"] test_set_accuracy = test_eval_result["accuracy"] return { "Training accuracy": training_set_accuracy, "Test accuracy": test_set_accuracy } results = {} results["nnlm-en-dim128"] = train_and_evaluate_with_module( "https://tfhub.dev/google/nnlm-en-dim128/1") results["nnlm-en-dim128-with-module-training"] = train_and_evaluate_with_module( "https://tfhub.dev/google/nnlm-en-dim128/1", True) results["random-nnlm-en-dim128"] = train_and_evaluate_with_module( "https://tfhub.dev/google/random-nnlm-en-dim128/1") results["random-nnlm-en-dim128-with-module-training"] = train_and_evaluate_with_module( "https://tfhub.dev/google/random-nnlm-en-dim128/1", True)
Let's look at the results.
pd.DataFrame.from_dict(results, orient="index")
Training accuracy | Test accuracy | |
---|---|---|
nnlm-en-dim128 | 0.80148 | 0.79384 |
nnlm-en-dim128-with-module-training | 0.94212 | 0.86792 |
random-nnlm-en-dim128 | 0.72072 | 0.67664 |
random-nnlm-en-dim128-with-module-training | 0.76124 | 0.71944 |
We can already see some patterns, but first we should establish the baseline accuracy of the test set - the lower bound that can be achieved by outputting only the label of the most represented class:
estimator.evaluate(input_fn=predict_test_input_fn)["accuracy_baseline"]
0.5
Assigning the most represented class will give us accuracy of 50%. There are a couple of things to notice here:
© 2018 The TensorFlow Authors. All rights reserved.
Licensed under the Creative Commons Attribution License 3.0.
Code samples licensed under the Apache 2.0 License.
https://www.tensorflow.org/tutorials/text_classification_with_tf_hub