Note: You can run this notebook live in Colab with zero setup.
TF-Hub is a platform to share machine learning expertise packaged in reusable resources, notably pre-trained modules. This tutorial is organized into two main parts.
Introduction: Training a text classifier with TF-Hub
We will use a TF-Hub text embedding module to train a simple sentiment classifier with a reasonable baseline accuracy. We will then analyze the predictions to make sure our model is reasonable and propose improvements to increase the accuracy.
Advanced: Transfer learning analysis
In this section, we will use various TF-Hub modules to compare their effect on the accuracy of the estimator and demonstrate advantages and pitfalls of transfer learning.
# Install the latest Tensorflow version. !pip install --quiet "tensorflow>=1.7" # Install TF-Hub. !pip install -q tensorflow-hub
More detailed information about installing Tensorflow can be found at https://www.tensorflow.org/install/.
import tensorflow as tf import tensorflow_hub as hub import matplotlib.pyplot as plt import numpy as np import os import pandas as pd import re import seaborn as sns
/usr/local/lib/python2.7/dist-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`. from ._conv import register_converters as _register_converters
We will try to solve the Large Movie Review Dataset v1.0 task from Mass et al. The dataset consists of IMDB movie reviews labeled by positivity from 1 to 10. The task is to label the reviews as negative or positive.
# Load all files from a directory in a DataFrame.
def load_directory_data(directory):
data = {}
data["sentence"] = []
data["sentiment"] = []
for file_path in os.listdir(directory):
with tf.gfile.GFile(os.path.join(directory, file_path), "r") as f:
data["sentence"].append(f.read())
data["sentiment"].append(re.match("\d+_(\d+)\.txt", file_path).group(1))
return pd.DataFrame.from_dict(data)
# Merge positive and negative examples, add a polarity column and shuffle.
def load_dataset(directory):
pos_df = load_directory_data(os.path.join(directory, "pos"))
neg_df = load_directory_data(os.path.join(directory, "neg"))
pos_df["polarity"] = 1
neg_df["polarity"] = 0
return pd.concat([pos_df, neg_df]).sample(frac=1).reset_index(drop=True)
# Download and process the dataset files.
def download_and_load_datasets(force_download=False):
dataset = tf.keras.utils.get_file(
fname="aclImdb.tar.gz",
origin="http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz",
extract=True)
train_df = load_dataset(os.path.join(os.path.dirname(dataset),
"aclImdb", "train"))
test_df = load_dataset(os.path.join(os.path.dirname(dataset),
"aclImdb", "test"))
return train_df, test_df
# Reduce logging output.
tf.logging.set_verbosity(tf.logging.ERROR)
train_df, test_df = download_and_load_datasets()
train_df.head()
Downloading data from http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz 84131840/84125825 [==============================] - 1s 0us/step 84140032/84125825 [==============================] - 1s 0us/step
| sentence | sentiment | polarity | |
|---|---|---|---|
| 0 | Next to "Star Wars" and "The Wizard of Oz," th... | 10 | 1 |
| 1 | I can't help but laugh at the people who prais... | 1 | 0 |
| 2 | Based on a true story, this series is a gem wi... | 10 | 1 |
| 3 | Van Dien must cringe with embarrassment at the... | 1 | 0 |
| 4 | This film had such promise!! What a great idea... | 4 | 0 |
Estimator framework provides input functions that wrap Pandas dataframes.
# Training input on the whole training set with no limit on training epochs.
train_input_fn = tf.estimator.inputs.pandas_input_fn(
train_df, train_df["polarity"], num_epochs=None, shuffle=True)
# Prediction on the whole training set.
predict_train_input_fn = tf.estimator.inputs.pandas_input_fn(
train_df, train_df["polarity"], shuffle=False)
# Prediction on the test set.
predict_test_input_fn = tf.estimator.inputs.pandas_input_fn(
test_df, test_df["polarity"], shuffle=False)
TF-Hub provides a feature column that applies a module on the given text feature and passes further the outputs of the module. In this tutorial we will be using the nnlm-en-dim128 module. For the purpose of this tutorial, the most important facts are:
embedded_text_feature_column = hub.text_embedding_column(
key="sentence",
module_spec="https://tfhub.dev/google/nnlm-en-dim128/1")
For classification we can use a DNN Classifier (note further remarks about different modelling of the label function at the end of the tutorial).
estimator = tf.estimator.DNNClassifier(
hidden_units=[500, 100],
feature_columns=[embedded_text_feature_column],
n_classes=2,
optimizer=tf.train.AdagradOptimizer(learning_rate=0.003))
Train the estimator for a reasonable amount of steps.
# Training for 1,000 steps means 128,000 training examples with the default # batch size. This is roughly equivalent to 5 epochs since the training dataset # contains 25,000 examples. estimator.train(input_fn=train_input_fn, steps=1000);
Run predictions for both training and test set.
train_eval_result = estimator.evaluate(input_fn=predict_train_input_fn)
test_eval_result = estimator.evaluate(input_fn=predict_test_input_fn)
print "Training set accuracy: {accuracy}".format(**train_eval_result)
print "Test set accuracy: {accuracy}".format(**test_eval_result)
Training set accuracy: 0.801320016384 Test set accuracy: 0.793600022793
We can visually check the confusion matrix to undestand the distribution of misclassifications.
def get_predictions(estimator, input_fn):
return [x["class_ids"][0] for x in estimator.predict(input_fn=input_fn)]
LABELS = [
"negative", "positive"
]
# Create a confusion matrix on training data.
with tf.Graph().as_default():
cm = tf.confusion_matrix(train_df["polarity"],
get_predictions(estimator, predict_train_input_fn))
with tf.Session() as session:
cm_out = session.run(cm)
# Normalize the confusion matrix so that each row sums to 1.
cm_out = cm_out.astype(float) / cm_out.sum(axis=1)[:, np.newaxis]
sns.heatmap(cm_out, annot=True, xticklabels=LABELS, yticklabels=LABELS);
plt.xlabel("Predicted");
plt.ylabel("True");
Transfer learning makes it possible to save training resources and to achieve good model generalization even when training on a small dataset. In this part, we will demonstrate this by training with two different TF-Hub modules:
And by training in two modes:
Let's run a couple of trainings and evaluations to see how using a various modules can affect the accuracy.
def train_and_evaluate_with_module(hub_module, train_module=False):
embedded_text_feature_column = hub.text_embedding_column(
key="sentence", module_spec=hub_module, trainable=train_module)
estimator = tf.estimator.DNNClassifier(
hidden_units=[500, 100],
feature_columns=[embedded_text_feature_column],
n_classes=2,
optimizer=tf.train.AdagradOptimizer(learning_rate=0.003))
estimator.train(input_fn=train_input_fn, steps=1000)
train_eval_result = estimator.evaluate(input_fn=predict_train_input_fn)
test_eval_result = estimator.evaluate(input_fn=predict_test_input_fn)
training_set_accuracy = train_eval_result["accuracy"]
test_set_accuracy = test_eval_result["accuracy"]
return {
"Training accuracy": training_set_accuracy,
"Test accuracy": test_set_accuracy
}
results = {}
results["nnlm-en-dim128"] = train_and_evaluate_with_module(
"https://tfhub.dev/google/nnlm-en-dim128/1")
results["nnlm-en-dim128-with-module-training"] = train_and_evaluate_with_module(
"https://tfhub.dev/google/nnlm-en-dim128/1", True)
results["random-nnlm-en-dim128"] = train_and_evaluate_with_module(
"https://tfhub.dev/google/random-nnlm-en-dim128/1")
results["random-nnlm-en-dim128-with-module-training"] = train_and_evaluate_with_module(
"https://tfhub.dev/google/random-nnlm-en-dim128/1", True)
Let's look at the results.
pd.DataFrame.from_dict(results, orient="index")
| Training accuracy | Test accuracy | |
|---|---|---|
| nnlm-en-dim128 | 0.80148 | 0.79384 |
| nnlm-en-dim128-with-module-training | 0.94212 | 0.86792 |
| random-nnlm-en-dim128 | 0.72072 | 0.67664 |
| random-nnlm-en-dim128-with-module-training | 0.76124 | 0.71944 |
We can already see some patterns, but first we should establish the baseline accuracy of the test set - the lower bound that can be achieved by outputting only the label of the most represented class:
estimator.evaluate(input_fn=predict_test_input_fn)["accuracy_baseline"]
0.5
Assigning the most represented class will give us accuracy of 50%. There are a couple of things to notice here:
© 2018 The TensorFlow Authors. All rights reserved.
Licensed under the Creative Commons Attribution License 3.0.
Code samples licensed under the Apache 2.0 License.
https://www.tensorflow.org/tutorials/text_classification_with_tf_hub