W3cubDocs

/PyTorch

torchvision.models

The models subpackage contains definitions of models for addressing different tasks, including: image classification, pixelwise semantic segmentation, object detection, instance segmentation, person keypoint detection and video classification.

Classification

The models subpackage contains definitions for the following model architectures for image classification:

You can construct a model with random weights by calling its constructor:

import torchvision.models as models
resnet18 = models.resnet18()
alexnet = models.alexnet()
vgg16 = models.vgg16()
squeezenet = models.squeezenet1_0()
densenet = models.densenet161()
inception = models.inception_v3()
googlenet = models.googlenet()
shufflenet = models.shufflenet_v2_x1_0()
mobilenet = models.mobilenet_v2()
resnext50_32x4d = models.resnext50_32x4d()
wide_resnet50_2 = models.wide_resnet50_2()
mnasnet = models.mnasnet1_0()

We provide pre-trained models, using the PyTorch torch.utils.model_zoo. These can be constructed by passing pretrained=True:

import torchvision.models as models
resnet18 = models.resnet18(pretrained=True)
alexnet = models.alexnet(pretrained=True)
squeezenet = models.squeezenet1_0(pretrained=True)
vgg16 = models.vgg16(pretrained=True)
densenet = models.densenet161(pretrained=True)
inception = models.inception_v3(pretrained=True)
googlenet = models.googlenet(pretrained=True)
shufflenet = models.shufflenet_v2_x1_0(pretrained=True)
mobilenet = models.mobilenet_v2(pretrained=True)
resnext50_32x4d = models.resnext50_32x4d(pretrained=True)
wide_resnet50_2 = models.wide_resnet50_2(pretrained=True)
mnasnet = models.mnasnet1_0(pretrained=True)

Instancing a pre-trained model will download its weights to a cache directory. This directory can be set using the TORCH_MODEL_ZOO environment variable. See torch.utils.model_zoo.load_url() for details.

Some models use modules which have different training and evaluation behavior, such as batch normalization. To switch between these modes, use model.train() or model.eval() as appropriate. See train() or eval() for details.

All pre-trained models expect input images normalized in the same way, i.e. mini-batches of 3-channel RGB images of shape (3 x H x W), where H and W are expected to be at least 224. The images have to be loaded in to a range of [0, 1] and then normalized using mean = [0.485, 0.456, 0.406] and std = [0.229, 0.224, 0.225]. You can use the following transform to normalize:

normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                 std=[0.229, 0.224, 0.225])

An example of such normalization can be found in the imagenet example here

The process for obtaining the values of mean and std is roughly equivalent to:

import torch
from torchvision import datasets, transforms as T

transform = T.Compose([T.Resize(256), T.CenterCrop(224), T.ToTensor()])
dataset = datasets.ImageNet(".", split="train", transform=transform)

means = []
stds = []
for img in subset(dataset):
    means.append(torch.mean(img))
    stds.append(torch.std(img))

mean = torch.mean(torch.tensor(means))
std = torch.mean(torch.tensor(stds))

Unfortunately, the concrete subset that was used is lost. For more information see this discussion or these experiments.

ImageNet 1-crop error rates (224x224)

Network

Top-1 error

Top-5 error

AlexNet

43.45

20.91

VGG-11

30.98

11.37

VGG-13

30.07

10.75

VGG-16

28.41

9.62

VGG-19

27.62

9.12

VGG-11 with batch normalization

29.62

10.19

VGG-13 with batch normalization

28.45

9.63

VGG-16 with batch normalization

26.63

8.50

VGG-19 with batch normalization

25.76

8.15

ResNet-18

30.24

10.92

ResNet-34

26.70

8.58

ResNet-50

23.85

7.13

ResNet-101

22.63

6.44

ResNet-152

21.69

5.94

SqueezeNet 1.0

41.90

19.58

SqueezeNet 1.1

41.81

19.38

Densenet-121

25.35

7.83

Densenet-169

24.00

7.00

Densenet-201

22.80

6.43

Densenet-161

22.35

6.20

Inception v3

22.55

6.44

GoogleNet

30.22

10.47

ShuffleNet V2

30.64

11.68

MobileNet V2

28.12

9.71

ResNeXt-50-32x4d

22.38

6.30

ResNeXt-101-32x8d

20.69

5.47

Wide ResNet-50-2

21.49

5.91

Wide ResNet-101-2

21.16

5.72

MNASNet 1.0

26.49

8.456

Alexnet

torchvision.models.alexnet(pretrained=False, progress=True, **kwargs) [source]

AlexNet model architecture from the “One weird trick…” paper.

Parameters
  • pretrained (bool) – If True, returns a model pre-trained on ImageNet
  • progress (bool) – If True, displays a progress bar of the download to stderr

VGG

torchvision.models.vgg11(pretrained=False, progress=True, **kwargs) [source]

VGG 11-layer model (configuration “A”) from “Very Deep Convolutional Networks For Large-Scale Image Recognition”

Parameters
  • pretrained (bool) – If True, returns a model pre-trained on ImageNet
  • progress (bool) – If True, displays a progress bar of the download to stderr
torchvision.models.vgg11_bn(pretrained=False, progress=True, **kwargs) [source]

VGG 11-layer model (configuration “A”) with batch normalization “Very Deep Convolutional Networks For Large-Scale Image Recognition”

Parameters
  • pretrained (bool) – If True, returns a model pre-trained on ImageNet
  • progress (bool) – If True, displays a progress bar of the download to stderr
torchvision.models.vgg13(pretrained=False, progress=True, **kwargs) [source]

VGG 13-layer model (configuration “B”) “Very Deep Convolutional Networks For Large-Scale Image Recognition”

Parameters
  • pretrained (bool) – If True, returns a model pre-trained on ImageNet
  • progress (bool) – If True, displays a progress bar of the download to stderr
torchvision.models.vgg13_bn(pretrained=False, progress=True, **kwargs) [source]

VGG 13-layer model (configuration “B”) with batch normalization “Very Deep Convolutional Networks For Large-Scale Image Recognition”

Parameters
  • pretrained (bool) – If True, returns a model pre-trained on ImageNet
  • progress (bool) – If True, displays a progress bar of the download to stderr
torchvision.models.vgg16(pretrained=False, progress=True, **kwargs) [source]

VGG 16-layer model (configuration “D”) “Very Deep Convolutional Networks For Large-Scale Image Recognition”

Parameters
  • pretrained (bool) – If True, returns a model pre-trained on ImageNet
  • progress (bool) – If True, displays a progress bar of the download to stderr
torchvision.models.vgg16_bn(pretrained=False, progress=True, **kwargs) [source]

VGG 16-layer model (configuration “D”) with batch normalization “Very Deep Convolutional Networks For Large-Scale Image Recognition”

Parameters
  • pretrained (bool) – If True, returns a model pre-trained on ImageNet
  • progress (bool) – If True, displays a progress bar of the download to stderr
torchvision.models.vgg19(pretrained=False, progress=True, **kwargs) [source]

VGG 19-layer model (configuration “E”) “Very Deep Convolutional Networks For Large-Scale Image Recognition”

Parameters
  • pretrained (bool) – If True, returns a model pre-trained on ImageNet
  • progress (bool) – If True, displays a progress bar of the download to stderr
torchvision.models.vgg19_bn(pretrained=False, progress=True, **kwargs) [source]

VGG 19-layer model (configuration ‘E’) with batch normalization “Very Deep Convolutional Networks For Large-Scale Image Recognition”

Parameters
  • pretrained (bool) – If True, returns a model pre-trained on ImageNet
  • progress (bool) – If True, displays a progress bar of the download to stderr

ResNet

torchvision.models.resnet18(pretrained=False, progress=True, **kwargs) [source]

ResNet-18 model from “Deep Residual Learning for Image Recognition”

Parameters
  • pretrained (bool) – If True, returns a model pre-trained on ImageNet
  • progress (bool) – If True, displays a progress bar of the download to stderr
torchvision.models.resnet34(pretrained=False, progress=True, **kwargs) [source]

ResNet-34 model from “Deep Residual Learning for Image Recognition”

Parameters
  • pretrained (bool) – If True, returns a model pre-trained on ImageNet
  • progress (bool) – If True, displays a progress bar of the download to stderr
torchvision.models.resnet50(pretrained=False, progress=True, **kwargs) [source]

ResNet-50 model from “Deep Residual Learning for Image Recognition”

Parameters
  • pretrained (bool) – If True, returns a model pre-trained on ImageNet
  • progress (bool) – If True, displays a progress bar of the download to stderr
torchvision.models.resnet101(pretrained=False, progress=True, **kwargs) [source]

ResNet-101 model from “Deep Residual Learning for Image Recognition”

Parameters
  • pretrained (bool) – If True, returns a model pre-trained on ImageNet
  • progress (bool) – If True, displays a progress bar of the download to stderr
torchvision.models.resnet152(pretrained=False, progress=True, **kwargs) [source]

ResNet-152 model from “Deep Residual Learning for Image Recognition”

Parameters
  • pretrained (bool) – If True, returns a model pre-trained on ImageNet
  • progress (bool) – If True, displays a progress bar of the download to stderr

SqueezeNet

torchvision.models.squeezenet1_0(pretrained=False, progress=True, **kwargs) [source]

SqueezeNet model architecture from the “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size” paper.

Parameters
  • pretrained (bool) – If True, returns a model pre-trained on ImageNet
  • progress (bool) – If True, displays a progress bar of the download to stderr
torchvision.models.squeezenet1_1(pretrained=False, progress=True, **kwargs) [source]

SqueezeNet 1.1 model from the official SqueezeNet repo. SqueezeNet 1.1 has 2.4x less computation and slightly fewer parameters than SqueezeNet 1.0, without sacrificing accuracy.

Parameters
  • pretrained (bool) – If True, returns a model pre-trained on ImageNet
  • progress (bool) – If True, displays a progress bar of the download to stderr

DenseNet

torchvision.models.densenet121(pretrained=False, progress=True, **kwargs) [source]

Densenet-121 model from “Densely Connected Convolutional Networks”

Parameters
  • pretrained (bool) – If True, returns a model pre-trained on ImageNet
  • progress (bool) – If True, displays a progress bar of the download to stderr
  • memory_efficient (bool) – but slower. Default: False. See “paper”
torchvision.models.densenet169(pretrained=False, progress=True, **kwargs) [source]

Densenet-169 model from “Densely Connected Convolutional Networks”

Parameters
  • pretrained (bool) – If True, returns a model pre-trained on ImageNet
  • progress (bool) – If True, displays a progress bar of the download to stderr
  • memory_efficient (bool) –

    but slower. Default: False. See “paper”

torchvision.models.densenet161(pretrained=False, progress=True, **kwargs) [source]

Densenet-161 model from “Densely Connected Convolutional Networks”

Parameters
  • pretrained (bool) – If True, returns a model pre-trained on ImageNet
  • progress (bool) – If True, displays a progress bar of the download to stderr
  • memory_efficient (bool) –

    but slower. Default: False. See “paper”

torchvision.models.densenet201(pretrained=False, progress=True, **kwargs) [source]

Densenet-201 model from “Densely Connected Convolutional Networks”

Parameters
  • pretrained (bool) – If True, returns a model pre-trained on ImageNet
  • progress (bool) – If True, displays a progress bar of the download to stderr
  • memory_efficient (bool) –

    but slower. Default: False. See “paper”

Inception v3

torchvision.models.inception_v3(pretrained=False, progress=True, **kwargs) [source]

Inception v3 model architecture from “Rethinking the Inception Architecture for Computer Vision”.

Note

Important: In contrast to the other models the inception_v3 expects tensors with a size of N x 3 x 299 x 299, so ensure your images are sized accordingly.

Parameters
  • pretrained (bool) – If True, returns a model pre-trained on ImageNet
  • progress (bool) – If True, displays a progress bar of the download to stderr
  • aux_logits (bool) – If True, add an auxiliary branch that can improve training. Default: True
  • transform_input (bool) – If True, preprocesses the input according to the method with which it was trained on ImageNet. Default: False

Note

This requires scipy to be installed

GoogLeNet

torchvision.models.googlenet(pretrained=False, progress=True, **kwargs) [source]

GoogLeNet (Inception v1) model architecture from “Going Deeper with Convolutions”.

Parameters
  • pretrained (bool) – If True, returns a model pre-trained on ImageNet
  • progress (bool) – If True, displays a progress bar of the download to stderr
  • aux_logits (bool) – If True, adds two auxiliary branches that can improve training. Default: False when pretrained is True otherwise True
  • transform_input (bool) – If True, preprocesses the input according to the method with which it was trained on ImageNet. Default: False

Note

This requires scipy to be installed

ShuffleNet v2

torchvision.models.shufflenet_v2_x0_5(pretrained=False, progress=True, **kwargs) [source]

Constructs a ShuffleNetV2 with 0.5x output channels, as described in “ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design”.

Parameters
  • pretrained (bool) – If True, returns a model pre-trained on ImageNet
  • progress (bool) – If True, displays a progress bar of the download to stderr
torchvision.models.shufflenet_v2_x1_0(pretrained=False, progress=True, **kwargs) [source]

Constructs a ShuffleNetV2 with 1.0x output channels, as described in “ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design”.

Parameters
  • pretrained (bool) – If True, returns a model pre-trained on ImageNet
  • progress (bool) – If True, displays a progress bar of the download to stderr
torchvision.models.shufflenet_v2_x1_5(pretrained=False, progress=True, **kwargs) [source]

Constructs a ShuffleNetV2 with 1.5x output channels, as described in “ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design”.

Parameters
  • pretrained (bool) – If True, returns a model pre-trained on ImageNet
  • progress (bool) – If True, displays a progress bar of the download to stderr
torchvision.models.shufflenet_v2_x2_0(pretrained=False, progress=True, **kwargs) [source]

Constructs a ShuffleNetV2 with 2.0x output channels, as described in “ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design”.

Parameters
  • pretrained (bool) – If True, returns a model pre-trained on ImageNet
  • progress (bool) – If True, displays a progress bar of the download to stderr

MobileNet v2

torchvision.models.mobilenet_v2(pretrained=False, progress=True, **kwargs) [source]

Constructs a MobileNetV2 architecture from “MobileNetV2: Inverted Residuals and Linear Bottlenecks”.

Parameters
  • pretrained (bool) – If True, returns a model pre-trained on ImageNet
  • progress (bool) – If True, displays a progress bar of the download to stderr

ResNext

torchvision.models.resnext50_32x4d(pretrained=False, progress=True, **kwargs) [source]

ResNeXt-50 32x4d model from “Aggregated Residual Transformation for Deep Neural Networks”

Parameters
  • pretrained (bool) – If True, returns a model pre-trained on ImageNet
  • progress (bool) – If True, displays a progress bar of the download to stderr
torchvision.models.resnext101_32x8d(pretrained=False, progress=True, **kwargs) [source]

ResNeXt-101 32x8d model from “Aggregated Residual Transformation for Deep Neural Networks”

Parameters
  • pretrained (bool) – If True, returns a model pre-trained on ImageNet
  • progress (bool) – If True, displays a progress bar of the download to stderr

Wide ResNet

torchvision.models.wide_resnet50_2(pretrained=False, progress=True, **kwargs) [source]

Wide ResNet-50-2 model from “Wide Residual Networks”

The model is the same as ResNet except for the bottleneck number of channels which is twice larger in every block. The number of channels in outer 1x1 convolutions is the same, e.g. last block in ResNet-50 has 2048-512-2048 channels, and in Wide ResNet-50-2 has 2048-1024-2048.

Parameters
  • pretrained (bool) – If True, returns a model pre-trained on ImageNet
  • progress (bool) – If True, displays a progress bar of the download to stderr
torchvision.models.wide_resnet101_2(pretrained=False, progress=True, **kwargs) [source]

Wide ResNet-101-2 model from “Wide Residual Networks”

The model is the same as ResNet except for the bottleneck number of channels which is twice larger in every block. The number of channels in outer 1x1 convolutions is the same, e.g. last block in ResNet-50 has 2048-512-2048 channels, and in Wide ResNet-50-2 has 2048-1024-2048.

Parameters
  • pretrained (bool) – If True, returns a model pre-trained on ImageNet
  • progress (bool) – If True, displays a progress bar of the download to stderr

MNASNet

torchvision.models.mnasnet0_5(pretrained=False, progress=True, **kwargs) [source]

MNASNet with depth multiplier of 0.5 from “MnasNet: Platform-Aware Neural Architecture Search for Mobile”. :param pretrained: If True, returns a model pre-trained on ImageNet :type pretrained: bool :param progress: If True, displays a progress bar of the download to stderr :type progress: bool

torchvision.models.mnasnet0_75(pretrained=False, progress=True, **kwargs) [source]

MNASNet with depth multiplier of 0.75 from “MnasNet: Platform-Aware Neural Architecture Search for Mobile”. :param pretrained: If True, returns a model pre-trained on ImageNet :type pretrained: bool :param progress: If True, displays a progress bar of the download to stderr :type progress: bool

torchvision.models.mnasnet1_0(pretrained=False, progress=True, **kwargs) [source]

MNASNet with depth multiplier of 1.0 from “MnasNet: Platform-Aware Neural Architecture Search for Mobile”. :param pretrained: If True, returns a model pre-trained on ImageNet :type pretrained: bool :param progress: If True, displays a progress bar of the download to stderr :type progress: bool

torchvision.models.mnasnet1_3(pretrained=False, progress=True, **kwargs) [source]

MNASNet with depth multiplier of 1.3 from “MnasNet: Platform-Aware Neural Architecture Search for Mobile”. :param pretrained: If True, returns a model pre-trained on ImageNet :type pretrained: bool :param progress: If True, displays a progress bar of the download to stderr :type progress: bool

Semantic Segmentation

The models subpackage contains definitions for the following model architectures for semantic segmentation:

As with image classification models, all pre-trained models expect input images normalized in the same way. The images have to be loaded in to a range of [0, 1] and then normalized using mean = [0.485, 0.456, 0.406] and std = [0.229, 0.224, 0.225]. They have been trained on images resized such that their minimum size is 520.

The pre-trained models have been trained on a subset of COCO train2017, on the 20 categories that are present in the Pascal VOC dataset. You can see more information on how the subset has been selected in references/segmentation/coco_utils.py. The classes that the pre-trained model outputs are the following, in order:

['__background__', 'aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus',
 'car', 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse', 'motorbike',
 'person', 'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor']

The accuracies of the pre-trained models evaluated on COCO val2017 are as follows

Network

mean IoU

global pixelwise acc

FCN ResNet50

60.5

91.4

FCN ResNet101

63.7

91.9

DeepLabV3 ResNet50

66.4

92.4

DeepLabV3 ResNet101

67.4

92.4

Fully Convolutional Networks

torchvision.models.segmentation.fcn_resnet50(pretrained=False, progress=True, num_classes=21, aux_loss=None, **kwargs) [source]

Constructs a Fully-Convolutional Network model with a ResNet-50 backbone.

Parameters
  • pretrained (bool) – If True, returns a model pre-trained on COCO train2017 which contains the same classes as Pascal VOC
  • progress (bool) – If True, displays a progress bar of the download to stderr
torchvision.models.segmentation.fcn_resnet101(pretrained=False, progress=True, num_classes=21, aux_loss=None, **kwargs) [source]

Constructs a Fully-Convolutional Network model with a ResNet-101 backbone.

Parameters
  • pretrained (bool) – If True, returns a model pre-trained on COCO train2017 which contains the same classes as Pascal VOC
  • progress (bool) – If True, displays a progress bar of the download to stderr

DeepLabV3

torchvision.models.segmentation.deeplabv3_resnet50(pretrained=False, progress=True, num_classes=21, aux_loss=None, **kwargs) [source]

Constructs a DeepLabV3 model with a ResNet-50 backbone.

Parameters
  • pretrained (bool) – If True, returns a model pre-trained on COCO train2017 which contains the same classes as Pascal VOC
  • progress (bool) – If True, displays a progress bar of the download to stderr
torchvision.models.segmentation.deeplabv3_resnet101(pretrained=False, progress=True, num_classes=21, aux_loss=None, **kwargs) [source]

Constructs a DeepLabV3 model with a ResNet-101 backbone.

Parameters
  • pretrained (bool) – If True, returns a model pre-trained on COCO train2017 which contains the same classes as Pascal VOC
  • progress (bool) – If True, displays a progress bar of the download to stderr

Object Detection, Instance Segmentation and Person Keypoint Detection

The models subpackage contains definitions for the following model architectures for detection:

The pre-trained models for detection, instance segmentation and keypoint detection are initialized with the classification models in torchvision.

The models expect a list of Tensor[C, H, W], in the range 0-1. The models internally resize the images so that they have a minimum size of 800. This option can be changed by passing the option min_size to the constructor of the models.

For object detection and instance segmentation, the pre-trained models return the predictions of the following classes:

COCO_INSTANCE_CATEGORY_NAMES = [
    '__background__', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',
    'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'N/A', 'stop sign',
    'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
    'elephant', 'bear', 'zebra', 'giraffe', 'N/A', 'backpack', 'umbrella', 'N/A', 'N/A',
    'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball',
    'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket',
    'bottle', 'N/A', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl',
    'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza',
    'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'N/A', 'dining table',
    'N/A', 'N/A', 'toilet', 'N/A', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone',
    'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'N/A', 'book',
    'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush'
]

Here are the summary of the accuracies for the models trained on the instances set of COCO train2017 and evaluated on COCO val2017.

Network

box AP

mask AP

keypoint AP

Faster R-CNN ResNet-50 FPN

37.0

RetinaNet ResNet-50 FPN

36.4

Mask R-CNN ResNet-50 FPN

37.9

34.6

For person keypoint detection, the accuracies for the pre-trained models are as follows

Network

box AP

mask AP

keypoint AP

Keypoint R-CNN ResNet-50 FPN

54.6

65.0

For person keypoint detection, the pre-trained model return the keypoints in the following order:

COCO_PERSON_KEYPOINT_NAMES = [
    'nose',
    'left_eye',
    'right_eye',
    'left_ear',
    'right_ear',
    'left_shoulder',
    'right_shoulder',
    'left_elbow',
    'right_elbow',
    'left_wrist',
    'right_wrist',
    'left_hip',
    'right_hip',
    'left_knee',
    'right_knee',
    'left_ankle',
    'right_ankle'
]

Runtime characteristics

The implementations of the models for object detection, instance segmentation and keypoint detection are efficient.

In the following table, we use 8 V100 GPUs, with CUDA 10.0 and CUDNN 7.4 to report the results. During training, we use a batch size of 2 per GPU, and during testing a batch size of 1 is used.

For test time, we report the time for the model evaluation and postprocessing (including mask pasting in image), but not the time for computing the precision-recall.

Network

train time (s / it)

test time (s / it)

memory (GB)

Faster R-CNN ResNet-50 FPN

0.2288

0.0590

5.2

RetinaNet ResNet-50 FPN

0.2514

0.0939

4.1

Mask R-CNN ResNet-50 FPN

0.2728

0.0903

5.4

Keypoint R-CNN ResNet-50 FPN

0.3789

0.1242

6.8

Faster R-CNN

torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=False, progress=True, num_classes=91, pretrained_backbone=True, trainable_backbone_layers=3, **kwargs) [source]

Constructs a Faster R-CNN model with a ResNet-50-FPN backbone.

The input to the model is expected to be a list of tensors, each of shape [C, H, W], one for each image, and should be in 0-1 range. Different images can have different sizes.

The behavior of the model changes depending if it is in training or evaluation mode.

During training, the model expects both the input tensors, as well as a targets (list of dictionary), containing:

  • boxes (FloatTensor[N, 4]): the ground-truth boxes in [x1, y1, x2, y2] format, with values of x between 0 and W and values of y between 0 and H
  • labels (Int64Tensor[N]): the class label for each ground-truth box

The model returns a Dict[Tensor] during training, containing the classification and regression losses for both the RPN and the R-CNN.

During inference, the model requires only the input tensors, and returns the post-processed predictions as a List[Dict[Tensor]], one for each input image. The fields of the Dict are as follows:

  • boxes (FloatTensor[N, 4]): the predicted boxes in [x1, y1, x2, y2] format, with values of x between 0 and W and values of y between 0 and H
  • labels (Int64Tensor[N]): the predicted labels for each image
  • scores (Tensor[N]): the scores or each prediction

Faster R-CNN is exportable to ONNX for a fixed batch size with inputs images of fixed size.

Example:

>>> model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
>>> # For training
>>> images, boxes = torch.rand(4, 3, 600, 1200), torch.rand(4, 11, 4)
>>> labels = torch.randint(1, 91, (4, 11))
>>> images = list(image for image in images)
>>> targets = []
>>> for i in range(len(images)):
>>>     d = {}
>>>     d['boxes'] = boxes[i]
>>>     d['labels'] = labels[i]
>>>     targets.append(d)
>>> output = model(images, targets)
>>> # For inference
>>> model.eval()
>>> x = [torch.rand(3, 300, 400), torch.rand(3, 500, 400)]
>>> predictions = model(x)
>>>
>>> # optionally, if you want to export the model to ONNX:
>>> torch.onnx.export(model, x, "faster_rcnn.onnx", opset_version = 11)
Parameters
  • pretrained (bool) – If True, returns a model pre-trained on COCO train2017
  • progress (bool) – If True, displays a progress bar of the download to stderr
  • pretrained_backbone (bool) – If True, returns a model with backbone pre-trained on Imagenet
  • num_classes (int) – number of output classes of the model (including the background)
  • trainable_backbone_layers (int) – number of trainable (not frozen) resnet layers starting from final block. Valid values are between 0 and 5, with 5 meaning all backbone layers are trainable.

RetinaNet

torchvision.models.detection.retinanet_resnet50_fpn(pretrained=False, progress=True, num_classes=91, pretrained_backbone=True, **kwargs) [source]

Constructs a RetinaNet model with a ResNet-50-FPN backbone.

The input to the model is expected to be a list of tensors, each of shape [C, H, W], one for each image, and should be in 0-1 range. Different images can have different sizes.

The behavior of the model changes depending if it is in training or evaluation mode.

During training, the model expects both the input tensors, as well as a targets (list of dictionary), containing:

  • boxes (FloatTensor[N, 4]): the ground-truth boxes in [x1, y1, x2, y2] format, with values between 0 and H and 0 and W
  • labels (Int64Tensor[N]): the class label for each ground-truth box

The model returns a Dict[Tensor] during training, containing the classification and regression losses.

During inference, the model requires only the input tensors, and returns the post-processed predictions as a List[Dict[Tensor]], one for each input image. The fields of the Dict are as follows:

  • boxes (FloatTensor[N, 4]): the predicted boxes in [x1, y1, x2, y2] format, with values between 0 and H and 0 and W
  • labels (Int64Tensor[N]): the predicted labels for each image
  • scores (Tensor[N]): the scores or each prediction

Example:

>>> model = torchvision.models.detection.retinanet_resnet50_fpn(pretrained=True)
>>> model.eval()
>>> x = [torch.rand(3, 300, 400), torch.rand(3, 500, 400)]
>>> predictions = model(x)
Parameters
  • pretrained (bool) – If True, returns a model pre-trained on COCO train2017
  • progress (bool) – If True, displays a progress bar of the download to stderr

Mask R-CNN

torchvision.models.detection.maskrcnn_resnet50_fpn(pretrained=False, progress=True, num_classes=91, pretrained_backbone=True, trainable_backbone_layers=3, **kwargs) [source]

Constructs a Mask R-CNN model with a ResNet-50-FPN backbone.

The input to the model is expected to be a list of tensors, each of shape [C, H, W], one for each image, and should be in 0-1 range. Different images can have different sizes.

The behavior of the model changes depending if it is in training or evaluation mode.

During training, the model expects both the input tensors, as well as a targets (list of dictionary), containing:

  • boxes (FloatTensor[N, 4]): the ground-truth boxes in [x1, y1, x2, y2] format, with values of x between 0 and W and values of y between 0 and H
  • labels (Int64Tensor[N]): the class label for each ground-truth box
  • masks (UInt8Tensor[N, H, W]): the segmentation binary masks for each instance

The model returns a Dict[Tensor] during training, containing the classification and regression losses for both the RPN and the R-CNN, and the mask loss.

During inference, the model requires only the input tensors, and returns the post-processed predictions as a List[Dict[Tensor]], one for each input image. The fields of the Dict are as follows:

  • boxes (FloatTensor[N, 4]): the predicted boxes in [x1, y1, x2, y2] format, with values of x between 0 and W and values of y between 0 and H
  • labels (Int64Tensor[N]): the predicted labels for each image
  • scores (Tensor[N]): the scores or each prediction
  • masks (UInt8Tensor[N, 1, H, W]): the predicted masks for each instance, in 0-1 range. In order to obtain the final segmentation masks, the soft masks can be thresholded, generally with a value of 0.5 (mask >= 0.5)

Mask R-CNN is exportable to ONNX for a fixed batch size with inputs images of fixed size.

Example:

>>> model = torchvision.models.detection.maskrcnn_resnet50_fpn(pretrained=True)
>>> model.eval()
>>> x = [torch.rand(3, 300, 400), torch.rand(3, 500, 400)]
>>> predictions = model(x)
>>>
>>> # optionally, if you want to export the model to ONNX:
>>> torch.onnx.export(model, x, "mask_rcnn.onnx", opset_version = 11)
Parameters
  • pretrained (bool) – If True, returns a model pre-trained on COCO train2017
  • progress (bool) – If True, displays a progress bar of the download to stderr
  • pretrained_backbone (bool) – If True, returns a model with backbone pre-trained on Imagenet
  • num_classes (int) – number of output classes of the model (including the background)
  • trainable_backbone_layers (int) – number of trainable (not frozen) resnet layers starting from final block. Valid values are between 0 and 5, with 5 meaning all backbone layers are trainable.

Keypoint R-CNN

torchvision.models.detection.keypointrcnn_resnet50_fpn(pretrained=False, progress=True, num_classes=2, num_keypoints=17, pretrained_backbone=True, trainable_backbone_layers=3, **kwargs) [source]

Constructs a Keypoint R-CNN model with a ResNet-50-FPN backbone.

The input to the model is expected to be a list of tensors, each of shape [C, H, W], one for each image, and should be in 0-1 range. Different images can have different sizes.

The behavior of the model changes depending if it is in training or evaluation mode.

During training, the model expects both the input tensors, as well as a targets (list of dictionary), containing:

  • boxes (FloatTensor[N, 4]): the ground-truth boxes in [x1, y1, x2, y2] format, with values of x between 0 and W and values of y between 0 and H
  • labels (Int64Tensor[N]): the class label for each ground-truth box
  • keypoints (FloatTensor[N, K, 3]): the K keypoints location for each of the N instances, in the format [x, y, visibility], where visibility=0 means that the keypoint is not visible.

The model returns a Dict[Tensor] during training, containing the classification and regression losses for both the RPN and the R-CNN, and the keypoint loss.

During inference, the model requires only the input tensors, and returns the post-processed predictions as a List[Dict[Tensor]], one for each input image. The fields of the Dict are as follows:

  • boxes (FloatTensor[N, 4]): the predicted boxes in [x1, y1, x2, y2] format, with values of x between 0 and W and values of y between 0 and H
  • labels (Int64Tensor[N]): the predicted labels for each image
  • scores (Tensor[N]): the scores or each prediction
  • keypoints (FloatTensor[N, K, 3]): the locations of the predicted keypoints, in [x, y, v] format.

Keypoint R-CNN is exportable to ONNX for a fixed batch size with inputs images of fixed size.

Example:

>>> model = torchvision.models.detection.keypointrcnn_resnet50_fpn(pretrained=True)
>>> model.eval()
>>> x = [torch.rand(3, 300, 400), torch.rand(3, 500, 400)]
>>> predictions = model(x)
>>>
>>> # optionally, if you want to export the model to ONNX:
>>> torch.onnx.export(model, x, "keypoint_rcnn.onnx", opset_version = 11)
Parameters
  • pretrained (bool) – If True, returns a model pre-trained on COCO train2017
  • progress (bool) – If True, displays a progress bar of the download to stderr
  • pretrained_backbone (bool) – If True, returns a model with backbone pre-trained on Imagenet
  • num_classes (int) – number of output classes of the model (including the background)
  • trainable_backbone_layers (int) – number of trainable (not frozen) resnet layers starting from final block. Valid values are between 0 and 5, with 5 meaning all backbone layers are trainable.

Video classification

We provide models for action recognition pre-trained on Kinetics-400. They have all been trained with the scripts provided in references/video_classification.

All pre-trained models expect input images normalized in the same way, i.e. mini-batches of 3-channel RGB videos of shape (3 x T x H x W), where H and W are expected to be 112, and T is a number of video frames in a clip. The images have to be loaded in to a range of [0, 1] and then normalized using mean = [0.43216, 0.394666, 0.37645] and std = [0.22803, 0.22145, 0.216989].

Note

The normalization parameters are different from the image classification ones, and correspond to the mean and std from Kinetics-400.

Note

For now, normalization code can be found in references/video_classification/transforms.py, see the Normalize function there. Note that it differs from standard normalization for images because it assumes the video is 4d.

Kinetics 1-crop accuracies for clip length 16 (16x112x112)

Network

Clip acc@1

Clip acc@5

ResNet 3D 18

52.75

75.45

ResNet MC 18

53.90

76.29

ResNet (2+1)D

57.50

78.81

ResNet 3D

torchvision.models.video.r3d_18(pretrained=False, progress=True, **kwargs) [source]

Construct 18 layer Resnet3D model as in https://arxiv.org/abs/1711.11248

Parameters
  • pretrained (bool) – If True, returns a model pre-trained on Kinetics-400
  • progress (bool) – If True, displays a progress bar of the download to stderr
Returns

R3D-18 network

Return type

nn.Module

ResNet Mixed Convolution

torchvision.models.video.mc3_18(pretrained=False, progress=True, **kwargs) [source]

Constructor for 18 layer Mixed Convolution network as in https://arxiv.org/abs/1711.11248

Parameters
  • pretrained (bool) – If True, returns a model pre-trained on Kinetics-400
  • progress (bool) – If True, displays a progress bar of the download to stderr
Returns

MC3 Network definition

Return type

nn.Module

ResNet (2+1)D

torchvision.models.video.r2plus1d_18(pretrained=False, progress=True, **kwargs) [source]

Constructor for the 18 layer deep R(2+1)D network as in https://arxiv.org/abs/1711.11248

Parameters
  • pretrained (bool) – If True, returns a model pre-trained on Kinetics-400
  • progress (bool) – If True, displays a progress bar of the download to stderr
Returns

R(2+1)D-18 network

Return type

nn.Module

© 2019 Torch Contributors
Licensed under the 3-clause BSD License.
https://pytorch.org/docs/1.7.0/torchvision/models.html