- What is object detection?
- How is object detection different from object classification?
- Types of object detection algorithms
- Code for object detection using PyTorch
What is object detection?
Object detection is a computer vision technique in which a software system can detect, locate, and trace the object from a given image or video. The special attribute about object detection is that it identifies the class of object (person, table, chair, etc.) and their location-specific coordinates in the given image. The location is pointed out by drawing a bounding box around the object. The bounding box may or may not accurately locate the position of the object. The ability to locate the object inside an image defines the performance of the algorithm used for detection.
These object detection algorithms might be pre-trained or can be trained from scratch. In most use cases, we use pre-trained weights from pre-trained models and then fine-tune them as per our requirements and different use cases.
Labeled data is of paramount importance in these tasks, and every algorithm when put into practice requires a lot of well-labeled data. The algorithms require data of varying nature to function correctly, and this can be done easily by either collecting a lot more samples of data or augmenting the available data in some form.
Data Augmentation is required in such cases when we have particularly limited access to labeled data. Hence, by data augmentation, we create images that are effectively containing the same image but their interpretation is done differently by the algorithms. For instance, let’s discuss a particular use case.
Let’s say we are given the task of detecting and classifying different types of fruits. Now the task is to detect both the type of fruit present and to also find the precise coordinates of the fruit in the image. But we have a problem. For training, we have 250 images containing bananas. For apples and oranges, we have only 120 images. This dataset imbalance can be dealt with by Data Augmentation. We can create superficial images by just distorting the existing images. The distortions can be in the form of rotation of images, such that the point of view of the objects in the picture changes. We can try different angles of rotation for the creation of new images. Similarly, we play with the lighting conditions, sharpness, or can even displace the images either vertically or horizontally to create images that will be digitally different from the existing image.
Also Read: Computer Vision: Deep Learning Approach
Now let us see a simple program for object detection using python. The code is very simple if you ignore the underlying architecture.
import cv2
import matplotlib.pyplot as plt
import cvlib
from cvlib.object_detection import draw_bbox
im = cv2.imread ('Vegetable - market.jpg')
bbox , label , conf = cvlib.detect_common_objects(im)
output_image = draw_bbox (im , bbox , label , conf)
plt.imshow (output_image)
plt.show()
Here cvlib is the library that has an object detection function for common objects. The model is trained to detect a variety of common objects like fruits, people, cars, etc.
Every detected object can be seen in the resulting image with a bounding box around it. This a picture of a vegetable market we picked up randomly from the internet. You can experiment with your own image. Just change the name of the image in the given code and you are good to go.
Another simple use case of object detection is face detection. Face detection is a specialized case of object detection in images or videos which is a collection of images in sequence. In a general object detection algorithm, the task is to identify a particular class of objects whether it be dogs, cats, trees, fruit cars, etc.
In face detection, we have a database of images with faces and the aspect ratio of various distances. Facial feature data is stored in the database.
When a new object comes in, its features are compared to that of faces stored in the database. Any feature mismatch disqualifies the image as a face. If all features are matched then a bounding box is drawn around the detected face.
We would be using the same concept in which we will store all the attributes of a face in XML file. We would read each frame of our webcam and then, if a face is found in the particular frame we will draw a bounding box around the face.
Also Read: Datasets for Computer Vision using Deep Learning
For this we will require the OpenCV module and harrcascade_default.xml
We begin with importing the cv2 module. If you have not already installed it, you can do so by doing the following.
!pip install opencv-python
import cv2
We then load the XML file which has all data about the facial features.
# Load the cascade
face_cascade = cv2.CascadeClassifier('haarcascade_frontalface_default.xml')
We then start capturing the video using object detection.
# To capture video from a webcam.
cap = cv2.VideoCapture(0)
# To use a video file as input
# cap = cv2.VideoCapture('filename.mp4')
Until we press escape the webcam will be functional. We read each frame and then convert that frame to a grayscale image.
while True:
# Read the frame
_, img = cap.read()
# Convert to grayscale
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
We then call the detectMultiScale function of OpenCV to detect faces in the frame. It detects multiple faces so if you hold a mobile phone with faces in it in front of the webcam it detects them as well.
# Detect the faces
faces = face_cascade.detectMultiScale(gray, 1.1, 4)
# Draw the rectangle around each face
for (x, y, w, h) in faces:
cv2.rectangle(img, (x, y), (x+w, y+h), (255, 0, 0), 2)
# Display
cv2.imshow('img', img)
# Stop if escape key is pressed
k = cv2.waitKey(30) & 0xff
if k==27:
break
# Release the VideoCapture object
cap.release()
How is object detection different from object classification?
Object classification is a traditional computer vision task that is effectively determining the class of the object in an image. Object classification finds out what the object in a given picture or video is. There is a probability score associated with the results so that we can get the confidence scores of the results.
Let’s perform object detection on the mnist dataset and fashion mnist data sets to give you more clarity on the topic.
import tensorflow as tf
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10)
])
predictions = model(x_train[:1]).numpy()
predictions
Output:
array([[-0.63204 , 0.29606453, 0.24910979, 0.28201205, -0.17138952,
0.3396452 , 0.37800127, -0.9318958 , 0.0439647 , -0.0467336 ]],
dtype=float32)
tf.nn.softmax(predictions).numpy()
Output:
array([[0.05021724, 0.12703504, 0.12120801, 0.12526236, 0.07959959,
0.13269372, 0.1378822 , 0.03720722, 0.09872746, 0.09016712]],
dtype=float32)
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
loss_fn(y_train[:1], predictions).numpy()
model.compile(optimizer='adam',
loss=loss_fn,
metrics=['accuracy'])
model.fit(x_train, y_train, epochs=20)
Ouput:
Epoch 1/20
1875/1875 [==============================] - 4s 2ms/step - loss: 0.0672 - accuracy: 0.9791
Epoch 2/20
1875/1875 [==============================] - 3s 2ms/step - loss: 0.0580 - accuracy: 0.9811
Epoch 3/20
1875/1875 [==============================] - 4s 2ms/step - loss: 0.0537 - accuracy: 0.9829
Epoch 4/20
1875/1875 [==============================] - 3s 2ms/step - loss: 0.0472 - accuracy: 0.9851
Epoch 5/20
1875/1875 [==============================] - 4s 2ms/step - loss: 0.0446 - accuracy: 0.9855
Epoch 6/20
1875/1875 [==============================] - 4s 2ms/step - loss: 0.0399 - accuracy: 0.9870
Epoch 7/20
1875/1875 [==============================] - 4s 2ms/step - loss: 0.0403 - accuracy: 0.9857
Epoch 8/20
1875/1875 [==============================] - 4s 2ms/step - loss: 0.0351 - accuracy: 0.9885
Epoch 9/20
1875/1875 [==============================] - 4s 2ms/step - loss: 0.0343 - accuracy: 0.9886
Epoch 10/20
1875/1875 [==============================] - 4s 2ms/step - loss: 0.0347 - accuracy: 0.9880
Epoch 11/20
1875/1875 [==============================] - 4s 2ms/step - loss: 0.0296 - accuracy: 0.9901
Epoch 12/20
1875/1875 [==============================] - 4s 2ms/step - loss: 0.0285 - accuracy: 0.9901
Epoch 13/20
1875/1875 [==============================] - 4s 2ms/step - loss: 0.0288 - accuracy: 0.9902
Epoch 14/20
1875/1875 [==============================] - 3s 2ms/step - loss: 0.0268 - accuracy: 0.9908
Epoch 15/20
1875/1875 [==============================] - 4s 2ms/step - loss: 0.0277 - accuracy: 0.9901
Epoch 16/20
1875/1875 [==============================] - 4s 2ms/step - loss: 0.0228 - accuracy: 0.9919
Epoch 17/20
1875/1875 [==============================] - 4s 2ms/step - loss: 0.0236 - accuracy: 0.9918
Epoch 18/20
1875/1875 [==============================] - 4s 2ms/step - loss: 0.0233 - accuracy: 0.9920
Epoch 19/20
1875/1875 [==============================] - 4s 2ms/step - loss: 0.0230 - accuracy: 0.9920
Epoch 20/20
1875/1875 [==============================] - 4s 2ms/step - loss: 0.0227 - accuracy: 0.9919
<tensorflow.python.keras.callbacks.History at 0x7fa0f06cd390>
model.evaluate(x_test, y_test, verbose=2)
Output:
313/313 – 0s – loss: 0.0765 – accuracy: 0.9762
[0.07645969837903976, 0.9761999845504761]
probability_model = tf.keras.Sequential([
model,
tf.keras.layers.Softmax()
])
probability_model(x_test[:5])
Ouput:
<tf.Tensor: shape=(5, 10), dtype=float32, numpy=
array([[3.9212882e-12, 2.1834714e-19, 1.9253871e-10, 2.2876110e-07,
9.0482010e-19, 1.1011923e-11, 2.5250806e-23, 9.9999976e-01,
1.7883041e-12, 1.3832281e-09],
[1.4191020e-17, 1.3323700e-10, 1.0000000e+00, 7.2097401e-16,
6.5754260e-37, 2.3290989e-16, 8.8370928e-17, 1.0187791e-29,
2.0311796e-18, 0.0000000e+00],
[7.0981394e-17, 9.9999857e-01, 5.5766418e-07, 7.3810041e-11,
4.1638457e-09, 5.4865166e-12, 1.6843820e-12, 7.9530673e-07,
2.9518892e-08, 2.5004247e-15],
[9.9999964e-01, 6.0739493e-21, 1.9297003e-07, 4.0246032e-13,
1.5357564e-12, 2.8772764e-08, 9.8391717e-10, 4.7179654e-08,
3.7541407e-17, 7.9969936e-10],
[9.2232035e-14, 2.7456325e-20, 1.8037905e-14, 7.4756340e-18,
9.9999642e-01, 7.5487475e-15, 6.5344392e-12, 6.5705713e-08,
7.8566824e-13, 3.4821376e-06]], dtype=float32)>
In the above example we did a use case on object classification using MNIST.
Let’s see another example, using the fashion mnist dataset.
# TensorFlow and tf.keras
#based on tensorflow examples from google
import tensorflow as tf
from tensorflow import keras
# Helper libraries
import numpy as np
import matplotlib.pyplot as plt
fashion_mnist = keras.datasets.fashion_mnist
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()
class_names = [‘T-shirt/top’, ‘Trouser’, ‘Pullover’, ‘Dress’, ‘Coat’,
‘Sandal’, ‘Shirt’, ‘Sneaker’, ‘Bag’, ‘Ankle boot’]
plt.figure()
plt.imshow(train_images[0])
plt.colorbar()
plt.grid(False)
plt.show()
train_images[10]
Ouput:
array([[0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0.04313725, 0.55686275, 0.78431373,
0.41568627, 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0.33333333, 0.7254902 ,
0.43921569, 0. , 0. , 0. , 0. ,
0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ,
0. , 0.59607843, 0.83921569, 0.85098039, 0.76078431,
0.9254902 , 0.84705882, 0.73333333, 0.58431373, 0.52941176,
0.6 , 0.82745098, 0.85098039, 0.90588235, 0.80392157,
0.85098039, 0.7372549 , 0.13333333, 0. , 0. ,
0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ,
0.25882353, 0.7254902 , 0.65098039, 0.70588235, 0.70980392,
0.74509804, 0.82745098, 0.86666667, 0.77254902, 0.57254902,
0.77647059, 0.80784314, 0.74901961, 0.65882353, 0.74509804,
0.6745098 , 0.7372549 , 0.68627451, 0. , 0. ,
0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ,
0.52941176, 0.6 , 0.62745098, 0.68627451, 0.70588235,
0.66666667, 0.72941176, 0.73333333, 0.74509804, 0.7372549 ,
0.74509804, 0.73333333, 0.68235294, 0.76470588, 0.7254902 ,
0.68235294, 0.63137255, 0.68627451, 0.23137255, 0. ,
0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ,
0.63137255, 0.57647059, 0.62745098, 0.66666667, 0.69803922,
0.69411765, 0.70588235, 0.65882353, 0.67843137, 0.68235294,
0.67058824, 0.7254902 , 0.72156863, 0.7254902 , 0.6745098 ,
0.67058824, 0.64313725, 0.68235294, 0.47058824, 0. ,
0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0.00784314,
0.68627451, 0.57254902, 0.56862745, 0.65882353, 0.69803922,
0.70980392, 0.7254902 , 0.70588235, 0.72156863, 0.69803922,
0.70196078, 0.73333333, 0.74901961, 0.75686275, 0.74509804,
0.70980392, 0.67058824, 0.6745098 , 0.61960784, 0. ,
0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0.1372549 ,
0.69411765, 0.60784314, 0.54901961, 0.59215686, 0.6745098 ,
0.74901961, 0.73333333, 0.72941176, 0.73333333, 0.72941176,
0.73333333, 0.71372549, 0.74901961, 0.76078431, 0.7372549 ,
0.70588235, 0.63137255, 0.63137255, 0.7254902 , 0. ,
0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0.23137255,
0.66666667, 0.6 , 0.55294118, 0.47058824, 0.60392157,
0.62745098, 0.63137255, 0.6745098 , 0.65882353, 0.65098039,
0.63137255, 0.64705882, 0.6745098 , 0.66666667, 0.64313725,
0.54509804, 0.58431373, 0.63529412, 0.65098039, 0.08235294,
0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0.30980392,
0.56862745, 0.62745098, 0.83921569, 0.48235294, 0.50196078,
0.6 , 0.62745098, 0.64313725, 0.61960784, 0.61568627,
0.60392157, 0.60784314, 0.66666667, 0.64705882, 0.55294118,
0.76470588, 0.75686275, 0.59607843, 0.65098039, 0.23921569,
0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0.39215686,
0.61568627, 0.88235294, 0.96078431, 0.68627451, 0.44313725,
0.68235294, 0.61960784, 0.61960784, 0.62745098, 0.60784314,
0.62745098, 0.64313725, 0.69803922, 0.7372549 , 0.52941176,
0.7254902 , 0.94117647, 0.78823529, 0.6745098 , 0.42352941,
0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ,
0.12156863, 0.68235294, 0.10980392, 0.49411765, 0.6 ,
0.65098039, 0.59607843, 0.61960784, 0.61960784, 0.62745098,
0.63137255, 0.61568627, 0.65882353, 0.74901961, 0.7372549 ,
0.07058824, 0.51764706, 0.62352941, 0.02745098, 0. ,
0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0.32156863, 0.73333333,
0.62352941, 0.6 , 0.61568627, 0.61960784, 0.63529412,
0.64313725, 0.64313725, 0.60392157, 0.73333333, 0.74509804,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0.00392157,
0.01176471, 0.01960784, 0. , 0.14509804, 0.68627451,
0.61960784, 0.60784314, 0.63529412, 0.61960784, 0.62745098,
0.63529412, 0.64705882, 0.6 , 0.69411765, 0.80392157,
0. , 0. , 0.01176471, 0.01176471, 0. ,
0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ,
0. , 0.00392157, 0. , 0.09803922, 0.68627451,
0.59607843, 0.62745098, 0.61960784, 0.63137255, 0.62745098,
0.64313725, 0.64313725, 0.63137255, 0.65098039, 0.78431373,
0. , 0. , 0.00392157, 0. , 0. ,
0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ,
0. , 0.01568627, 0. , 0.11764706, 0.67058824,
0.57647059, 0.64313725, 0.60784314, 0.64705882, 0.63137255,
0.64705882, 0.63529412, 0.66666667, 0.64313725, 0.63529412,
0. , 0. , 0.00784314, 0. , 0. ,
0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ,
0. , 0.01568627, 0. , 0.22352941, 0.65098039,
0.60784314, 0.64313725, 0.65098039, 0.63137255, 0.63137255,
0.64313725, 0.65490196, 0.64705882, 0.64705882, 0.63529412,
0.10980392, 0. , 0.01176471, 0. , 0. ,
0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ,
0. , 0.01176471, 0. , 0.44705882, 0.63137255,
0.63137255, 0.65098039, 0.62352941, 0.65882353, 0.63137255,
0.63137255, 0.6745098 , 0.63529412, 0.64705882, 0.67058824,
0.19607843, 0. , 0.01960784, 0. , 0. ,
0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ,
0. , 0.00392157, 0. , 0.58431373, 0.61568627,
0.65490196, 0.6745098 , 0.62352941, 0.6745098 , 0.64313725,
0.63137255, 0.6745098 , 0.66666667, 0.62745098, 0.67058824,
0.34901961, 0. , 0.01568627, 0. , 0. ,
0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ,
0.00784314, 0. , 0.01568627, 0.67058824, 0.64313725,
0.65098039, 0.67843137, 0.62352941, 0.70196078, 0.65098039,
0.62745098, 0.68235294, 0.65490196, 0.63529412, 0.65098039,
0.50196078, 0. , 0.00784314, 0. , 0. ,
0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ,
0.01176471, 0. , 0.07058824, 0.59607843, 0.67843137,
0.62745098, 0.70196078, 0.60392157, 0.70980392, 0.65098039,
0.64313725, 0.68627451, 0.66666667, 0.65098039, 0.66666667,
0.64313725, 0. , 0. , 0.00392157, 0. ,
0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ,
0.01568627, 0. , 0.18431373, 0.64705882, 0.6745098 ,
0.65490196, 0.7254902 , 0.6 , 0.73333333, 0.67843137,
0.64705882, 0.68235294, 0.70196078, 0.65098039, 0.65098039,
0.61960784, 0.01960784, 0. , 0.01176471, 0. ,
0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ,
0.01568627, 0. , 0.34117647, 0.70588235, 0.63529412,
0.70196078, 0.70196078, 0.61568627, 0.74901961, 0.71372549,
0.64705882, 0.65882353, 0.74509804, 0.67843137, 0.64705882,
0.65098039, 0.07843137, 0. , 0.01568627, 0. ,
0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ,
0.01568627, 0. , 0.41176471, 0.73333333, 0.61568627,
0.76078431, 0.68627451, 0.63137255, 0.74509804, 0.72156863,
0.66666667, 0.61960784, 0.80392157, 0.69411765, 0.65882353,
0.67058824, 0.17254902, 0. , 0.01568627, 0. ,
0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ,
0.01960784, 0. , 0.54117647, 0.70980392, 0.61960784,
0.80392157, 0.62745098, 0.65490196, 0.74509804, 0.77647059,
0.65490196, 0.59607843, 0.85490196, 0.72941176, 0.66666667,
0.6745098 , 0.22352941, 0. , 0.01960784, 0. ,
0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ,
0.01960784, 0. , 0.52941176, 0.68235294, 0.65490196,
0.78039216, 0.60784314, 0.65098039, 0.78823529, 0.85882353,
0.64705882, 0.61960784, 0.85490196, 0.7372549 , 0.65490196,
0.68627451, 0.21960784, 0. , 0.02745098, 0. ,
0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ,
0.01960784, 0. , 0.50588235, 0.67058824, 0.6745098 ,
0.69411765, 0.6 , 0.62352941, 0.80784314, 0.84705882,
0.58039216, 0.61568627, 0.80784314, 0.74509804, 0.64705882,
0.68627451, 0.18823529, 0. , 0.01960784, 0. ,
0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ,
0.01960784, 0. , 0.65490196, 0.73333333, 0.71372549,
0.77647059, 0.76078431, 0.78431373, 0.88627451, 0.94117647,
0.72156863, 0.80784314, 1. , 0.77254902, 0.69803922,
0.70196078, 0.16470588, 0. , 0.01960784, 0. ,
0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ,
0.01176471, 0. , 0.45098039, 0.52941176, 0.44313725,
0.41568627, 0.33333333, 0.32156863, 0.42352941, 0.52156863,
0.3254902 , 0.35294118, 0.4745098 , 0.47058824, 0.43137255,
0.61960784, 0.07058824, 0. , 0.01176471, 0. ,
0. , 0. , 0. ]])
#scale pixel values between 0 and 1
x=255.0
train_images = train_images / x
test_images = test_images / x
plt.figure(figsize=(15,15))
j=np.random.randint(0,1000,100)
y=1
for i in range(100):
plt.subplot(10,10,i+y)
plt.xticks([])
plt.yticks([])
plt.grid(False)
plt.imshow(train_images[j[i]], cmap=plt.cm.binary)
plt.xlabel(class_names[train_labels[i]])
plt.show()
#after scaling
Train_images[10]
Output:
array([[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 6.63394924e-07,
8.56382538e-06, 1.20617259e-05, 6.39271472e-06, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 5.12623350e-06, 1.11570964e-05,
6.75456649e-06, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 9.16691167e-06, 1.29060467e-05,
1.30869726e-05, 1.16998741e-05, 1.42328365e-05, 1.30266640e-05,
1.12777137e-05, 8.98598578e-06, 8.14166497e-06, 9.22722030e-06,
1.27251208e-05, 1.30869726e-05, 1.39312934e-05, 1.23632690e-05,
1.30869726e-05, 1.13380223e-05, 2.05049340e-06, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 3.98036954e-06, 1.11570964e-05, 1.00112325e-05,
1.08555533e-05, 1.09158619e-05, 1.14586396e-05, 1.27251208e-05,
1.33282071e-05, 1.18808000e-05, 8.80505989e-06, 1.19411086e-05,
1.24235777e-05, 1.15189482e-05, 1.01318497e-05, 1.14586396e-05,
1.03730843e-05, 1.13380223e-05, 1.05540101e-05, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 8.14166497e-06, 9.22722030e-06, 9.64938071e-06,
1.05540101e-05, 1.08555533e-05, 1.02524670e-05, 1.12174051e-05,
1.12777137e-05, 1.14586396e-05, 1.13380223e-05, 1.14586396e-05,
1.12777137e-05, 1.04937015e-05, 1.17601827e-05, 1.11570964e-05,
1.04937015e-05, 9.70968934e-06, 1.05540101e-05, 3.55820914e-06,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 9.70968934e-06, 8.86536852e-06, 9.64938071e-06,
1.02524670e-05, 1.07349360e-05, 1.06746274e-05, 1.08555533e-05,
1.01318497e-05, 1.04333929e-05, 1.04937015e-05, 1.03127756e-05,
1.11570964e-05, 1.10967878e-05, 1.11570964e-05, 1.03730843e-05,
1.03127756e-05, 9.89061522e-06, 1.04937015e-05, 7.23703553e-06,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
1.20617259e-07, 1.05540101e-05, 8.80505989e-06, 8.74475126e-06,
1.01318497e-05, 1.07349360e-05, 1.09158619e-05, 1.11570964e-05,
1.08555533e-05, 1.10967878e-05, 1.07349360e-05, 1.07952447e-05,
1.12777137e-05, 1.15189482e-05, 1.16395655e-05, 1.14586396e-05,
1.09158619e-05, 1.03127756e-05, 1.03730843e-05, 9.52876345e-06,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
2.11080203e-06, 1.06746274e-05, 9.34783756e-06, 8.44320812e-06,
9.10660304e-06, 1.03730843e-05, 1.15189482e-05, 1.12777137e-05,
1.12174051e-05, 1.12777137e-05, 1.12174051e-05, 1.12777137e-05,
1.09761706e-05, 1.15189482e-05, 1.16998741e-05, 1.13380223e-05,
1.08555533e-05, 9.70968934e-06, 9.70968934e-06, 1.11570964e-05,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
3.55820914e-06, 1.02524670e-05, 9.22722030e-06, 8.50351675e-06,
7.23703553e-06, 9.28752893e-06, 9.64938071e-06, 9.70968934e-06,
1.03730843e-05, 1.01318497e-05, 1.00112325e-05, 9.70968934e-06,
9.95092385e-06, 1.03730843e-05, 1.02524670e-05, 9.89061522e-06,
8.38289949e-06, 8.98598578e-06, 9.76999796e-06, 1.00112325e-05,
1.26648122e-06, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
4.76438172e-06, 8.74475126e-06, 9.64938071e-06, 1.29060467e-05,
7.41796142e-06, 7.71950456e-06, 9.22722030e-06, 9.64938071e-06,
9.89061522e-06, 9.52876345e-06, 9.46845482e-06, 9.28752893e-06,
9.34783756e-06, 1.02524670e-05, 9.95092385e-06, 8.50351675e-06,
1.17601827e-05, 1.16395655e-05, 9.16691167e-06, 1.00112325e-05,
3.67882639e-06, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
6.03086294e-06, 9.46845482e-06, 1.35694416e-05, 1.47756142e-05,
1.05540101e-05, 6.81487512e-06, 1.04937015e-05, 9.52876345e-06,
9.52876345e-06, 9.64938071e-06, 9.34783756e-06, 9.64938071e-06,
9.89061522e-06, 1.07349360e-05, 1.13380223e-05, 8.14166497e-06,
1.11570964e-05, 1.44740711e-05, 1.21220345e-05, 1.03730843e-05,
6.51333198e-06, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 1.86956751e-06, 1.04937015e-05, 1.68864162e-06,
7.59888731e-06, 9.22722030e-06, 1.00112325e-05, 9.16691167e-06,
9.52876345e-06, 9.52876345e-06, 9.64938071e-06, 9.70968934e-06,
9.46845482e-06, 1.01318497e-05, 1.15189482e-05, 1.13380223e-05,
1.08555533e-06, 7.96073908e-06, 9.58907208e-06, 4.22160406e-07,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
4.94530761e-06, 1.12777137e-05, 9.58907208e-06, 9.22722030e-06,
9.46845482e-06, 9.52876345e-06, 9.76999796e-06, 9.89061522e-06,
9.89061522e-06, 9.28752893e-06, 1.12777137e-05, 1.14586396e-05,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
6.03086294e-08, 1.80925888e-07, 3.01543147e-07, 0.00000000e+00,
2.23141929e-06, 1.05540101e-05, 9.52876345e-06, 9.34783756e-06,
9.76999796e-06, 9.52876345e-06, 9.64938071e-06, 9.76999796e-06,
9.95092385e-06, 9.22722030e-06, 1.06746274e-05, 1.23632690e-05,
0.00000000e+00, 0.00000000e+00, 1.80925888e-07, 1.80925888e-07,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 6.03086294e-08, 0.00000000e+00,
1.50771574e-06, 1.05540101e-05, 9.16691167e-06, 9.64938071e-06,
9.52876345e-06, 9.70968934e-06, 9.64938071e-06, 9.89061522e-06,
9.89061522e-06, 9.70968934e-06, 1.00112325e-05, 1.20617259e-05,
0.00000000e+00, 0.00000000e+00, 6.03086294e-08, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 2.41234518e-07, 0.00000000e+00,
1.80925888e-06, 1.03127756e-05, 8.86536852e-06, 9.89061522e-06,
9.34783756e-06, 9.95092385e-06, 9.70968934e-06, 9.95092385e-06,
9.76999796e-06, 1.02524670e-05, 9.89061522e-06, 9.76999796e-06,
0.00000000e+00, 0.00000000e+00, 1.20617259e-07, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 2.41234518e-07, 0.00000000e+00,
3.43759188e-06, 1.00112325e-05, 9.34783756e-06, 9.89061522e-06,
1.00112325e-05, 9.70968934e-06, 9.70968934e-06, 9.89061522e-06,
1.00715411e-05, 9.95092385e-06, 9.95092385e-06, 9.76999796e-06,
1.68864162e-06, 0.00000000e+00, 1.80925888e-07, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 1.80925888e-07, 0.00000000e+00,
6.87518375e-06, 9.70968934e-06, 9.70968934e-06, 1.00112325e-05,
9.58907208e-06, 1.01318497e-05, 9.70968934e-06, 9.70968934e-06,
1.03730843e-05, 9.76999796e-06, 9.95092385e-06, 1.03127756e-05,
3.01543147e-06, 0.00000000e+00, 3.01543147e-07, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 6.03086294e-08, 0.00000000e+00,
8.98598578e-06, 9.46845482e-06, 1.00715411e-05, 1.03730843e-05,
9.58907208e-06, 1.03730843e-05, 9.89061522e-06, 9.70968934e-06,
1.03730843e-05, 1.02524670e-05, 9.64938071e-06, 1.03127756e-05,
5.36746802e-06, 0.00000000e+00, 2.41234518e-07, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 1.20617259e-07, 0.00000000e+00, 2.41234518e-07,
1.03127756e-05, 9.89061522e-06, 1.00112325e-05, 1.04333929e-05,
9.58907208e-06, 1.07952447e-05, 1.00112325e-05, 9.64938071e-06,
1.04937015e-05, 1.00715411e-05, 9.76999796e-06, 1.00112325e-05,
7.71950456e-06, 0.00000000e+00, 1.20617259e-07, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 1.80925888e-07, 0.00000000e+00, 1.08555533e-06,
9.16691167e-06, 1.04333929e-05, 9.64938071e-06, 1.07952447e-05,
9.28752893e-06, 1.09158619e-05, 1.00112325e-05, 9.89061522e-06,
1.05540101e-05, 1.02524670e-05, 1.00112325e-05, 1.02524670e-05,
9.89061522e-06, 0.00000000e+00, 0.00000000e+00, 6.03086294e-08,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 2.41234518e-07, 0.00000000e+00, 2.83450558e-06,
9.95092385e-06, 1.03730843e-05, 1.00715411e-05, 1.11570964e-05,
9.22722030e-06, 1.12777137e-05, 1.04333929e-05, 9.95092385e-06,
1.04937015e-05, 1.07952447e-05, 1.00112325e-05, 1.00112325e-05,
9.52876345e-06, 3.01543147e-07, 0.00000000e+00, 1.80925888e-07,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 2.41234518e-07, 0.00000000e+00, 5.24685076e-06,
1.08555533e-05, 9.76999796e-06, 1.07952447e-05, 1.07952447e-05,
9.46845482e-06, 1.15189482e-05, 1.09761706e-05, 9.95092385e-06,
1.01318497e-05, 1.14586396e-05, 1.04333929e-05, 9.95092385e-06,
1.00112325e-05, 1.20617259e-06, 0.00000000e+00, 2.41234518e-07,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 2.41234518e-07, 0.00000000e+00, 6.33240609e-06,
1.12777137e-05, 9.46845482e-06, 1.16998741e-05, 1.05540101e-05,
9.70968934e-06, 1.14586396e-05, 1.10967878e-05, 1.02524670e-05,
9.52876345e-06, 1.23632690e-05, 1.06746274e-05, 1.01318497e-05,
1.03127756e-05, 2.65357969e-06, 0.00000000e+00, 2.41234518e-07,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 3.01543147e-07, 0.00000000e+00, 8.32259086e-06,
1.09158619e-05, 9.52876345e-06, 1.23632690e-05, 9.64938071e-06,
1.00715411e-05, 1.14586396e-05, 1.19411086e-05, 1.00715411e-05,
9.16691167e-06, 1.31472812e-05, 1.12174051e-05, 1.02524670e-05,
1.03730843e-05, 3.43759188e-06, 0.00000000e+00, 3.01543147e-07,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 3.01543147e-07, 0.00000000e+00, 8.14166497e-06,
1.04937015e-05, 1.00715411e-05, 1.20014173e-05, 9.34783756e-06,
1.00112325e-05, 1.21220345e-05, 1.32075898e-05, 9.95092385e-06,
9.52876345e-06, 1.31472812e-05, 1.13380223e-05, 1.00715411e-05,
1.05540101e-05, 3.37728325e-06, 0.00000000e+00, 4.22160406e-07,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 3.01543147e-07, 0.00000000e+00, 7.77981319e-06,
1.03127756e-05, 1.03730843e-05, 1.06746274e-05, 9.22722030e-06,
9.58907208e-06, 1.24235777e-05, 1.30266640e-05, 8.92567715e-06,
9.46845482e-06, 1.24235777e-05, 1.14586396e-05, 9.95092385e-06,
1.05540101e-05, 2.89481421e-06, 0.00000000e+00, 3.01543147e-07,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 3.01543147e-07, 0.00000000e+00, 1.00715411e-05,
1.12777137e-05, 1.09761706e-05, 1.19411086e-05, 1.16998741e-05,
1.20617259e-05, 1.36297502e-05, 1.44740711e-05, 1.10967878e-05,
1.24235777e-05, 1.53787005e-05, 1.18808000e-05, 1.07349360e-05,
1.07952447e-05, 2.53296244e-06, 0.00000000e+00, 3.01543147e-07,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 1.80925888e-07, 0.00000000e+00, 6.93549238e-06,
8.14166497e-06, 6.81487512e-06, 6.39271472e-06, 5.12623350e-06,
4.94530761e-06, 6.51333198e-06, 8.02104771e-06, 5.00561624e-06,
5.42777665e-06, 7.29734416e-06, 7.23703553e-06, 6.63394924e-06,
9.52876345e-06, 1.08555533e-06, 0.00000000e+00, 1.80925888e-07,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00]])
model = keras.Sequential([
keras.layers.Flatten(input_shape=(28, 28)),
keras.layers.Dense(128, activation=’relu’),
keras.layers.Dense(10)
])
model.compile(optimizer=’adam’,
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=[‘accuracy’])
model.fit(train_images, train_labels, epochs=20)
Train on 60000 samples
Epoch 1/20
60000/60000 [==============================] – 10s 160us/sample – loss: 2.3026 – accuracy: 0.1003
Epoch 2/20
60000/60000 [==============================] – 10s 159us/sample – loss: 2.3004 – accuracy: 0.1214
Epoch 3/20
60000/60000 [==============================] – 9s 156us/sample – loss: 2.2912 – accuracy: 0.1519
Epoch 4/20
60000/60000 [==============================] – 9s 150us/sample – loss: 2.2710 – accuracy: 0.1963
Epoch 5/20
60000/60000 [==============================] – 9s 156us/sample – loss: 2.2398 – accuracy: 0.2117
Epoch 6/20
60000/60000 [==============================] – 9s 147us/sample – loss: 2.2013 – accuracy: 0.2122
Epoch 7/20
60000/60000 [==============================] – 9s 148us/sample – loss: 2.1584 – accuracy: 0.2189
Epoch 8/20
60000/60000 [==============================] – 9s 146us/sample – loss: 2.1131 – accuracy: 0.2311
Epoch 9/20
60000/60000 [==============================] – 9s 154us/sample – loss: 2.0680 – accuracy: 0.2320
Epoch 10/20
60000/60000 [==============================] – 9s 146us/sample – loss: 2.0240 – accuracy: 0.2304
Epoch 11/20
60000/60000 [==============================] – 9s 144us/sample – loss: 1.9825 – accuracy: 0.2491
Epoch 12/20
60000/60000 [==============================] – 9s 149us/sample – loss: 1.9438 – accuracy: 0.2526
Epoch 13/20
60000/60000 [==============================] – 8s 129us/sample – loss: 1.9083 – accuracy: 0.2649
Epoch 14/20
60000/60000 [==============================] – 8s 128us/sample – loss: 1.8761 – accuracy: 0.2816
Epoch 15/20
60000/60000 [==============================] – 8s 129us/sample – loss: 1.8466 – accuracy: 0.3038
Epoch 16/20
60000/60000 [==============================] – 8s 135us/sample – loss: 1.8195 – accuracy: 0.2962
Epoch 17/20
60000/60000 [==============================] – 8s 128us/sample – loss: 1.7948 – accuracy: 0.3250
Epoch 18/20
60000/60000 [==============================] – 8s 127us/sample – loss: 1.7716 – accuracy: 0.3496
Epoch 19/20
60000/60000 [==============================] – 8s 130us/sample – loss: 1.7495 – accuracy: 0.3587
Epoch 20/20
60000/60000 [==============================] – 8s 129us/sample – loss: 1.7280 – accuracy: 0.3801
<tensorflow.python.keras.callbacks.History at 0x1e9515f2088>
test_loss, test_acc = model.evaluate(test_images, test_labels, verbose=2)
print(‘\nTest accuracy:’, test_acc)
10000/10000 – 2s – loss: 1.7180 – accuracy: 0.4055
Test accuracy: 0.4055
probability_model = tf.keras.Sequential([model,
tf.keras.layers.Softmax()])
predictions = probability_model.predict(test_images)
predictions = probability_model.predict(test_images)
predictions[10]
array([0.12555718, 0.13396162, 0.14664494, 0.13513418, 0.14349538,
0.02516511, 0.14363666, 0.01587282, 0.10545293, 0.02507922],
dtype=float32)
Output:
np.argmax(predictions[10])
test_labels[10]
def plot_image(i, predictions_array, true_label, img):
predictions_array, true_label, img = predictions_array, true_label[i], img[i]
plt.grid(False)
plt.xticks([])
plt.yticks([])
plt.imshow(img, cmap=plt.cm.binary)
predicted_label = np.argmax(predictions_array)
if predicted_label == true_label:
color = ‘blue’
else:
color = ‘red’
plt.xlabel(“{} {:2.0f}% ({})”.format(class_names[predicted_label],
100*np.max(predictions_array),
class_names[true_label]),
color=color)
def plot_value_array(i, predictions_array, true_label):
predictions_array, true_label = predictions_array, true_label[i]
plt.grid(False)
plt.xticks(range(10))
plt.yticks([])
thisplot = plt.bar(range(10), predictions_array, color=”#777777″)
plt.ylim([0, 1])
predicted_label = np.argmax(predictions_array)
thisplot[predicted_label].set_color(‘red’)
thisplot[true_label].set_color(‘blue’)
i = 10
plt.figure(figsize=(6,3))
plt.subplot(1,2,1)
plot_image(i, predictions[i], test_labels, test_images)
plt.subplot(1,2,2)
plot_value_array(i, predictions[i], test_labels)
plt.show()
i = 122
plt.figure(figsize=(6,3))
plt.subplot(1,2,1)
plot_image(i, predictions[i], test_labels, test_images)
plt.subplot(1,2,2)
plot_value_array(i, predictions[i], test_labels)
plt.show()
# Plot the first X test images, their predicted labels, and the true labels.
# Color correct predictions in blue and incorrect predictions in red.
num_rows = 7
num_cols = 7
j=np.random.randint(0,1000,num_rows*num_cols)
num_images = num_rows*num_cols
plt.figure(figsize=(2*2*num_cols, 2*num_rows))
for i in range(num_images):
plt.subplot(num_rows, 2*num_cols, 2*i+1)
plot_image(j[i], predictions[j[i]], test_labels, test_images)
plt.subplot(num_rows, 2*num_cols, 2*i+2)
plot_value_array(j[i], predictions[j[i]], test_labels)
plt.tight_layout()
plt.show()
Types of Object Detection Algorithms
1. Region-based Convolutional Neural Networks(R-CNN):
Since we know that object detection is a classification problem, the success of the model depends on the accuracy of the classification of all objects. The general idea is to use CNNs.But a problem with CNN’s is that they are too slow and computationally expensive. Hence it’s not feasible to run CNNs on so many patches generated by sliding window detectors.
Hence, R-CNN we introduced.R-CNN networks solve this problem by using an object proposal algorithm termed Selective Search which is used to reduce the number of bounding boxes that are being fed to the classifier to a maximum of 2000 region proposals. Selective search uses features like texture, pixel intensity, color, etc to generate all possible locations of objects in an image. Now, these boxes can be fed to our CNN based classifier.Se we run Selective Search to generate probable objects.
- These patches are then fed to CNN, followed by an SVM classifier to predict the class of objects in each patch.
- We then optimize all patches by training a model for bounding box regression exclusively.
2. Fast R-CNN:
Fast R-CNN was introduced because R-CNN architectures were very slow. Fast RCNN uses the concepts of RCNN. But it has a few architectural changes as compared to R-CNN architectures. For instance for gradient propagation, it uses spatial pooling. Back-propagation calculation is used which is very similar to max-pooling but is more effective.
In Fast R-CNN architectures the bounding box regression was added to the neural network training instead of doing it separately. It enabled the network to have two heads, classification head, and bounding box regression head.
These two changes reduced the overall training time and increased the accuracy.
3. Faster R-CNN:
An improvement over Fast R-CNN was faster R-CNN.
Apart from that, we have some more networks which are very popular.
- Yolo
- SSD
A comparative graph of performances of all networks.
SSD seems to be a good choice as we are able to run it on a video and the accuracy trade-off is very little. However, it may not be that simple, look at this chart that compares the performance of SSD, YOLO, and Faster-RCNN on various sized objects. At large sizes, SSD seems to perform similarly to Faster-RCNN. However, look at the accuracy numbers when the object size is small, the gap widens.
YOLO vs SSD vs Faster-RCNN for various sizes
Code for object detection using PyTorch
Defining the Dataset
In defining the dataset we need to add our dataset to the torch.utils.data.Datasets. For this we inherit the torch.utils.data.Dataset class, and do implementation of __len__ and __getitem__.
The reference scripts for training object detection, instance segmentation, and person keypoint detection allow for easily supporting adding new custom datasets.
Our class should return the following values from __getitem__
image: an image of size (x, y) in PIL format as the dimensions of the image should already be predefined.
target: a dictionary which contains the following keys:
- boxes (FloatTensor[N, 4]): the numerical coordinates of the N bounding boxes which we obtain in [ x0, y0, x1, y1 ] format, ranging from 0 to x and 0 to y
- labels ( Int64Tensor [ N ] ) : It should have the label for each bounding box. 0 represents the background class and is reserved for the background class only.
- image_id ( Int64 Tensor[1]): An identifier for an image that should be unique for all the images in the dataset and which is used while evaluation of the performance of the metrics.
- area ( Tensor [ N ] ) : The bounding box area which is calculated from the coordinates. This is used with the COCO metric for evaluation, to separate the individual metric scores for small, medium, and large boxes.
- crowd ( UInt8Tensor [ N ] ) : cases with iscrowd=True will be ignored while evaluation
- masks ( UInt8Tensor [N , x , y ] ) : The segmentation masks for each one of the objects(optional)
- key points ( FloatTensor [ N , K , 3 ] ) : For each object in a total of N objects, it contains the K key points in [ x , y , visibility] format, defining the present object. visibility=0 means that the key point is not identified/visible. It should be noted that for data augmentation, the idea of flipping a key point is dependent on the representation of data, and probably we should adapt references/detection/transforms.py for our new keypoint representations if any.
If our model returns the above methods as specified, it will make it work for both training and evaluation phases, and will make use of the evaluation scripts from protocols.
Point to be noted for the labels:
Background is considered class 0 by the model. If the dataset does not have the background class, we will not have 0 in our labels. For instance, assuming we have only two classes, cat and dog, we can define 1 ( and not 0) to specify cats and 2 to specify dogs. So, for example, if any of the images have both the classes, our labels tensor will look like [1,2].
Also, if we want to use a grouped aspect ratio during training (so that each and every batch of images only contains images which are having the same aspect ratio), then it is advised to incorporate the implementation of a get_height_and_width function, which returns the specific height and the specific width of the image for all images in the dataset. If this method is not defined, we will have to query all the elements of the dataset via __getitem__ , which in turn loads the image in computer memory and is comparatively slower than a custom method if defined.
Writing a custom dataset
Let’s write a dataset for the PennFudan dataset. Before that, we will have to download and extract the dataset as given in official PyTorch documentation. After we are finished with download and extraction of the zip file, we would have the following directory structure:
PennFudanPed/
PedMasks/
FudanPed00001_mask.png
FudanPed00002_mask.png
FudanPed00003_mask.png
FudanPed00004_mask.png
…
PNGImages/
FudanPed00001.png
FudanPed00002.png
FudanPed00003.png
FudanPed00004.png
So we see that each image has a segmentation mask, where each color is mapped to a different class. Let’s write a torch.utils.data.Dataset class for this data we have.
#importing libraries
import os #os for folder operations
import numpy as np
import torch # pytorch library
from PIL import Image #for image operations
class PennFudanDataset(object): #class for returning attributes as specified above
def __init__(self, root, transforms):
self.root = root
self.transforms = transforms
# loading all the image files, and sorting them to ensure that they have proper alignment
self.imgs = list(sorted(os.listdir(os.path.join(root, “PNGImages”))))
self.masks = list(sorted(os.listdir(os.path.join(root, “PedMasks”))))
def __getitem__(self, idx):
# loading images and masks
img_path = os.path.join(self.root, “PNGImages”, self.imgs[idx])
mask_path = os.path.join(self.root, “PedMasks”, self.masks[idx])
img = Image.open(img_path).convert(“RGB”)
# note that we have not converted the mask to RGB color format
# as each color corresponds to a different class
# with 0 representing background class
mask = Image.open(mask_path)
# converting the PIL Image into a numpy arrayformat
mask = np.array(mask)
# different class of objects are encoded with different colors
obj_ids = np.unique(mask)
# first id is the background ie. class 0, hence we remove it
obj_ids = obj_ids[1:]
# we split the color-encoded masks into a set
# of binary masks
masks = mask == obj_ids[:, None, None]
# now we get bounding box coordinates for each of the masks
num_objs = len(obj_ids)
boxes = []
for i in range(num_objs):
pos = np.where(masks[i])
xmin = np.min(pos[1])
xmax = np.max(pos[1])
ymin = np.min(pos[0])
ymax = np.max(pos[0])
boxes.append([xmin, ymin, xmax, ymax])
# converting everything to a torch.Tensor
boxes = torch.as_tensor(boxes, dtype=torch.float32)
# only one class is present
labels = torch.ones((num_objs,), dtype=torch.int64)
masks = torch.as_tensor(masks, dtype=torch.uint8)
image_id = torch.tensor([idx])
area = (boxes[:, 3] – boxes[:, 1]) * (boxes[:, 2] – boxes[:, 0])
# assuming all classes are not crowd
iscrowd = torch.zeros((num_objs,), dtype=torch.int64)
target = {}
target[“boxes”] = boxes
target[“labels”] = labels
target[“masks”] = masks
target[“image_id”] = image_id
target[“area”] = area
target[“iscrowd”] = iscrowd
if self.transforms is not None:
img, target = self.transforms(img, target)
return img, target
def __len__(self):
return len(self.imgs)
Now we have obtained the dataset in desired format. Now we define a model that can be used for predictions on the above dataset.
Model Definition
In this code demonstration, we are using Mask R-CNN, which is based on top of a Faster R-CNN implementation. Faster R-CNN is an object detection model that is used for prediction of both bounding boxes and the predicted class-scores for each potential object in the image.
Mask R-CNN being an image segmentation technique adds an extra branch to the Faster R-CNN, by also predicting segmentation masks for each class.
There are always two common situations where we might need to modify one of the available models currently being provided in torchvision modelzoo. The first condition being when we intend to start from a pre-trained model, and then finetune the last layer to get results. The other situation being when intending to replace the backbone of the model with a different model (for faster predictions).
In following sections we will take a look at the aforementioned scenarios:
1. Fine Tuning a pretrained model
Let’s assume that we want to start from a model pre-trained on the COCO dataset and we want to finetune it for our particular classes. Here is a feasible way of doing it:
import torchvision
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
# loading a model pre-trained on the COCO dataset, resnet50 in this case
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
# then we replace the classifier of resnet50 with a new one, that has
# number of classes defined by the user
num_classes = 2 # 1 class (person) + class(background)
# specify number of input features required by the classifier
in_features = model.roi_heads.box_predictor.cls_score.in_features
# replace the already existing head with new one
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)
2. Modification of the model to by adding a different backbone
import torchvision
from torchvision.models.detection import FasterRCNN
from torchvision.models.detection.rpn import AnchorGenerator
# we initially load a pre-trained model
# we only return the features
backbone = torchvision.models.mobilenet_v2(pretrained=True).features
# FasterRCNN requires the number of output channels in the backbone.
# For mobilenet_v2, it’s 1280 so we need to add it in our model
backbone.out_channels = 1280
Now let’s make the RPN model generate 5 x 3 anchors for each spatial location, with 5 different sizes and 3 different aspect ratios. We obtain a Tuple[Tuple[int]] because for each feature map we have have different sizes and aspect ratios.
anchor_generator = AnchorGenerator(sizes=((32, 64, 128, 256, 512),),
aspect_ratios=((0.5, 1.0, 2.0),))
#Now let’s define feature maps that will be used to perform ROI cropping. We also define the size of the cropping after the rescaling.
# if the backbone returns a Tensor, featmap_names must be [0].
# More generally, the backbone should return an
# OrderedDict[Tensor], and in featmap_names you can choose which
# feature maps to use.
roi_pooler = torchvision.ops.MultiScaleRoIAlign(featmap_names=[0],
output_size=7,
sampling_ratio=2)
# put the pieces together inside a FasterRCNN model
model = FasterRCNN(backbone,
num_classes=2,
rpn_anchor_generator=anchor_generator,
box_roi_pool=roi_pooler)
An Instance segmentation model using PennFudan Dataset.
In our case, we have to fine-tune a pre-trained model, and having seen that our dataset is very small, we will be following through on approach 1.
Here we will also compute the instance segmentation masks, so we use a Mask R-CNN type of model.
import torchvision
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
from torchvision.models.detection.mask_rcnn import MaskRCNNPredictor
def get_model_instance_segmentation(num_classes):
# loading an instance segmentation model pre-trained on the COCO dataset
model = torchvision.models.detection.maskrcnn_resnet50_fpn(pretrained=True)
# get number of input features for the classifier
in_features = model.roi_heads.box_predictor.cls_score.in_features
# replacing the pre-trained head with the new one
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)
# now we calculate the number of input features for the mask classifier
in_features_mask = model.roi_heads.mask_predictor.conv5_mask.in_channels
hidden_layer = 256
# we also replace the mask predictor with a new mask predictor
model.roi_heads.mask_predictor = MaskRCNNPredictor(in_features_mask,
hidden_layer,
num_classes)
return model
Now our model is ready to be trained and evaluated on our custom dataset.
Putting everything together
In references/detection/, we have a number of helper functions to simplify training and evaluating detection models. Here, we will use references/detection/engine.py, references/detection/utils.py and references/detection/transforms.py. Just copy them to your folder and use them here.
Let’s write some helper functions for data augmentation/transformation:
import transforms as T
def get_transform(train):
transforms = []
transforms.append(T.ToTensor())
if train:
transforms.append(T.RandomHorizontalFlip(0.5))
return T.Compose(transforms)
Testing forward() method (Optional)
Before iterating over the dataset, it’s good to see what the model expects during training and inference time on sample data.
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
dataset = PennFudanDataset(‘PennFudanPed’, get_transform(train=True))
data_loader = torch.utils.data.DataLoader(
dataset, batch_size=2, shuffle=True, num_workers=4,
collate_fn=utils.collate_fn)
# For Training
images,targets = next(iter(data_loader))
images = list(image for image in images)
targets = [{k: v for k, v in t.items()} for t in targets]
output = model(images,targets) # Returns losses and detections
# For inference
model.eval()
x = [torch.rand(3, 300, 400), torch.rand(3, 500, 400)]
predictions = model(x) # Returns predictions
Let’s now write the main function which performs the training and the validation:
from engine import train_one_epoch, evaluate
import utils
def main():
# train on the GPU or on the CPU, if a GPU is not available
device = torch.device(‘cuda’) if torch.cuda.is_available() else torch.device(‘cpu’)
# our dataset has two classes only – background and person
num_classes = 2
# use our dataset and defined transformations
dataset = PennFudanDataset(‘PennFudanPed’, get_transform(train=True))
dataset_test = PennFudanDataset(‘PennFudanPed’, get_transform(train=False))
# split the dataset in train and test set
indices = torch.randperm(len(dataset)).tolist()
dataset = torch.utils.data.Subset(dataset, indices[:-50])
dataset_test = torch.utils.data.Subset(dataset_test, indices[-50:])
# define training and validation data loaders
data_loader = torch.utils.data.DataLoader(
dataset, batch_size=2, shuffle=True, num_workers=4,
collate_fn=utils.collate_fn)
data_loader_test = torch.utils.data.DataLoader(
dataset_test, batch_size=1, shuffle=False, num_workers=4,
collate_fn=utils.collate_fn)
# get the model using our helper function
model = get_model_instance_segmentation(num_classes)
# move model to the right device
model.to(device)
# construct an optimizer
params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.SGD(params, lr=0.005,
momentum=0.9, weight_decay=0.0005)
# and a learning rate scheduler
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer,
step_size=3,
gamma=0.1)
# let’s train it for 10 epochs
num_epochs = 10
for epoch in range(num_epochs):
# train for one epoch, printing every 10 iterations
train_one_epoch(model, optimizer, data_loader, device, epoch, print_freq=10)
# update the learning rate
lr_scheduler.step()
# evaluate on the test dataset
evaluate(model, data_loader_test, device=device)
print(“That’s it!”)
You should get as output for the first epoch:
Epoch: [0] [ 0/60] eta: 0:01:18 lr: 0.000090 loss: 2.5213 (2.5213) loss_classifier: 0.8025 (0.8025) loss_box_reg: 0.2634 (0.2634) loss_mask: 1.4265 (1.4265) loss_objectness: 0.0190 (0.0190) loss_rpn_box_reg: 0.0099 (0.0099) time: 1.3121 data: 0.3024 max mem: 3485
Epoch: [0] [10/60] eta: 0:00:20 lr: 0.000936 loss: 1.3007 (1.5313) loss_classifier: 0.3979 (0.4719) loss_box_reg: 0.2454 (0.2272) loss_mask: 0.6089 (0.7953) loss_objectness: 0.0197 (0.0228) loss_rpn_box_reg: 0.0121 (0.0141) time: 0.4198 data: 0.0298 max mem: 5081
Epoch: [0] [20/60] eta: 0:00:15 lr: 0.001783 loss: 0.7567 (1.1056) loss_classifier: 0.2221 (0.3319) loss_box_reg: 0.2002 (0.2106) loss_mask: 0.2904 (0.5332) loss_objectness: 0.0146 (0.0176) loss_rpn_box_reg: 0.0094 (0.0123) time: 0.3293 data: 0.0035 max mem: 5081
Epoch: [0] [30/60] eta: 0:00:11 lr: 0.002629 loss: 0.4705 (0.8935) loss_classifier: 0.0991 (0.2517) loss_box_reg: 0.1578 (0.1957) loss_mask: 0.1970 (0.4204) loss_objectness: 0.0061 (0.0140) loss_rpn_box_reg: 0.0075 (0.0118) time: 0.3403 data: 0.0044 max mem: 5081
Epoch: [0] [40/60] eta: 0:00:07 lr: 0.003476 loss: 0.3901 (0.7568) loss_classifier: 0.0648 (0.2022) loss_box_reg: 0.1207 (0.1736) loss_mask: 0.1705 (0.3585) loss_objectness: 0.0018 (0.0113) loss_rpn_box_reg: 0.0075 (0.0112) time: 0.3407 data: 0.0044 max mem: 5081
Epoch: [0] [50/60] eta: 0:00:03 lr: 0.004323 loss: 0.3237 (0.6703) loss_classifier: 0.0474 (0.1731) loss_box_reg: 0.1109 (0.1561) loss_mask: 0.1658 (0.3201) loss_objectness: 0.0015 (0.0093) loss_rpn_box_reg: 0.0093 (0.0116) time: 0.3379 data: 0.0043 max mem: 5081
Epoch: [0] [59/60] eta: 0:00:00 lr: 0.005000 loss: 0.2540 (0.6082) loss_classifier: 0.0309 (0.1526) loss_box_reg: 0.0463 (0.1405) loss_mask: 0.1568 (0.2945) loss_objectness: 0.0012 (0.0083) loss_rpn_box_reg: 0.0093 (0.0123) time: 0.3489 data: 0.0042 max mem: 5081
Epoch: [0] Total time: 0:00:21 (0.3570 s / it)
creating an index…
index created!
Test: [ 0/50] eta: 0:00:19 model_time: 0.2152 (0.2152) evaluator_time: 0.0133 (0.0133) time: 0.4000 data: 0.1701 max mem: 5081
Test: [49/50] eta: 0:00:00 model_time: 0.0628 (0.0687) evaluator_time: 0.0039 (0.0064) time: 0.0735 data: 0.0022 max mem: 5081
Test: Total time: 0:00:04 (0.0828 s / it)
Averaged stats: model_time: 0.0628 (0.0687) evaluator_time: 0.0039 (0.0064)
Accumulating evaluation results…
DONE (t=0.01s).
Accumulating evaluation results…
DONE (t=0.01s).
IoU metric: bbox
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.606
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.984
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.780
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.313
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.582
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.612
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.270
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.672
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.672
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.650
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.755
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.664
IoU metric: segm
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.704
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.979
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.871
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.325
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.488
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.727
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.316
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.748
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.749
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.650
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.673
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.758
So after one epoch of training, we obtain a COCO-style mAP of 60.6, and a mask mAP of 70.4.
After training for 10 epochs, I got the following metrics
IoU metric: bbox
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.799
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.969
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.935
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.349
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.592
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.831
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.324
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.844
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.844
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.400
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.777
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.870
IoU metric: segm
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.761
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.969
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.919
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.341
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.464
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.788
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.303
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.799
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.799
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.400
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.769
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.818
So this was our model on object detection. Hope this helps!