The purpose of computer vision (CV) is to allow machines to obtain valuable information from their surroundings by analyzing visual data that can be provided by a variety of sources such as digital images and video. The nature of such information depends on the end goal of the machine. For example, think about self-driving cars. A CV module that is capable of detecting the objects in front of the car in real time is essential to avoid accidents. On the other hand, a robot that has to guide people inside a railway station could change the way they speak depending on whether the listener is a child or an adult. This information can be obtained thanks to the CV software that applies image classification methods to the frames captured by the cameras installed on the robot.
Since CV is one of the most lucrative areas of Artificial Intelligence (AI), the Deep Learning (DL) revolution has significantly transformed this research branch as well. Indeed, DL solutions are widely used nowadays to achieve various CV tasks such as object detection, face recognition, image classification, object tracking in video, motion estimation, and many more. DL methods have become so popular because of their ability to automatically extract meaningful features from available data. This minimizes the effort required to perform handcrafted feature extraction such as locating corners in an image.
On the other hand, the main drawbacks of DL solutions are their intrinsic ambiguity and lack of labeled data. Indeed, the high complexity of deep neural networks makes it impossible for humans to understand the logic behind their predictions. However, in different areas, it is necessary to understand why the machine made a specific decision, for example, to prevent ethical and racial issues. At the same time, a large amount of labeled data is often needed to train a DL model. During the training process, the model needs to be fed with data samples associated with the label that we want the predicted system to be predicted after deployment (people’s age based on their faces, what kind of animal is one in one). Picture, …). This linkage between training data and associated labels is not always a viable process and, in any case, requires enormous human effort in terms of time and cost.
For these reasons, CV researchers will focus their efforts on discovering and leveraging solutions that can soften the above issues.
Here are some computer vision trends to watch:
Trend 1: Explainable AI Solutions
Deep Explanable AI (XAI) includes methods that help humans understand the decisions of DL solutions, to make them more transparent and trustworthy. Most XAI methods have been developed to be applied to any existing DL model, without the need to replace it. Such methods have recently been criticized by the research community because they do not provide enough detail about the decision process of the DL model, as you can clearly see in the figure below.
Therefore, the trend will be the deployment of DL solutions that are understandable by design. This means that the DL model itself is able to formulate an explanation for each of its predictions. In the figure below, you can see an example of an interpretive-by-design DL model recently developed at the University of Twente in the Netherlands.
Some recent papers about explanatory AI methods applied to CV tasks:
Trend 2: Self-supervised learning
Self-supervised learning aims to leverage large amounts of unlabeled data to learn meaningful features from them through pre-text tasks, and then fine-tune such features with some available labeled data by learning downstream tasks. . , Consider the following example. Our ultimate goal is to train an Image Captioning Deep Neural Network that will operate on animal images, however, we do not have enough labeled data to train our model accurately. We can exploit available unlabeled data so that the model learns the features needed to differentiate different types of animals. As you can see in the figure below, the pre-text task consists of just a classification problem, where our network has to detect the rotation applied to the input image. Therefore, we apply a random rotation to each available unlabeled image, and then we pseudo-label it with such a rotation. While the model learns to detect which rotations have been applied to an input image, it will learn higher-level features about the animals of such images. Indeed, in order to detect the rotation applied to a cat’s image, for example, it is important to recognize its muzzle in different situations. These features would also be very useful for separating different animals required for downstream tasks.
Self-supervision is currently a very hot topic among AI researchers. For example, the popular GANs (Generative Adversarial Networks) are based on a specific type of self-supervised learning method called generative-contrastive (or adversarial).
Here, you can find some recent research that focuses on self-supervised learning:
Trend 3: Neuro-Symbolic AI
Neuro-Symbolic AI aims to combine modern deep learning techniques with traditional symbolic AI methods that typically rely on rule-based reasoning about entities and their relationships. For example, if we know that Bob and Alice are children of Carl, we can conclude that Bob and Alice are brother and sister. The main advantage of the neuro-symbolic AI approach is the ability to learn with less data and provide an inherently interpretable model. The MIT-IBM Watson AI Lab is already focusing its efforts on this very promising research area. One of the contributions of this laboratory is Clever: Collision Events for Video Representation and Reasoning, a work developed as a collaboration between MIT CSAL, IBM Research, Harvard University and Google DeepMind.
Some recent papers that talk about Neuro-Symbolic AI: