From the course: Artificial Intelligence Foundations: Neural Networks

Unlock this course with a free trial

Join today to access over 25,300 courses taught by industry experts.

Why we need more than CNNs

Why we need more than CNNs

Earlier, we explored how convolutional neural networks became the backbone of computer vision. In this video, we'll explore how and why the field is shifting from CNNs toward transformer-based architectures for computer vision. Some visual tasks require understanding relationships between distant regions. For example, matching a logo on one side of an image with text on the other, or recognizing that two far apart objects belong to the same scene. CNNs can only compare nearby pixels, and they require many stacked layers to connect distant regions. This makes them slower, harder to train, and sometimes less accurate for global reasoning. This raises an important question. What if we could see everything at once? What if every part of the image could directly attend to every other part simultaneously? This example shows two cars for a part in a parking lot scene. The cars are spatially distant. CNNs would need many…

Contents