But despite their ingenuity, ConvNets remained on the sidelines of computer vision and artificial intelligence because they faced a serious problem: They could not scale.
CNNs needed a lot of data and compute resources to work efficiently for large images. At the time, the technique was only applicable to images with low resolutions. In , AlexNet showed that perhaps the time had come to revisit deep learning , the branch of AI that uses multi-layered neural networks. The availability of large sets of data, namely the ImageNet dataset with millions of labeled pictures, and vast compute resources enabled researchers to create complex CNNs that could perform computer vision tasks that were previously impossible.
Convolutional neural networks are composed of multiple layers of artificial neurons. Artificial neurons, a rough imitation of their biological counterparts, are mathematical functions that calculate the weighted sum of multiple inputs and outputs an activation value.
The behavior of each neuron is defined by its weights. When fed with the pixel values, the artificial neurons of a CNN pick out various visual features. When you input an image into a ConvNet, each of its layers generates several activation maps. Activation maps highlight the relevant features of the image. Each of the neurons takes a patch of pixels as input, multiplies their color values by its weights, sums them up, and runs them through the activation function.
The first or bottom layer of the CNN usually detects basic features such as horizontal, vertical, and diagonal edges. The output of the first layer is fed as input of the next layer, which extracts more complex features, such as corners and combinations of edges.
As you move deeper into the convolutional neural network, the layers start detecting higher-level features such as objects, faces, and more.
A CNN is usually composed of several convolution layers, but it also contains other components. The final layer of a CNN is a classification layer, which takes the output of the final convolution layer as input remember, the higher convolution layers detect complex objects.
One of the great challenges of developing CNNs is adjusting the weights of the individual neurons to extract the right features from images. In the beginning, the CNN starts off with random weights. During training, the developers provide the neural network with a large dataset of images annotated with their corresponding classes cat, dog, horse, etc. The corrections are made through a technique called backpropagation or backprop.
Essentially, backpropagation optimizes the tuning process and makes it easier for the network to decide which units to adjust instead of making random corrections. After each epoch, the neural network becomes a bit better at classifying the training images. As the CNN improves, the adjustments it makes to the weights become smaller and smaller. After training the CNN, the developers use a test dataset to verify its accuracy.
The test dataset is a set of labeled images that are were not part of the training process. Each image is run through the ConvNet, and the output is compared to the actual label of the image. Essentially, the test dataset evaluates how good the neural network has become at classifying images it has not seen before. If the model performs well on the training set compared to the validation set, then the model has been overfit to the training data.
If the model performs poorly on both training and validation sets, then the model has been underfit to the data. Although the longer a network is trained, the better it performs on the training set, at some point, the network fits too well to the training data and loses its capability to generalize. An abundance of well-labeled data in medical imaging is desirable but rarely available due to the cost and necessary workload of radiology experts.
There are a couple of techniques available to train a model efficiently on a smaller dataset: data augmentation and transfer learning. As data augmentation was briefly covered in the previous section, this section focuses on transfer learning. Transfer learning is a common and effective strategy to train a network on a small dataset, where a network is pretrained on an extremely large dataset, such as ImageNet, which contains 1. The underlying assumption of transfer learning is that generic features learned on a large enough dataset can be shared among seemingly disparate datasets.
This portability of learned generic features is a unique advantage of deep learning that makes itself useful in various domain tasks with small datasets. At present, many models pretrained on the ImageNet challenge dataset are open to the public and readily accessible, along with their learned kernels and weights, such as AlexNet [ 3 ], VGG [ 30 ], ResNet [ 31 ], Inception [ 32 ], and DenseNet [ 33 ].
In practice, there are two ways to utilize a pretrained network: fixed feature extraction and fine-tuning Fig. Transfer learning is a common and effective strategy to train a network on a small dataset, where a network is pretrained on an extremely large dataset, such as ImageNet, then reused and applied to the given task of interest.
A fixed feature extraction method is a process to remove FC layers from a pretrained network and while maintaining the remaining network, which consists of a series of convolution and pooling layers, referred to as the convolutional base, as a fixed feature extractor. In this scenario, any machine learning classifier, such as random forests and support vector machines, as well as the usual FC layers, can be added on top of the fixed feature extractor, resulting in training limited to the added classifier on a given dataset of interest.
A fine-tuning method, which is more often applied to radiology research, is to not only replace FC layers of the pretrained model with a new set of FC layers to retrain them on a given dataset, but to fine-tune all or part of the kernels in the pretrained convolutional base by means of backpropagation. FC, fully connected. A fixed feature extraction method is a process to remove fully connected layers from a network pretrained on ImageNet and while maintaining the remaining network, which consists of a series of convolution and pooling layers, referred to as the convolutional base, as a fixed feature extractor.
In this scenario, any machine learning classifier, such as random forests and support vector machines, as well as the usual fully connected layers in CNNs, can be added on top of the fixed feature extractor, resulting in training limited to the added classifier on a given dataset of interest.
This approach is not common in deep learning research on medical images because of the dissimilarity between ImageNet and given medical images. A fine-tuning method, which is more often applied to radiology research, is to not only replace fully connected layers of the pretrained model with a new set of fully connected layers to retrain on a given dataset, but to fine-tune all or part of the kernels in the pretrained convolutional base by means of backpropagation.
All the layers in the convolutional base can be fine-tuned or, alternatively, some earlier layers can be fixed while fine-tuning the rest of the deeper layers. This is motivated by the observation that the early-layer features appear more generic, including features such as edges applicable to a variety of datasets and tasks, whereas later features progressively become more specific to a particular dataset or task [ 34 , 35 ].
One drawback of transfer learning is its constraints on input dimensions. The input image has to be 2D with three channels relevant to RGB because the ImageNet dataset consists of 2D color images that have three channels RGB: red, green, and blue , whereas medical grayscale images have only one channel levels of gray.
On the other hand, the height and width of an input image can be arbitrary, but not too small, by adding a global pooling layer between the convolutional base and added fully connected layers. There has also been increasing interest in taking advantage of unlabeled data, i. Examples of this attempt include pseudo-label [ 36 ] and incorporating generative models, such as generative adversarial networks GANs [ 37 ].
However, whether these techniques can really help improve the performance of deep learning in radiology is not clear and remains an area of active investigation. This section introduces recent applications within radiology, which are divided into the following categories: classification, segmentation, detection, and others.
In medical image analysis, classification with deep learning usually utilizes target lesions depicted in medical images, and these lesions are classified into two or more classes. For example, deep learning is frequently used for the classification of lung nodules on computed tomography CT images as benign or malignant Fig. As shown, it is necessary to prepare a large number of training data with corresponding labels for efficient classification using CNN.
For lung nodule classification, CT images of lung nodules and their labels i. Figure 11 b, c show two examples of training data of lung nodule classification between benign lung nodule and primary lung cancer; Fig.
After training CNN, the target lesions of medical images can be specified in the deployment phase by medical doctors or computer-aided detection CADe systems [ 38 ].
A schematic illustration of a classification system with CNN and representative examples of its training data. Because 2D images are frequently utilized in computer vision, deep learning networks developed for the 2D images 2D-CNN are not directly applied to 3D images obtained in radiology [thin-slice CT or 3D-magnetic resonance imaging MRI images]. To apply deep learning to 3D radiological images, different approaches such as custom architectures are used.
For example, Setio et al. They extracted differently oriented 2D image patches based on multiplanar reconstruction from one nodule candidate one or nine patches per candidate , and these patches were used in separate streams and merged in the fully connected layers to obtain the final classification output.
One previous study used 3D-CNN for fully capturing the spatial 3D context information of lung nodules [ 43 ]. They used a multiview strategy in 3D-CNN, whose inputs were obtained by cropping three 3D patches of a lung nodule in different sizes and then resizing them into the same size. They also used the 3D Inception model in their 3D-CNN, where the network path was divided into multiple branches with different convolution and pooling operators. One previous study used CT image sets of liver masses over three phases non-enhanced CT, and enhanced CT in arterial and delayed phases for the classification of liver masses with 2D-CNN [ 8 ].
Segmentation of organs or anatomical structures is a fundamental image processing technique for medical image analysis, such as quantitative evaluation of clinical parameters organ volume and shape and computer-aided diagnosis CAD system.
In the previous section, classification depends on the segmentation of lesions of interest. Segmentation can be performed manually by radiologists or dedicated personnel, a time-consuming process.
However, one can also apply CNN to this task as well. Figure 12 a shows a representative example of segmentation of the uterus with a malignant tumor on MRI [ 24 , 44 , 45 ]. In most cases, a segmentation system directly receives an entire image and outputs its segmentation result.
Training data for the segmentation system consist of the medical images containing the organ or structure of interest and the segmentation result; the latter is mainly obtained from previously performed manual segmentation. Figure 12 b shows a representative example of training data for the segmentation system of a uterus with a malignant tumor. In contrast to classification, because an entire image is inputted to the segmentation system, it is necessary for the system to capture the global spatial context of the entire image for efficient segmentation.
A schematic illustration of the system for segmenting a uterus with a malignant tumor and representative examples of its training data. Note that original images and corresponding manual segmentations are arranged next to each other. One way to perform segmentation is to use a CNN classifier for calculating the probability of an organ or anatomical structure. In this approach, the segmentation process is divided into two steps; the first step is construction of the probability map of the organ or anatomical structure using CNN and image patches, and the second is a refinement step where the global context of images and the probability map are utilized.
By calculating the probabilities of the liver being present for each image patch, a 3D probability map of the liver was obtained. Then, an algorithm called graph cut [ 47 ] was used for refinement of liver segmentation, based on the probability map of the liver. In this method, the local context of CT images was evaluated by 3D-CNN and the global context was evaluated by the graph cut algorithm. Although segmentation based on image patch was successfully performed in deep learning, U-net of Ronneberger et al.
The architecture of U-net consists of a contracting path to capture anatomical context and a symmetric expanding path that enables precise localization. Although it was difficult to capture global context and local context at the same time by using the image patch-based method, U-net enabled the segmentation process to incorporate a multiscale spatial context.
As a result, U-net could be trained end-to-end from a limited number of training data. One potential approach of using U-net in radiology is to extend U-net for 3D radiological images, as shown in classification.
For example, V-net was suggested as an extension of U-net for segmentation of the prostate on volumetric MRI images [ 49 ]. In the study, V-net utilized a loss function based on the Dice coefficient between segmentation results and ground truth, which directly reflected the quality of prostate segmentation.
Another study [ 9 ] utilized two types of 3D U-net for segmenting liver and liver mass on 3D CT images, which was named cascaded fully convolutional neural networks; one type of U-net was used for segmentation of the liver and the other type for the segmentation of liver mass using the liver segmentation results. Because the second type of 3D U-net focused on the segmentation of liver mass, the segmentation of liver mass was more efficiently performed than single 3D U-net.
A common task for radiologists is to detect abnormalities within medical images. Abnormalities can be rare and they must be detected among many normal cases. One previous study investigated the usefulness of 2D-CNN for detecting tuberculosis on chest radiographs [ 7 ]. To develop the detection system and evaluate its performance, the dataset of chest radiographs was used. According to the results, the best area under the curve of receiver operating characteristic curves for detecting pulmonary tuberculosis from healthy cases was 0.
Nearly 40 million mammography examinations are performed in the USA every year. These examinations are mainly performed for screening programs aimed at detecting breast cancer at an early stage. Both systems were trained on a large dataset of around 45, images. The two systems shared the candidate detection system. The CNN-based CADe system classified the candidate based on its region of interest, and the reference CADe system classified it based on the hand-crafted imaging features obtained from the results of a traditional segmentation algorithm.
Low-dose CT has been increasingly used in clinical situations. For example, low-dose CT was shown to be useful for lung cancer screening [ 51 ].
Because noisy images of low-dose CT hindered the reliable evaluation of CT images, many techniques of image processing were used for denoising low-dose CT images. Two previous studies showed that low-dose and ultra-low-dose CT images could be effectively denoised using deep learning [ 52 , 53 ].
There is no special instruction for the CNN to focus on more complex objects in deeper layers. Layer 3 is where we start to see some complex patterns like the eyes, face etc. We can assume that this feature maps are obtained from a model trained for detection of human faces. In Layer 4 we see our features finding patterns in the more complex parts of the faces such as eyes.
In Layer 5, you can the feature map generates the specific faces of humans, tyres of cars, faces of animals etc. This feature map contains to most information about the patters found in the images. In general terms, CNN is not too different from the various Machine Learning algorithms where it tries to find patterns in the dataset. Here we try to find suitable weights for the kernel which will generate a specific pattern found in the dataset.
To learn more about this, you can refer to the following paper. It will help you understand more about the visualizations in CNN. You can use Picasso which is Deep Neural Network visualization tool. Sign in. What exactly does CNN see? Jinde Shubham Follow. Written by Jinde Shubham Follow.
More From Medium. Linear Regression. Kemal Erdem burnpiro in Towards Data Science. Tech Ninja. Privacy attacks on Machine Learning. Eugenia Anello in Towards AI. Robust Detection of Evasive Malware — Part 3. In CNNs this is referred to as a convolution layer, hinting at the fact that it will soon have other layers added to it. Although we can sketch our CNN on the back of a napkin, the number of additions, multiplications and divisions can add up fast.
In math speak, they scale linearly with the number of pixels in the image, with the number of pixels in each feature and with the number of features.
Small wonder that microchip manufacturers are now making specialized chips in an effort to keep up with the demands of CNNs. Another power tool that CNNs use is called pooling. Pooling is a way to take large images and shrink them down while preserving the most important information in them.
The math behind pooling is second-grade level at most. It consists of stepping a small window across an image and taking the maximum value from the window at each step. In practice, a window 2 or 3 pixels on a side and steps of 2 pixels work well. After pooling, an image has about a quarter as many pixels as it started with.
Because it keeps the maximum value from each window, it preserves the best fits of each feature within the window. The result of this is that CNNs can find whether a feature is in an image without worrying about where it is. This helps solve the problem of computers being hyper-literal. A pooling layer is just the operation of performing pooling on an image or a collection of images. The output will have the same number of images, but they will each have fewer pixels.
This is also helpful in managing the computational load. Taking an 8 megapixel image down to a 2 megapixel image makes life a lot easier for everything downstream. This helps the CNN stay mathematically healthy by keeping learned values from getting stuck near 0 or blowing up toward infinity. The output of a ReLU layer is the same size as whatever is put into it, just with all the negative values removed. Because of this, we can stack them like Lego bricks. Raw images get filtered, rectified and pooled to create a set of shrunken, feature-filtered images.
These can be filtered and shrunken again and again. Each time, the features become larger and more complex, and the images become more compact. This lets lower layers represent simple aspects of the image, such as edges and bright spots. Higher layers can represent increasingly sophisticated aspects of the image, such as shapes and patterns.
These tend to be readily recognizable. For instance, in a CNN trained on human faces, the highest layers represent patterns that are clearly face-like. CNNs have one more arrow in their quiver. Fully connected layers take the high-level filtered images and translate them into votes. In our case, we only have to decide between two categories, X and O.
Fully connected layers are the primary building block of traditional neural networks.
0コメント