1 Introduction

A convolutional neural network (CNN) is a deep learning technique designed for processing image data. Due to the use of matrix convolutions, a CNN is able to understand the spatial relationships between the adjacent image points in the data instead of treating them as variables in a random order. Consequently, we can easily increase the amount of image data by such augmentation transformations that preserve the meaningful structures in the images but somehow modify each image so that is not a perfect copy of the original one. For instance, common types of augmentation include reflections, rotations, and cropping out borders.

Table 1 The abbreviation used for the data set in tables of this paper, the image type, the diagnosis or the meaning of the positive and the negative instances, the total number of images, and the reference (Ref.) to the original source of the data for our 11 data sets

Image augmentation is especially important on the medical field. Unlike regular photographs, tomography images of human patients are highly sensitive and private material protected by strict regulation, which is why there is a very limited selection of open-source data sets about them publicly available. Because of this, a researcher is often able to use only a data set collected at their own institution and, even if several patients are imaged there, the physicians might not have enough time for the annotation of these images so that they can be given to a CNN. Furthermore, some diseases might be so rare that there are simply not so many cases diagnosed.

However, several augmentation techniques were originally designed for typical photographs and are therefore not necessary well-suited for medical images. Because of the bilateral symmetry of the human body, a reflection over the vertical axis of coronal or transaxial images is an intuitive approach for multiplying the amount of images of the brain or the head and neck area as the results are often nearly indistinguishable from the original images. While a reflection switching the top and the bottom of the image notably differs from the original image, it would also preserve the symmetry, which might be beneficial for a CNN trying to detect asymmetrically located targets such as tumors. A rotation does not preserve this symmetry, but it might be useful when the task is to find signs of disease in chest X-rays that are often visible as wispy white sections in the normally clear areas of the lungs. Furthermore, one simple augmentation transformation is to add blur to the images, which might produce very different results for an imaging method such as positron emission tomography (PET) that produces already blurry images without sharp borders between regions than it would for magnetic resonance imaging (MRI). In existing literature [4, 9, 10, 13, 22], these augmentations have typically been compared by using only one data set rather than analyzing more systemically the differences caused by the different imaging modalities.

Fig. 1
figure 1

Examples of positive images from our data sets, including a coronal chest X-ray of a COVID-19 patient A, a shoulder X-ray showing an artificial joint B, a wrist X-ray with an internal fixation device C, a breast ultrasound image of a patient with a malignant tumor D, a transaxial slice from an MRI image of a patient with a brain tumor E, a transaxial MRI image slice of a head and neck cancer patient F, the corresponding transaxial PET image slice from the same head and neck cancer patient G, a transaxial PET image slice showing a prostate cancer tumor H, and a myocardial polar map of an ischemic patient I

Here, we study the impact of seven simple augmentation types on the predictions of a CNN performing binary classification. We compare the augmentation types both with each other and results obtained by using no augmentation. We use 11 different data sets, including lung X-ray images from patients with COVID-19 (CoV), pneumonia (PNA), or tuberculosis (TB), limb X-rays with artificial joints or internal fixation devices, breast cancer (BC) ultrasound (US) images, MRI images from patients with brain tumor (BT) or head and neck cancer (HNC), PET images from HNC and prostate cancer (PCa) patients, and myocardial polar maps with patients with hear disease.

2 Materials and methods

2.1 Software requirements

The CNN of this article was built and tested in Python (version: 3.9.9) [25] with the packages TensorFlow (version: 2.7.0) [1] and Keras (version: 2.7.0) [6].

2.2 Data

In this study, we use 11 different data sets, important details of which are summarized in Table 1 and the example images are shown in Fig. 1. Seven of them are created from six publicly available repositories. These repositories are COVID-19 Radiography Database [7, 16], Chest X-Ray Images (Pneumonia) [14], Tuberculosis (TB) Chest X-ray Database [17], MURA: Large Data set for Abnormality Detection in Musculoskeletal Radiographs [21], Breast Ultrasound Images Data set [2], and Br35h:: Brain Tumor Detection 2020 [11], and their links are listed in the data availability statement. They include chest X-rays of patients with either CoV, PNA, or TB, limb X-rays with different bone abnormalities, US images of BC patients, and two-dimensional (2D) MRI images of BT patients. Each data set also has similar images of negative cases. We use the MURA data set to create two smaller image sets so that one of them has shoulder X-rays with or without shoulder joint replacements or internal fixation devices, and the other one has wrist X-rays with or without internal fixation devices. Furthermore, we only include breast US images with a clearly visible tumor as positive images to our own data set so that the CNN is able recognize them. For each data set, we choose the images so that we have equally many positive and negative instances and the total number of images is divisible by 10.

We also use four private data sets retrospectively collected from patients imaged at Turku PET Centre in Turku, Finland. The first two of these data sets are from 200 HNC patients, 182 of which were diagnosed with head and neck squamous cell carcinoma, while the rest of them had adenocarcinoma, adenoid cystic carcinoma, parotid cancer, or some other HNC. As a part of their treatment at Turku University Hospital, they had been referred to a PET/MRI scan in Turku PET Centre during 2014-2022, and they were imaged with either Philips Ingenuity TF PET/MRI scanner (Philips Health Care) or SIGNA PET/MRI with QuantWorks (GE Healthcare) by using 18F-fluorodeoxyglucose as a tracer. The presence of cancer was confirmed for 100 patients by re-imaging or histopathological sampling and a medical doctor created three-dimensional (3D) binary tumor masks with Carimas [18] for the 100 positive PET/MRI images. We create two separate data sets so that one of them is based on MRI and the other one on PET. By using the tumor masks, we choose a total 1115 transaxial slices depicting a tumor for both modalities and equally many random slices from the images of the 100 negative patients who did not have recurrence of the cancer. The same PET data set was also used in [12], and MRI/PET data from some of these patients have been studied in [15].

Our third private data set is from 78 PCa patients, who were imaged with Discovery MI digital PET/computed tomography (CT) system (GE Healthcare) in Turku PET Centre during 2018–2019 after a dosage of \(^{18}\)F-prostate-specific membrane antigen-1007 (\(^{18}\)F-PSMA-1007). A physician created 3D binary masks with Carimas to denote the primary tumor in the prostate. Since there were no negative patients imaged with \(^{18}\)F-PSMA-1007, our classification is task to detect which transaxial slices show the intraprostatic tumor and which of the slices depict healthy parts of the prostate or area outside the prostate. For this purpose, we use the binary masks to find all the slices showing the PCa tumor and choose negative slices below and above the positive slices so that we have equally many negative slices in total. We use only PET data here and fully exclude the CT data. For each transaxial PET slice, we use the same square around the prostate area as the region of interest and crop its borders out. The data set was originally introduced in [3] and it is also studied in [19].

The last private data set is from 138 patients who had been treated at Turku University Hospital in Turku, Finland, during the years 2007-2011 and who had had stable chest pain or a similar symptom of a possible heart disease. A dynamic myocardial PET perfusion imaging was performed with Discovery VCT PET/CT scanner (GE Healthcare) by first injecting the patients with \(^{15}\)O-labelled water as an intravenous bolus and infusing them adenosine to see how stress affected their heart. Carimas was used to combine the dynamic image sequences of each patient into one 2D image. All the patients also had an invasive coronary artery angiography on Siemens Axiom Artis coronary angiography system (Siemens), which was used to classify the polar maps as ischemic or non-ischemic based on finding obstructive coronary artery disease. Since there were 55 ischemic polar maps, we included them and equally many non-ischemic polar maps in our data set. The polar maps had been converted into RGB images with Carimas’ rainbow scaling function at some point, but we converted them back into grayscale images by choosing only the value in the first channel for each pixel. More details of the original data can be found in [24].

2.3 Pre-processing and cross-validation

Our data sets had grayscale images with pixel values in [0,255], and the images were stretched into the squares of \(128\) pixels. The output of the CNN was therefore \(128\times 128\times 1\) matrices whose each element was an integer from [0,255]. We created the training and test sets with fivefold cross-validation. It was done patient-wise for the all the data sets, including the ones which had multiple images from same patients. All the possible test sets contained exactly 20% of data and had equally many positive and negative images.

2.4 Convolutional neural network

In this article, we use the same CNN as in [12]. This CNN was inspired by the well-known U-Net CNN [23], which is commonly used in the medical field. The U-Net used in segmentation consists of a contracting and an expanding path, but, since our aim is binary classification instead of image segmentation, we only use the contracting path of the U-Net followed by three dense layers. This path consists of four sequences, each of which has first a convolution layer and then two maximum pooling operations. We use ReLu activation function on all the layers except the last one, which has a sigmoid function instead. We use the stochastic gradient descent with a learning rate of 0.001 as an optimizer and a validation set containing 30% of the training data. The number of epochs is 15 for shoulder X-ray, wrist X-ray, BC US, HNC MRI, and HNC PET data sets, and 10 for the other six data sets.

2.5 Augmentation

We compare the following seven types of augmentation transformations:

  1. (1)

    Reflection over the vertical axis

  2. (2)

    Reflection over the horizontal axis

  3. (3)

    Rotation of 90 degrees in clockwise direction

  4. (4)

    Rotation of k degrees where k is a randomly chosen number from the interval \((-15,15)\)

  5. (5)

    Translation that moves the image \(k_0\)% and \(k_1\)% in the horizontal and vertical directions, respectively, where \(k_0,k_1\) are randomly chosen from the interval \((-10,10)\)

  6. (6)

    Cropping the borders so that the size of the image decreases k% where k is randomly chosen from the interval (0, 10)

  7. (7)

    Adding blur from the Gaussian distribution whose standard deviation \(\sigma \) is randomly chosen from the interval (0.5, 1.5)

The augmentation transformations are tested separately so that we only use one of them during each iteration round performed. With these transformations, we create exactly one copy of every image in the training data so that we double the amount of data used for training. The first three transformations always produce the same end result from a given image, but there is some variation in the augmented images created with the other four. See Fig. 2 for an example of each augmentation type.

Fig. 2
figure 2

An original X-ray image of a negative patient from the CoV X-ray data set and the new versions of this image created with the seven different types of augmentation

2.6 Evaluation metrics

To convert the numerical output of the CNN into binary classifications, we compute the Youden’s threshold [26] from the predictions of the training data. This threshold is the one that maximizes the sum of sensitivity (percentage of positive instances classified correctly) and specificity (percentage of negative instances classified correctly). After that, we can compute the accuracy of the predictions of the test set. We also consider the receiver operating characteristic (ROC) curve, which is sensitivity plotted against the false positive rate (percentage of negative instances classified incorrectly), and compute the area under ROC curve (AUC). The AUC value can be used as a evaluation metric, but, unlike accuracy, it does not depend on the choice of the threshold.

2.7 Structure of the experiment

For each data set and augmentation type, we repeat the process of initialing the CNN, training it with augmented data, and predicting the contents of the training and test sets for 30 times. This means six repeats of each different test set of the fivefold cross-validation. For each data set, we also run 30 iteration rounds without using any augmentation. The results are evaluated by their accuracy and AUC, and the values of these metrics are compared with Wilcoxon test with 1% level of significance so that we can estimate whether the differences between the augmentation options are statistically significant or not.

3 Results

Our results are summarized in three tables: Table 2 contains the medians of the 30 values of accuracy computed from the test set predictions given by the CNN for each data set when either no augmentation or one of the seven augmentation types presented in Subsect. 2.5 is used for the training data. Table 3 has the similar medians but for AUC instead. Table 4 tells us which of the other augmentation options produce significantly higher median accuracy or AUC than augmentation option in the given column for each data set according the Wilcoxon tests with 1% level of significance.

Table 2 Median accuracy (%) of the predictions of the CNN from test sets of the different data sets when either no augmentation (denoted by 0) or one of the seven different augmentation types listed in Subsect. 2.5 is used
Table 3 Median area under receiver operating curve values (%) of the predictions of the CNN from test sets of the different data sets when either no augmentation (denoted by 0) or one of the seven different augmentation types listed in Subsect. 2.5 is used
Table 4 The numbers in each cell refer to the augmentation options giving statistically higher median accuracy or area under receiver operating curve (AUC) than the augmentation option in the given column according to Wilcoxon tests with 1% level of significance

According to Table 4, there are very few statistically significant differences between the different augmentation options for the one US and the three PET data sets. However, what augmentation is used is much more important for X-ray and MRI data sets. The use of non-augmented data leads significantly more inaccurate predictions than the seven types of augmentation for all five X-ray data sets and the BT MRI data set. The best choice seems to be augmentation by adding Gaussian blur (augmentation type 7), which gives the best accuracy or AUC values for five data sets (CoV X-ray, PNA X-ray, shoulder X-ray, BT MRI, and polar map) and has never significantly lower accuracy nor AUC than any of the augmentation types. According to our results, other good augmentation types are cropping the image (type 6), slight rotation (type 4), and translation (type 5).

Our results also reveal that he reflection over the vertical axis (type 1) only gives the highest accuracy or AUC for HNC PET and MRI data sets. It is still often better than the reflection over the horizontal axis (type 2): The first option works significantly better for CoV X-ray, TB X-ray, wrist X-ray, and HNC MRI data sets, while the second is significantly better only for PNA X-ray and BT MRI data sets. However, neither of these reflections is significantly better than the slight rotation (type 4), which works better than at least one of reflection types for four X-ray data sets and the BT MRI data set.

4 Discussion

We expected that the reflection over the vertical axis would have worked very well due the bilateral symmetry of the human body, but it performed quite poorly in comparison of the random rotation of 15 degrees or less. In research by Rama et al. [22], augmentation based on reflections also led the least accuracy in classification of lung X-rays of TB patients (different data set than here). However, the reflection over the vertical axis worked the best on the HNC MRI and PET data sets which contain highly heterogeneous images of the human head and neck area and are therefore difficult to classify correctly.

Adding the Gaussian blur was clearly the best method according to our results, and in particular, it worked very well on all the five X-ray data sets and the MRI BT data set. For instance, the AUC of predictions of the MRI BT data augmented with this method was 97.5, while the use of non-augmented data resulted in a statistically significantly lower AUC of 92.6. In earlier research by Haekal et al. [10], both Gaussian and Perlin noise worked well when classifying X-rays of lung cancer patients. However, adding Gaussian blur did not work so well on US nor PET images, possibly because the targets in these images do not have clear boundaries and can be only detected because they are lighter or darker shade than their environment, as can be seen in Subfigures (D), (G), and (B) in Fig. 1. It was also noted by Hussain et al. [13] that adding Gaussian noise works poorly for mammography images in comparison with reflections.

Natural continuation of this study would be extending the research on 3D medical images. However, there are some challenges: There is a very limited amount of publicly available data sets with 3D medical images that also have similar negative images so that they can be used for classification. Furthermore, training CNNs for 3D data requires much more data, which might set too high requirements on the time of the tests and computational efficiency or memory of the computer.

Another question for further study would be the comparison of between these simple transformations and the more complicated augmentation types specifically designed for medical data. For instance, a general adversarial network (GAN), a type of a neural network generating synthetic samples of the original images introduced by Goodfellow et al. [8], has been applied very much deep learning tasks related to medical images [5]. While GANs have been noted to lead to better results than some affine transformations at least for certain data sets [4, 9], the best methods here might be better-suited for some imaging modalities. Also, a new method of augmentation based on the use of conformal mappings was recently introduced [20].

5 Conclusion

In this article, we compared the impact of seven simple augmentation techniques on the accuracy of the predictions of a CNN in different binary classification tasks related to medical images. We used several data sets, most which were contained either X-rays of patients with lung infections or internal fixation devices or other imaging modalities of cancer patients. According to our results, the best method of augmentation for X-rays and MRI images is adding Gaussian blur to the images but also slight rotation of 15 degrees or less, cropping the image, and translation work quite well.