Introduction

Information on species distribution in the natural environment is the critically important for understanding and monitoring ecosystems. In practice, the European Union Nature Information System is being implemented to monitor ecosystems on a pan-European level. In Japan, the national census on river environments-riparian zone (NCRER) has been conducted on-site over 30 years to monitor aquatic habitats and communities. However, such ground survey is generally complicated and costly, in particular for aquatic ecosystems. Thus, there is a growing demand for relatively simple and inexpensive alternative methods to assess habitat conditions and species distribution in rivers. For indirect estimation of species distribution, some empirical modeling approaches have been developed based on habitat information, for example species distribution models (Guisan and Thuiller 2005; Kameyama et al. 2007). Nevertheless, these methods also require physical habitat information and, therefore, field survey is inevitable.

Meanwhile, the integration of environmental data using latest information technology is being actively explored. For instance, the combination of photographic images and artificial intelligence has been proposed and applied for species identification (Pankhurst 1979; Rauf et al. 2019). Such image recognition has drastically improved the utility of graphical information, this approach is currently being applied in various fields of biology and ecology (e.g., Favorskaya and Pakhirka 2019; Mohanty et al. 2016). They commonly use convolutional neural network (CNN), one of the deep learning methods, to train a model using many sample images to identify key features and patterns in images for detecting specific species and other biological information in images. This approach can be technically applied to areal and satellite images for habitat assessment, which targets a larger spatial scale than in the common species identification (Elango et al. 2022). In the field of environmental assessment of rivers, however, only one investigation has applied machine learning to photographic images (Harrison et al. 2020). In their work, unmanned aircraft systems and machine learning have been used to identify spawning sites of salmon based on the color distribution, demonstrating the potential applicability of the method for other cases.

In fact, our knowledge in riverine ecology supports the approach of using riverine images for ecosystem assessment. Firstly, rivers have various hydraulic conditions in terms of flow velocity and depth, affecting species composition and the trophic structure for fish (Kobayashi et al. 2013; Zeni and Casatti 2014). Each fish species shows a specific preference in habitat conditions in relation to its feeding habits and life cycle, and therefore their distribution is closely related to riverine morphology (Wang et al. 2006; Fialho et al. 2008). Secondly, fish species richness is also positively related to the riverine morphology, for example the variability in the longitudinal slope of streams (Camana et al. 2016). This is because morphologically heterogeneous habitats accommodate species with different habitat characteristics. For example, relatively stagnant habitats such as pools and fluvial lagoons within a river section provide them refuges. Huang et al. (2019) have also revealed that velocity, turbidity, depth, and wetted width of rivers showed significant relationships with the number of fish species, emphasizing the importance of diverse habitat conditions for conservation of fish community.

Considering such a background, we hypothesized that CNN can be applied to riverine aerial photography for estimating fish distribution. It means that key attributes of river morphology for fish distribution are possibly identified in the appearance of the river by CNN. If so, aerial photographs of rivers could be used for estimating species distribution of fish species. For instance, such image analysis may identify the heterogeneity in riverine morphology based on the river shape, meandering degree, width, depth, and vegetation, which are indicated by the color distribution of the land and water surfaces. Given such possibility, the present study aimed to explore the applicability of CNN and the archived aerial photography of rivers for modeling fish distribution in rivers in the Kanto region of Japan. In reality, fish distribution is determined also by water temperature (Neill 1979), water quality (Araújo et al. 2000), and longitudinal connectivity (Tummers et al. 2016). Thus, our attempt clarified the relative importance of riverine geomorphology and flow condition in fish distribution. To this end, we tested three specific hypotheses: H1, CNN can identify important attributes of aerial riverine photography for the distribution of fish species; H2, the performance of CNN-based fish distribution model is positively correlated with number of training images; H3, the model performance depends on habitat preferences of fish species. Testing of these hypotheses was based on publicly available data of aerial photographs and fish distribution in rivers in Japan.

Materials and method

Study area and data

The study area was the Kanto region of Japan. Targeting this region, we collected aerial photographs and fish distribution data totally at 72 sites in main streams of Tama River, Arakawa River, Tone River, and the tributaries of the Tone River, which are archived for the period from 2010 to 2019.

Aerial photographs for above sites were obtained from Geographical Survey Institute (URL. https://mapps.gsi.go.jp/maplibSearch.do). To ensure spatial correspondence with the biological data described below, the aerial photographs were adopted from the river sections where corresponding fish data are available in NCRER. The difference in time points between the photographs and biological data was minimized within five years in the site selection procedure. All the original photographs that were collected in this study contained the spatial resolution of 20 cm and their image size varied depending on location and timing. Therefore, to perform machine learning, all the images were standardized to 500 × 550 pixels corresponding to the consistent spatial coverage of 2.0 km in latitude and 2.2 km in longitude. After this image processing, the spatial resolution of all aerial photographs was 4 m. Ortho-correction was not applied because we did not align aerial photographs with other geospatial images. In this process, we adjusted the river channel to the center of each image while we treated terrestrial parts in different manners depending on the tested hypothesis. Consequently, the aerial photographs of the Tama River from 2017 to 2019, the Arakawa River from 2010 to 2019, and the Tone River basin from 2010 to 2019 were adopted. In total, we collected 121 aerial images from the 72 sites in the targeted rivers.

Data on fish distribution were obtained from NCRER (URL. http://www.nilim.go.jp/lab/fbg/ksnkankyo/). The survey section covers approximately 2 km in the longitudinal direction. In this study, time periods for the fish data were 2016 (June and October) for the Tama River, 2015 (June, August and September) for the Arakawa River, and 2014 (June and November) for the Tone River basin. Migratory fish in Kanto region are generally known to inhabit rivers during the period from spring to fall (Kawanabe and Mizuno 2001). Thus, the fish data collected in the period from June to November represented their typical distribution in rivers. Among the observed species, we selected three fish species: yellowfin goby (Acanthogobius flavimanus, Mahaze in Japanese), dark chub (Nipponocypris temminckii, Kawamutsu in Japanese) and sweetfish (Plecoglossus altivelis, Ayu in Japanese). Those three species were selected because they are commonly observed in the targeted rivers and they show different habitat characteristics (Kawanabe and Mizuno 2001). Thus, those species are suitable for investigating the applicability of the image recognition technique for estimating their habitats in rivers. If a species was observed at least once during the subjected period, the river section was considered to be inhabited by that species.

The general habitat preferences of the three target fish species can be described as follows, accordingly to Kawanabe and Mizuno 2001. Yellowfin goby (benthic species) is mainly distributed in brackish water and estuarine areas. Dark chub (freshwater fish) is mainly found in pools in upper and middle reaches of rivers. Sweetfish (migrating fish species) distributes mainly in the middle reaches of rivers, although this species changes the habitat depending on its developmental stage. As such, it is known that fish species generally change their habitats as their life history stages and seasonal changes (Yoshimura et al. 2005). To be specific, yellowfin goby is known to migrate closer to the sea as they grow older, and yellowfin goby and the juvenile sweetfish can inhabit in a sea area in winter. In this study, therefore, fish data were collected only in summer and autumn, to be specific the period from June to November. Given this way of data collection, their seasonal habitat change was minimized, although this study did not distinguish their habitat changes over life stages (or habitat preference patterns). For model building, we compiled their presence and absence (P/A) at the 72 sites in the three river basins. As the number of image data is 121, there are 121 sets of aerial photographs and P/A information for each species. Overall proportions of presence for each species in all studied sites were 51.3% for sweetfish, 43.0% for dark chub, and 37.5% for yellowfin goby (Fig. A1).

Fig. 1
figure 1

Aerial photographs for Hypothesis 1. Panel A an original image; B an image only with a river channel; C an image without the river channel. Panel A, B and C are in the same scale

Besides local geomorphological conditions, spatial distribution of fish species is influenced by other conditions such as water temperature, water quality, and connectivity. We could confirm that the water temperature in targeted sites ranged from 9.1 °C to 29.5 °C from June to November (NCRER, URL. http://www.nilim.go.jp/lab/fbg/ksnkankyo/) and that no substantial water pollution was present (i.e., BOD < 5.0 mg/L, URL. https://www.mlit.go.jp/river/toukei_chousa/kankyo/kankyou/suisitu/r2_suisitu.html). In addition, several water intake weirs are built in the studied rivers which are equipped with fish passages. We also confirmed the presence of sweetfish in the upstream section (Fig. A1). Thus, the local water condition and longitudinal connectivity in the targeted sections of rivers are considered to have minor influence on the distribution of the three species.

Model building

To link the aerial photographs to fish distribution, we applied residual neural network (Res-Net), which is a type of CNN (He et al. 2016; Krizhevsky et al. 2012). Res-Net has a unique structure of shortcut connections in building neural network, which provides high accuracy in image-based prediction. Thus, it is commonly used for image recognition models (e.g., He et al. 2016; Veit et al. 2016). Deep learning using Res-Net was performed on Neural Network Console, an artificial intelligence development tool provided online by SONY (URL. https://dl.sony.com/ja/). For the Graphics Processing Unit (GPU), NVIDIA T4 was used for deep learning. This type of GPU can process the images with above-mentioned resolution with a relatively low cost. In model building, P/A information of each target species was assigned by 1 or 0, respectively. In the learning phase, based on the binary data, ResNet was trained to link each image to P/A based on the image property. In this analysis, model outputs equal to or over 0.5 were considered to be “present” while outputs less than 0.5 were considered to be “absent”.

For all analyses (hypothesis tests), Res-Net having 18 layers (Res-Net 18 hereafter) was assigned as a deep learning model, because we confirmed in a preliminary test that the model suffers from overfitting if a higher number of the layers is given for the presented data size. Res-Net first learned from the pairs of an image and the corresponding P/A information of a specific fish species in the training data, meaning that Res-Net attempted to seek some attributes of images which are key information to explain the P/A information of a target species. Then, the performance of each built model was confirmed using the test data. Model performance was evaluated by area under the curve (AUC, Hand 2009) and indicators of accuracy, precision, and recall (Fawcett 2006). In short, AUC is a measure of how well the images are classified in that model. Accuracy is the percentage of correct answers, precision is an indicator of how correct the prediction is, and recall is the value that indicates how well the prediction reproduces the correct answer. These indices were based on the validation and range from 0 to 1, with a larger value indicating higher accuracy of a model.

Hypothesis test

To evaluate the applicability of the proposed approach, we tested three hypotheses (H1, H2, and H3), which are described below. From the collected 121 aerial images, 80 and 20 images were randomly selected for training and validation, respectively, for these hypothesis tests except H2 which involved different numbers of training data. By repeating this random selection, we created five different datasets (A, B, C, D, and E) for testing each hypothesis. We set batch size to 1 due to the relatively small size of prepared data, and the iteration number to 100. Thus, we evaluated the model showing the best accuracy out of the 100 training times. Based on the model performance, hypotheses were tested by one-way ANOVA and Pearson’s correlation analysis using R (version 4.0.3).

H1: CNN can identify important attributes of aerial riverine photography for the distribution of fish species

To test this hypothesis, image processing was applied to the obtained aerial photographs, which produced three types of images. They were a) original images, b) images having only the extracted river channel, and c) images excluding the river channel as negative control (Fig. 1). The river channel here refers to the area inside the dike, or if no dike exists, we considered the inundated area as the boundary of the river channel. This image processing (i.e., setting border lines) was manually implemented, which could introduce bias to the further analyses. Nevertheless, in this image recognition, the separation of terrestrial and riverine areas should be more influential than the bias because of the much larger information of the land areas compared to the difference creased by the line assignment. For each of the targeted species, we applied CNN to each type of dataset together with the P/A information at a subjected site. The hypothesis was assessed by focusing on whether a significant difference is found in the accuracy among the models based on the three types of data. It means that if a significant difference was found, CNN is able to recognize the geomorphological characteristics which are related to fish habitat conditions. In addition, we discussed the importance of image preprocessing by comparing the performance of the models.

H2: The performance of CNN-based fish distribution model is positively correlated with number of training images

To test this hypothesis, six models were prepared based on different numbers of training data, which were 10, 30, 40, 60, 80, and 100. CNN was applied to five randomly produced datasets of each data volume for each species. For each model, the number of validation data was set to 20, which is the same as the tests of H1 and H3. This hypothesis was addressed by whether the model performance is positively correlated with the number of training images. Note that this is a general pattern in machine learning including CNN (Cho et al. 2015; Gomez Villa et al. 2017), and thus we confirmed if it is also the case for estimating fish distribution by riverine aerial photography.

H3: The model performance depends on habitat preferences of fish species

This hypothesis examined how the performance of the model differs in regard to habitat preferences of fish species. CNN was applied to five datasets consisting of pairs of aerial images and P/A information. This hypothesis was addressed to investigate if the accuracy of the models is significantly different or not among those species. Yellowfin goby and dark chub inhabit freshwater or brackish water, while sweetfish is a diadromous species. Although many of riverine fish species migrate and change their habitats over life cycle, typical habitats of yellowfin goby and dark chub are known to be riverbed and stagnant water, respectively (Kawanabe and Mizuno 2001). Thus, we may speculate that the model accuracy for the distribution of those two species is higher than that of sweetfish because sweetfish is migratory and thus widely distributed.

Results

H1: CNN can identify important attributes of aerial riverine photography for the distribution of fish species

Regarding the performance indicators for the three species, AUC ranged from 0.61 to 0.79, accuracy ranged from 0.64 to 0.80, precision ranged from 0.65 to 0.84, and recall ranged from 0.59 to 0.79 (Fig. 2). According to one-way ANOVA, images containing river channel resulted in a CNN-based model with significantly higher performance than the images without river channel only for yellowfin goby, while the type of images did not affect the model performance for dark chub and sweetfish. For yellowfin goby in particular, models based on original and river channel images showed higher indicators (i.e., AUC, accuracy, and recall) than those based on urban images. The average of the AUC was 0.79 for the models based on original images, while it was 0.74 for the models based only on river channel images and 0.61 based only on urban images. The indicator precision did not show a significant difference although the level of precision was low if we applied images with only urban area.

Fig. 2
figure 2

Performance indicators of models for Hypothesis 1, stating that CNN can identify important attributes of aerial riverine photography for the distribution of fish species. Models were based on 80 training images and 20 validation images. Error bars indicate standard deviation. * 0.01 < p < 0.05, ** 0.001 < p < 0.01 (ANOVA)

H2: The performance of CNN-based fish distribution model is positively correlated with number of training images

According to the correlation test, the model for yellowfin goby showed that all four performance indicators were positively correlated with the number of training images (Fig. 3), and their correlation coefficients ranged from 0.37 to 0.42. The model for dark chub showed significant positive correlation only for AUC but not for other indicators, while sweetfish showed no correlation between the number of training images and any performance indicator (Fig. 3).

Fig. 3
figure 3

Performance indicators of models for Hypothesis 2, stating that the performance of CNN-based fish distribution model is positively correlated with number of training images. Error bars indicate standard deviation. * 0.01 < p < 0.05 (probability of no correlation)

H3: The model performance depends on habitat preferences of fish species

Regarding the performance indicators for the three species, AUC ranged from 0.61 to 0.73, accuracy ranged from 0.67 to 0.8, precision ranged from 0.65 to 0.81, and recall ranged from 0.64 to 0.77 (Fig. 4). According to one-way ANOVA, a significant difference was not found with any performance indicator.

Fig. 4
figure 4

Performance indicators of models for Hypothesis 3, stating that the model performance depends on habitat preferences of fish species. Error bars indicate standard deviation. Models were based on 80 training images and 20 validation images. According to one-way ANOVA, significant difference was not found for any performance indicator

Discussion

Hypothesis test for H1 confirmed that CNN is able to identify the important attribute of aerial riverine photography for the distribution of yellowfin goby, but not for dark chub and sweetfish. The reason for this difference among species is possibly related to habitat conditions suitable for those fish species. By visual comparison of aerial images (Fig. A2), we confirmed the habitat of yellowfin goby was characterized by wide river sections and relatively dark water surfaces. Previous research has also shown that yellowfin goby inhabits mainly the bottom of downstream section of rivers and brackish water (Kawanabe and Mizuno 2001). Relatively dark-colored water surfaces in particular are often found in deeper sections. Thus, the relatively dark water surface and wide channel width seem to serve as important factors in this CNN-based machine learning.

In contrast, CNN failed to detect common features in the images for dark chub and sweetfish. Nevertheless, based on visual comparison, habitat of dark chub seems to be characterized by highly meandering section (Fig. A3 A–D). Habitat of sweetfish was characterized by spatially heterogeneous flow conditions (Fig. A4). These patterns are somehow reasonable as they indicated suitable habitats in relatively natural sections of the rivers, although we know that dark chub inhabits mainly the upper reaches of rivers, while sweetfish is found in both upstream and downstream sections as it is a migratory fish (Kawanabe and Mizuno 2001). Thus, the results implied that key hydrological and geomorphological conditions for fish distribution can be automatically extracted by CNN depending on the fish species.

Classification by color is a particular advantage of machine learning. For instance, Harrison et al. (2020) was able to estimate the spawning locations of salmon based on color characteristics with an accuracy of up to 0.9 by integrating spectral analysis and machine learning. Thus, contrasting colors of the water surface indicating depth contour or bathymetry should be key information of arial images in terms of habitat assessment. Our results also provided evidence that machine learning is able to recognize the information of aerial imagery that determines the habitat of yellowfin goby. In addition, we should note that CNN might also recognize geomorphological features such as degree of meandering, hydraulic variability, sand bars, and vegetation in a direct or an indirect manner. Such information also possibly is key information explaining their suitable habitat conditions. We may further specify the key information within each aerial image by applying Grad-Class activation map (Grad-CAM) (Chen et al. 2020), which allows us to visualize parts of images responsible for fish distribution.

Interestingly, in the result about yellowfin goby, precision did not show a significant difference among type of images although other indicators showed significant differences (Fig. 2). Precision indicates the percentage of correct responses in sites where a model predicts the presence of a targeted species, whereas recall indicates how well a model predicts where species is actually present. Due to the relatively small number of images applied for training (i.e., n = 80), there might not be much diversity in habitat patterns to be learned by the CNN in this study. In other words, if the characteristics of limited habitat patterns could be captured, it would be relatively easy to reproduce and estimate habitat patterns, and thus recall would show a significant difference among types of images. However, if some of the key habitats were not recognized in the training phase and are present in the images used for validation, it could not be properly judged by machine learning. This is a possible reason why precision did not show a significant difference in the hypothesis test (H1). Previous studies have also shown that these two indicators tend to show a trade-off relationship (Buckland and Gey 1994), which might be the case in this study as well. In that case, if a sufficient number of images covering the majority of typical habitat patterns is available for training CNN for the distribution of yellowfin goby, precision also possibly would show a significant difference as shown by other indicators.

Dark chub and sweetfish did not show a significant difference in performance indicators among types of images (Fig. 2). In this study, we assumed that information within river channels was important for estimating fish habitat. However, the results implied that CNN identified some clues also in terrestrial areas (e.g., Fig. 1C) which were useful for estimating fish distributions. For example, forests and farms located in the upstream reaches might provide clues for estimating the distribution of dark chub as this species tends to inhabit upstream reaches (Kawanabe and Mizuno 2001). Residential areas in the middle and downstream reaches of the rivers might also be some hints for this estimation. In short, the relative importance of terrestrial and in-channel information was not significantly different for these two species. To develop more accurate models for these types of fish, careful integration of other factors (e.g., water temperature, water quality, flow velocity, and depth profile) would be required.

The hypothesis test for H2 also indicated that key hydrological and geomorphological conditions for yellowfin goby distribution could be extracted by CNN, which was consistent with the result from the hypothesis test for H1. In other words, based on the hypothesis tests, we may conclude that we can expect further improvement in accuracy by increasing the number of training images for fish species whose visual information in the river channel is closely related to their habitat condition. In addition, the hypothesis regarding habitat preferences of fish species (H3) was not accepted, meaning that the model performance did not vary among fish species at least in the present analysis. Given this result, although the model for yellowfin goby showed a higher performance than others, it also has a certain margin of error, which may explain why the results show no significant differences among the fish species. The reason for errors can be related to the randomness of CNN, the relatively small set of training data, and the difficulty of learning the characteristics of the fish habitat. The randomness of the CNN means that the filters used to extract image features are randomly generated by the training process, resulting in variations in accuracy (Albawi et al. 2018).

In this study, the number of training images was 100, while general image-based deep learning often uses 50,000 to 100,000 training images (Ando et al. 2019; Mohanty et al. 2016). Thus, further accumulation of data is obviously worthwhile. This study collected 121 aerial images from the basins of the Tama River, the Arakawa River, and the Tone River. Considering the data availability of NCRER, we confirmed that approximately 60 additional aerial images can be collected at most for the remaining first-class rivers in the Kanto region. There might be extra data available for aerial river images near the cities. If we widen the target area to whole Japan, it would be possible to collect at least 500 aerial images. Although it would still not be as many as in typical deep learning practices, significant improvement could be expected in describing fish distribution, compared to the results presented here.

As a further possibility, the use of satellite imagery is also expected to increase the number of images for training and calibration. In this case, based on the results of the verification of Hypothesis 1, an image with a spatial resolution sufficiently recognizing the river width and the water color can be applied for the estimation of fish habitats. For example, WorldView-3 (spatial resolution of 31 cm) and GeoEye-1 (spatial resolution of 41 cm) would be alternative candidates although they are not free of charge. Nevertheless, when these alternative satellite images are used for this purpose, the effect of their spatial resolution on accuracy should be carefully examined. In addition, species showing relatively limited migration such as yellowfin goby and others (e.g., Odontobutis obscura, Acanthogobius lactipes, Mugilogobius abei), are estimated from aerial photographs possibly with a high accuracy, which may be tested by collecting additional data on a national scale. For fish species inhabiting upstream sections such as dark chub, it would be effective to revise the model structure by integrating information other than images, such as water quality and the presence of weirs and dams. Migratory fish such as sweetfish would be better modeled by hydrological conditions such as flow regime and water temperature, including their seasonal shifts, not only by topographical factors that can be determined from aerial imagery.

This study demonstrated the possibility of direct application of residual neural network to riverine aerial photography for estimating fish distribution. In conclusion, it was possible to estimate the habitat of yellowfin goby to a certain degree using aerial imagery of river channels, and further improvement in accuracy can be expected by increasing the number of training images. However, practical application is still challenging in terms of available data and model validity. For some fish species, information obtained from aerial photographs alone is not sufficient to identify habitats. We would also face an additional challenge for rare and endangered species as their presence data would not be as many as we presented in this study. Nevertheless, further improvement in accuracy could be expected if models could be created by including further detailed factors such as habitat structure (e.g., rapids, pools, and isolated water), water quality, water temperature, flow velocity.