Data derived from biological samples, such as spectra, chromatograms, and microscopic images, contain a wealth of information, some of which often remains unharvested. Traditionally, the interpretation of such data has been grounded in prior knowledge, such as marker molecules that correlate with or are causally related to the target phenomenon. However, this approach does not always encapsulate the entirety of the phenomena in question, and potentially overlooks other valuable information embedded in the data. In other words, focusing too narrowly on certain parameters limits the insights that can be gained from the data, and thus the scope of the analysis.

Recent advances in data-analysis techniques, including machine learning (ML), show promise to help overcome this limitation. ML can effectively extract latent beneficial information from complex bioanalytical data, thus facilitating the exploration of unknown relationships and patterns. This article highlights several instances in which leveraging ML has enabled realizing the potential inherent in bioanalytical data.

The infrared and Raman spectra of biological samples reflect the characteristics of the many different molecules contained in such samples. The spectral peaks are usually assigned based on prior knowledge of specific vibrational modes, a practice that can be complicated by overlapping peaks resulting from complex sample compositions. ML addresses this issue by recognizing the overall pattern of the spectrum, thereby facilitating identification of the state and properties of biological samples [1, 2].

Vieira and colleagues have proposed a method that leverages a ML technique, i.e., principal component analysis (PCA) combined with linear discriminant analysis (LDA), to analyze the attenuated total reflection Fourier transform infrared spectroscopy (ATR-FTIR) data for human saliva containing various substances. This approach proved effective in identifying the cardiopulmonary exercise load [3]. Furthermore, ML has also enabled the differentiation of counterfeit or low-quality pharmaceuticals from standard ones using near-infrared (NIR) spectroscopy [4].

Nuclear-magnetic-resonance (NMR) spectroscopy is often used to identify the chemical and higher-order structures of compounds and proteins. However, it is also powerful for acquiring detailed information on complex biological samples. For instance, Xia and colleagues have established a model to predict the moisture content and shelf life of food products by feeding the low-field NMR spectra into partial least squares (PLS) and back-propagation artificial-neural-network (BP-ANN) models [5]. Absorption and fluorescence spectra have also been employed in this manner. Examples include the analysis of variations in fluorescence spectra arising from the interaction between probe materials that incorporate environmentally responsive dyes and biological samples. Such approaches are branching out widely, allowing the identification of proteins and cells [6] as well as advancing toward applications in diagnosis and drug discovery [7].

These technologies may not identify the causative substances, but they enable extracting the desired information through recognition of overall spectral patterns. Consequently, they promise potential for distinguishing the state and type of samples, including non-invasive assessments of physical activity and stress load [3], determination of the quality of commercial pharmaceuticals [4] and food products [5], as well as cellular diagnostics [6], thus representing a forward-looking approach to the use of bioanalytical data.

Image-acquisition techniques, which are often relatively straightforward, are indispensable in the evaluation of biological samples. In the field of drug discovery, so-called high-content screening (HCS) is widely used, wherein the morphological features of cells are recognized from fluorescence-microscopy images of stained cellular structures such as the cytoskeleton and nucleus through ML [8]. Recently, the application scope of HCS has expanded to include unstained optical-microscopy images for, e.g., evaluating the quality of stem cells [9] and selecting in-vitro-fertilized embryos [10]. Vaughan and colleagues have extensively reviewed the effectiveness of ML for the analysis of unstained images in the early detection of harmful algal blooms and water-quality risk assessment, focusing on cyanobacteria [11]. With the proliferation of smartphones, their use as image-acquisition devices is also increasing [12]. Kılıç and colleagues have used smartphones to capture the color change triggered by the reaction between glucose and metal nanoparticles in an aqueous solution [13]. By applying LDA to analyze multiple selected features of these images, they were able to distinguish between different glucose concentrations.

The analysis of data obtained through single-channel current recordings is also garnering attention [14]. Komoto and colleagues have successfully used ML to identify the length of homo-oligomer nucleic acids, which is generally difficult, through current-time profiles while the DNA passes through nanopores [15]. As it is possible to extract valuable information from the noisy single-channel current-recording data, this approach has the potential to further advance next-generation sequencing technologies.

Hopefully, this article has demonstrated that ML can significantly enhance the interpretation of bioanalytical data acquired from a diverse range of measurement techniques. The advent of the Python programming language, which offers user-friendly ML libraries and toolkits, along with the use of AI assistant tools with natural-language processing (e.g., ChatGPT) for efficient script creation, has made ML easily accessible to researchers at all levels. These advancements can be expected to pave the way for a future in which the integration of bioanalysis and ML will become commonplace, thereby opening novel avenues in the realm of analytical science.

figure a

Schematical illustration of bioanalytical-data analysis through machine learning. Reproduced with permission from refs. 3, 6, 13, 15. Copyright 2023 Springer Nature