The Molecular Twin artificial-intelligence platform integrates multi-omic data to predict outcomes for pancreatic adenocarcinoma patients

Osipov, Arsen; Nikolic, Ognjen; Gertych, Arkadiusz; Parker, Sarah; Hendifar, Andrew; Singh, Pranav; Filippova, Darya; Dagliyan, Grant; Ferrone, Cristina R.; Zheng, Lei; Moore, Jason H.; Tourtellotte, Warren; Van Eyk, Jennifer E.; Theodorescu, Dan

doi:10.1038/s43018-023-00697-7

Download PDF

Article
Open access
Published: 22 January 2024

The Molecular Twin artificial-intelligence platform integrates multi-omic data to predict outcomes for pancreatic adenocarcinoma patients

Arsen Osipov^1,2,3^na1,
Ognjen Nikolic⁴^na1,
Arkadiusz Gertych ORCID: orcid.org/0000-0002-3107-602X^2,5,6^na1,
Sarah Parker^2,7,
Andrew Hendifar^1,2,
Pranav Singh⁴,
Darya Filippova⁴,
Grant Dagliyan²,
Cristina R. Ferrone^2,6,
Lei Zheng³,
Jason H. Moore^2,8,
Warren Tourtellotte^2,5,
Jennifer E. Van Eyk ORCID: orcid.org/0000-0001-9050-148X^2,5,7 &
…
Dan Theodorescu ORCID: orcid.org/0000-0002-8708-8206^2,5,9

Nature Cancer volume 5, pages 299–314 (2024)Cite this article

16k Accesses
2 Citations
98 Altmetric
Metrics details

Subjects

Abstract

Contemporary analyses focused on a limited number of clinical and molecular biomarkers have been unable to accurately predict clinical outcomes in pancreatic ductal adenocarcinoma. Here we describe a precision medicine platform known as the Molecular Twin consisting of advanced machine-learning models and use it to analyze a dataset of 6,363 clinical and multi-omic molecular features from patients with resected pancreatic ductal adenocarcinoma to accurately predict disease survival (DS). We show that a full multi-omic model predicts DS with the highest accuracy and that plasma protein is the top single-omic predictor of DS. A parsimonious model learning only 589 multi-omic features demonstrated similar predictive performance as the full multi-omic model. Our platform enables discovery of parsimonious biomarker panels and performance assessment of outcome prediction models learning from resource-intensive panels. This approach has considerable potential to impact clinical care and democratize precision cancer medicine worldwide.

Harnessing multimodal data integration to advance precision oncology

Article 18 October 2021

Serum biomarker-based early detection of pancreatic ductal adenocarcinomas with ensemble learning

Article Open access 20 January 2023

High-dimensional role of AI and machine learning in cancer research

Article 10 January 2022

Main

Pancreatic ductal adenocarcinoma (PDAC) is one of the most aggressive malignancies. It accounts for 55,550 deaths in the United States and is expected to become the second-leading cause of cancer-related deaths nationally by 2030 (refs. ^1,2). While only 30–40% of patients with PDAC present with localized disease and undergo potentially curative surgical resection after diagnosis or following neoadjuvant chemotherapy, most develop recurrence and succumb to their disease^3,4,5. Despite advancements in molecular testing, serum carbohydrate antigen 19-9 (CA 19-9), first discovered in 1979, is presently the only US Food and Drug Administration (FDA)-approved biomarker widely employed for diagnostic management and preoperative prognostication of PDAC⁶; however, CA 19-9 has limitations, with a high false-positive rate due to other pathologic conditions and can result in false negatives in about 10% of the population⁷. Thus, there is an urgent need for improvement in new markers aimed at identifying patients most likely to be cured by surgery and/or respond to systemic therapies^8,9.

One approach that could lead to such improvements is combining comprehensive molecular evaluation of the tumor and host with machine-learning (ML) models. Studies in other tumor types have employed ML and utilized various molecular analytes to predict therapy response and refine prognosis^10,11,12. Most of these investigations, especially those on PDAC, have only focused on limited biological variables such as DNA and combined these with ML to determine whether findings can predict outcomes¹³. Multi-omic proteogenomic studies in PDAC have revealed unique phenotypes of PDAC, but they have shown limited ability to predict clinical outcomes¹⁴. Even if effective, the nature of such multi-omic analyses comes with high complexity and resource cost. Thus, an important consideration in the development of new predictive biomarkers is how to utilize the power of multi-omics to develop parsimonious panels of markers that would be both cost-effective and deployable in clinical practice in both resource rich and limited countries.

Here we use a multi-omic analytic platform that incorporates advanced molecular profiling beyond the examination of common analytes. Molecular profiling data were collected from both tumor and host samples and included computational pathology features. Multiple ML models were developed and applied to this dataset to test the hypothesis that this approach can provide biomarker panels that accurately predict DS after surgery in patients with resectable PDAC. Through recursive feature/analyte elimination, our approach was also able to provide a parsimonious model employing a limited number of features/analytes, which maintains a high degree of performance in prediction of DS compared to the full optimal models that we developed. Utilizing external samples/data from The Cancer Genome Atlas (TCGA), Johns Hopkins University (JHU) and Massachusetts General Hospital (MGH), we independently validated the power of our full and parsimonious ML models to predict DS. Through this analysis, we also discovered that among all analytes available in the preoperative setting, plasma protein is the most critical biomarker with substantial predictive power for survival and superior to CA 19-9. This work is a proof of concept for our Molecular Twin platform; a virtual, bioinformatic computational replica of the patient that can be updated and enriched in space and time with additional analyte types obtained longitudinally. While we utilize PDAC here, this approach is tumor-type agnostic, allowing it to potentially impact clinical care and scientific discovery across all cancers.

Results

Patient baseline demographics and specimen handling

Our Molecular Twin Pilot (MT-Pilot) cohort included 74 patients at clinical stage I (n = 47) and II (n = 27) with surgically resected PDAC between March 2015 and April 2019. Tumor specimens were collected at the time of surgery and plasma specimens were collected preoperatively. DS was recorded and treated as a binary end point at the time of analysis as of 21 October 2021. At this time, 45 (61%) patients were deceased. All demographic and clinical characteristics (Supplementary Table 1) were included as features for the clinical analyte in our multi-omic analysis. The surgical pathology information was obtained from a pancreas tumor resection. Tumor and plasma specimens were assessed for individual features by molecular profiling, including targeted next-generation DNA sequencing (NGS), full-transcriptome RNA sequencing, paired (tumor and normal from the same patient) tissue proteomics, unpaired (tumor from patients and normal unrelated controls) plasma proteomics, lipidomics and computational pathology. Analyte profiling yielded features that we used to validate single- and multi-omic models for predicting DS; a leave-one-out cross-validation approach was applied to the MT-Pilot cohort, whereas the four independent datasets: TCGA, JHU cohort 1, JHU cohort 2 and MGH were used to validate our ML models and feature panels developed based on the MT-Pilot data (Fig. 1).

**Fig. 1: Study classification methodology overview.**

Clinical and surgical pathology features influence outcomes

The 331 clinical features, including surgical pathology features and chemotherapy treatment (Supplementary Table 1), as well as comorbidities (Supplementary Table 2) were analyzed using multiple ML models. When trained with these features, the random forest (RF) was the top performing model in determining DS and achieved an accuracy of 0.70 (95% confidence interval (CI) 0.60–0.81) and positive predictive value (PPV) of 0.71 (95% CI 0.60–0.82) (Table 1and Extended Data Fig. 1). The top features predicting outcome included comorbidities, such as hyperlipidemia, jaundice and pancreatitis, as well as surgical margin status (Supplementary Table 2), which are known in the PDAC field^15,16,17. The model for DS was predominantly driven by comorbid conditions, which accounted for 306 of the 331 total features. The RF model was also trained using the remaining 25 features, which included known PDAC predictors such as previous chemotherapy and margin status. This model performed similarly to ones that included all clinical features (Supplementary Table 2). Notably, the top ten features of this model included surgical margin status, tumor grade and chemotherapy, which are known to influence patient outcome^18,19.

Table 1 Top single-omic and multi-omic analytes for predicting disease survival in PDAC in the MT-Pilot cohort

Full size table

DNA analysis reveals alterations with prognostic importance

Point mutations and insertion/deletion (INDEL) polymorphisms are common in established PDAC oncogenes and tumor suppressor genes²⁰. Tissue samples were processed for 611 somatic single-nucleotide variants (SNVs), 648 copy-number variations (CNVs) and 126 INDELs. These features were then used in patient DS prediction models (Supplementary Table 3).

Using SNV features, the top-performing model to determine DS was RF, with accuracy of 0.64 (95% CI 0.53–0.75) and PPV of 0.66 (95% CI 0.55–0.77) (Table 1 and Extended Data Fig. 1). In models evaluating SNVs, we found that alterations in RAD51, IL6R, FGF20 and SOX2 genes were the top DS predictors (Supplementary Table 3) and their associated signaling pathways have important prognostic implications in PDAC^21,22,23. In addition, we found genes, such as RIT1, that were top predictive DS markers identified by our model and not previously associated with PDAC prognosis or targetable pathways.

Using CNV features, the top-performing model to determine DS was an RF model with an accuracy of 0.65 (95% CI 0.57–0.80) and PPV of 0.68 (95% CI 0.57–0.80) (Table 1and Extended Data Fig. 1). The top CNV features for DS are noted in Supplementary Table 3. Notably, we found that FOXQ1 and KDM5D were top predictors associated with DS. Both are markers for PDAC prognosis and potential therapeutic targets^24,25,26. In our cohort, the four commonly mutated genes, KRAS, TP53, CDKN2A and SMAD4 (ref. ²⁷), were included among a total of 126 specific INDEL features and were learned by multiple ML model types. The top model predicting DS was RF with an accuracy of 0.64 (95% CI 0.53–0.75) and PPV of 0.70 (95% CI 0.58–0.82) (Table 1 and Extended Data Fig. 1). The top features in the model included mutations of TP53, CDKN2A and SMAD4 (refs. ^28,29), which have been shown to correlate with poor prognosis and more aggressive phenotypes of PDAC.

RNA signatures of drug resistance impact prognosis

Whole-transcriptome sequencing was performed on 72 of the 74 formalin-fixed paraffin-embedded (FFPE) tumor tissue samples. To optimize the most predictive features, we first ran a differential expression analysis between cancer and noncancer samples from the GTex Consortium to select RNA gene transcripts for downstream modeling³⁰. The top-performing model to determine DS was L1-normalized RF, which yielded an accuracy of 0.68 (95% CI 0.56–0.80) and PPV of 0.70 (95% CI 0.57–0.83) (Table 1 and Extended Data Fig. 1). In our top model for DS prediction the NFE2L2 and LRIG3 genes were the two top features (Supplementary Table 4). Recent investigations have shown that the NRF2 pathway through NFE2L2 regulates resistance to drugs and immunotherapy^31,32. Additionally, a total of 29 RNA fusions were analyzed using multiple model types (Supplementary Table 4). The top performing model featuring RNA fusions to determine DS, was support vector machine (SVM) with an accuracy of 0.75 (95% CI 0.64–0.87) and PPV of 0.74 (95% CI 0.62–0.87) (Table 1 and Extended Data Fig. 1).

Plasma proteins are critical biomarkers in survival prediction

Proteomics and lipidomics analysis initially generated 3,777 tumor tissue proteomic, 1,051 plasma proteomic and 939 lipidomic features (Supplementary Table 5).

Using tissue protein features, the top performing model to predict DS was RF model with accuracy of 0.73 (95% CI 0.61–0.86) and PPV of 0.76 (95% CI 0.63–0.89) (Table 1 and Extended Data Fig. 1). For plasma protein features, the top-performing model for DS was the five-hidden-layer-deep neural network model with an accuracy of 0.75 (95% CI 0.63–0.86) and PPV of 0.80 (95% CI 0.68–0.90) (Table 1 and Extended Data Fig. 1). Among DS predictive plasma proteins, we identified ANXA1, which is an important emerging player in pancreatic carcinogenesis and PDAC drug resistance^33,34. The top performing model using plasma lipid features to determine DS was the RF model with an accuracy of 0.71 (95% CI 0.58–0.83) and PPV of 0.74 (95% CI 0.61–0.87) (Table 1 and Extended Data Fig. 1). The top plasma lipidomics features for DS were driven by diacylglycerols and cholesteryl esters (Supplementary Table 5).

As discussed above, CA 19-9 is routinely utilized in clinical practice at PDAC diagnosis, pre- and postoperatively to assess disease biology, treatment response and prognosis. CA 19-9 readouts obtained at diagnosis, before surgery and postoperatively, were learned by the RF model, but the DS prediction had a low accuracy (0.59–0.64, 95% CI 0.47–0.76) and PPV (0.52–0.61, 95% CI 0.40–73) across all time points (Supplementary Table 6).

Predictive nuclear morphology via computational pathology

The 71 hematoxylin and eosin (H&E)-stained PDAC tissue whole-slide images (WSIs) were evaluated by a artificial intelligence (AI)-based computational pathology pipeline (Fig. 2). The pipeline included two convolutional neural network models: a model to mask-out cancer cells (Fig. 2a) and a model to delineate nuclei (Fig. 2b). When tested on images from an independent set of 40 PDAC cases, the cancer-masking model achieved 0.90 global accuracy, 0.784 mean intersection over union (mIoU) and mean F1-scores of 0.83 and 0.77 in identifying nontumor and tumor tissue pixels, respectively. Next, the pipeline was run on 2,908 regions (~41 ± 11 regions per case) randomly selected from the 71 WSIs in our cohort and automatically isolated 345,038 tumor cell nuclei (~4,860 nuclei per case). Nuclear morphology and texture were quantitated by a panel of 63 characteristics. Distribution of characteristics in each case was further summarized by 13 order statistics, yielding 819 features per case (Fig. 2c and Supplementary Table 7). A uniform manifold approximation and projection (UMAP) plot revealed clusters of cases with similar outcome (Fig. 2d) suggesting that some of the features in the panel bear prognostic potential. Using the leave-one-out approach and the 819 features per case, we cross-validated seven classification models for DS prediction. An RF with an accuracy of 0.66 (95% CI 0.55–0.77) and PPV of 0.76 (95% CI 0.63–0.88) (Fig. 2e) performed the best. Throughout all validation steps, features learned by the top models were ranked based on the impact on predicting the outcome and the frequency of occurrence of impactful features measured. Impactful features that occurred in at least 10% of validation steps were considered top features. The 17 out of 39 top features to predict survival in Fig. 2f originated from the same 10 out of 63 nuclear characteristics in Fig. 2c.

**Fig. 2: Computational pathology pipeline.**

To assess whether the computational pathology-based prediction of DS could benefit from the inclusion of percent of stroma or cancer to stroma ratio in our samples, we applied our pipeline (Fig. 2b) to the cancer regions marked by our pathologist (W.T.) and measured the proportion of tumor pixels (pCA), stromal pixels (pST) and the ratio of these two (r = pCA/pST) in the regions (Extended Data Fig. 2a,b). No statistically significant difference in pCA (t-test P value = 0.3) and r (t-test P value = 0.257) was found when tumors associated with poor survival (DS = 1, n = 28) were compared to those with better survival (DS = 0, n = 43). As no difference was observed, we did not incorporate the above features into the computational pathology analyte. Regardless, we found that the percentage of stroma is significantly larger in tissue after neoadjuvant therapy, which can occur following neoadjuvant therapy. Additionally, the percentage of cancer was smaller in tissue after neoadjuvant therapy, which is the intent of neoadjuvant therapy (Supplementary Table 8).

Multi-omic analysis suggests hierarchical complementarity

The 6,363 individual features from the single-omic sources were combined and analyzed using seven independent ML models cross-validated with a leave-one-patient-out technique (complete multi-omic feature dataset: Table 1 and Source Data Fig. 1). Each single-omic source and multi-omic combinations were evaluated using all ML models. The hyperparameters of each model were fixed at the initial design of the study to prevent over-optimization and overfitting due to the small cohort size. The top model for prediction of DS was the multi-omic model, which had an accuracy of 0.85 (95% CI 0.73–0.96) and PPV of 0.87 (95% CI 0.75–0.99), followed by single-omic analyte models that learned plasma protein, RNA fusions, tissue protein, plasma lipids, clinical and surgical pathology, RNA gene expression, computational pathology, DNA CNV, DNA INDELS and DNA SNV features in decreasing order of model prediction accuracy (Table 1 and Extended Data Fig. 1).

The accuracy and PPV performance yielded by single-omic models suggest that each single-omic analyte in isolation carries some predictive power and thus potential clinical utility. The best predictors of DS were plasma proteins leading to development of a model with an accuracy of 0.75 (95% CI 0.63–0.86) and PPV of 0.80 (95% CI 0.68–0.92). The model learning only presurgery CA 19-9 achieved an accuracy of 0.59 (95% CI 0.47–0.71) and PPV of 0.53 (95% CI 0.40–0.65) and it was considered the worst among all the single-omic models. As observed in the top two rows of the model performance (Table 1), the top multi-omic models outperformed the single-omic ones in accuracy (by 10–21%) and PPV (by 7–19%) in predicting DS, suggesting complementarity and information gain across analytes when combined under the multi-omic analytical approach. On the other hand, the multi-omic models had a larger dispersion of accuracy and PPV, when compared to the single-omic models (Table 1 and Extended Data Fig. 1) likely resulting from the involvement of a much larger set of features available for training.

The 1,024 individual analyte combinations (single and multiple) with all seven modeling strategies per analyte combination resulted in 7,168 grid search runs (Fig. 1). To establish per-analyte importance, the drop-column importance strategy was utilized and adapted, where each analyte’s set of features were dropped in their entirety. Using results from the 7,168 runs, we evaluated the model’s predictive performance, analyte composition and feature contributions (Fig. 3). Models trained with features from any 2–4 or 9–10 analytes were inferior in accuracy and PPV to the models trained with features from any 4–8 analytes (Fig. 3a).

**Fig. 3: Multi-omic performance by number of analytes and contribution.**

Additionally, with the drop-column importance approach, we were also able to quantify the importance of each analyte category (Supplementary Table 9) and showed that the exclusion of any one analyte from the study generally reduced but did not substantially alter the performance; where the accuracy and PPV for DS prediction were in the range of 0.85–0.83 and 0.84–0.83, respectively.

Next, we focused on the top 15 multi-omic models for DS prediction (Fig. 3b), which were those with an accuracy >0.80 and PPV > 0.78. We plotted proportions of analyte’s features learned by each model (Fig. 3c) and observed that the top models had nearly similar accuracies and PPVs, however the proportions of contributing features varied across the top 15 models. The predominant feature contribution was from the plasma protein analyte (green bar, Fig. 3c).

Multi-omic models provide biological insights into PDAC

Given the relative paucity of predictive biomarkers and therapeutic advances in PDAC compared to other cancers, a notable exploratory objective of our study was to assess whether our platform can identify potential pathways and targets of therapy. Using a differentially expressed feature set, we were able to ascertain features to study objective Spearman correlations and the importance for all analyte features (Fig. 4a). By evaluating analyte contribution for each model, it was possible to generate ontology visualizations for protein, DNA and RNA as shown for the top multi-omic models for DS (Fig. 4b).

**Fig. 4: Biological relevance of top features in muti-omic model and clustering.**

mTOR signaling, a known pathway in many tumors^34,35 including PDAC, was found in the Gene Ontology network visualizations of the top multi-omic models³⁶ (Fig. 4b). It has been targeted in PDAC alone and in combination with other agents³⁷ with mixed results. Excluding mTOR, our Gene Ontology network visualizations revealed other clinically and biologically relevant pathways in PDAC, including glycolysis and cellular metabolism^38,39.

To examine the relationship of tumor to outcome heterogeneity, all 6,363 features across all analytes were used to create patient-level clustering based on multi-omic molecular signatures and labeled for DS (Fig. 4c). Cluster 1 represents patients homogeneous for their clinical outcome (all deceased). To better understand the association of the heterogenous clusters, (2 and 3), with other clinical and computational pathology features, we compared the expression of a feature in one cluster to that in the two other clusters combined using t-test or Fisher’s test. This analysis revealed proportions of relevant features (P < 0.05) in each analyte (Supplementary Table 10), where except for computational pathology, no other analyte contained features that were present in all three pair-wise comparisons. Subsequently, we used one-way analysis of variance, which identified eight differentially expressed computational pathology features (Supplementary Table 11). These eight features were then analyzed by the Tukey–Kramer test for multiple comparisons. No feature was significantly different between the three clusters, but there were features that differed between two clusters. Furthermore, hierarchical clustering of 39 subjects characterized by the eight computational pathology features (Extended Data Fig. 3) suggested that they strongly contributed to the formation of clusters 1, 2 and 3. Together, these findings suggest that with more patients and with prospective iterative analysis over time, our approach will result in progressively more accurate predictions especially for patients who fit membership in specific clusters (for example, cluster 1) and deeper insight into what features are critical to individual patient clusters.

The parsimonious multi-omic models for disease survival

The complementarity of analytes observed in multi-omic models in Table 1 and Fig. 3, suggested that a parsimonious multi-omic model offering similar predictive performance to models with larger and more complex analyte compositions could be developed. If true, the global public health and societal impact would be consequential as it would potentially begin the process of democratizing precision cancer medicine, especially to areas of the world with limited financial and technical healthcare resources. To test this hypothesis, we started with the complete multi-omic feature space of 6,363 features and trained an RF model for DS utilizing a recursive feature elimination (RFE) strategy such that at each step the least-informative features were eliminated from further model iterations (Fig. 5a). Most notably, Fig. 5a highlights the inflection point of the ‘parsimonious model’ location on the curve (accuracy of 0.85, PPV of 0.85) learning only 589 multi-omic features. Further, the contribution of respective analytes to the parsimonious model remains mostly stable across iterations after the inflection point, with plasma lipids and RNA being the most relevant; however, note that plasma (proteins or lipids) alone can provide accurate prediction with fewer features. This opens the possibility that a screening of plasma could eventually be used for decision-making regarding pancreatic surgery.

**Fig. 5: Performance of parsimonious multi-omic models and analyte contribution for disease survival.**

Trying to examine the potential of this approach for eventual globalization of precision oncology, we assessed specific limited analyte combinations and feature sets that could be applied to our parsimonious model. These analytes were selected based on criteria of standard availability (pathology specimens or clinical data including surgical pathology) or easily obtained (plasma lipids or proteins) as part of the diagnostic workup. Using this approach, we identified accurate parsimonious models that learned features from clinical, surgical pathology and computational pathology analytes (Fig. 5b), all plasma analytes (lipidomics and protein) (Fig. 5c) and clinical, combined with computational pathology and plasma analytes (Fig. 5d) and which had similar accuracy and PPV to the models that learned features from the entire set of 6,363 features in Fig. 5a.

Validation of RNA markers as predictors of survival

Whole-transcriptome sequencing and analysis was performed on 57 samples from our pilot cohort (Supplementary Table 4). Employing L1-normalized RF modeling, RNA gene transcripts significantly (P ≤ 0.05) predicting survival (n = 79) were used to develop gene signatures for improved (positive Pearson and Spearman rho for survival, n = 40 genes) and for poor (negative Pearson and Spearman rho for survival, n = 39 genes) survival (Supplementary Table 12). These two signatures were evaluated in an independent dataset of 177 PDAC patients⁴⁰ for their ability to stratify DS. High score of the signature composed of genes whose expression was associated with poor prognosis in our data (n = 39) was also associated with poor DS in this set (hazard ratio (HR) = 2.17, (1.28–3.66), log-rank P = 0.0031) (Extended Data Fig. 4a), whereas that of genes whose expression was defined as a good prognostic in our data (n = 40), had a trend toward improved DS (HR = 0.74 (0.49–1.12), log-rank P = 0.15) (Extended Data Fig. 4b). We also performed gene set enrichment analysis on the RNA transcripts used in the two signatures above (n = 79). Enrichr⁴¹ found numerous significant pathways (Supplementary Table 13) implicated in PDAC resistance and treatment-targeting, including interferon signaling, AMP-activated protein kinase (AMPK) and CXCR4 signaling pathways^42,43,44,45. Together, these data independently validate the clinical relevance of our RNA expression discoveries.

Validation of multi-omic models as predictors of survival

To further validate our single-omic, multi-omic and parsimonious analytes for DS prediction, we evaluated their predictive performance on the TCGA dataset, containing 157 evaluable samples that had at least one analyte type (Supplementary Table 1)⁴⁶. As TCGA has data only on DNA, RNA, digital H&E slides and clinical analytes, our modeling had a reduced set of 3,423 total features compared to the 6,363 in our MT-Pilot cohort (Table 1 and Fig. 1e). The full 3,423 analyte model had an accuracy and PPV of 0.94 (95 CI 0.83–1.00) and 0.95 (95% CI 0.84–1.00) (Table 2 and Supplementary Table 14) for DS prediction with computational pathology, DNA SNVs and RNA gene expressions performing strongly in single-omic validation of DS (Table 2 and Supplementary Table 14).

Table 2 Top single-omic and multi-omic performance for predicting disease survival in PDAC: study validation cohorts

Full size table

Next, we examined the validity of our multi-omic parsimonious model on the TCGA dataset. Because this cohort had an overall reduced analyte set, we used an RFE strategy to retrain a RF model for DS on our cohort (MT-Pilot) and determined that the optimal (top of peak) parsimonious model employed 202 features out of 3,423 and had accuracy and PPV of 0.74 (0.63–0.85) and 0.77 (0.65–0.89), respectively (Extended Data Fig. 4c). Notably, when the model was applied to these same 202 features (Supplementary Table 15) in the TCGA dataset, it yielded an accuracy of 0.88 and PPV of 0.95 for DS prediction. Furthermore, in both our MT-Pilot cohort and the TCGA cohort, computational pathology and RNA gene expression were found to be primary analytes learned by the DS predicting models, with CNV and the clinical analyte providing minor additional improvement (Extended Data Fig. 4c). The signal dominance of RNA is not driven by expression of any single gene, but by a specific set of genes.

As TCGA lacked tissue proteomic level data, we sought an external dataset with tissue protein data, along with other critical single-omic informative analytes such as DNA, RNA and clinical data. We found an independent publicly available dataset¹⁴ named JHU cohort 1 that met these criteria. With DNA, RNA, clinical data and tissue protein analytes from our MT-Pilot cohort serving as the training set, we trained an L1-normalized RF model and applied it to this validation test set. This model predicted DS with an accuracy and PPV of 0.89 (95% CI 0.83–0.95) and 0.91 (95% CI 0.85–0.98), respectively (Table 2 and Supplementary Table 14). While a model trained on the tissue protein as a single-omic analyte had an accuracy and PPV of 0.56 (95% CI 0.50–0.63) and 0.53 (95% CI 0.47–0.60) in the JHU cohort 1 (Table 2 and Supplementary Table 14), addition of DNA, RNA and clinical analytes improved the predictive performance of the model and validated the multi-omics approach.

Validation of plasma proteins as a preoperative biomarker

Through our multi-omic and parsimonious modeling of the MT-Pilot cohort, we discovered that plasma protein is an analyte that provides not only accurate prediction of DS in PDAC, but does so with the fewest features compared to other analytes. As a result of these findings, as well as the poor performance of CA 19-9 as a preoperative marker for decision-making regarding the benefit of surgery, we next sought to validate our findings solely on analytes that would be available to the clinical practitioner before surgery.

Besides the TCGA and JHU cohort 1, we utilized two more cohorts; JHU cohort 2 and the MGH cohort (Supplementary Table 1). They included similar stage I/II resected PDAC with clinical and demographic data collected longitudinally and preoperative plasma samples, including CA 19-9 obtained and analyzed as described above. Application of the L1-normalized RF model trained on the MT-Pilot data on the two cohorts showed that plasma proteins remained highly predictive of DS in both validation cohorts, with accuracy and PPV of 0.98 (95% CI 0.83–1.00) 0.92 (95% CI 0.79–1.00), respectively in JHU cohort 2 and 0.89 (95% CI 0.76–1.00) 0.80 (95% CI 0.69–0.91), respectively in the MGH cohort (Table 2 and Supplementary Table 14). The addition of clinical data to plasma protein improves the multi-omics model for DS prediction. Overall, preoperative plasma protein was highly predictive of DS among three separate independent datasets and provided a unique preoperative biomarker with significantly better predictive performance than routinely utilized CA 19-9 (Table 2 and Supplementary Table 14).

Discussion

Here we demonstrate a ‘proof-of-principle’ Molecular Twin platform that incorporates multiple molecular, histopathological and clinical features from both host and tumor and a comprehensive machine learning-based multi-omic analysis to provide accurate clinical outcome prediction (Fig. 6). The Molecular Twin platform has allowed us to develop predictive multi-omic models and led to the discovery that plasma proteins are a highly predictive analyte for DS prediction in PDAC. Most notably, validation of the approach on four independent datasets have confirmed its value in predicting DS and revealed its superiority to CA 19-9. This approach has potential to substantially impact how we develop biomarkers in the future and in the case of preoperative markers, may have provided enough rationale to initiate clinical development and large-scale testing to determine its value in treatment decision-making. Finally, this platform, by virtue of its ability to generate parsimonious models has laid a foundation for the future democratization of precision oncology and thus reduce national and global disparities in its use.

**Fig. 6: The Molecular Twin platform.**

Our study reveals that the multi-omic analytes incorporating individual single-omic sources are the most accurate clinical predictors of DS. We also show that multi-omic models with limited, but highly predictive analytes, perform just as well as the top multi-omic models with higher number of individual single-omic analytes. None of the top multi-omic models consisted of all ten available analytes. This reinforces the concept of complementarity and highlights the overlap of signal across analytes, suggesting that it may not be necessary to carry out the comprehensive ten-analyte workup to obtain accurate predictions. A strength of this platform is its resilience, allowing interchangeability and complementarity among analytes. This observation also suggests flexibility in analyte selection to approximate optimal predictive performance, with patient burden, efficiency, ease of testing, time and cost of analyte acquisition being other notable considerations. Many analytic techniques, especially comprehensive genomics, can be expensive as well as time and labor intensive; however, our study reveals single-omic sources employed in this platform, such as computational pathology-based features or plasma proteins which offer the opportunity to circumvent these challenges using near-term practical solutions with clinical implications in resource-poor geographies. In computational pathology analysis, features of nuclear architecture can predict survival^47,48,49 in many cancer types, and our results were consistent with these reports. To extract features, computational pathology uses only H&E slides prepared to obtain routine pathology reports. As no special tissue processing or chemical reagents are necessary, the cost of measuring a feature through this platform is low. In addition, digital H&E slides can be sent for analysis through the cloud and results sent back to the requester as a multi-omic score generated by combining all other information on the patient electronically.

Studies employing smaller cohorts, for example one study with 14 patients, has shown that certain predefined plasma proteins can predict early recurrence⁵⁰. Our study is more comprehensive, includes 74 patients and identifies more plasma proteins as critical predictors of DS. Plasma proteins within multi-omic panels also represent a unique opportunity for efficient, informative and clinically impactful testing as this specific analyte can be obtained quickly and preoperatively in a non-invasive manner. Although preoperative antigen testing, such as CA 19-9, continues to be routinely utilized in predicting tumor biology and survival⁵¹, we demonstrated that plasma proteins alone, and even more so when combined with other preoperative analytes such as clinical data, is superior to CA 19-9 alone. These results are not surprising as it is well appreciated that preoperative CA 19-9 has limitations that may contribute to its poor performance as a tool predicting DS⁶. Unlike CA 19-9, plasma proteins have the potential to inform subsequent therapeutic decisions involving surgery and chemotherapy. Our approach provided insight into molecular drivers and clinically useful markers of survival in PDAC; the latter findings helping to validate the value of our approach.

Multi-omic analysis across tumor types has been undertaken before, but not to this extent. One study employed a smaller number of analytes¹³, identified hurdles in multi-omic analyses and highlighted the high cost of genomic signatures in clinical practice. In our study, we sought to address the cost of data generation and technical sophistication, which are two of the major issues impeding the global use of precision therapy in cancer care. As a solution, we employed an RFE strategy to identify optimal features across analytes and developed the parsimonious model, which achieved an accuracy and PPV of 0.85 with 589 features, which was similar to the full multi-omic model trained with 6,363 features. We also found that plasma analytes were the dominant feature type of the parsimonious panel. The parsimonious model uncovered highly informative features, while simultaneously minimizing the number of required analytes without compromising performance.

A strength of our study is that we validated our findings in independent datasets of PDAC, including the TCGA cohort, two separate cohorts from JHU and a cohort from MGH. In our validation approach, we recognize that no single multi-omic model contains all ten single-omic analytes concurrently. This is an inherent shortcoming of our validation datasets as well as many currently available datasets, where none contain complete data of all ten single-omic sources that our original MT-Pilot cohort provided. Regardless, we externally validated our multi-omic panels with available and complete data where it was possible. Of note, for the potential democratization aspects of this work, the 202 highly predictive features learned from the MT-Pilot cohort data by the parsimonious model were sufficient to accurately predict DS in the TCGA cohort. Additionally, models learning single- and multi-omic panels incorporating plasma proteins were validated in two separate prospective test cohorts. This further supports the development of plasma proteins as a potentially clinically usable assay in PDAC.

Despite evaluating a larger number of patients and use of more comprehensive analyte panels, our study has several limitations. First, specific features of certain analytes have outcome relationships that are contradictory to published literature, likely a reflection of the sample size of our MT-Pilot cohort. Second, despite comprehensive panels, it still does not incorporate all accessible analyte sources. These include radiomic features extracted from preoperative computed tomography scans and microbiome features as just two examples; however, our platform is flexible so integration of additional sources of data is facile and has already begun. We also recognize that our molecular profiling of tissue was conducted on bulk tissue, which provides limited insight into tumor heterogeneity, but single-cell analysis may address this issue. Similarly, dissection of the tumor microenvironment with techniques such as multiplex immunohistochemistry (IHC) can provide further understanding into disease biology and improve the performance of our models. We also recognize the single time point multi-omic analysis as a potential shortcoming, and hence future iterations will incorporate longitudinal analyses to enhance the predictive potential of the platform, as well as discover additional markers for minimal residual disease or treatment response prediction. We are exploring these issues in upcoming iterations of the Molecular Twin.

This proof-of-concept, yet externally validated study, examined an aggressive malignancy, PDAC, that lacks robust predictive and prognostic biomarkers. The Molecular Twin represents the way forward for the discovery of promising predictive and clinically meaningful biomarkers, targets for treatment and ultimately tools to democratize and reduce national and global health disparities in the use of precision medicine across all of types of cancer.

Methods

Sample collection, processing and classification

Specimens were obtained from consented patients enrolled in the ethically approved study by Cedars-Sinai Institutional Review Board STUDY00000806 MT-Pilot study ‘Feasibility of extensive molecular profiling of pancreatic tumors: lessons for Molecular Twin’ and stored onsite in the Cedars-Sinai Medical Center Biorepository. Tissues were procured from surgical specimens as part of the standard of care. Blood samples were collected with routine blood work. The dates in which these samples were collected ranged from April 2015 to May 2019. Follow-up data were completed based on the standard of care. All cases were pancreatic cancer with the diagnosis of ductal adenocarcinoma. This was chosen based on the availability of FFPE and frozen tissue, buffy coat and plasma. FFPE and frozen tissues were collected following tumor resection and stored in the biorepository for future research use.

The Cedars-Sinai Medical Center Biobank and Pathology Shared Resource reviewed in-house cases and histologically confirmed PDAC from an initially assembled list. Specifically, fresh frozen tissue (tumor and adjacent normal) and FFPE tissues (tumor and adjacent normal) were identified. The biobank prepared each sample for genomic analysis (ten unstained slides per sample and one H&E). These slides were de-identified and sent to Tempus via overnight shipping for genomic and transcriptomic analyses as well as H&E slide digitization.

The following set of samples were sent to Tempus:

93 FFPE tumor samples (ten unstained slides and one H&E)
93 FFPE normal samples (ten unstained slides and one H&E)
93 blood samples (buffy coat at 500-μl aliquots)

Clinical data variables for the cohort

Cedars-Sinai Medical Center Proteomics and Metabolomic Proteomics Core analyzed:

60 frozen tissue normal
60 frozen tissue tumor
61 tumor plasma samples with 81 unpaired normal samples. Normal samples were provided by IRB-approved study 00001316.

Stage III and IV patients were excluded. Due to the limited number of samples in this pilot cohort, we trained models in an LOO fashion for every analyte separately. During the train phase, we performed feature selection, missing data imputation and normalization; the same transformations were then applied to the validation sample (the LOO sample) using the means and variance learned on the train data. For certain analytes, we performed preliminary, analyte-specific transformations and feature selection. We utilized binary end points at the time of our analysis, 21 October 2021: DS, deceased at the time of analysis.

Clinical data analysis

We collected 74 plasma and tissue samples of patients with clinical stage Ia, Ib, IIa and IIb, resectable pancreatic adenocarcinoma. We obtained clinical characteristics and longitudinal clinical and surgical pathology information for each patient whose sample was analyzed for our multi-omic analysis (Supplementary Table 1). Our baseline model for the clinical and surgical pathology analytes included general features such as self-reported sex, age, body mass index/weight/height, tumor stage/size, histological grade, pathological variable (margin status, grade, pathological staging, perineural invasion and lymphovascular invasion), treatment duration and type, family history and personal history of comorbid conditions, including other cancers.

NGS targeted genomics

Bulk tissue samples were processed via NGS Tempus xT oncogene panel, specifically a v4 xT assay covering 648 genes, spanning ~3.6 Mb of genomic space at 500× coverage. An industry-standard bioinformatics pipeline was run on the NGS data for alignment, quality control and calling of somatic SNVs, INDELs and CNVs. SNVs were counted per gene in the target panel, generated via Freebayes snp calling pipeline with matched tumor–normals, resulting in 611 gene-level SNV features. INDELs were counted per gene in the target panel, with INDEL calling via the Pindel pipeline using matched tumor–normals, resulting in 126 gene-level INDEL features. Additionally, called CNVs were counted per gene in the target panel, resulting in 648 CNV features. Upon obtaining gene-level somatic SNVs, INDELs and CNV features, further feature preprocessing was performed, specifically univariate normalization, pruning of low-variance features (with variance threshold <0.05) and dropout of highly correlated features (Spearman correlation coefficient <0.95). Processed genomic features consisted of 337 somatic SNV, 219 CNV and 72 INDEL gene-level features, respectively considered for predictive patient survival outcome models.

RNA sequencing

Whole-transcriptome sequencing (RNA-seq) was performed on 72 tumor tissue samples. In addition, we used 204 (out of 382 total) RNA-seq pancreatic tissues samples from the GTex consortium as controls. The GTex samples were selected using the following criteria: participant did not have a cancer diagnosis and participant’s age was matched to the age range of the pilot cohort. We then derived two types of RNA-seq features:

1.
Gene-level estimated read counts for a set of genes that we found to be differentially expressed between cancer and noncancer samples.
2.
Read counts per gene for a set of fusion genes.

We obtained estimated transcript read counts by running Kallisto tool (v.0.46.1) on the fastq files for cancer and noncancer samples. We aggregated transcript-level read counts to gene-level counts using tximport R package (v.1.14.2, Bioconductor v.3.10); this step reduced the number of features from 169,000 transcripts to 30,427 genes.

To further reduce the feature space and retain only the most promising features, we ran a differential expression analysis between cancer and noncancer samples. First, we removed all counts below 2 and then removed any genes (separately for cancer and noncancer datasets) for which fewer than 25% of samples in the set had nonzero values. This left us with 16,470 genes for the cancer set and 10,478 genes for the noncancer set. We then only kept genes in the intersection of noncancer and cancer gene sets, leaving us with 10,185 genes in total. We selected 2,000 genes with the lowest adjusted P values using the default analysis in DESeq2 package (v.1.26.0). Finally, we trained our classifiers using log₁₀ estimated read counts for these 2,000 genes as features. Unpaired differential expression was conducted via Mann–Whitney U-test with P value < 0.05, from which the 2,000 most differentially expressed RNA gene transcripts were selected.

Fusion gene derivation from RNA-seq data was another category of omic features considered in the study to capture translocations, interstitial deletions or chromosomal inversions of two distant, independent genes. Fusion gene features were derived from RNA-seq data using an alignment-free algorithm⁵². The number of reads mapping to each fusion gene was aggregated, then limited to known COSMIC fusion pairs. In total 29 fusion gene features were derived from tumor tissue RNA-seq data.

Proteomics and lipid analysis

Proteomics analyses were performed on 58 patients with paired tumor–normal tissue samples, via resection of tumor and normal samples from the same frozen tissue block and on 61 tumor plasma samples with 81 unpaired normal samples (Supplementary Table 16). Proteomics data were generated using data-independent acquisition mass spectrometry (DIA-MS) technology, with post-processing bioinformatics pipelines performing quality control, peak picking, retention time alignment, scoring and false discovery rate identification, normalization and quantitation⁵³. MS2 peak areas at both protein and peptide levels were computed as proteomics features, using a 3,777-protein panel for paired tumor–normal tissue samples and a 1,052-protein panel for unpaired plasma samples. Similarly, lipidomics analysis using the Lipidyzer Platform kit with internal lipid class standards for quantification reference was performed on plasma samples to obtain composition and concentrations for lipid species, lipid classes and fatty acids.

Further pre-processing steps for all proteomics and lipidomics data included filtering out proteins and lipids with more than 25% missing data not meeting quality control criteria, removing proteins with a low variance <0.1 threshold, followed by imputation of remaining missing values using halved median values for each column and univariate normalization of each column. Alternative strategies for imputation of missing proteomics values, specifically column mean and k-nearest neighbor imputation, were deemed too sensitive to outliers due to the small sample size.

Differential expression analysis was performed on the 58 paired tumor–normal tissue samples. A Wilcoxon rank-sum test was performed between the dependent tumor–normal proteomics samples, with a two-tailed P value < 0.05 threshold applied to further remove tumor tissue protein distributions similar to their respective paired normals.

Differential expression analysis was performed on the 61 tumor plasma samples with unpaired 81 plasma samples. A Mann–Whitney U-test was performed between unpaired tumor–normal protein distributions, with a two-tailed P value < 0.05 threshold applied to remove plasma tumor protein distributions similar to the unpaired normals.

Overall, proteomics and lipidomics analysis generated 3,777 tumor tissue proteomic, 1,051 plasma proteomic and 939 lipidomic features. Plasma proteomic features were reduced to 257 via tumor–normal plasma protein differential expression analysis (Mann–Whitney U-test, P value < 0.05). Redundancy was reduced by elimination of highly correlated features (Spearman correlation, rho < 0.95, P value < 0.05) leaving 406 lipidomic features. Tumor tissue proteomic features were pruned to 1,130 by eliminating those not expressed at higher levels in tumors compared to normal pancreas (Wilcoxon signed-rank test, P value < 0.05).

Plasma proteomics methods

Aliquots from plasma samples were depleted of 14 highly abundant proteins using the High Select Top 14 Abundant Protein Depletion Camel Antibody Resin (ThermoFisher Scientific) according to the manufacturer’s protocol. Proteins from 5 μl depleted and undepleted plasma were then processed separately for tryptic digestion on the Protifi S-Trap columns according to the manufacturer’s protocol (Protifi). Mass spectrometry data were acquired on an Evosep HPLC system in line with an Orbitrap Exploris 480 (ThermoFisher) instrument, operating in data-independent acquisition (DIA) mode, separately for the depleted and undepleted plasma samples. Full LC–MS instrument settings are provided with raw data in the public repository.

Peptide identification and quantification was performed as described previously⁵⁴ using the OpenSWATH workflow and searched against the Human Twin population plasma peptide assay library. Proteotypic peptides were aggregated into quantitative protein abundance estimates using mapDIA software⁵⁵. To generate a single table of quantified plasma proteins from each sample, we appended only unique proteins (for example, not seen in undepleted) from depleted runs to the undepleted plasma protein table.

Plasma lipidomics methods

Sample processing and lipid extraction

Lipids were extracted from plasma using the Bligh-Dyer method as previously described⁵⁶. The extracted dry lipids were resuspended in 250 µl running buffer (10 mM ammonium acetate and 50:50 methanol:dichloromethane). Lipids were analyzed on a Sciex Lipidyzer Platform with a standardized workflow for the simultaneous analysis of 1,153 lipids representing 13 lipid class. Samples were loaded by direct infusion from a Shimadzu LC-30AD LC system equipped with a SIL-30AC auto sampler. Lipid concentrations were determined by the Lipidyzer software using the ratio of the endogenous lipid to internal standard. Data are reported for each individual lipid species as an aggregated value for lipid classes and as the relative composition compared to all other measured lipid classes.

Tissue proteomics methods

Sample processing and lysis

Tumor and nontumor tissue sections were homogenized in 8 M urea with 5% SDS and 100 mM glycine. Following sonication and centrifugation (1,500g) to pellet debris, the protein concentration was determined using a Pierce BCA assay (ThermoFisher Scientific). A total of 30 μg from each sample were then processed and digested using the S-TRAP micro-elution tips (Protifi) according to the manufacturer’s protocol. Peptides were analyzed using a ThermoFisher Ultimate3000 HPLC system in line with a Fusion Lumos Orbitrap mass spectrometer operating in DIA mode. Full MS details are provided with raw files in the public repository. Data were analyzed using our established openSWATH-based⁵⁷ workflow using the pan-human library⁵⁸ as previously described^54,59. The mapDIA software⁵⁵ was used to perform protein-level quantification from proteotypic peptides.

Computational H&E slide analysis

Slides and computational pipeline overview

The 71 cases in our MT-Pilot cohort had available FFPE tumors that we used to prepare H&E slides for computational analysis. After slide digitization (Aperio GT450 scanner with ×40 objective), the resulting WSIs (n = 71) were loaded up to the slide viewer (ImageScope, Leica Biosystems) for a pathologist to box-outline random regions of interest (ROIs) with cancer cells for the analysis. Our goal was to extract architectural features of cancer cell nuclei and assess their fitness and contribution as an analyte in single- and multi-omic ML-based DS prediction models. The ROIs marked (n = 2,908) and exported from WSIs were subsequently analyzed by two semantic neural network models. The first (DeepLabV3Plus) provided a mask of cancer cells and the second (StarDist) provided a mask of all nuclei in the ROI (Fig. 2).

DeepLabV3Plus was trained and tested for the tumor cell masking task using biobanked digital H&E and IHC slides with PDAC. StarDist, a model that predicts cell nucleus instance using star-convex polygons, was already available⁶⁰. The intersection of the masks yielded by these two models was the mask of cancer cell nuclei that we overlaid onto the ROI images.

Training and evaluation of DeepLabV3Plus model

Data for the DeepLabV3Plus training were obtained from ten slides (biobanked PDAC tumors at Cedars-Sinai) that we sequentially stained with H&E, digitized, destained, restained with IHC (pan-cytokeratin and DAB) and digitized again^61,62. By overlaying the H&E WSIs onto the corresponding IHC WSIs, we obtained the ground-truth of cancer cells in the H&E WSIs. Matching ROIs with tumor cells were extracted from the corresponding H&E and IHC WSIs and aligned by image registration^61,62. The aligned ROIs (n = 416) were divided into non-overlapping 256 × 256-pixel tiles. To generate the ground-truth mask for cancer cells in the tiles, the DAB staining image was digitally deconvoluted and thresholded, and the resulting cancer cell mask was smoothened by mathematical morphology operators. The tiles were then augmented 15 times⁶³ and a set of paired H&E and tumor cell mask tiles (n = 39,840) was prepared for DeepLabV3Plus training. Training hyperparameters such as the minibatch, initial learning rate, momentum and L2-regularization for stochastic gradient descent optimizer were set to 12 tiles, 0.005, 0.9 and 0.001, respectively. At the end of 75 training epochs, the overall accuracy of 97.5% was observed.

The trained DeepLabV3Plus was tested for the tumor cell detection ability on a PDAC tissue microarray (TMA) (PA483e, TissueArray). To prepare H&E and IHC WSIs from the TMA, the slides underwent the same staining/restraining/digitization protocol as for the training slides. The H&E and IHC TMA WSIs provided 80 H&E cancer mask ROI pairs to measure the accuracy, mIoU and F1-scores (tumor and nontumor) of the DeepLabV3Plus.

Nuclear features

Nuclear feature extraction was preceded by color deconvolution⁶⁴ of the ROI to digitally separate the image of hematoxylin staining from eosin. Subsequently, the cancer cell nuclei mask was overlaid onto the hematoxylin image and features of morphology (size and shape) and hematoxylin staining texture were quantitated for each nucleus under the mask by means of the 63-feature library (Source Data Fig. 2 and Supplementary Table 7) that we assembled from available resources^65,66. Nuclear features from tumor cell nuclei across all regions in the case were aggregated by means of order statistics: maximum, minimum, average, s.d. and 1st, 5th, 10th, 25th, 50th, 75th, 90th, 95th and 99th percentiles, thereby yielding 819 (13 × 63) unique features for each case. The z-scored case-level features were used to develop ML models for survival prediction. All features in the library were image rotation invariant.

For validation in the TCGA cohort, 33 diagnostic WSIs with PDAC (1 WSI per case) that closely corresponded to WSI specifications (×40 scanning magnification and compression quality of 70) of the MT-Pilot WSIs were downloaded. The TCGA WSIs were annotated for cancer ROIs (624 total, 20 ROIs per WSI) and tumor cell nuclei (137,617 total, 4,170 nuclei per WSI) automatically identified and delineated in the ROIs by our pipeline. Subsequently, features (n = 819) were extracted from tumor cell nuclei in the ROIs, z-scored and classified by the ML models predicting DS that we developed using features extracted from the MT-Pilot WSIs. Before feature extraction, the H&E staining coloration in the ROIs was digitally matched to that in the MT-Pilot WSIs.

Statistics and reproducibility

There was no a priori power analysis for this study. We had 93 patient samples available, of which 74 patients had their samples pass quality control, where samples from patients were both; viable for comprehensive molecular testing across all analytes and patients had stage I and II resected PDAC. No statistical method was used to predetermine the study sample size. The experiments were not randomized. We applied LOO in-pilot cross-validation and independent dataset validations. The investigators who conducted the full multi-omic analysis were not blinded to allocation during experiments and outcome assessment; however, investigators who conducted molecular analysis of the feature sets for each analyte were blinded to the outcomes of patients. Data distributions were not formally tested; however, we have provided clinical data distributions (Supplementary Table 1).

Development of machine-learning models

The goal of our study was to train an ensemble of ML classification models, ranging from simple linear models (SVMs) to more sophisticated random forests and neural networks. The ensemble of the predetermined models’ approach was used to assess the level of dependence of multi-omic features and the extent to which subtle, nonlinear, cross-feature dependencies would provide additional signal and predictive power for nonlinear models. Additionally, the model architecture and model hyperparameters were prespecified and fixed for the study due to the limited sample size in the study and sample size to feature imbalance. This was carried out to prevent overfitting and over-tuning of models on the study dataset, instead showing relative performance across classification techniques and demonstrating directional performance of each approach. The architecture and hyperparameters for each classification model, optimization technique and hyperparameters used in the study were implemented in the Python programming language listed in code availability. Depending on the validation scenario (internal MT-Pilot cohort or external cohorts), developed models were validated using either the LOO cross-validation (internal MT-Pilot cohort only) or using analyte combinations depending on their availability in the validation cohorts (TCGA, JHU cohort 1, JHU cohort 2 and MGH cohort).

Validation cohorts

Four validation cohorts were utilized in the study. The TCGA, JHU cohort 1 and cohort 2 and MGH cohort. The TCGA and JHU cohort 1 are publicly available datasets^14,46. JHU cohort 2 is an independent prospective cohort employing identical proteomic and lipidomic analysis as our MT-Pilot and whose raw data was analyzed utilizing the Molecular Twin ML model pipeline by the JHU team for validation.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

Transcriptomic, genomic and clinical data used in this study are available under National Center for Biotechnology Information/National Institutes of Health BioProject ID PRJNA889519 and the associated SRA database. Proteomic data used in this study were submitted and are available in the proteomics Identification Database (PRIDE) as ‘Profiling of pancreatic adenocarcinoma using artificial intelligence-based integration of multi-omic and computational pathology features’ under project accession no. PXD037038. Lipidomic data used in this study are submitted and are available in the MassIVE Dataset Repository project under accession no. MSV000093118. The human pancreatic adenocarcinoma genomic data were derived from the TCGA Research Network at http://cancergenome.nih.gov/. Previously published data from TCGA and JHU cohort 1 (refs. ^14,46) that were re-analyzed here are available at https://doi.org/10.1038/s43018-023-00697-7 and serve as source data for Table 2. The complete multi-omic MT-Pilot dataset utilized for Table 1 is provided and is identical to the source data for Fig. 1. Source data are provided with this paper. All other data supporting the findings of this study are available from the corresponding author upon reasonable request.

Code availability

Software resources utilized in this study are included in the reporting summary and associated code, as well as the architecture and hyperparameters for each classification model is available at https://doi.org/10.5281/zenodo.8423595.

References

Rahib L. et al. Projecting cancer incidence and deaths to 2030: the unexpected burden of thyroid, liver, and pancreas cancers in the united states). Cancer Res. https://doi.org/10.1158/0008-5472.can-14-0155 (2014).
Siegel, R. L., Miller, K. D., Wagle, N. S. & Jemal, A. Cancer statistics, 2023. CA Cancer J. Clin. 73, 17–48 (2023).
Article PubMed Google Scholar
Gostimir, M., Bennett, S., Moyana, T., Sekhon, H. & Martel, G. Complete pathological response following neoadjuvant FOLFIRINOX in borderline resectable pancreatic cancer: a case report and review. BMC Cancer 16, 786 (2016).
Article PubMed PubMed Central Google Scholar
Pietrasz, D. et al. Pathologic major response after FOLFIRINOX is prognostic for patients secondary resected for borderline or locally advanced pancreatic adenocarcinoma: an AGEO-FRENCH, prospective, multicentric cohort. Ann. Surg. Oncol. 22, S1196–S1205 (2015).
Article PubMed Google Scholar
Park, W., Chawla, A. & O’Reilly, E. M. Pancreatic cancer: a review. JAMA 326, 851–862 (2021).
Article CAS PubMed PubMed Central Google Scholar
Poruk, K. E. et al. The clinical utility of CA 19-9 in pancreatic adenocarcinoma: diagnostic and prognostic updates. Curr. Mol. Med. 13, 340–351 (2013).
CAS PubMed PubMed Central Google Scholar
Ballehaninna, U. K. & Chamberlain, R. S. The clinical utility of serum CA 19-9 in the diagnosis, prognosis and management of pancreatic adenocarcinoma: an evidence based appraisal. J. Gastrointest. Oncol. 3, 105–119 (2012).
CAS PubMed PubMed Central Google Scholar
Schreyer, D., Neoptolemos, J. P., Barry, S. T. & Bailey, P. Deconstructing pancreatic cancer using next generation-omic technologies-from discovery to knowledge-guided platforms for better patient management. Front. Cell Dev. Biol. 9, 795735 (2021).
Article PubMed Google Scholar
Brar, G. et al. Multi-omic molecular comparison of primary versus metastatic pancreatic tumours. Br. J. Cancer 121, 264–270 (2019).
Article CAS PubMed PubMed Central Google Scholar
Huang, C. et al. Proteogenomic insights into the biology and treatment of HPV-negative head and neck squamous cell carcinoma. Cancer Cell 39, 361 (2021).
Article CAS PubMed PubMed Central Google Scholar
Sammut, S. J. et al. Multi-omic machine learning predictor of breast cancer therapy response. Nature 601, 623–629 (2022).
Article ADS CAS PubMed Google Scholar
Gao, Y., Zhou, R. & Lyu, Q. Multiomics and machine learning in lung cancer prognosis. J. Thorac. Dis. 12, 4531–4535 (2020).
Article PubMed PubMed Central Google Scholar
Baek, B. & Lee, H. Prediction of survival and recurrence in patients with pancreatic cancer by integrating multi-omics data. Sci. Rep. 10, 18951 (2020).
Article CAS PubMed PubMed Central Google Scholar
Cao, L. et al. Proteogenomic characterization of pancreatic ductal adenocarcinoma. Cell 184, 5031–5052 (2021).
Article CAS PubMed PubMed Central Google Scholar
Feng, Q. et al. Recurrence and survival after surgery for pancreatic cancer with or without acute pancreatitis. World J. Gastroenterol. 25, 6006–6015 (2019).
Article PubMed PubMed Central Google Scholar
Verbeke, C. S. & Menon, K. V. Redefining resection margin status in pancreatic cancer. HPB 11, 282–289 (2009).
Article PubMed PubMed Central Google Scholar
Osipov, A. et al. Impact of margin status and lymphadenectomy on clinical outcomes in resected pancreatic adenocarcinoma: implications for adjuvant radiotherapy. J. Gastrointest. Oncol. 7, 239–247 (2016).
PubMed PubMed Central Google Scholar
Bilici, A. Prognostic factors related with survival in patients with pancreatic adenocarcinoma. World J. Gastroenterol. 20, 10802–10812 (2014).
Article PubMed PubMed Central Google Scholar
Lee, S. R., Kim, H. O., Son, B. H., Yoo, C. H. & Shin, J. H. Prognostic factors associated with long-term survival and recurrence in pancreatic adenocarcinoma. Hepatogastroenterology 60, 358–362 (2013).
CAS PubMed Google Scholar
Lu, S., Ahmed, T., Du, P. & Wang, Y. Genomic variations in pancreatic cancer and potential opportunities for development of new approaches for diagnosis and treatment. Int. J. Mol. Sci. 18, 1201 (2017).
Article PubMed PubMed Central Google Scholar
Nagathihalli, N. S. & Nagaraju, G. RAD51 as a potential biomarker and therapeutic target for pancreatic cancer. BBA Rev. Cancer 1816, 209–218 (2011).
CAS Google Scholar
Zhang, X. M., Ma, N. Y., Yao, W. Q., Li, S. & Ren, Z. G. RAD51 is a potential marker for prognosis and regulates cell proliferation in pancreatic cancer. Cancer Cell Int. 19, 356 (2019).
Article CAS PubMed PubMed Central Google Scholar
Quan, M. Y. et al. An FGFR/AKT/SOX2 signaling axis controls pancreatic cancer stemness. Front. Cell Dev. Biol. 8, 287 (2020).
Article PubMed PubMed Central Google Scholar
Zhan, H. X. et al. FoxQ1 is a novel molecular target for pancreatic cancer and is associated with poor prognosis. Curr. Mol. Med. 15, 469–477 (2015).
Article CAS PubMed Google Scholar
Cui, J. et al. A novel KDM5A/MPC-1 signaling pathway promotes pancreatic cancer progression via redirecting mitochondrial pyruvate metabolism. Oncogene 39, 1140–1151 (2020).
Article CAS PubMed Google Scholar
Roy, S., Singh, A. P. & Gupta, D. Unsupervised subtyping and methylation landscape of pancreatic ductal adenocarcinoma. Heliyon 7, e06000 (2021).
Article CAS PubMed PubMed Central Google Scholar
Cancer Genome Atlas Research Network. Integrated genomic characterization of pancreatic ductal adenocarcinoma. Cancer Cell 32, 185–203 (2017).
Shugang, X. et al. Prognostic value of SMAD4 in pancreatic cancer: a meta-analysis. Transl. Oncol. 9, 1–7 (2016).
Article PubMed PubMed Central Google Scholar
Doyle, A. et al. The impact of CDKN2A mutations on overall survival in pancreatic adenocarcinoma. J. Clin. Oncol. 37, 278–278 (2019).
Article Google Scholar
Aguet, F. et al. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science 369, 1318–1330 (2020).
Article ADS CAS Google Scholar
Dempke, W. C. M. & Reck, M. KEAP1/NRF2 (NFE2L2) mutations in NSCLC: fuel for a superresistant phenotype? Lung Cancer 159, 10–17 (2021).
Article CAS PubMed Google Scholar
Zhu, H. et al. KEAP1/NFE2L2 mutations of liquid biopsy as prognostic biomarkers in patients with advanced non-small cell lung cancer: results from two multicenter, randomized clinical trials. Front. Oncol. 11, 659200 (2021).
Article CAS PubMed PubMed Central Google Scholar
Yu, J. et al. Plasma protein biomarkers for early detection of pancreatic ductal adenocarcinoma. Int. J. Cancer 148, 2048–2058 (2021).
Article CAS PubMed Google Scholar
Faivre, S., Kroemer, G. & Raymond, E. Current development of mTOR inhibitors as anticancer agents. Nat. Rev. Drug Discov. 5, 671–688 (2006).
Article CAS PubMed Google Scholar
Iriana, S. et al. Targeting mTOR in pancreatic ductal adenocarcinoma. Front. Oncol. 6, 99 (2016).
Article PubMed PubMed Central Google Scholar
Populo, H., Lopes, J. M. & Soares, P. The mTOR signalling pathway in human cancer. Int. J. Mol. Sci. 13, 1886–1918 (2012).
Article CAS PubMed PubMed Central Google Scholar
Babiker, H. M. et al. Everolimus for the treatment of advanced pancreatic ductal adenocarcinoma (PDAC). Expert Opin. Investig. Drugs 28, 583–592 (2019).
Article CAS PubMed PubMed Central Google Scholar
Bettac, L., Denk, S., Seufferlein, T. & Huber-Lang, M. Complement in pancreatic disease-perpetrator or savior? Front. Immunol. 8, 15 (2017).
Article PubMed PubMed Central Google Scholar
Yang, J. et al. The enhancement of glycolysis regulates pancreatic cancer metastasis. Cell. Mol. Life Sci. 77, 305–321 (2020).
Article CAS PubMed Google Scholar
Nagy, A., Munkacsy, G. & Gyorffy, B. Pancancer survival analysis of cancer hallmark genes. Sci. Rep. 11, 6047 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Chen, E. Y. et al. Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool. BMC Bioinform. 14, 128 (2013).
Article Google Scholar
Fujisawa, M. et al. Involvement of the interferon signaling pathways in pancreatic cancer cells. Anticancer Res. 40, 4445–4455 (2020).
Article CAS PubMed Google Scholar
Chen, K. et al. Loss of AMPK activation promotes the invasion and metastasis of pancreatic cancer through an HSF1-dependent pathway. Mol. Oncol. 11, 1475–1492 (2017).
Article ADS CAS PubMed PubMed Central Google Scholar
Zhang, J. B., Liu, C. X., Mo, X. K., Shi, H. & Li, S. Mechanisms by which CXCR4/CXCL12 cause metastatic behavior in pancreatic cancer. Oncol. Lett. 15, 1771–1776 (2018).
PubMed Google Scholar
Blair, A. B. et al. Dual stromal targeting sensitizes pancreatic adenocarcinoma for anti-programmed cell death protein 1 therapy. Gastroenterology https://doi.org/10.1053/j.gastro.2022.06.027 (2022).
Cancer Genome Atlas Research Network et al. The Cancer Genome Atlas pan-cancer analysis project. Nat. Genet. 45, 1113–1120 (2013).
Lu, C. et al. Nuclear shape and orientation features from H&E images predict survival in early-stage estrogen receptor-positive breast cancers. Lab. Invest. 98, 1438–1448 (2018).
Article CAS PubMed PubMed Central Google Scholar
Saito, A. et al. Prediction of early recurrence of hepatocellular carcinoma after resection using digital pathology images assessed by machine learning. Modern Pathol. 34, 417–425 (2021).
Article CAS Google Scholar
Chen, S. T. et al. Machine learning-based pathomics signature could act as a novel prognostic marker for patients with clear cell renal cell carcinoma. Br. J. Cancer 126, 771–777 (2022).
Article CAS PubMed Google Scholar
Rittmann, M. C. et al. Plasma biomarkers for prediction of early tumor recurrence after resection of pancreatic ductal adenocarcinoma. Sci. Rep. 11, 7499 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Heger, U. et al. Induction chemotherapy in pancreatic cancer: CA 19-9 may predict resectability and survival. HPB 22, 224–232 (2020).
Article PubMed Google Scholar
Yang, X. et al. Alignment-free filtering for cfNA fusion fragments. Bioinformatics 35, i225–i232 (2019).
Article CAS PubMed PubMed Central Google Scholar
Mc Ardle, A. et al. Standardized workflow for precise mid- and high-throughput proteomics of blood biofluids. Clin. Chem. 68, 450–460 (2022).
Article Google Scholar
Parker, S. J. et al. Proteomics reveals Rictor as a noncanonical TGF-β signaling target during aneurysm progression in Marfan mice. Am. J. Physiol. Heart Circ. Physiol. 315, H1112–H1126 (2018).
Article CAS PubMed PubMed Central Google Scholar
Teo, G. et al. mapDIA: preprocessing and statistical analysis of quantitative proteomics data from data independent acquisition mass spectrometry. J. Proteomics 129, 108–120 (2015).
Article CAS PubMed PubMed Central Google Scholar
Bligh, E. G. & Dyer, W. J. A rapid method of total lipid extraction and purification. Can. J. Biochem. Physiol. 37, 911–917 (1959).
Article CAS PubMed Google Scholar
Röst, H. L. et al. OpenSWATH enables automated, targeted analysis of data-independent acquisition MS data. Nat. Biotechnol. 32, 219–223 (2014).
Article PubMed Google Scholar
Rosenberger, G. et al. A repository of assays to quantify 10,000 human proteins by SWATH-MS. Sci. Data 1, 140031 (2014).
Article CAS PubMed PubMed Central Google Scholar
Parker, S. J., Venkatraman, V. & Van Eyk, J. E. Effect of peptide assay library size and composition in targeted data-independent acquisition-MS analyses. Proteomics 16, 2221–2237 (2016).
Article CAS PubMed Google Scholar
Schmidt U., Weigert M., Broaddus C. & Myers G. Cell Detection with Star-Convex Polygons (Springer International Publishing, 2018).
Ing, N. et al. A novel machine-learning approach reveals latent vascular phenotypes predictive of renal cancer outcome. Sci. Rep. 7, 13190 (2017).
Article ADS PubMed PubMed Central Google Scholar
Cyprys P. et al. Rigid and Elastic Registrations Benchmark on Re-stained Histologic Human Ileum Images (Springer International Publishing, 2022).
Gertych, A. et al. Convolutional neural networks can accurately distinguish four histologic growth patterns of lung adenocarcinoma in digital slides. Sci. Rep. 9, 1483 (2019).
Article ADS PubMed PubMed Central Google Scholar
Ruifrok, A. C. & Johnston, D. A. Quantification of histochemical staining by color deconvolution. Anal. Quant. Cytol. Histol. 23, 291–299 (2001).
CAS PubMed Google Scholar
Spraker, M. B. et al. MRI radiomic features are independently associated with overall survival in soft tissue sarcoma. Adv. Radiat. Oncol. 4, 413–421 (2019).
Article PubMed PubMed Central Google Scholar
Young, I. T., Verbeek, P. W. & Mayall, B. H. Characterization of chromatin distribution in cell nuclei. Cytometry 7, 467–474 (1986).
Article CAS PubMed Google Scholar

Download references

Acknowledgements

We thank the Cedars-Sinai Medical Center Biobank and Pathology Shared Resource and the Cedars-Sinai Medical Center Proteomics and Metabolomics Shared Resource for their technical assistance. We also thank the investigators of the CORALE Study at Cedars-Sinai for providing normal unpaired samples for analysis. We thank J. Mota, C. Rosser and M. Bebawy from Cedars-Sinai Cancer for their technical assistance in data and project curation. We thank J. Gai from JHU for technical assistance with validation of JHU cohort data. A.O. is supported by National Institutes of Health grant K08 CA259456 and a Conquer Cancer Foundation ASCO Career Development Award. A.G. is supported by a US Department of Defense grant, Congressionally Directed Medical Research Programs grant no. W81XWH2210632.

Author information

These authors contributed equally: Arsen Osipov, Ognjen Nikolic, Arkadiusz Gertych.

Authors and Affiliations

Department of Medicine (Medical Oncology), Cedars-Sinai Medical Center, Los Angeles, CA, USA
Arsen Osipov & Andrew Hendifar
Samuel Oschin Comprehensive Cancer Institute, Cedars-Sinai Medical Center, Los Angeles, CA, USA
Arsen Osipov, Arkadiusz Gertych, Sarah Parker, Andrew Hendifar, Grant Dagliyan, Cristina R. Ferrone, Jason H. Moore, Warren Tourtellotte, Jennifer E. Van Eyk & Dan Theodorescu
Department of Oncology, Pancreatic Cancer Precision Medicine Center of Excellence, Johns Hopkins University, Baltimore, MD, USA
Arsen Osipov & Lei Zheng
Betteromics, Redwood City, CA, USA
Ognjen Nikolic, Pranav Singh & Darya Filippova
Department of Pathology and Laboratory Medicine, Cedars-Sinai Medical Center, Los Angeles, CA, USA
Arkadiusz Gertych, Warren Tourtellotte, Jennifer E. Van Eyk & Dan Theodorescu
Department of Surgery, Cedars-Sinai Medical Center, Los Angeles, CA, USA
Arkadiusz Gertych & Cristina R. Ferrone
Department of Biomedical Sciences and Smidt Heart Institute, Cedars−Sinai Medical Center, Los Angeles, CA, USA
Sarah Parker & Jennifer E. Van Eyk
Department of Computational Biomedicine, Cedars−Sinai Medical Center, Los Angeles, CA, USA
Jason H. Moore
Department of Urology, Cedars-Sinai Medical Center, Los Angeles, CA, USA
Dan Theodorescu

Authors

Arsen Osipov
View author publications
You can also search for this author in PubMed Google Scholar
Ognjen Nikolic
View author publications
You can also search for this author in PubMed Google Scholar
Arkadiusz Gertych
View author publications
You can also search for this author in PubMed Google Scholar
Sarah Parker
View author publications
You can also search for this author in PubMed Google Scholar
Andrew Hendifar
View author publications
You can also search for this author in PubMed Google Scholar
Pranav Singh
View author publications
You can also search for this author in PubMed Google Scholar
Darya Filippova
View author publications
You can also search for this author in PubMed Google Scholar
Grant Dagliyan
View author publications
You can also search for this author in PubMed Google Scholar
Cristina R. Ferrone
View author publications
You can also search for this author in PubMed Google Scholar
Lei Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Jason H. Moore
View author publications
You can also search for this author in PubMed Google Scholar
Warren Tourtellotte
View author publications
You can also search for this author in PubMed Google Scholar
Jennifer E. Van Eyk
View author publications
You can also search for this author in PubMed Google Scholar
Dan Theodorescu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

A.O., O.N., A.G., S.P., W.T. and J.E. developed methodology. A.O., O.N., A.G., S.P., W.T., C.R.F. and L.Z. acquired data. A.O., O.N., A.G., S.P., A.H., P.S., D.F., J.H.M., W.T., J.E., L.Z., C.R.F. and T.E. analyzed and interpreted data. A.O., O.N., A.G. and D.T. wrote, and/or revised the manuscript. A.O. and D.T. supervised the study. D.T. developed the concept and overall design of the study. All authors reviewed and revised the final manuscript.

Corresponding author

Correspondence to Dan Theodorescu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Cancer thanks Jakob Kather and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Top Single-omic and Multi-omic Performance for Disease Survival Prediction.

Asymmetric violin plots showing accuracy and PPV distributions per analyte for predicting PDAC disease survival in decreasing order of accuracy (left to right) for multi-omic and single-omic analytes. PPV, Positive Predictive Value. Surg path, Surgical Pathology. Comp Path, Computational Pathology.

Source data

Extended Data Fig. 2 AI Modeling of Tumor and Stroma.

(a) H&E slide with the tumor area and regions of interest (ROIs) marked by pathologist (WT). Frames of selected regions were box-outlined and saved in a database file to assure reproducibility of the experiment.; b) Same area with the cancer cells mask (cyan) predicted by our AI model. Segmentation of cancer mask by DeepLabV3Plus was performed 10 times. Each time the exact same mask was outputted. Scale bar is 1mm.

Extended Data Fig. 3 Hierarchical co-clustering of 8 features extracted from tumor cell nuclei.

Cluster labels correspond to those in Fig. 4c. Cluster #1 represents patients homogeneous for their clinical outcome (all deceased) and multi-omic features. Cluster #2 represents a heterogeneous population with regards to clinical outcome while cluster #3 represents a more homogenous population compared to cluster #2. Notably, in cluster #3, patients noted to be alive at the time of analysis were strongly predicted to be deceased by the model.

Source data

Extended Data Fig. 4 Validation of the Single-omic and Multi-omic and Parsimonious Models on TCGA.

Validation of RNA gene signatures for disease survival: (a) 39 gene signature of poor survival (HR = 2.17, [1.28-3.66], logrank p = 0.0031) (b) 40 gene signature of improved survival (HR=0.74 [0.49-1.12], logrank p = 0.15) (c) Parsimonious model of clinical, DNA (CNV, INDEL, SNV), RNA gene expression and computational pathology in the original cohort used to select optimal 202 analytes (peak) for validation in TCGA. Multi-omic model performance across feature reduction steps by restricting the maximum selectable features during model training. Left y axis - Accuracy and PPV score: multi-omic model performance across feature reduction steps by restricting the maximum selectable features during model training. X axis – number of maximum features at each reduction step. Right y axis - Analyte Percent (%) Contribution: each analyte’s aggregated absolute feature weight contribution at each feature reduction step.

Source data

Supplementary information

Reporting Summary

Supplementary Tables 1–16

Table of contents and Supplementary Tables 1–16 in each tab.

Source data

Source Data Fig. 1

Complete multi-omic dataset: statistical source data.

Source Data Fig. 2

Nuclear features and computational pathology source data for analysis.

Source Data Fig. 3

Multi-omic modeling results with feature weights and statistical source data.

Source Data Fig. 4

Feature correlations, top multi-omic features statistical source data, UMAP multi-omic feature weights and informative nodes.

Source Data Fig. 5

Parsimonious learning curve complete data and statistical source.

Source Data Extended Data Fig. 1

Multi-omic modeling results with feature weights and statistical source data.

Source Data Extended Data Fig. 3

Nuclear features source data.

Source Data Extended Data Fig. 4

TCGA full multi-omic analysis set.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Osipov, A., Nikolic, O., Gertych, A. et al. The Molecular Twin artificial-intelligence platform integrates multi-omic data to predict outcomes for pancreatic adenocarcinoma patients. Nat Cancer 5, 299–314 (2024). https://doi.org/10.1038/s43018-023-00697-7

Download citation

Received: 31 August 2022
Accepted: 30 November 2023
Published: 22 January 2024
Issue Date: February 2024
DOI: https://doi.org/10.1038/s43018-023-00697-7

Subjects

Abstract

Similar content being viewed by others

Main

Results

Patient baseline demographics and specimen handling

Clinical and surgical pathology features influence outcomes

DNA analysis reveals alterations with prognostic importance

RNA signatures of drug resistance impact prognosis

Plasma proteins are critical biomarkers in survival prediction

Predictive nuclear morphology via computational pathology

Multi-omic analysis suggests hierarchical complementarity

Multi-omic models provide biological insights into PDAC

The parsimonious multi-omic models for disease survival

Validation of RNA markers as predictors of survival

Validation of multi-omic models as predictors of survival

Validation of plasma proteins as a preoperative biomarker

Discussion

Methods

Sample collection, processing and classification

Clinical data variables for the cohort

Clinical data analysis

NGS targeted genomics

RNA sequencing

Proteomics and lipid analysis

Plasma proteomics methods

Plasma lipidomics methods

Sample processing and lipid extraction

Tissue proteomics methods

Sample processing and lysis

Computational H&E slide analysis

Slides and computational pipeline overview

Training and evaluation of DeepLabV3Plus model

Nuclear features

Statistics and reproducibility

Development of machine-learning models

Validation cohorts

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Extended data

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links