Abstract

Diffuse large B-cell lymphoma (DLBCL) is a clinically and molecularly heterogeneous disease. The increasing recognition and targeting of genetically defined DLBCLs highlights the need for robust classification algorithms. We previously characterized recurrent genetic alterations in DLBCL and identified five discrete subtypes, Clusters 1-5 (C1-C5), with unique mechanisms of transformation, immune evasion, candidate treatment targets and different outcomes following standard first-line therapy. Herein, we validate the C1-C5 DLBCL taxonomy in an independent dataset and use the expanded series of 699 primary DLBCLs to develop a probabilistic molecular classifier and confirm its performance in an independent test set. Using our previously assigned cluster labels as a reference, we systematically compared multiple machine learning models and strategies for input feature dimensionality reduction with a newly developed performance metric that captured the relationship between accuracy and confidence of class assignments. The winning neural network model, DLBclass, assigned all cases in the training/validation and independent test sets with 91% and 89% accuracies, respectively. In the 75% of cases with confidence >0.7, DLBclass assignments were accurate in 97% of the training/validation set and 98% of the test set. DLBclass enables robust prospective classification of single cases for inclusion in genetically guided clinical trials or practice and represents a framework for the development of genomic-based classification methods in other cancers.