A machine learning approach for the identification of population-informative markers from trout genotyping data

Salvatore, G.; Palombo, V.; De Zio, E.; Maiuro, L.; Rusco, G.; Di Iorio, M.; Esposito, S.; Iaffaldano, N.; D’Andrea, M. S.

Cost-effective commercial SNP arrays are now available for several species and this has had a substantial impact on livestock as well as on fields of natural ecology, evolution and conservation biology. Nowadays, genome-wide SNP analysis is the method of choice for the characterization of natural populations. In this context, the identification of a minimum number of SNP with the maximum information to differentiate populations is becoming important but challenging. This may have interesting implications for several downstream applications such as allocation of individuals and comparative analyses of selection signatures. Recently, the use of machine learning approaches and notably of random forest classifier (RF) has been proposed for the identification of the most discriminating genetic markers among thousands of SNP. Here we used the RF algorithm to analyse genotyping data obtained with 57K Trout BeadChip array (Affymetrix) from autochthonous and allochthonous trout populations of Molise rivers and their tributaries. The 48 highest ranked SNP were obtained and compared with the list of the most informative SNP estimated using traditional statistical approaches: Delta, FST and principal component analyses. In total, 103 specimens were enrolled in the study, from a larger cohort of ~300 fishes caught in 30 different sites of Volturno and Biferno basins. The samples were chosen based on results obtained by PCR-RFLP and preliminary fine-scale population structure outcomes. Trout considered in this study were representative of four different native trout subpopulations and one Atlantic species. Four reduced informative panels were obtained and their performances estimated using correct prediction proportions from RF classification. The correct assignment of the specimens to their subpopulations had an average of ~92% for all tested approaches. RF shared the highest number of SNP with FST method (19 SNP). Chromosome 3 harboured the largest number of selected SNP across all panels. Six SNP resulted in common among the tested approaches resulting in a correct assignment performance of ~69%. For the first time SNP-array technology and machine learning were combined to identify population informative markers in trout species. Further studies with larger populations and samples size are required to evaluate the validity of the approach.

IRIS Catalogo Istituzionale della Ricerca dell'Università degli Studi del Molise