A phylogenomic analysis of the role and timing of molecular adaptation in the aquatic transition of cetartiodactyl mammals

Recent studies have reported multiple cases of molecular adaptation in cetaceans related to their aquatic abilities. However, none of these has included the hippopotamus, precluding an understanding of whether molecular adaptations in cetaceans occurred before or after they split from their semi-aquatic sister taxa. Here, we obtained new transcriptomes from the hippopotamus and humpback whale, and analysed these together with available data from eight other cetaceans. We identified more than 11 000 orthologous genes and compiled a genome-wide dataset of 6845 coding DNA sequences among 23 mammals, to our knowledge the largest phylogenomic dataset to date for cetaceans. We found positive selection in nine genes on the branch leading to the common ancestor of hippopotamus and whales, and 461 genes in cetaceans compared to 64 in hippopotamus. Functional annotation revealed adaptations in diverse processes, including lipid metabolism, hypoxia, muscle and brain function. By combining these findings with data on protein–protein interactions, we found evidence suggesting clustering among gene products relating to nervous and muscular systems in cetaceans. We found little support for shared ancestral adaptations in the two taxa; most molecular adaptations in extant cetaceans occurred after their split with hippopotamids.

! Supplementary material. Contains information concerning taxon sampling, sequencing and RNA-Seq de novo assembly, as well as ortholog identification and data set assembly. It also provides additional information for natural selection analyses, GC content estimation, Gene Ontology (GO) enrichment analysis and, finally, network analysis of protein-protein interactions. Table S1. RNA extraction QC, RNA-Seq and assembly statistics. Table S2. Ortholog identification. Table S3. Genome-wide analysis for bursts of divergent selection Table S4. Pearson correlation test between MA model fit (LRT p-values) and ΔGC3 at the third codon position of the branch. Table S5-S6. Separate .xls file with complete lists of genes with PSSs in hippos and whales under branch-site codon and clade models. Table S7. GO terms enriched for positively selected genes in cetaceans, hippo and/or in both.

Supplementary material
Taxon sampling, sequencing and RNA-Seq de novo assembly. We obtained new RNA-Seq data from the common hippo H. amphibius as well as from the humpback whale M. novaeangliae. For the hippo, we used a pool of tissue types (muscle, skin, heart, liver, spleen, kidney, lung and neural tissue) obtained from an adult female euthanized at Copenhagen Zoo. For the humpback whale, we pooled skin biopsies from three adult males. RNA was isolated at BGI (hippo) and the Australian Marine Mammal Centre (whale). cDNA library construction followed by pair-end Illumina HiSeq sequencing was performed at BGI (see Additional file 2). CDS data from three additional cetacean species, the minke whale B. acutorostrata, the fin whale B. physalus, and the finless porpoise Neophocaena phocaenoides were from Yim et al (2014). RNA-Seq sequences for the sperm whale Physeter macrocephalus and the Indo-Pacific humpback dolphin Sousa chinensis, were downloaded from the SRA (Short Read Archive) of GenBank (Acc. Nos. SRX220350-SRX220358 and ERX283216, respectively). Similar to the Illumina data from the humpback whale and hippo, all RNA-Seq data were assembled into transcripts de novo using the program Trinity (trinityrnaseq-r2013-02-25) under the default parameters [1]. For the bottlenose dolphin T. truncatus and the killer whale Orcinus orca, assembled and annotated RNA transcripts were previously available and downloaded directly from the Ensembl database (release 70) and Genbank, respectively. In the case of the river dolphin Lipotes vexillifer, there were no transcriptome data available at the time, so we used the software MAKER2 [2] to perform a de novo gene annotation on its genome available in Genbank.

Ortholog identification and data set assembly. To build clusters of orthologous
CDSs across all sampled cetaceans and the hippo, we performed a series of reciprocal similarity searches using blastx and tblastn. We used the bottlenose dolphin and human genomes as a reference and blasted the longest protein product of each locus (~16,500 sequences in the dolphin and ~23,600 in the human genomes) against our assembled sequences for each species. Hits in the targeted sequence pool were next blasted back to the initial T. truncatus and H. sapiens proteins, and only reciprocal matches were retained as putatively orthologous sequences. Based on orthologs that we successfully identified in at least one Mysticeti and one Odontoceti species, we next populated our gene sets with single copy orthologous (one-to-one) coding sequences of other (lauriasiatherian) mammals using a custom perl script in Ensembl API [3]. The number of candidate one-to-one orthologs across eight cetancodontan species ranged from 6,249 to 16,047 genes (see Additional file 2). More orthologs were obtained in humananchored similarity searches, despite the fact that the bottlenose dolphin is a cetacean; this reflects the better quality of the human genome. One exception was the finless porpoise, due to the CDS data for this species being generated indirectly by SNP calling against bottlenose dolphin genomic sequence [4]. Overall, transcriptome sequencing yielded fewer complete CDS sequences compared to genome-based annotation methods (gene prediction or SNP calling). For example, killer whale CDSs derived from genomic data were identified for 7,981 human orthologs, whereas humpback whale CDSs derived from the transcriptome sequenced here matched 2,396 human CDSs.
The collected CDSs of each locus were next aligned as codons using PRANK v.130820 [5] and multiple sequence alignment (MSA) reliability was assessed using GUIDANCE [6]. GUIDANCE assigns scores (from 0 to 1) by comparing a set of MSAs generated by the original alignment algorithm based on bootstrap guide-trees, here 10. When a CDS alignment contained sequences with a score lower than 0.6, these were pruned from the dataset as unreliable and a new MSA was re-built from the reduced data under the original parameters, as above. Codon sites below a score of 0.93 (GUIDANCE default value) were discarded, as well as sites with gaps in more than 50% of the sequences sampled in the data set. We observed that, in many alignments, the CDS of the minke whale presented long internal deletions, possibly the result of missing exons excluded from the original annotations using gene prediction analyses [4], the accuracy of which for identifying genes in draft genomes is known to be prone to errors [7]. To account for this type of error in our data, regions with deletions in the B. acutorostrata sequences were removed from the alignment, plus 10 flanking codon positions upstream and downstream of the indel. Finally, all sequences were trimmed by 10 codons at both the 5' and 3' ends.
Natural selection analyses. In addition to the branch-site model MA, we also implemented the clade model C [8,9] to test for divergent selection pressures acting on one of the following groups in the Laurasiatheria tree [10,11]: (I) Cetancodonta (Hippopotamidae + Cetacea); (II) Cetacea; (III) Mysticeti and (IV) Odontoceti ( Figure  1). Clade model C assumes three classes of sites, which differ in their selection pressure, as measured by dN/dS or ω. In the first two site classes, ω was constrained to be under purifying selection (0 < ω 0 < 1) or neutral (ω 1 = 1), while in the third class, ω was estimated separately in foreground (ω 2 ) and background (ω 3 ) branches without constraint. Likelihood ratio tests (LRT) were used to assess the model fit of the clade models over the null model M1a (Nearly Neutral), in which two sites are constrained to fall into two classes, negatively selected (ω < 1) or neutral (ω = 1). In all LRTs log-likelihood differences were compared to the χ 2 distribution for a critical value α = 0.05 with three degrees of freedom (df). In cases where Model C had a significantly better fit over the null, only sites with a Bayes empirical Bayes (BEB) posterior probability > 0.80 were considered as significant.
We first applied the clade model C to test for divergent selection between Cetancodonta and the rest of the tree (Clade I in Figure 1). Among 6,894 loci tested, we identified 3,180 for which model C had a better overall fit when compared to the null model prior to any correction for multiple testing (see Additional file 2). Likewise, we also applied model C to each of the Cetacea, Mysticeti and Odontoceti clades (Clades II, III and IV, respectively in Figure 1) and found divergent selection pressures acting on between 4,402 and 5,635 CDSs. Overall, between 1,132 and 2,123 genes had sites with a dN/dS or ω>1 in each of the four clades of interest (see Additional file 2). After filtering for potential false positives (see Material and Methods and Additional file 2), we report 2,169 coding gene sequences under divergent selection in Cetancodonta using Model C. Of these genes, 392 show positively selected sites (PSSs), for which ω>1 in the foreground clade (see Additional files 2-3). We also recovered ~3,600 genes showing evidence of divergent selection acting only in cetaceans, among which 487 to 732 genes exhibited sites with ω>1 on the ancestral cetacean, ancestral odontocete, or ancestral mysticete branches (see Additional files 2-3).

GC content estimation.
To control for CpG islands, also known to inflate inferences of natural selection in mammalian genome data, we used the program nhPhyml [12,13] to infer the GC content at the third codon position (GC3) at each node of the laurasiatherian tree for every CDS dataset (topology was fixed, alpha parameter estimated with four distinct categories and GC equilibrium optimized). We next calculated the shift in GC3 along the five branches (ΔGC3) tested for positive selection (Figure 1), by subtracting the GC3 content between the nodes delineating the branch (as in [14]). To test whether the GC content may have affected our selection results, we calculated the Pearson correlation coefficient between ΔGC3 and the P value obtained by the LRTs for a given branch. We found no significant correlation between strength of selection and shift in GC, with the exception of a slight negative correlation detected for the ancestral branch of Odontoceti (P = 0.0159). In this case the correlation coefficient was very low (r = -0.0251), indicating that GC content is unlikely to have biased our selection analyses (see Additional file 2).
Gene ontology (GO) enrichment analysis. To investigate whether genes showing evidence of molecular adaptation shared similar functions, we used the topGO package [15] to perform an enrichment analysis for GO terms. Gene annotations for the three major GO domains -cellular component (CC), biological process (BP) and molecular function (MF)-were retrieved using Ensembl BioMart [3]. We asked whether genes with PSSs shared specific functions and performed a series of gene ontology (GO) enrichment analyses of genes showing evidence for molecular adaptations in the ancestral branches of Cetacea (pooling branches ii, iii, iv), Cetancodonta and H. amphibius. We used two statistics to look for overrepresented terms: (1) the parametric Fisher's exact test and (2) the non-parametric Kolmogorov-Smirnov or else widely known as gene set enrichment analysis (GSEA). In Fisher's test, we set the significance threshold at genes with Q ≤ 0.10. Both tests were performed following the classic approach, where significance for enrichment was calculated independently for each GO term, as well as using the elim and weight01 algorithms introduced by Alexa et al. 2006 and implemented in the topGO package. Both elim and weight algorithms were shown to improve the explanatory power of GO group scoring, by eliminating local dependencies between GO terms in the GO graph structure [15].
With respect to the last common ancestor of Cetancodonta, we found 35 significantly enriched GO terms for genes showing evidence of positive selection in Cetancondonta (branch i in Figure 1 Figure 1, see Additional file 4) recovered the term "vesicle", with roles in vesicle coating (GO:0048208, GO:0048199, GO:0006901) and vesicle targeting (GO:0048207, GO:0048199, GO:0006903). Additional GO terms detected in the hippo were associated with immunity, e.g. "antigen processing and presentation via MHC II" (GO:0002495, GO:0002474). Note again our GO results are reflective of the very low number of positively selected genes retained after our filtering steps for false positives. Therefore, we cannot exclude the possibility that a wider range of gene sets have undergone Darwinian selection in the last common ancestor of hippos and whales.
Next we employed a relaxed approach for identifying functional enrichment associated with increased ω in the foreground branch. Applying Kolmogorov-Smirnov (KS) test coupled with the elim method of the TopGO package [15], we ranked all genes based on a score drawn by the goodness-of-fit of the MA selection model (Q) and looked for outlier GO sets, as compared to a null distribution GO scores of all loci. This analysis allowed us to detect over-or underrepresented genes in 509 GO categories in Cetacea, 218 terms in Cetancodonta and 213 in H. amphibius branches within BP (see Additional file 4). We also found enrichment in 97 and 141 GO terms in all cetaceans within CC and MF domains, respectively. Finally, 44 and 49 GO groups were significantly enriched in hippo and Cetancodonta based on CC domain, whereas 46 in MF in total. A more detailed presentation of all GO terms for each branch and method is provided in Additional file 5. In cetaceans, the majority of GO terms obtained via the KS-elim method based on selection screens seemed to be related to immune response (e.g., GO:0002204, GO:0002208, GO:0002285) and metabolic process (e.g., GO:0006475, GO:0006520, GO:0006541). Analysis of cetacean gene selection results recovered enrichment also in categories related to lipid metabolism (e.g., GO:0042632, GO:0019217, GO:0016126, GO:0019433), blood clotting or platelet formation (e.g.,GO:0007596, GO:0050817), muscle, heart and brain development (e.g., GO:0051145, GO:0003205, GO:0030901), response to stress (e.g.GO:0006950, GO:0033554), hypoxia (GO:0036294) and visual perception (GO:0007601), none of which was obtained by gene set enrichment analysis of selection screen at the hippo branch or at the common ancestor of hippos and whales. Some of these GO terms however may include gene clusters for which we had not sampled the hippo sequences.