|Library: Folta Fragaria FA_SEa
The Fragaria EST data processed at CUGI utilized publicly available software incorporated in a fully automated in-house developed script (ProcEST.pl). The processing occurred in three stages:
Stage I: Trace File Processing
Sequence trace files were converted into fasta files and a quality score files using the phred (Ewing et al, 1998) base-calling program. Vector and host contamination were identified and masked using the sequence comparison program cross_match (Gordon, et al, 1998). Vector trimming excised the longest non-masked sequence and further trimming removed low quality bases (less than phred score 20) at both ends of a read. Sequences were discarded if they had greater than 5% ambiguous bases, more than 40 PolyA or Poly T bases or less than 100 high quality bases (minimum phred score of 20). At this stage of processing the script generated an overall summary report file, clone report tables, a Genbank submission file and fasta formatted library files of the high quality trimmed sequences and associated quality values. The fasta library was not filtered to remove reads having significant similarity with the species specific mitochondial, rRNA, tRNA or snoRNA sequences as no Fragaria RNA sequences are currently available in GenBank.
Stage II: Assembly of High Quality Sequences
In stage II processing, the filtered library file was assembled using the contig assembly program CAP3 (Huang and Madan, 1999). More stringent parameters (- p 90. -d 60) were used to prevent over assembly and help identify potential paralogs. The assembly was refined where possible using homology to the swissprot database to indicate contig accuracy. Homology was determined by running the contigs and clones against Swiss Prot using the fastx3.4 algorithm (Pearson and Lipman, 1988) with EXP < 1e -6. Contigs whose clones showed difference in homology were deconstructed and contigs with the same homology to other contigs were joined using default CAP3 parameters. The unigene data set was derived by joining the contig and singleton data sets.
Stage III: Annotation
Annotation of the unigene data set consisted of pairwise comparison of both the filtered library and the contig consensus library file against the Genbank nr protein database using the fastx3.4 algorithm (Pearson and Lipman, 1988). The sequences were also characterized by comparison with the Genbank Rosaceae EST dataset (160,000 as of 072404) and the Genome Database for Rosaceae (GDR) mapped peach ESTs using the BLAST software package (Altschul, et al, 1997). The most significant matches (EXP < 1e -7) for each contig and individual clones in the library were recorded. Simple Sequence Repeats (SSRs) were indentified in the unigene data set using the CUGISSR.pl script and further filtered for optimal primer development according to GC content. The sequence, assembly, homology and SSR data will be stored in the GDR, facilitating efficient data querying and display. Users can view contig assembly, clones and annotation, download the library and unigene sequence libraries and search their sequences against the Fragaria EST database using our BLAST/FASTA server facility.
Altschul, S.F., Madden, T.L, Schaffer, A.A., Zhang, J., Miller, W., and Lipman, D.J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25(17)3389-402. Review.
Ewing, B., Hiller, L., Wendl, M. and Green, P. (1998). Basecalling of automated sequencee traces using phred. I. Accuracy assessment. Genome Research 8, 175-185.
Gordon, D. Abanjian, C., and Green, P. (1998). Consed: A graphical tool for sequence finishing. Genome Research 8, 195-202.
Huan, X. and Madan, A. (1999). CAP3: A DNA sequence assembly program. Genome Research, 9, 868-877.
Pearson, J.D. and Lipman, D.J. (1988). Improved tools for biological sequence comparison. Proceedings of the National Academy of Science, USA 85,
Copyright © 2004 | Clemson University Genomics Institute. Last updated July 28, 2004.