Background Reference

GenBank Research Reference Overviews

Background Reference

General Strategies Reference

Potential Research Reference

Syntax Reference

Semantics Reference

Redundancy Reference

Inconsistency Reference

Irrelevancy Reference

Development Reference


Background Reference
GenBank (1999),Dennis A. Benson, Mark S. Boguski, David J. Lipman, James Ostell, B. F. Francis Ouellette, Barbara A. Rapp, et al. Nucleic Acids Research
Data cleaning paper and research group
Genbank Documentation

Sample records[pacc]&doptcmdl=GenBank

Bad data warning over public gene databases

journal article talking about the necessity of cleanup of Genbank
other BioDB collections

GeneDB (curated)

Swiss-Prot (curated)

Peter Sterk and Stephan Beck

The Up-to-Date Status of Major Genome Sequencing Projects: The Genome MOT
Pursuant to agreements made at their 2002 Collaborative Meeting,

DDBJ/EMBL/GenBank have undertaken the collection of a new class of

sequence data : Third-Party Annotation (TPA).

\Document GenBank.htm

In order to assure that the sequence annotation is of high quality,

it is required that TPA records be associated with a study published

in a peer-reviewed journal before the data is released to the public.

\Document GenBank.htm

FASTA format description

>gi|22136741|gb|AY133756.1| Arabidopsis thaliana clone U18350 putative copper/zinc superoxide dismutase (At2g28190) mRNA, complete cds










A New File Format and Tools for the Large-Scale DataSubmission to DNA Data Bank of Japan (DDBJ)

Data Sequence Data Sequence Databases Genbank

Entrez based resource

steps and tips to download GenBank

NCBI's Genome Annotation Pipeline

Biologic database fundamental

The BioCatalog

Dr Ian Collet, bioinformatics lecturer at Queensland University of Technology

More than 71% of all GenBank entries and 40% of the individual nucleotides in the database are derived EST sequences

Schuler, G.D. 1997. Pieces of the puzzle: expressed sequence tags and the catalog of human genes. J. Mol. Med. 75:694-698.

General Strategies Reference

rich link to resource----

related tool:
Bioinformatics Laboratory,

BioInfoBank Institute

BioInfo.PL is the home page of a group of Polish scientists working in the field of Bioinformatics. The site is meant to promote our scientific and academic activity. It contains several useful bioinformatics links and local services focused mainly on the prediction and analysis of the structure and function of proteins or genes.
In the beginning of the year 2002 a team of biologists and programmers launched new FREE bioinformatics resource. This site offers:

- collected information in searchable databases [incl. GBK, SPRT, PIR and many of major databases available];

- Algorithms [Blast, ClustlW, 3D modeler, 2D Prediction and many others]

- User can save their files generated by algorithms and search processes.

Servers are placed in Bulgaria, at the following address:
DNannotator (Chunyu Liu, 2001)

Tools for integration of annotation for regional genomic sequences

Special uses of terms by DNannotator

Annotation: Used in its narrow sense meaning mapping of features to genomic DNA sequences.

Customized: Users supply their own annotation source data, such as SNPs, genes, STSs, oligos etc., and their preferred target gDNA sequence for annotation.

High Throughput: Maps batches of source data (prepared by users) onto one gDNA sequence.

Genomic region: A genomic region sized < ~ 30 Mb. DNannotator is a supplement to public annotation efforts such as NCBI's Map Viewer, UCSC's Genome Browser or Sanger's Ensembl. The user can merge annotation from all sources of public annotation, and from his own findings, onto the genomic region of interest.

Potential Research Reference
R. Apweiler, P. Kersey, V. Junker, A. Bairoch (AKJB01)

Technical comment to "Database verification studies of SWISS-PROT and GenBank" by Karp et al.

Bioinformatics, 2001, 17, 6, 533-534

P.D. Karp, S. Paley, J. Zhu (KPZ01)

Database verification studies of SWISS-PROT and GenBank.

Bioinformatics, 2001, 17, 6, 526-532
Late-Night Thoughts on the Sequence Annotation Problem

Sarah J. Wheelan and Mark S. Boguski

Syntax Reference

Sequence tools

GI Rerieval - A script to extract GI numbers from BLAST output

Batch Entranz - Get GenBank records using GI

Name Formateer - Format GenBank DEFINITION entry

NN - Secondary structure prediction. NOTE: This method is in developement so confidence is very limited.

GB Format - Gene Bank data formating

get UNF - Get sequence from unfinished genomes

related tool:

GenBank tool

Genome Project Submission Account guidelines
Comments and tips for Genbank java XML based parsers: BioJava, SUN’s JAXP API, jaxp.jar, parser.jar, crimson.jar, Xerces
Genbank parser BioPython problem

Genbank parser BioPerl problem msg41005.html thread.php?

general genbank parser in perl

GenBank tool Genquire

Sequin is a stand-alone software tool developed by the NCBI for submitting and updating entries to the GenBank, EMBL, or DDBJ sequence databases. It is capable of handling simple submissions which contain a single short mRNA sequence, and complex submissions containing long sequences, multiple annotations, segmented sets of DNA, or phylogenetic and population studies.

Data cleanup before submitting to GenBank .

Semantics Reference

PubCrawler - Automated Retrieval of PubMed and GenBank Reports

Redundancy Reference

SPTR - A comprehensive, non-redundant and up-to-date view of the protein sequence world
J. Gorodkin, C. Zwieb, B. Knudsen (GZK01)

Semi-automated update and cleanup of structural RNA alignment databases.

Bioinformatics, 2001, 17, 7, 642-645
DNannotator (Chunyu Liu, 2001)

CLEANUP (Grillo G., Attimonelli M., Liuni S., and Pesole G.)

Grillo, G., Attimonelli, M., Liuni, S., and Pesole G. (1996). CABIOS 12, 1-8.

CLEANUP: a fast computer program for removing redundancies from nucleotide sequence databases

NRDB (Warren Gish )

ICAass (Jeremy Parsons)

ICAtools: Medium-to-large scale DNA sequencing analysis

Inconsistency Reference

DNannotator (Chunyu Liu, 2001)

A utility that prepares raw DNA sequence fragments for sequence assembly. This sequence cleanup program includes quality assessment, confidence reassurane, vector trimming and vector removal. Software tool is available freely
M.Y. Galperin, E.V. Koonin (GaKo98)

Sources of systematic error in functional annotation of genomes: domain rearrangement, non-orthologous gene displacement, and operon disruption.

In Silico Biology, 1998
S.E. Brenner (Bre99)

Errors in genome annotation

Trends in Genetics, 1999, 15, 4, 132-133
A. Felsenfeld, J. Peterson, J. Schloss, M. Guyer (FPSG99)

Assessing the quality of the DNA sequence from The Human Genome Project.

Genome Research, 1999, 9, 1-4
C. Medigue, M. Rose, A. Viari, A. Danchin (MRVD99)

Detecting and Analyzing DNA sequencing errors: Toward higher quality of the Bacillus subtilis genome sequence.

Genome Research, 1999, 9, 1116-1127
P. Bork (Bor00)

Power and pitfalls in sequence analysis: The 70% hurdle

Genome Research, 2000, 10, 398-400
R. Guigo, P. Agarwal, J.F. Abril, M. Burset, J.W. Fickett (GAABF00)

An assessment of gene prediction accuracy in large DNA sequences.

Genome Research, 2000, 10, 1631-1642
D. Devos, A. Valencia (DeVa01)

Inrinsic errors in genome annotation.

Trends in Genetics, 2001, 17, 8, 429-431
C. Médigue, M. Rose, A. Viari, and A. Danchin
Detecting and Analyzing DNA Sequencing Errors: Toward a Higher Quality of the Bacillus subtilis Genome Sequence
Genome Res., November 1, 1999; 9(11): 1116 - 1127.

Graziano Pesole, Sabino Liuni, Giorgio Grillo and Cecilia Saccone

UTRdb: a specialized database of 54- and 34-untranslated regions of eukaryotic mRNAs

J. Posfai, R.J. Roberts (PoRo92)

Finding errors in DNA sequences.

Proc. Natl. Acad. Sci. USA, 1992, 89, 4698-4702
J.-M. Claverie (Cla93)

Detecting frame shifts by amino acid sequence comparison.

J. Mol. Biol., 1993, 234, 1140-1157
G.A. Fichant, Y. Quentin (FiQu95)

A frameshift error detection algorithm for DNA sequencing projects.

Nucleic Acid Research, 1995, 23, 15, 2900-2908
S. Schweigert, P.V.G. Herde, P.R. Sibbald (SHS95)

Issues in incorporation semantic integrity in molecular biological object-oriented databases.

Comp. Appl. Biosci., 1995, 11, 4, 339-347
P. Bork, A. Bairoch (BoBa96)

Go hunting in sequence databases but watch out for the traps.

Trends in Genetics, 1996, 12, 10, 425-427
U. Bhatia, K. Robinson, W. Gilbert (BRG97)

Dealing with Database Explosion: A cautionary note.

Science, 1997, 276, 1724-1725

Irrelevancy Reference
QIAGEN product line

PCR (Polymerase Chain Reaction) cleanup

Gel extraction, enzymatic reaction cleanup

Nucleotide removal

Dye-terminator removal.

reaction cleanup

A concise guide to cDNA Microarray analysis, biotechniques, 29(3), sept. 2000,548-562

Qbio Gene product line

Perkinelmer product line


MagneSil™ Sequencing CleanUp

Ultra Clean PCR Cleanup kit (MoBio Laboratories), free kit

Development Resource

A set of Unix utilities called filtersites for genome data manipulating or cleanup processing was found on

Some cleanup software can be downloaded for free at
Bioinformatics free software


R. Kimball (Kim96)

Dealing with dirty data. DBMS, September 1996

  1. Maydanchik (May99)

Challenges of Efficient Data Cleansing.

Published in DM Direct in September 1999

J.I. Maletic, A. Marcus (MaMa00)

Data Cleansing: Beyond Integrity Analysis.

Proceedings of the Conference on Information Quality, October 2000
E. Rahm, Hong Hai Do (RaDo00)

Data Cleaning: Problems and current approaches.

IEEE Bulletin of the Technical Committee on Data Engineering, 2000, 24, 4
D. Bitton, D.J. DeWitt (BDeW83)

Duplicate record elimination in large data files.

ACM Transactions on Database Systems, 1983, 8, 2, 255-265
M.A. Hernandez, S.J. Stolfo (HeSt95)

The merge/purge problem for large databases.

Proceedings of the ACM SIGMOD Conference, 1995
A.E. Monge, C.P. Elkan (MoEl97)

An efficient domain-independent algorithm for detecting approximately duplicate database records.

Proceedings of the SIGMOD 1997 workshop on data mining and knowledge discovery, 1997
Mong Li Lee, Hongjun Lu, Tok Wang Ling, Yee Teng Ko (LLLK99)

Cleansing data for mining and warehousing.

Proceedings of the 10th International Conference on Database and Expert Systems Applications, Florence, Italy, August 1999
H. Galhardas, D. Florescu, D. Shasha, E. Simon (GFSS99)

An extensible framework for data cleaning.

INRIA Technical Report, 1999
H. Galhardas, D. Florescu, D. Shasha, E. Simon (GFSS00a)

Declaratively cleaning your data using AJAX.

16èmes Journées Bases de Données Avancées (BDA), Blois, France, October 2000
H. Galhardas, D. Florescu, D. Shasha, E. Simon (GFSS00b)

AJAX: An extensible data cleaning tool.

Proceedings of the ACM SIGMOD on Management of data, Dallas, TX USA, May 2000

H. Galhardas, D. Florescu, D. Shasha, E. Simon, C.-A. Saita (GFSSS01a)

Improving data cleaning quality using a data lineage facility.

Proceedings of the 3rd International Workshop on Design and Management of Data Warehouses, Interlaken, Switzerland, June 2001

H. Galhardas, D. Florescu, D. Shasha, E. Simon, C.-A. Saita (GFSSS01b)

Declarative data cleaning: Language, model, and algorithms.

Proceedings of the 27th VLDB Conference, Roma, Italy, 2001
Mong Li Lee, Tok Wang Ling, Wai Lup Low (LLL00)

IntelliClean: A knowledge-based intelligent data cleaner.

Proceedings of the ACM SIGKDD, Boston, USA, 2000

VecScreen is a system for quickly identifying segments of a nucleic acid sequence that may be of vector origin. NCBI developed VecScreen to combat the problem of vector contamination in public sequence databases. This web page is designed to help researchers identify and remove any segments of vector origin prior to sequence analysis or submission.

