
                                 wordcount 



Function

   Count and extract unique words in DNA sequence(s)

Description

   wordcount counts and extracts all possible unique sequence words of a
   specified size in one or more DNA sequences. It writes an output file
   giving all possible words for that word size with a count of each word
   in the input sequences. Optionally, only words occuring a specified
   minimum number of times are reported.

Usage

   Here is a sample session with wordcount


% wordcount tembl:u68037 -wordsize=3 
Count and extract unique words in DNA sequence(s)
Output file [u68037.wordcount]: 

   Go to the input files for this example
   Go to the output files for this example

Command line arguments

   Standard (Mandatory) qualifiers:
  [-sequence]          seqall     Sequence(s) filename and optional format, or
                                  reference (input USA)
   -wordsize           integer    [@($(acdprotein)? 2 : 4)] Word size (Integer
                                  1 or more)
  [-outfile]           outfile    [*.wordcount] Output file name

   Additional (Optional) qualifiers:
   -mincount           integer    [1] Minimum word count to report (Integer 1
                                  or more)

   Advanced (Unprompted) qualifiers: (none)
   Associated qualifiers:

   "-sequence" associated qualifiers
   -sbegin1            integer    Start of each sequence to be used
   -send1              integer    End of each sequence to be used
   -sreverse1          boolean    Reverse (if DNA)
   -sask1              boolean    Ask for begin/end/reverse
   -snucleotide1       boolean    Sequence is nucleotide
   -sprotein1          boolean    Sequence is protein
   -slower1            boolean    Make lower case
   -supper1            boolean    Make upper case
   -sformat1           string     Input sequence format
   -sdbname1           string     Database name
   -sid1               string     Entryname
   -ufo1               string     UFO features
   -fformat1           string     Features format
   -fopenfile1         string     Features file name

   "-outfile" associated qualifiers
   -odirectory2        string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write first file to standard output
   -filter             boolean    Read first file from standard input, write
                                  first file to standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report dying program messages

Input file format

   wordcount reads any sequence USA.

  Input files for usage example

   'tembl:u68037' is a sequence entry in the example nucleic acid
   database 'tembl'

  Database entry: tembl:u68037

ID   U68037; SV 1; linear; mRNA; STD; ROD; 1218 BP.
XX
AC   U68037;
XX
DT   23-SEP-1996 (Rel. 49, Created)
DT   04-MAR-2000 (Rel. 63, Last updated, Version 2)
XX
DE   Rattus norvegicus EP1 prostanoid receptor mRNA, complete cds.
XX
KW   .
XX
OS   Rattus norvegicus (Norway rat)
OC   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia
;
OC   Eutheria; Euarchontoglires; Glires; Rodentia; Sciurognathi; Muroidea;
OC   Muridae; Murinae; Rattus.
XX
RN   [1]
RP   1-1218
RA   Abramovitz M., Boie Y.;
RT   "Cloning of the rat EP1 prostanoid receptor";
RL   Unpublished.
XX
RN   [2]
RP   1-1218
RA   Abramovitz M., Boie Y.;
RT   ;
RL   Submitted (26-AUG-1996) to the EMBL/GenBank/DDBJ databases.
RL   Biochemistry & Molecular Biology, Merck Frosst Center for Therapeutic
RL   Research, P. O. Box 1005, Pointe Claire - Dorval, Quebec H9R 4P8, Canada
XX
FH   Key             Location/Qualifiers
FH
FT   source          1..1218
FT                   /organism="Rattus norvegicus"
FT                   /strain="Sprague-Dawley"
FT                   /mol_type="mRNA"
FT                   /db_xref="taxon:10116"
FT   CDS             1..1218
FT                   /codon_start=1
FT                   /product="EP1 prostanoid receptor"
FT                   /note="family 1 G-protein coupled receptor"
FT                   /db_xref="GOA:P70597"
FT                   /db_xref="InterPro:IPR000276"
FT                   /db_xref="InterPro:IPR000708"
FT                   /db_xref="InterPro:IPR001244"
FT                   /db_xref="InterPro:IPR008365"
FT                   /db_xref="UniProtKB/Swiss-Prot:P70597"
FT                   /protein_id="AAB07735.1"
FT                   /translation="MSPYGLNLSLVDEATTCVTPRVPNTSVVLPTGGNGTSPALPIFS
M
FT                   TLGAVSNVLALALLAQVAGRLRRRRSTATFLLFVASLLAIDLAGHVIPGALVLRLYTA
G
FT                   RAPAGGACHFLGGCMVFFGLCPLLLGCGMAVERCVGVTQPLIHAARVSVARARLALAL
L
FT                   AAMALAVALLPLVHVGHYELQYPGTWCFISLGPPGGWRQALLAGLFAGLGLAALLAAL
V
FT                   CNTLSGLALLRARWRRRRSRRFRENAGPDDRRRWGSRGLRLASASSASSITSTTAALR
S
FT                   SRGGGSARRVHAHDVEMVGQLVGIMVVSCICWSPLLVLVVLAIGGWNSNSLQRPLFLA
V
FT                   RLASWNQILDPWVYILLRQAMLRQLLRLLPLRVSAKGGPTELSLTKSAWEASSLRSSR
H
FT                   SGFSHL"
XX
SQ   Sequence 1218 BP; 162 A; 397 C; 387 G; 272 T; 0 other;
     atgagcccct acgggcttaa cctgagccta gtggatgagg caacaacgtg tgtaacaccc        6
0
     agggtcccca atacatctgt ggtgctgcca acaggcggta acggcacatc accagcgctg       12
0
     cctatcttct ccatgacgct gggtgctgtg tccaacgtgc tggcgctggc gctgctggcc       18
0
     caggttgcag gcagactgcg gcgccgccgc tcgactgcca ccttcctgtt gttcgtcgcc       24
0
     agcctgcttg ccatcgacct agcaggccat gtgatcccgg gcgccttggt gcttcgcctg       30
0
     tatactgcag gacgtgcgcc cgctggcggg gcctgtcatt tcctgggcgg ctgtatggtc       36
0
     ttctttggcc tgtgcccact tttgcttggc tgtggcatgg ccgtggagcg ctgcgtgggt       42
0
     gtcacgcagc cgctgatcca cgcggcgcgc gtgtccgtag cccgcgcacg cctggcacta       48
0
     gccctgctgg ccgccatggc tttggcagtg gcgctgctgc cactagtgca cgtgggtcac       54
0
     tacgagctac agtaccctgg cacttggtgt ttcattagcc ttgggcctcc tggaggttgg       60
0
     cgccaggcgt tgcttgcggg cctcttcgcc ggccttggcc tggctgcgct ccttgccgca       66
0
     ctagtgtgta atacgctcag cggcctggcg ctccttcgtg cccgctggag gcggcgtcgc       72
0
     tctcgacgtt tccgagagaa cgcaggtccc gatgatcgcc ggcgctgggg gtcccgtgga       78
0
     ctccgcttgg cctccgcctc gtctgcgtca tccatcactt caaccacagc tgccctccgc       84
0
     agctctcggg gaggcggctc cgcgcgcagg gttcacgcac acgacgtgga aatggtgggc       90
0
     cagctcgtgg gcatcatggt ggtgtcgtgc atctgctgga gccccctgct ggtattggtg       96
0
     gtgttggcca tcgggggctg gaactctaac tccctgcagc ggccgctctt tctggctgta      102
0
     cgcctcgcgt cgtggaacca gatcctggac ccatgggtgt acatcctgct gcgccaggct      108
0
     atgctgcgcc aacttcttcg cctcctaccc ctgagggtta gtgccaaggg tggtccaacg      114
0
     gagctgagcc taaccaagag tgcctgggag gccagttcac tgcgtagctc ccggcacagt      120
0
     ggcttcagcc acttgtga                                                    121
8
//

Output file format

  Output files for usage example

  File: u68037.wordcount

ctg     54
gcc     53
tgg     53
ggc     51
gct     47
cgc     47
gtg     40
tgc     39
cct     38
gcg     36
cca     29
ggg     26
tcc     25
ctt     25
cag     25
ccc     24
ggt     24
ctc     23
tgt     23
ccg     22
gca     22
cgt     22
cac     22
agc     21
ttg     19
acg     19
cgg     19
tcg     18
ttc     17
cat     17
agg     17
gag     16
act     16
gtc     16
aac     15
tct     14
atc     14
gga     14
tca     13
cta     13
atg     12
acc     11
gta     11
gtt     11
aca     10
tga     10
caa     10
tac     10
gac     9
tag     9
agt     9
ttt     8
cga     7
gat     6
taa     6
aga     5
tat     5
gaa     4
aat     3
tta     3
ata     3
att     3
aag     2
aaa     1

   The file simply consists of two columns, separated by spaces or TAB
   characters.

   The first column consists of all the possible words of size wordsize.
   The second column consists of the count of those words in the input
   sequence.

Data files

   None.

Notes

   None.

References

   None.

Warnings

   None.

Diagnostic Error Messages

   None.

Exit status

   0 if successful.

Known bugs

   None.

See also

   Program name Description
   backtranambig Back-translate a protein sequence to ambiguous
   nucleotide sequence
   backtranseq Back-translate a protein sequence to a nucleotide sequence
   banana Plot bending and curvature data for B-DNA
   btwisted Calculate the twisting in a B-DNA sequence
   chaos Draw a chaos game representation plot for a nucleotide sequence
   charge Draw a protein charge plot
   checktrans Reports STOP codons and ORF statistics of a protein
   compseq Calculate the composition of unique words in sequences
   dan Calculates nucleic acid melting temperature
   density Draw a nucleic acid density plot
   emowse Search protein sequences by digest fragment molecular weight
   freak Generate residue/base frequency table or plot
   iep Calculate the isoelectric point of proteins
   isochore Plots isochores in DNA sequences
   mwcontam Find weights common to multiple molecular weights files
   mwfilter Filter noisy data from molecular weights file
   octanol Draw a White-Wimley protein hydropathy plot
   pepinfo Plot amino acid properties of a protein sequence in parallel
   pepstats Calculates statistics of protein properties
   pepwindow Draw a Kyte-Doolittle hydropathy plot for a protein sequence
   pepwindowall Draw Kyte-Doolittle hydropathy plot for a protein
   alignment
   sirna Finds siRNA duplexes in mRNA

Author(s)

   Ian Longden (il  sanger.ac.uk)
   Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge,
   CB10 1SA, UK.

History

   Completed 27th November 1998.

Target users

   This program is intended to be used by everyone and everything, from
   naive users to embedded scripts.

Comments

   None
