Test sets from random trees
คคคคคคคคคคคคคคคคคคคคคคคคคคค


For the data sets 24_tax.tar.gz and 96_tax.tar.gz
-------------------------------------------------

The  trees  were  generated   by  the  stochastic  speciation  process
described in  (Kuhner and Felsenstein 1994).  Deviation from molecular
clock  was   obtained  by  multiplying  every  branch   length  by  an
exponentially distributed variable.  The parameter of this exponential
distribution was tuned to produce a realistic deviation. The following
values are the distribution quantiles of the ratio between the lengths
of the  longest and the shortest  lineages from root to  taxa; under a
strict molecular clock this ratio is equal to 1.

24 taxa 
0%: 1.28, 25%: 1.78, 50%: 2.12, 75%: 2.51, 100%: 5.31

96 taxa
0%: 1.46, 25%: 2.13, 50%: 2.37, 75%: 2.62, 100%: 4.22

The average  maximum pairwise  divergence is close  to 0.4 in  both 24
taxa and  96 taxon  trees. The following  values are  the distribution
quantiles of the maximum pairwise divergence.

24 taxa
0%: 0.26, 25%: 0.36, 50%: 0.40, 75%: 0.45, 100%: 0.70

96 taxa
0%, 0.30, 25%: 0.37, 50%: 0.41, 75%: 0.45, 100%: 0.65 

Trees are given in NEWICK format; there is one tree per line.

Sequences with length 500 were obtained using SeqGen 1.1 (Rambault and
Grassly 1997). We  used the Kimura two parameter  model (Kimura 1980),
with a  transition/transversion ratio equal to 2.  Sequences are given
in PHYLIP interleaved format, one after the other.


For the data sets 24_xxx_seq.gz and 96_xxx_seq.gz
-------------------------------------------------

Those are only homologous sequences  data sets. For "small" data sets,
each branch lengths  of the original trees were  divided by 2.0. These
trees were then used to generate  new sequences data sets (500 bp long
sequences  generated  with  Seq-Gen  1.1  under the  K2P  model,  with
transition/transversion  ratio equal  to 2).   For "large"  data sets,
each branch lengths  of the original tree were divided  by 0.4 (i.e. x
2.5).


Files:
------

The files  24tax.tar.gz and 96tax.tar.gz  have to be  decompressed. On
UNIX systems, you have to use the following instruction:

tar -zxvf 24tax.tar.gz; tar -zxvf 96tax.tar.gz ;

The following directories will appear: 24tax and 96tax

In  each of these  directories you'll  find trees  (24_tree, 96_tree),
and sequences (24_seq, 96_seq).

For the files 24_xxx_seq.gz and 96_xxx_seq.gz, type :
gzip -d 24_xxx_seq.gz
or
gzip -d 96_xxx_seq.gz
You can then use the seqences files 24_xxx_seq or 96_xxx_seq.


----------------------------------------------------------------------
BIBLIOGRAPHY

Kuhner,  M  and Felsenstein,  J  (1994).  A  simulation comparison  of
phylogeny   algorithms   under   equal   and   unequal   evolutionnary
rates. Mol. Biol. Evol., 11:459-468

Rambaut, A  and Grassly,  N (1997). Seq-gen  : An application  for the
monte carlo  simulation of  dna sequence evolution  along phylogenetic
trees. Comput. Appl. Biosci., 13, 235-238

Kimura, M  (1980). A simple method for  estimating evolutionnary rates
base   substitutions  through   comparative  studies   of  nucleotides
sequences.         J.          Mol.         Evol.,         16:111-120.
----------------------------------------------------------------------


These   data   sets  have   been   generated   by  Stephane   Guindon:
guindon@lirmm.fr.