Table 1.

Comparison of times required to cluster sequences into OTUs for distance cutoffs ranging between 0.00 and 0.10 for various clustering algorithms and input data formats when applied to full-length, V13, and V35 16S rRNA gene sequencesa

AlgorithmApproachbWall time (min) for sequence:
Full lengthV13V35
Average neighborTraditional61.6359.2265.77
Unique61.6342.6838.17
Sparse27.258.1230.58
Split-824.8211.4330.90
On-the-fly6,085.972,848.806,035.52
Weighted neighborTraditional63.8759.6363.67
Unique63.8743.1738.28
Sparse20.307.7524.28
Split-824.7011.5028.73
On-the-fly7,597.983,396.177,852.87
Furthest neighborTraditional61.2756.5062.85
Unique61.2743.2339.00
Sparse0.530.150.25
Split-82.801.321.92
Online3.281.332.57
Nearest neighborTraditional65.3061.9066.72
Unique65.3045.3839.83
Sparse0.530.150.25
Split-82.801.351.92
On-the-fly3.251.282.50
CD-HITUniqSeq88.1315.9010.00
UClustUniqSeq11.852.982.63
ESPRITUniqSeq6,361.85228.45390.70
BlastClustUniqSeq919.52165.67187.47
PhylotypeUniqSeq46.3810.3812.08
  • a Although the V13 and V35 16S rRNA gene sequences are comparable in length, the V35 16S rRNA gene sequences took longer to cluster because there were more pairwise distances among sequences in that region that were smaller than 0.10 than were found in the other data sets. All times represent the “wall time” in minutes required for each analysis using the computer system described in Materials and Methods.

  • b The “traditional” approach represented all 14,956 sequences according to a PHYLIP-formatted lower-triangular distance matrix. The “unique” approach only used the sequences that were identical to each other over their full length according to a PHYLIP-formatted lower-triangular-distance matrix. The “sparse” approach only used the sequences that were not identical to each other over their full length according to a sparse matrix format. The “split-8” approach split the sparse data format into mutually exclusive submatrices and clustered the submatricies in parallel by using 8 processors. The “on-the-fly” data format used the sparse data format but processed the distance matrix without reading the entire matrix into memory. The “UniqSeq” approach represented the data by only using unique, unaligned, FASTA-formatted sequences.