This page contains information about the software for Bayesian estimation of bacterial communities (BEBaC), which is currently available for Linux environments.


Source code is available upon request.

A bug has been found regarding the fine clustering phase in the 12.06.2012 BEBaC version. 64-bit Linux compiled version is updated, 32-bit Linux version is no longer supported due to lack of 32-bit computers.

1. Now one can check the quality score of consensus sequences using command: viewQuality.
2. Usually people combine sequencing reads from different group of samples, such as healthy people and diseased people. One can run BEBaC analysis on the combined dataset, then check how the derived OTUs are distributed over the different groups, by using this command: calGroupDis. Please note that a pre-defined group file "testseq.groups" is added to test dataset.


You can find the user manual here.

BEBaC(32bit, Linux) is no longer available!

BEBaC(64bit, Linux) is available here.

BEBaC source code is available upon request.

Test dataset (4 taxa, 1600 reads) is available here.

Simulated dataset (11 taxa, 22K reads) is available here.

eMC dataset (21 species, 91K reads) is available here.
IMPORTANT: This dataset is preprocessed from Haas et al.'s eMC dataset, please reference this paper if you want to use it.


1. Add BEBaC directory to path Here we take ubuntu for example, open a terminal gedit .bashrc file append the following line to the end of the file (You have to replace "PATH_TO_BEBaC" with your BEBaC directory) export PATH=PATH_TO_BEBaC:$PATH Now save the file and close the terminal, then start a new terminal. 2. Install MCR If you have Matlab(2010a), then you do not need to install MCR. To install MCR, type chmod u+x MCRInstaller_unix_2010a_64bit.bin ./MCRInstaller_unix_2010a_64bit.bin 3. Install MUSCLE see instructions here: 4. Configure the file gedit specify the installation directory of your MCR (or MATLAB) in the file specify the path to MUSCLE software on your machine 5. run BEBaC analysis Here we use Test dataset (4 taxa, 1600 reads) for illustration. Open a new terminal, type: mkdir OUTPUT_DIRECTORY cd OUTPUT_DIRECTORY then copy testseq.fasta to OUTPUT_DIRECTORY sequences in testseq.fasta should only contain "ACGT", other characters such as "N","-" are not allowed type: preprocSeq testseq.fasta . you will get reads.mat type: preGroup reads.mat . initCluster you will get pregroup_initial_K=4.mat and a folder "pgdist" type: preGroup reads.mat . calDisMat 1 you will get distance matrix file pgdist/dismat1.mat of initial cluster 1 type: preGroup reads.mat . calDisMat 2:4 you will get distance matrix file for initial cluster 2,3,4 type: preGroup reads.mat . pregroup you will get pregroup result file: pregroup_final_K=102.mat type: clusterL1 20 pregroup_final_K\=102.mat . perform crude clustering, you will get 4 crude clusters, and the result file is L1_clusters_K=4.mat "20" is the maximum number of crude clusters. type: clusterL2 1 L1_clusters_K\=4.mat . perform fine clustering for crude cluster 1, you will have a new folder "L1-clusters", and the subfolder "1" contains the fine clustering results. type: clusterL2 2:4 L1_clusters_K\=4.mat . perform fine clustering for crude cluster 2 to 4, you will get subfolders "2","3","4" under folder "L1-clusters" type: fetchConsensus 4 . fetch the consensus sequences of OTUs, you will get a folder "results" "conseq.fasta" and "conseq.qual" are the consensus sequence and quality files "" are a graph which shows the quality of each consensus sequence "crudeLabels.txt" and "fineLabels.txt" shows the partition of the input sequences Now BEBaC analysis ended 6. Example to use extra commands type: viewQuality final_result.mat "1 3:4" view the quality score for consensus sequence 1,3,4 You will see a figure displaying the quality score and other information type: calGroupDis final_result.mat testseq.groups groupOTUdistribution.txt calculate the OTU distribuion for each pre-defined group "testseq.groups" contains the pre-defined group information of each read "groupOTUdistribution.txt" is the output file, each column which contains OTU distribution of each pre-defined group type: seqAlnCluster L1-clusters/1/seqs.aln 4 tmp.mat perform CT's clustering alogrithm (see fine Clustering section in our paper) to sequences in crude cluster 1 in this example "L1-clusters/1/seqs.aln" is the multiple sequence alignment file in FASTA format "4" means the maximum number of clusters "tmp.mat" stores the output information in MATLAB format. You will also get a file "tmp.mat.txt", which shows that partition of the input sequences.


1. How should I select the maximum number of crude clusters? In theory you can set it as large as you want, such that the resulted partition has less OTUs than you specified. But setting it to a large number will make the computation too slow. Thus we suggest you set it to a reasonable number (less than 200). BEBaC will automatically double your input maximum number if it thinks it is too small, up to 16 fold of the input number. 2. How to interpret the numbers in crudeLabels.txt and fineLabels.txt Each line corresponds to each read in the input sequence file, i.e. the ith line corresponds to the ith read. The number means the crude/fine cluster label of that read, i.e. reads with the same label belongs to the same crude/fine cluster. 3. How many reads can BEBaC analyze? What about the read length? Usually a dataset with 20~500K reads can be handled by BEBaC. For larger datasets, please contact the author for possible solutions. The read length is expected to be longer than 200 bp.