Note: This entry assumes that you are using a Unix-like operative system, ideally GNU/Linux and alternatively OSX. Not tested in BSD or others.
I’ve been using Beast2 for a while as the bayesian platform for a study on the effect of calibration priors using a Neotropical fish group as study model because of its reasonable fossil record and availability of DNA sequences. The study requires at least 18 different analyses to be run, with runtime varying from one to four days in order to reach convergence1, so it can be defined as computationally exhaustive. As I had access to a server with good amount of processors I could “parallelize” the analysis by running several instances of beast2 at the same time, each with several threads. However, even with this approach I had to limit the number of analyses to batches of four to five each time so that I did not use all of the resources, because others were also running their own analyses on it.
From time to time I noticed that a given analysis failed without producing any useful signal that I could use for picking the problem in time, so I had to wait until the end of the analysis in order to discover that a given one did not even run at all because problems in the definition of some priors. Until then I didn’t know about the
-validate option of Beast2 that allows to check that most of the xml file definition is OK and then the analysis will run. Plainly this option will attempt to run a given xml file and report any error or exit without completing the analysis. It’s usage is as follows:
beast -validate myFile.xml
myFile.xml is the name of the xml file to be tested. However, what if we wanted to check a bunch of files automatically? Here we can use the command-line options of Beast2 along with
bash control structures in order to iterate over each file with some conditions (e.g., the file extension, or the file name) and then check whether it is a valid xml file that will be run by Beast2 without problem. This approach assumes that you have all of your xml files in the same directory (but alternative versions can avoid this requirement and traverse recursively path structures, that is, navigate sub-directories) for a simple case.
We will use six main software pieces:
foris an iterative control structure already available in any unix-based system (e.g., GNU/Linux or OSX operative systems).
lswill show a list of file names composed of any text terminating in
*.xmlreads any text and then dot, x, m, and l to the end of the name).
beastis the program that we will use for bayesian inference of divergence time (not covered in this post though).
2>&1redirects the text from the console as input to another command.
|is the “pipe” operator, it will redirect the result of the commands on the left of the operator as input to the commands to the right. It reads “evaluate the code to the left of |, take the result and input it to those other commands to the right of |”. In our case, it will take the text being redirected from the screen by
2>&1and to use it as the input of
grep(as its last argument, actually).
greptool is a powerful command-line tool for searching either simple text or regular expressions (not covered in this post either), in this case, the content redirected by
Basically our code will pick one xml file at a time in a
for loop, test with
beast -validate whether it is a correct xml file, and then redirect the text on the console as the input for
grep to look for specific text indicating that something is wrong with out files. The code is then:
# one-liner for i in `ls *.xml`; do beast -validate $i 2>&1 | grep "Error"; done # several-lines version for i in `ls *.xml` do beast -validate $i 2>&1 | grep "Error" done
The expected output in presence of corrupted xml files will be text containing the word
Error printed to the console, indicating that a given xml file contained errors. Otherwise, out code will finish silently, indicating that we can run our analyses. This control-quality step is crucial for cases where you are submitting processes to a cluster or a server where you need to wait until completion of the analyses for a given reason.
- Assessed primarily from the ESS value but also comparing different runs of the same analysis, and also by examining the behavior of the estimation along the analysis so that no pattern is detected. ↩