Checking .xml Beast2 files for errors

Note: This entry assumes that you are using a Unix-like operative system, ideally GNU/Linux and alternatively OSX. Not tested in BSD or others.

I’ve been using Beast2 for a while as the bayesian platform for a study on the effect of calibration priors using a Neotropical fish group as study model because of its reasonable fossil record and availability of DNA sequences. The study requires at least 18 different analyses to be run, with runtime varying from one to four days in order to reach convergence1, so it can be defined as computationally exhaustive. As I had access to a server with good amount of processors I could “parallelize” the analysis by running several instances of beast2 at the same time, each with several threads. However, even with this approach I had to limit the number of analyses to batches of four to five each time so that I did not use all of the resources, because others were also running their own analyses on it.

From time to time I noticed that a given analysis failed without producing any useful signal that I could use for picking the problem in time, so I had to wait until the end of the analysis in order to discover that a given one did not even run at all because problems in the definition of some priors. Until then I didn’t know about the -validate option of Beast2 that allows to check that most of the xml file definition is OK and then the analysis will run. Plainly this option will attempt to run a given xml file and report any error or exit without completing the analysis. It’s usage is as follows:

beast -validate myFile.xml

where myFile.xml is the name of the xml file to be tested. However, what if we wanted to check a bunch of files automatically? Here we can use the command-line options of Beast2 along with bash control structures in order to iterate over each file with some conditions (e.g., the file extension, or the file name) and then check whether it is a valid xml file that will be run by Beast2 without problem. This approach assumes that you have all of your xml files in the same directory (but alternative versions can avoid this requirement and traverse recursively path structures, that is, navigate sub-directories) for a simple case.

We will use six main software pieces: for, ls beast, 2>&1, |, and grep:

  • for is an iterative control structure already available in any unix-based system (e.g., GNU/Linux or OSX operative systems).

  • ls will show a list of file names composed of any text terminating in .xml (*.xml reads any text and then dot, x, m, and l to the end of the name).

  • beast is the program that we will use for bayesian inference of divergence time (not covered in this post though).

  • 2>&1 redirects the text from the console as input to another command.

  • | is the “pipe” operator, it will redirect the result of the commands on the left of the operator as input to the commands to the right. It reads “evaluate the code to the left of |, take the result and input it to those other commands to the right of |”. In our case, it will take the text being redirected from the screen by 2>&1 and to use it as the input of grep (as its last argument, actually).

  • grep tool is a powerful command-line tool for searching either simple text or regular expressions (not covered in this post either), in this case, the content redirected by 2>&1 from beast -validate to grep.

Basically our code will pick one xml file at a time in a for loop, test with beast -validate whether it is a correct xml file, and then redirect the text on the console as the input for grep to look for specific text indicating that something is wrong with out files. The code is then:

# one-liner
for i in `ls *.xml`; do beast -validate $i 2>&1 | grep "Error"; done

# several-lines version
for i in `ls *.xml`
    do beast -validate $i 2>&1 | grep "Error"
done

The expected output in presence of corrupted xml files will be text containing the word Error printed to the console, indicating that a given xml file contained errors. Otherwise, out code will finish silently, indicating that we can run our analyses. This control-quality step is crucial for cases where you are submitting processes to a cluster or a server where you need to wait until completion of the analyses for a given reason.


  1. Assessed primarily from the ESS value but also comparing different runs of the same analysis, and also by examining the behavior of the estimation along the analysis so that no pattern is detected. 
Advertisements

Bibtex and Mendeley-Desktop

 

Today I came across a weird behavior of mendeley-desktop-generated bibtex files while formatting references for my thesis. As my PhD field is zoology, I use italics a lot for scientific names and formatting all of them in the references by hand can be boring, even more when each time that mendeley syncs its database, it overwrites the previously-generated .bib files. So, manual modification requires to place the textit command elsewhere, compile with bibtex, and then compile the whole document with pdflatex (actually with C-c C-c LaTeX RET since I use emacs). However, as I edit, add or remove any reference in the database, the automatic bibtex files generated by mendeley will be overwritten and all hand-made changes will go away with it.

My first guess was that if I include the tag in the reference title (e.g., Phractocephalus nassi as \textit{Phractocephalus nassi}) the compiler would understand it as a latex command correctly, assuming that mendeley desktop was generating a plain-text version of the metadata. In fact I changed a lot of references before even testing whether it would work or not (shame on me), just to note that the PDF output was showing the latex command too! After a quick search I found that since eight years ago, the mendeley user community is asking the developers to support italics in metadata (see here). Later on during 2015 mendeley estated that mendeley desktop (as md onwards) does not support latex/bibtex. This issue has been already addressed by Kathy Lam); her solution to this issue was implemented in python in order to replace automatically the <i> to \textit{, and in a second version, to its escaped version {\\textless}i{\\textgreater}. I made first a “manual” replacement in emacs with M-% “tags” to “\textit{” or “}” but then realized that I would need to do it every single time I wanted to correct the bibtex formatting; then I found it a good chance to learn a bit more bash so that I could encapsulated sed replacements into a bash script that accepts arguments, the latter being until then a mistery to me.

My implementation searches both <i> and {\\textless}i{\\textgreater} tags (more on this later) and uses sed -i -e 's/tag1/tag2/g' file to correct inplace tags in the bibtex files. I also wanted the script to manage multiple files since md has the option of creating a bibtex file for each collection of references in my library (my current setting), so actually I have multiple files to be converted, not just one. The argument of my script would be then the path to the directory where md creates these files, since one of them (PhDThesis.bib) is already defined as the references file in my LaTeX main document; therefore, my thesis document uses automatically the file generated by md, so format conversion needs to take place between creation and compilation of the .tex file, and optionally it won’t hurt to to convert the remaining ones for sharing or compilation of other documents.

#!/bin/bash

# name the bibtex file or path-to-file for ease of understanding
bibtexPath=$1

# replace using sed with the in-place and expression arguments looking for html tags mistranslated to latex '<i>' = '{\textless}i{\textgreater}' '</i>' = '{\textless}/i{\textgreater}' to '\textit{' and '}' respectively
# Also, please note that find ... | will find only files in $bibtexPath and then feed them one by one to read FILE so that its content passes to the iterartive variable $FILE (between quotation marks since it will take the content of the variable literally, without breaking at spaces)
find $bibtexPath -type f | while read FILE
do
echo Processing file "$FILE"
sed -i -e 's/{\\textless}i{\\textgreater}/\\textit{/g' "$FILE"
sed -i -e 's/{\\textless}\/i{\\textgreater}/}/g' "$FILE"
echo Success!
done

Two important things are noteworthy here: First, the script tells the user whether the file was visited and whether it was successfully converted; and second, it replaces both types of tags since mendeley can generate both depending on whether the option “Escape LaTeX special characters” is active under ‘Bibtex’ in options. Because of the latter, this script differs from Kathy Lam’s implementation as it is of wider application regardless of whether the file was created escaping special characters or not. The script is housed in my general_scripts repository on github. To be useful, it can be either executed as super user, or changed to executable and placed in an executable files path:

# Option 1
user@computer:~/path/to/script$ sudo ./html2bibtex path/to/bibtex/files

# option 2
user@computer:~/path/to/script$ sudo chmod +x html2bibtex # this makes the script executable
user@computer:~/path/to/script$ sudo cp html2bibtex path/to/executable/files/dir # such path can be something like /usrs/bin for instance
user@computer:~$ html2bibtex path/to/bibtex/files # run the script with the path as argument

This solves the problem of html tags, but what if we already have some reference titles with the \textit command (as was my case)? There are still two options: Convert them to properly-formatted latex, or deactivate the “Escape LaTeX…” option. The latter demonstrated to work directly after compiled since tags are not escaped when md generates the bibtex file; however, this approach is dangerous if there is any special character in the reference name (e.g., %) as it will cause problems during compilation (so far I have not run into any of these problems, though). Therefore, activation of the escaping option is the safest way to deal with bibtex file creation, but ir requires either to re-convert the commands to valid latex or to change all italics in reference name to html tags so that the html2bibtex script can deal with them properly. I suggest to use html tags instead of latex commands since the libreoffice plugin does not understand the latter, though it formats properly the html tags. So, the most general setting is to use html tags in metadata so that you can either use bibtex (after conversion of html tags to latex command with html2bibtex and before compilation of the .tex file) or the libreoffice plugin for generating bibliographies.

Hope this helps anyone having problems with bibtex and mendeley-generated files, specially in the biological sciences where italized terms are widespread.