Searching for text content into GNU/Linux files (and possibly OSX too)

Machines running GNU/Linux are very powerful tools once we are used to the command line. As an example, let us imagine that we are looking for a file named importantScript.py, a text file containing python code that we forgot where we put but are sure about its name. The standard file search engine called catfish is the tool for looking for files by filename and path in the hard drive. It is similar to search engines in Windows and OSX, possibly with the difference that apparently Finder in OSX can look for keywords into the files (the latter property of Finder is according to my wife, I honestly stopped using OSX long long ago).

But, how about keywords inside files? For instance, we were working on a project with a lot of files written in \LaTeX with the knitr package, so that we can embed R code to execute during document compilation. In just one from a whole bunch of text files we used a special package but forgot how to use it again (it was useful but we just needed it once so far), and internet fell so that we can not rely on good ol’ google in order to look for the usage. For reasons that don’t matter for now (probably ignorance) we also overlook the local documentation. Are we lost? Probably not.

Let us assume we remember a keyword, maybe a package name, or another sort of keyword, for instance a macro (\mymacro in \LaTeX). We can use the command grep with the -r option in order to visit all of the files and subdirectories in the current directory looking for the keyword(s) of interest. First we open an instance of the command line (e.g., bash) and navigate to the directory from where we want to search. Use of explicit or relative paths are also possible but I leave them to the reader for exploration. After calling grep, there will appear some output indicating file names and some text after a semicolon (:); this means that the command found our keywords in some of the files in the search path. On the other hand, if the terminal gives us the control again by showing the prompt ($), it failed to find the keywords inside any of the files and files-within-subdirectories of our current path. For instance:

user@machine:~$ cd ~/Documents/millionsOfScripts
# now let's see its content
# by the way, this directory is completely hypothetical!
user@machine:~$ ls
block  class  devices   fs          kernel  power
bus    dev    firmware  hypervisor  module
# search recursively the files into each of the directories show above for 
# the word 'subfigure'
user@machine:~$ grep -r subfigure
module/project1/scripts4/toCompile/blocks.Rnw:\begin{subfigure}

The code above indicates that our recursive search for the term ‘subfigure’ found a single file called blocks.Rnw in the subdirectory module/project1/scripts4/toCompile that contained inside such file a line with the text \begin{subfigure}. Then we can visit that file, open it, and see how did we use that tool in the past when working with \LaTeX. Pretty useful when we just have a bare idea of what we are looking for.

I found it useful for exactly that case when I did not remember how to use the subfigure \LaTeX package and did not want to search in google; I then just went to a directory where I had some indication that files using the package could be placed (e.g., my folder with beamer presentations or my PhD folder with some of my scholarship report files, all of them in \LaTeX) and then let grep look for them. Hope you find it useful and funny, it could save your life some day (as most of the bash commands, so learn to love the command line!)

PS: grep can take very very long to execute if your directory has a lot of files, subdirectories, and they are large.

Advertisements

New species (to science, sometimes actually)

This blog post is about the real meaning of a expression/term commonly found in technical biodiversity jargon that tend to be misunderstood by people outside the field, commonly journalists and media consumers, and is the concept of new species.

Despite its literal meaning, what scientists mean by “new” is not what the public expects. I’ve found myself in the situation of being criticized by media consumers (people outside biology) by claiming that a species is “new”. I can remember a specific case when Farlowella yarigui was described (Ballen & Mojica, 2014). This work received a lot of media coverage thanks to the interest of journalists from Unimedios on my work with fishes collected during one of the courses at the Universidad Nacional when I was still an undergrad student. For some reason, such news post was “recycled” by some other newspapers/journals in their news section on environment (e.g., El Espectador) and posted in their social media sites.

Figure 1
Farlowella yarigui, a species from the Magdalena basin in Colombia.

Some of the readers claimed that the species can by no means be new since they have seen that fish before, sometimes since they were young, and in a few cases, implying that scientists lie to the public by telling the species is new! I did not take it as an attack because I acknowledge the ambiguity of saying that a species is “new”; in fact, at least three different cases fall within the concept of “new” to science as opposed to “new” to everyone in the world:

  • A species can be mistakenly called with a scientific name when it in fact is not such species. This happens when biologists fail to recognize that the scientific name that is applied to a given species is erroneously used, usually because we are confusing two species under the same name. This is analogous to calling “metal” to the music by Nirvana and Death: We are erroneously calling Nirvana as ‘metal’ when it is in fact Grunge. A good example is calling the tiger catfish of the Magdalena river basin in Colombia as Pseudoplatystoma fasciatum when it is not such species but another one, that was given the name Pseudoplatystoma magdaleniatum after recognizing the differences between both species (Buitrago-Suarez & Burr, 2007).

  • A species can be known to be around but without a scientific name to date. In this case, scientists recognize it is not any of the species already named properly by them but still further work could be needed in order to apply a different name for it. This happens when the differences among species of a group are difficult to ascertain, or more specimens in museums are needed in order to be confident that we are not naming a species that already has a name. This is similar to finding something striking and unfamiliar to us, such as when astronomers name a planet we already saw but that did not have a name yet for it.

  • A species can be completely unknown to everyone, and consequently, lack a proper scientific name. This happens with rare, cryptic, or difficult-to-find living things, such as fishes from the deep waters that even fishermen usually cannot catch; these fishes can live to incredible depths always, living and dying without a minimal chance of reaching the surface. In these cases, neither the public nor the scientific community have been in contact with such organisms before. They are, in a sense, really new species.

In all these three cases the technical name in taxonomy (the biological field that aims at documenting living things and naming them scientifically) is “new species”, sometimes even species novum in latin. We need to understand that taxonomical rules started formally during the XVIII century by the works of Carolus Linnaeus, and by that time the Latin was the mandatory language of natural sciences. As a consequence, even today we call them by the translation of the sentence species novum to modern languages such as English, the current standard language in science.

From time to time different scientists decide to avoid such ambiguity by stating that the species is not a new one but an undescribed one, or even avoid such implications at all by saying that a scientific paper proposes a scientific name for a given species. As an example I can think of Grant et al.’s 2007 paper describing Allobates niputidea, a frog species from Colombia, entitled in fact “A name for the species of Allobates (Anura: Dendrobatoidea: Aromobatidae) from the Magdalena Valley of Colombia“.

That said, a scientific name is actually defined as a special name that we apply under certain clear rules in order to avoid ambiguity among scientists, and whose meaning cannot change among languages, therefore being stable. The advantages of using such system is that we avoid calling with several names, sometimes even common names, the same thing, and that once we use a scientific name our peers know exactly to which kind of organism we are referring to. It is in some sense as using our given-and-family name combination (e.g., John Doe) in order to ascertain our own uniqueness from among people sharing our given or family name (e.g., John Doe is different from both John Jameson and Evelyn Doe); however, even such combination is subject to synonyms, something that biological nomenclature tries to avoid by using such rules.

This ambiguity implies at least two possible solutions: To avoid using “new species” in article names in favor of “undescribed species”, or to do the titanic labour (given that there is much more people out there that is nor part of the scientific community; actually these groups differ by orders of magnitude) of explaining the people what we mean by new species. This blog post is an effort to the latter direction as the former is so widespread in scientific literature than it might not be practical to abandon the term at all. A second benefit of the latter alternative is to force us scientists to interact with the people outside our field in order to promote making science accessible to the public.

References

Ballen, G. A., & Mojica, J. I. (2014). A new trans-Andean Stick Catfish of the genus Farlowella Eigenmann & Eigenmann, 1889 (Siluriformes: Loricariidae) with the first record of the genus for the río Magdalena Basin in Colombia. Zootaxa, 3765(2), 134-142.

Buitrago-Suarez, U. A., & Burr, B. M. (2007). Taxonomy of the catfish genus Pseudoplatystoma Bleeker (Siluriformes: Pimelodidae) with recognition of eight species. Zootaxa, 1512(1), 1-38.

Grant, T., Acosta, A., & Rada, M. (2007). A name for the species of Allobates (Anura: Dendrobatoidea: Aromobatidae) from the Magdalena Valley of Colombia. Copeia, 2007(4), 844-854.

Checking .xml Beast2 files for errors

Note: This entry assumes that you are using a Unix-like operative system, ideally GNU/Linux and alternatively OSX. Not tested in BSD or others.

I’ve been using Beast2 for a while as the bayesian platform for a study on the effect of calibration priors using a Neotropical fish group as study model because of its reasonable fossil record and availability of DNA sequences. The study requires at least 18 different analyses to be run, with runtime varying from one to four days in order to reach convergence1, so it can be defined as computationally exhaustive. As I had access to a server with good amount of processors I could “parallelize” the analysis by running several instances of beast2 at the same time, each with several threads. However, even with this approach I had to limit the number of analyses to batches of four to five each time so that I did not use all of the resources, because others were also running their own analyses on it.

From time to time I noticed that a given analysis failed without producing any useful signal that I could use for picking the problem in time, so I had to wait until the end of the analysis in order to discover that a given one did not even run at all because problems in the definition of some priors. Until then I didn’t know about the -validate option of Beast2 that allows to check that most of the xml file definition is OK and then the analysis will run. Plainly this option will attempt to run a given xml file and report any error or exit without completing the analysis. It’s usage is as follows:

beast -validate myFile.xml

where myFile.xml is the name of the xml file to be tested. However, what if we wanted to check a bunch of files automatically? Here we can use the command-line options of Beast2 along with bash control structures in order to iterate over each file with some conditions (e.g., the file extension, or the file name) and then check whether it is a valid xml file that will be run by Beast2 without problem. This approach assumes that you have all of your xml files in the same directory (but alternative versions can avoid this requirement and traverse recursively path structures, that is, navigate sub-directories) for a simple case.

We will use six main software pieces: for, ls beast, 2>&1, |, and grep:

  • for is an iterative control structure already available in any unix-based system (e.g., GNU/Linux or OSX operative systems).

  • ls will show a list of file names composed of any text terminating in .xml (*.xml reads any text and then dot, x, m, and l to the end of the name).

  • beast is the program that we will use for bayesian inference of divergence time (not covered in this post though).

  • 2>&1 redirects the text from the console as input to another command.

  • | is the “pipe” operator, it will redirect the result of the commands on the left of the operator as input to the commands to the right. It reads “evaluate the code to the left of |, take the result and input it to those other commands to the right of |”. In our case, it will take the text being redirected from the screen by 2>&1 and to use it as the input of grep (as its last argument, actually).

  • grep tool is a powerful command-line tool for searching either simple text or regular expressions (not covered in this post either), in this case, the content redirected by 2>&1 from beast -validate to grep.

Basically our code will pick one xml file at a time in a for loop, test with beast -validate whether it is a correct xml file, and then redirect the text on the console as the input for grep to look for specific text indicating that something is wrong with out files. The code is then:

# one-liner
for i in `ls *.xml`; do beast -validate $i 2>&1 | grep "Error"; done

# several-lines version
for i in `ls *.xml`
    do beast -validate $i 2>&1 | grep "Error"
done

The expected output in presence of corrupted xml files will be text containing the word Error printed to the console, indicating that a given xml file contained errors. Otherwise, out code will finish silently, indicating that we can run our analyses. This control-quality step is crucial for cases where you are submitting processes to a cluster or a server where you need to wait until completion of the analyses for a given reason.


  1. Assessed primarily from the ESS value but also comparing different runs of the same analysis, and also by examining the behavior of the estimation along the analysis so that no pattern is detected. 

Bibtex and Mendeley-Desktop

 

Today I came across a weird behavior of mendeley-desktop-generated bibtex files while formatting references for my thesis. As my PhD field is zoology, I use italics a lot for scientific names and formatting all of them in the references by hand can be boring, even more when each time that mendeley syncs its database, it overwrites the previously-generated .bib files. So, manual modification requires to place the textit command elsewhere, compile with bibtex, and then compile the whole document with pdflatex (actually with C-c C-c LaTeX RET since I use emacs). However, as I edit, add or remove any reference in the database, the automatic bibtex files generated by mendeley will be overwritten and all hand-made changes will go away with it.

My first guess was that if I include the tag in the reference title (e.g., Phractocephalus nassi as \textit{Phractocephalus nassi}) the compiler would understand it as a latex command correctly, assuming that mendeley desktop was generating a plain-text version of the metadata. In fact I changed a lot of references before even testing whether it would work or not (shame on me), just to note that the PDF output was showing the latex command too! After a quick search I found that since eight years ago, the mendeley user community is asking the developers to support italics in metadata (see here). Later on during 2015 mendeley estated that mendeley desktop (as md onwards) does not support latex/bibtex. This issue has been already addressed by Kathy Lam); her solution to this issue was implemented in python in order to replace automatically the <i> to \textit{, and in a second version, to its escaped version {\\textless}i{\\textgreater}. I made first a “manual” replacement in emacs with M-% “tags” to “\textit{” or “}” but then realized that I would need to do it every single time I wanted to correct the bibtex formatting; then I found it a good chance to learn a bit more bash so that I could encapsulated sed replacements into a bash script that accepts arguments, the latter being until then a mistery to me.

My implementation searches both <i> and {\\textless}i{\\textgreater} tags (more on this later) and uses sed -i -e 's/tag1/tag2/g' file to correct inplace tags in the bibtex files. I also wanted the script to manage multiple files since md has the option of creating a bibtex file for each collection of references in my library (my current setting), so actually I have multiple files to be converted, not just one. The argument of my script would be then the path to the directory where md creates these files, since one of them (PhDThesis.bib) is already defined as the references file in my LaTeX main document; therefore, my thesis document uses automatically the file generated by md, so format conversion needs to take place between creation and compilation of the .tex file, and optionally it won’t hurt to to convert the remaining ones for sharing or compilation of other documents.

#!/bin/bash

# name the bibtex file or path-to-file for ease of understanding
bibtexPath=$1

# replace using sed with the in-place and expression arguments looking for html tags mistranslated to latex '<i>' = '{\textless}i{\textgreater}' '</i>' = '{\textless}/i{\textgreater}' to '\textit{' and '}' respectively
# Also, please note that find ... | will find only files in $bibtexPath and then feed them one by one to read FILE so that its content passes to the iterartive variable $FILE (between quotation marks since it will take the content of the variable literally, without breaking at spaces)
find $bibtexPath -type f | while read FILE
do
echo Processing file "$FILE"
sed -i -e 's/{\\textless}i{\\textgreater}/\\textit{/g' "$FILE"
sed -i -e 's/{\\textless}\/i{\\textgreater}/}/g' "$FILE"
echo Success!
done

Two important things are noteworthy here: First, the script tells the user whether the file was visited and whether it was successfully converted; and second, it replaces both types of tags since mendeley can generate both depending on whether the option “Escape LaTeX special characters” is active under ‘Bibtex’ in options. Because of the latter, this script differs from Kathy Lam’s implementation as it is of wider application regardless of whether the file was created escaping special characters or not. The script is housed in my general_scripts repository on github. To be useful, it can be either executed as super user, or changed to executable and placed in an executable files path:

# Option 1
user@computer:~/path/to/script$ sudo ./html2bibtex path/to/bibtex/files

# option 2
user@computer:~/path/to/script$ sudo chmod +x html2bibtex # this makes the script executable
user@computer:~/path/to/script$ sudo cp html2bibtex path/to/executable/files/dir # such path can be something like /usrs/bin for instance
user@computer:~$ html2bibtex path/to/bibtex/files # run the script with the path as argument

This solves the problem of html tags, but what if we already have some reference titles with the \textit command (as was my case)? There are still two options: Convert them to properly-formatted latex, or deactivate the “Escape LaTeX…” option. The latter demonstrated to work directly after compiled since tags are not escaped when md generates the bibtex file; however, this approach is dangerous if there is any special character in the reference name (e.g., %) as it will cause problems during compilation (so far I have not run into any of these problems, though). Therefore, activation of the escaping option is the safest way to deal with bibtex file creation, but ir requires either to re-convert the commands to valid latex or to change all italics in reference name to html tags so that the html2bibtex script can deal with them properly. I suggest to use html tags instead of latex commands since the libreoffice plugin does not understand the latter, though it formats properly the html tags. So, the most general setting is to use html tags in metadata so that you can either use bibtex (after conversion of html tags to latex command with html2bibtex and before compilation of the .tex file) or the libreoffice plugin for generating bibliographies.

Hope this helps anyone having problems with bibtex and mendeley-generated files, specially in the biological sciences where italized terms are widespread.