Referencing zoological name authorships in LaTeX

Motivation

During the preparation of a manuscript using \LaTeX I came across an issue with references for authors of taxonomic entities (e.g., species and genus names). Initially I started writing taxonomic authors as plain text hoping to solve the cross-referencing later by citing elsewhere the same work or so. However, for some references there was no need at all to cite them elsewhere but only at the name, and this pushed me to look for a solution. Initially I considered to code a variant of the commands \citet and \citep of the natbib package that results way useful for handling bibliographies and text references.

The problem of referencing authors of names is that the International Code on Zoological Nomenclature (ICZN) has specific rules for citing such information in taxonomic works, and they may mimic regular in-text references to the uninitiated. Also, several pre-ICZN works handled variably these cases as there was no Code to rule them all (fortunately now we do have one).

This case will be specific to zoological nomenclature and may not be directly useful for botanical nomenclature because they use author abbreviations (e.g., Planch. for Planchon, or even L. for Linnaeus); therefore, the instructions below would need modification for rendering the author in abbreviated form but retaining the full name in the bibliography. I don’t even know if it is possible at all.

Please note that all instances of Author1990 are examples of a reference key that you need to define in your bibtex bibliography file (in this example, manuscript.bib). A typical bibtex entry would look like this:

@article{Author1990,
author = {Author, Bob A},
journal = {Tertiary Research},
number = {4},
pages = {131--137},
title = {{Fossils from the Bracklesham Group exposed in the M27 Motorway excavations, Southampton, Hampshire}},
volume = {12},
year = {1990}
}

Regular citations

Basically, textual references from natbib are cited depending on whether they are intended to be textual citations where the author is part of the sentence, or whether both author and year are to be part of the citation:

  • Reference between parentheses: \citep{Author1990}. Example: Author (1990) suggested that this genus resembles the other genus.
  • Reference without parentheses: \citet{Author1990}. Example: Genus A resembles genus B (Author, 1990).

Taxonomic authorship

The ICNZ (Chapter 11, Article 51, p.54) considers two cases: Either we want to cite the author of the name but not the year of publication, or we want to cite in full the author and year of publication; for the latter case the separation between author name and year must have a comma and a space, and there must be no mark or character between specific epithet name and the author. Also, the style of citation will depend on whether the name is to be used as first proposed by the author, that is, a species combined with its original genus, or whether the specific epithet underwent allocation to a genus other than the original. In the first case, the author (or author-and-year) are never to be put between parentheses (Article 51.2), whereas in the latter, the use of parentheses around the author (or again, author-and-year) is mandatory (Article 51.3). This latter usage leads to confusion for persons unaware of the correct citation of authors in zoological nomenclature that tend to view these as bibliographical citations, that is, everywhere put between parentheses. As is clear form the Code, the parentheses indicate that such specific name underwent some allocation from its original genus, that is, someone proposed a new combination with another genus.

The proper way to get the desired, ICNZ-compliant results are as follows:

  • Only the author without year, as originally proposed (sensu Art. 51.1): \citeauthor*{Author1990}. Please note that for multi-authored names the Code requires to write in full all the author names (i.e., without abbreviating to the et al. for three or more authors). For one or two authors there is no difference between \citeauthor{Author1990} and \citeauthor*{Author1990}, both will produce the author name lacking parentheses and year.
  • Both author and year, as originally proposed (sensu Art. 51.2): \citealp{Author1990} and \citealp*{Author1990}. In this case, the author and year are to be produced without parentheses, hence the “al” in the command. These are identical to \citep{Author1990} and \citep*{Author1990} except that they remove the parentheses around parenthetical citations.
  • Only the author without year, for a name that underwent some sort of recombination or allocation to a genus other than the original one (sensu Arts. 51.1 and 51.3): Just use \citeauthor{Author1990} or \citeauthor*{Author1990} but place manually the parentheses around the command. This will render the desired output from the \citeauthor command but between parentheses. Example: (\citeauthor*{Author1990}) will produce (Author) instead of just Author.
  • Both the author and year, recombined or allocated to a genus other than the original one (sensu Art. 51.3): \citealp{Author1990} and \citealp*{Author1990} but between parentheses. As above, manual parentheses around the command will produce the desired output. Example: (\citealp*{Author1990}) will produce (Author, 1990) instead of just Author, 1990.

It is noteworthy that this way you will not need to worry with journals that ask you to cite all of the name authorships in taxonomic works (e.g., Neotropical Ichthyology). With this approach you can produce a document with cross-references and formatted authorship citations, without need of manual insertion of these into the document (that for \LaTeX happens to be impossible as far as I know). Once you use any of the natbib commands, the reference will be inserted and formatted automatically.

A hypothetical example

The \LaTeX code block will generate the PDF below showing the regular citations and taxonomic references properly handling the use cases discussed above.

 
\documentclass{article}

\usepackage{natbib} % this package will manage the citations and provides the \citeX commands
\usepackage[utf8]{inputenc} % allows to input with other encodings
\usepackage[colorlinks=true,citecolor=blue]{hyperref} % this colors the references and links
\usepackage[noblocks]{authblk} % for the affil block in authorships

\author{Gustavo A. Ballen}
\affil{Museu de Zoologia da Universidade de São Paulo, gaballench@gmail.com}
\title{On the genus \textit{Sphyraena}}

\begin{document}

\maketitle

This article reviews the nomenclatorial status of \textit{Sphyraena bolcensis} 
\citealp{Agassiz1843}, or alternatievely if we want, of \textit{Sphyraena bolcensis} 
\citeauthor{Agassiz1843} without the year. A species originally described in 
\textit{Sphyraena} but currently removed from it by \citet{Woodward1901} is 
\textit{Sphyraenodus speciosus} (\citeauthor{Leidy1877}), or with the year, 
\textit{Sphyraenodus speciosus} (\citealp{Leidy1877}).

These usages are compliant with the relevant articles of the \citet{ICZN1999}. This 
kind of bibliographical reference has been used erroneously in the past, specially 
in pre-ICZN works \citep[e.g.,][p.]{Rapp1946}. In the latter example, 
\textit{Sphyraenodus silovianus} (\citeauthor{Cope1875}) should have been written 
down as \textit{Sphyraenodus silovianus} \citeauthor{Cope1875} instead as this 
species was originally described in \textit{Sphyraenodus} and not 
\textit{Sphyraena}.

\bibliographystyle{apa} % type of citation style to be used, here 'APA'
\bibliography{manuscript.bib}  % the bibtex file 'manuscript.bib'

\end{document}

Screenshot_2019-03-22_22-36-10

Final comment

As we have seen this blog post reinforces the idea that we can use \LaTeX for successfully preparing a taxonomic manuscript that is compliant with the ICZN and also makes use of the automation power of this typesetting system and its huge package ecosystem. Now we need to convince both publishers and scientific societies that edit and publish scientific periodicals to better support (or just support, to begin with) \LaTeX submissions. In the end, most (if not all) manuscript submission systems will send a PDF to the reviewers and not the Word version submitted by the authors. This would be a step further towards support of free software by avoiding the restriction to use a commercial package such as MS Office.

Advertisements

Assorted tools for interacting with servers

Personal list of commands to remember when running analyses on servers. This post is expected to grow from time to time as new commands/command-combinations are found to be useful.

# login with ssh
ssh -l user ip # e.g., myUser 186.333.444.111

# create a virtual screen
screen -S screenName # use something identifying the analysis in screenName

# back to virtual screen called "screenName"
screen -r screenName

# list screens
screen -r

# detach virtual screen keeping the analysis
Ctrl + a + d

# detach virtual screen AND kill the job
Ctrl + c

# copy files through ssh to server
# supports the -r recursive tag for several files/directories
scp user@ip:/path/to/files path/to/local/directory # from directory in user at ip to local directory 
scp path/to/local/file user@ip:/destination/directory
scp -r path/to/local/directory user@ip:/destination/directory

# monitor processes while they are running based on the filetype and modification time
find . -name *.log -ls | grep "date-right-now" # e.g., "Apr  06" or "Apr 21", note the space in the former

Common compilation errors in Beamer and possible solutions (part 1)

This blog post is the first of a series that aims at documenting the most common errors I’ve found when preparing presentation slides in beamer AND using the knitr package for compiling both \LaTeX and R code in the same document and the explanations and solutions found for them. Most of these, of course, are based on a huge pile of stack overflow and most of them were found about a year or so ago when preparing the slides for my quals, unfortunately, I did not save any of these sources since. As for the date of preparation of this post, I will link the solutions found elsewhere, that frequently are used partially or in addition to other sources/tests. As a final note, some of the problems/solutions apply when authoring slides with R code evaluation, so you will not likely find them unless using knitr. Please note that the error messages come from Emacs+ESS, so I’m not sure if you will find them spelled exactly the same in other tools such as RStudio; I’m not even sure that you can compile them so compactly tools other than Emacs. I will try to reproduce some examples of them as code so that you can compare them to your own code. That said, it just suffices to point out that Emacs rules!

Caveat. I am testing the \LaTeX code with Emacs+ESS compiling with the keystrokes M-n r and then M-n P (uppercase) and RET in the buffer of the .Rnw file in order to compile the PDF slides. I haven’t tried other latex environment but would love to hear about alternative error messages for the same cases herein highlighted.

'Missing $ inserted'

Most likely you used a character or symbol reserved for math mode (e.g., underscore _). Please note that these characters need to be escaped with the \ or its respective reserved word, or even enclosed into a math inline “pharse”, for instance when using greek letters:

Good: $\alpha$-diversity; Bad: \alpha-diversity
Good: filename\_without\_spaces; Bad: filename_without_spaces

Example:

\documentclass[svgnames,mathserif,serif]{beamer}

\title{Awesome beamer presentation}
%\subtitle{}
\author{Gustavo A. Ballen, D.Sc.(c)}
\institute{University of Sao Paulo \\ 
           Ichthyology \\
           \texttt{myEmail@usp.email.com}}
\date{\today}

\begin{document}

\frame{\titlepage}

\begin{frame}
  \frametitle{My Slide}
  \begin{itemize}
    \item First item with good use\_of\_subscripts
    \item Second item with bad use_of_subscripts 
  \end{itemize}
\end{frame}

\end{document}

The code above will produce the error:

./mathSymbol.tex:71: Missing $ inserted.
<inserted text> 
                $
l.71 \end{frame}
                
? 

./myFile.tex:71: Emergency stop.
<inserted text> 
                $
l.71 \end{frame}
                
./myFile.tex:71:  ==> Fatal error occurred, no output PDF file produced!
Transcript written on myFile.log.
/usr/bin/texi2dvi: pdflatex exited with bad status, quitting.

Please also note that the error message refers to the .tex file with the line number of the problem, not to the .Rnw file. Actually, this message error does not tell us that the problem is with the underline, yet it gives a clue about the math mode as the $ is inserted, and this one is used in order to open and close in-line math text (e.g., formulae).

Whenever this error happens, check whether you are using uncommon characters in your text that might be expected to play a reserved role in math mode. For instance:

% Let's assume you have exactly the same code 
% before this point as lines 1-14
\begin{frame}
  \frametitle{My Slide}
  \begin{itemize}
    \item First item with good use of $\alpha$-diversity in-line math mode
    \item Second item with bad \alpha-diversity in-line math mode
  \end{itemize}
\end{frame}

will produce the same error as the underscore case but this time associated to \alpha, that is a command for inputting the greek letter \alpha. Enclosing the \alpha command with the $ operators will solve the problem.

Searching for text content into GNU/Linux files (and possibly OSX too)

Machines running GNU/Linux are very powerful tools once we are used to the command line. As an example, let us imagine that we are looking for a file named importantScript.py, a text file containing python code that we forgot where we put but are sure about its name. The standard file search engine called catfish is the tool for looking for files by filename and path in the hard drive. It is similar to search engines in Windows and OSX, possibly with the difference that apparently Finder in OSX can look for keywords into the files (the latter property of Finder is according to my wife, I honestly stopped using OSX long long ago).

But, how about keywords inside files? For instance, we were working on a project with a lot of files written in \LaTeX with the knitr package, so that we can embed R code to execute during document compilation. In just one from a whole bunch of text files we used a special package but forgot how to use it again (it was useful but we just needed it once so far), and internet fell so that we can not rely on good ol’ google in order to look for the usage. For reasons that don’t matter for now (probably ignorance) we also overlook the local documentation. Are we lost? Probably not.

Let us assume we remember a keyword, maybe a package name, or another sort of keyword, for instance a macro (\mymacro in \LaTeX). We can use the command grep with the -r option in order to visit all of the files and subdirectories in the current directory looking for the keyword(s) of interest. First we open an instance of the command line (e.g., bash) and navigate to the directory from where we want to search. Use of explicit or relative paths are also possible but I leave them to the reader for exploration. After calling grep, there will appear some output indicating file names and some text after a semicolon (:); this means that the command found our keywords in some of the files in the search path. On the other hand, if the terminal gives us the control again by showing the prompt ($), it failed to find the keywords inside any of the files and files-within-subdirectories of our current path. For instance:

user@machine:~$ cd ~/Documents/millionsOfScripts
# now let's see its content
# by the way, this directory is completely hypothetical!
user@machine:~$ ls
block  class  devices   fs          kernel  power
bus    dev    firmware  hypervisor  module
# search recursively the files into each of the directories show above for 
# the word 'subfigure'
user@machine:~$ grep -r subfigure
module/project1/scripts4/toCompile/blocks.Rnw:\begin{subfigure}

The code above indicates that our recursive search for the term ‘subfigure’ found a single file called blocks.Rnw in the subdirectory module/project1/scripts4/toCompile that contained inside such file a line with the text \begin{subfigure}. Then we can visit that file, open it, and see how did we use that tool in the past when working with \LaTeX. Pretty useful when we just have a bare idea of what we are looking for.

I found it useful for exactly that case when I did not remember how to use the subfigure \LaTeX package and did not want to search in google; I then just went to a directory where I had some indication that files using the package could be placed (e.g., my folder with beamer presentations or my PhD folder with some of my scholarship report files, all of them in \LaTeX) and then let grep look for them. Hope you find it useful and funny, it could save your life some day (as most of the bash commands, so learn to love the command line!)

PS: grep can take very very long to execute if your directory has a lot of files, subdirectories, and they are large.

New species (to science, sometimes actually)

This blog post is about the real meaning of a expression/term commonly found in technical biodiversity jargon that tend to be misunderstood by people outside the field, commonly journalists and media consumers, and is the concept of new species.

Despite its literal meaning, what scientists mean by “new” is not what the public expects. I’ve found myself in the situation of being criticized by media consumers (people outside biology) by claiming that a species is “new”. I can remember a specific case when Farlowella yarigui was described (Ballen & Mojica, 2014). This work received a lot of media coverage thanks to the interest of journalists from Unimedios on my work with fishes collected during one of the courses at the Universidad Nacional when I was still an undergrad student. For some reason, such news post was “recycled” by some other newspapers/journals in their news section on environment (e.g., El Espectador) and posted in their social media sites.

Figure 1
Farlowella yarigui, a species from the Magdalena basin in Colombia.

Some of the readers claimed that the species can by no means be new since they have seen that fish before, sometimes since they were young, and in a few cases, implying that scientists lie to the public by telling the species is new! I did not take it as an attack because I acknowledge the ambiguity of saying that a species is “new”; in fact, at least three different cases fall within the concept of “new” to science as opposed to “new” to everyone in the world:

  • A species can be mistakenly called with a scientific name when it in fact is not such species. This happens when biologists fail to recognize that the scientific name that is applied to a given species is erroneously used, usually because we are confusing two species under the same name. This is analogous to calling “metal” to the music by Nirvana and Death: We are erroneously calling Nirvana as ‘metal’ when it is in fact Grunge. A good example is calling the tiger catfish of the Magdalena river basin in Colombia as Pseudoplatystoma fasciatum when it is not such species but another one, that was given the name Pseudoplatystoma magdaleniatum after recognizing the differences between both species (Buitrago-Suarez & Burr, 2007).

  • A species can be known to be around but without a scientific name to date. In this case, scientists recognize it is not any of the species already named properly by them but still further work could be needed in order to apply a different name for it. This happens when the differences among species of a group are difficult to ascertain, or more specimens in museums are needed in order to be confident that we are not naming a species that already has a name. This is similar to finding something striking and unfamiliar to us, such as when astronomers name a planet we already saw but that did not have a name yet for it.

  • A species can be completely unknown to everyone, and consequently, lack a proper scientific name. This happens with rare, cryptic, or difficult-to-find living things, such as fishes from the deep waters that even fishermen usually cannot catch; these fishes can live to incredible depths always, living and dying without a minimal chance of reaching the surface. In these cases, neither the public nor the scientific community have been in contact with such organisms before. They are, in a sense, really new species.

In all these three cases the technical name in taxonomy (the biological field that aims at documenting living things and naming them scientifically) is “new species”, sometimes even species novum in latin. We need to understand that taxonomical rules started formally during the XVIII century by the works of Carolus Linnaeus, and by that time the Latin was the mandatory language of natural sciences. As a consequence, even today we call them by the translation of the sentence species novum to modern languages such as English, the current standard language in science.

From time to time different scientists decide to avoid such ambiguity by stating that the species is not a new one but an undescribed one, or even avoid such implications at all by saying that a scientific paper proposes a scientific name for a given species. As an example I can think of Grant et al.’s 2007 paper describing Allobates niputidea, a frog species from Colombia, entitled in fact “A name for the species of Allobates (Anura: Dendrobatoidea: Aromobatidae) from the Magdalena Valley of Colombia“.

That said, a scientific name is actually defined as a special name that we apply under certain clear rules in order to avoid ambiguity among scientists, and whose meaning cannot change among languages, therefore being stable. The advantages of using such system is that we avoid calling with several names, sometimes even common names, the same thing, and that once we use a scientific name our peers know exactly to which kind of organism we are referring to. It is in some sense as using our given-and-family name combination (e.g., John Doe) in order to ascertain our own uniqueness from among people sharing our given or family name (e.g., John Doe is different from both John Jameson and Evelyn Doe); however, even such combination is subject to synonyms, something that biological nomenclature tries to avoid by using such rules.

This ambiguity implies at least two possible solutions: To avoid using “new species” in article names in favor of “undescribed species”, or to do the titanic labour (given that there is much more people out there that is nor part of the scientific community; actually these groups differ by orders of magnitude) of explaining the people what we mean by new species. This blog post is an effort to the latter direction as the former is so widespread in scientific literature than it might not be practical to abandon the term at all. A second benefit of the latter alternative is to force us scientists to interact with the people outside our field in order to promote making science accessible to the public.

References

Ballen, G. A., & Mojica, J. I. (2014). A new trans-Andean Stick Catfish of the genus Farlowella Eigenmann & Eigenmann, 1889 (Siluriformes: Loricariidae) with the first record of the genus for the río Magdalena Basin in Colombia. Zootaxa, 3765(2), 134-142.

Buitrago-Suarez, U. A., & Burr, B. M. (2007). Taxonomy of the catfish genus Pseudoplatystoma Bleeker (Siluriformes: Pimelodidae) with recognition of eight species. Zootaxa, 1512(1), 1-38.

Grant, T., Acosta, A., & Rada, M. (2007). A name for the species of Allobates (Anura: Dendrobatoidea: Aromobatidae) from the Magdalena Valley of Colombia. Copeia, 2007(4), 844-854.

Checking .xml Beast2 files for errors

Note: This entry assumes that you are using a Unix-like operative system, ideally GNU/Linux and alternatively OSX. Not tested in BSD or others.

I’ve been using Beast2 for a while as the bayesian platform for a study on the effect of calibration priors using a Neotropical fish group as study model because of its reasonable fossil record and availability of DNA sequences. The study requires at least 18 different analyses to be run, with runtime varying from one to four days in order to reach convergence1, so it can be defined as computationally exhaustive. As I had access to a server with good amount of processors I could “parallelize” the analysis by running several instances of beast2 at the same time, each with several threads. However, even with this approach I had to limit the number of analyses to batches of four to five each time so that I did not use all of the resources, because others were also running their own analyses on it.

From time to time I noticed that a given analysis failed without producing any useful signal that I could use for picking the problem in time, so I had to wait until the end of the analysis in order to discover that a given one did not even run at all because problems in the definition of some priors. Until then I didn’t know about the -validate option of Beast2 that allows to check that most of the xml file definition is OK and then the analysis will run. Plainly this option will attempt to run a given xml file and report any error or exit without completing the analysis. It’s usage is as follows:

beast -validate myFile.xml

where myFile.xml is the name of the xml file to be tested. However, what if we wanted to check a bunch of files automatically? Here we can use the command-line options of Beast2 along with bash control structures in order to iterate over each file with some conditions (e.g., the file extension, or the file name) and then check whether it is a valid xml file that will be run by Beast2 without problem. This approach assumes that you have all of your xml files in the same directory (but alternative versions can avoid this requirement and traverse recursively path structures, that is, navigate sub-directories) for a simple case.

We will use six main software pieces: for, ls beast, 2>&1, |, and grep:

  • for is an iterative control structure already available in any unix-based system (e.g., GNU/Linux or OSX operative systems).

  • ls will show a list of file names composed of any text terminating in .xml (*.xml reads any text and then dot, x, m, and l to the end of the name).

  • beast is the program that we will use for bayesian inference of divergence time (not covered in this post though).

  • 2>&1 redirects the text from the console as input to another command.

  • | is the “pipe” operator, it will redirect the result of the commands on the left of the operator as input to the commands to the right. It reads “evaluate the code to the left of |, take the result and input it to those other commands to the right of |”. In our case, it will take the text being redirected from the screen by 2>&1 and to use it as the input of grep (as its last argument, actually).

  • grep tool is a powerful command-line tool for searching either simple text or regular expressions (not covered in this post either), in this case, the content redirected by 2>&1 from beast -validate to grep.

Basically our code will pick one xml file at a time in a for loop, test with beast -validate whether it is a correct xml file, and then redirect the text on the console as the input for grep to look for specific text indicating that something is wrong with out files. The code is then:

# one-liner
for i in `ls *.xml`; do beast -validate $i 2>&1 | grep "Error"; done

# several-lines version
for i in `ls *.xml`
    do beast -validate $i 2>&1 | grep "Error"
done

The expected output in presence of corrupted xml files will be text containing the word Error printed to the console, indicating that a given xml file contained errors. Otherwise, out code will finish silently, indicating that we can run our analyses. This control-quality step is crucial for cases where you are submitting processes to a cluster or a server where you need to wait until completion of the analyses for a given reason.


  1. Assessed primarily from the ESS value but also comparing different runs of the same analysis, and also by examining the behavior of the estimation along the analysis so that no pattern is detected.