First release of the ‘bayestools’ R package

After some time developing functions for my own research on using fossil information as a source of prior information for divergence time estimation studies (in prep.) I decided to release the first version of an R package with functions that I found useful for pre- and post-processing of data in evolutionary analyses using bayesian methods. It has been first released to its development repository on github (https://github.com/gaballench/bayestools/) and I expect to implement further tools, so the package must be understood as a release with the minimal set of functions.

I expect to continue active development and eventually to submit it to CRAN, but until then you can install the development version with devtools‘s function install_github:

# install devtools if needed
install.packages("devtools")

# load devtools to access the install_github function
library(devtools)

# install bayestools from the github repository
install_github(repo = "gaballench/bayestools")

What the package can do?

There are two main sets of functions that bayestools provide: Pre-processing and post-processing functions. In the first set we have functions that allow us to specify priors, and plot probability density functions from Beast2 parameters in order toi check visually their correctness. In the second group we have a tool for measuring interdependence between empirical densities, in addition to some ways to visualize it.

The main place in evolutionary biology where the features contained in bayestools belong is in the core of divergence time estimation and incorporation of paleontological and geological information into estimation of the time component of phylogenies and diversification analyses. Here, the tools present in the package aid in implementing information from these fields into bayesian analyses, mainly through specification of priors, the most conspicuous feature of bayesian inference. However, prior specification is a very difficult task and there are few if any rules to follow, along with a lot of misunderstanding along with clearly-unjustified practices.

Priors

Prior specification depends strongly on what we are trying to use in order to calibrate a given phylogeny. For instance, node calibration and tip calibration work in different ways, and prior specification will not only depend on this but also on the nature of the information we are trying to use as priors.

In the most common case, we have a fossil taxon (or occurrence) that could inform us about the age of a given node (node calibration). However, we lack exact information on two things: When the organism lived (measured without error), and when a given diversification event took place (i.e., the exact age of a node). What we have is a fossil, whose age has been inferred or measured with error or uncertainty. It is this information, uncertainty, what we expect to model through priors (or at lease what should be done in real life). A prior is in itself a probability density function (PDF hereafter), that is, a mathematical function describing the probability that a variable take a given value inside an given interval. More or less precise statements can be made with aid of PDFs, such as that it is more likely that a variable X takes a value between values a and b than rather between c and d; or which is the expected value for the variable of interest.

Now, what we have is age information such as the following instances:

  • The fossil comes from a volcanic layer that has been dated with radiometric techniques with a specific value \pm uncertainty, for instance, 40.6 \pm 0.5 Ma, where 0.5 is the standard deviation \sigma.

  • The fossil comes from a sedimentary unit that is said to be of Miocene age (5.33 to 23.03 Ma).

  • The fossil comes from a layer bracketed by two levels with volcanic ash that have been dated as 10.2 \pm 0.2 and 12.4 \pm 0.4 respectively. So, the fossil itself have an age determined as interpolated between the layers that have real age information.

How can we convert this information into legitimate priors?

The findParams function

Given that we have prior on the age of a fossil to be 1 – 10 Ma and that we want to model it with a lognormal distribution, fin the parameters of the PDF that best reflect the uncertainty in question (i.e., the parameters for which the observed quantiles are 1, 5.5, and 10, assuming that we want the midpoint to reflect the mean of the PDF:

bayestools::findParams(q = c(1, 5.5, 10),
          p = c(0.025,  0.50, 0.975),
          output = "complete",
          pdfunction = "plnorm",
          params = c("meanlog", "sdlog"))
## $par
## [1] 1.704744 0.305104
## 
## $value
## [1] 0.0006250003
## 
## $counts
## function gradient 
##      101       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

Now we have a published study that specified a lognormal prior (with or without justification, that’s another question) and we want to actually plot it outside of beauti, for instance, in order to assess sensitivity of the posterior to the prior. How can we plot it in R and use its data as any other statistical density?

The lognormalBeast function

Generate a matrix for the lognormal density with mean 1 and standard deviation 1, with mean in real space, and spanning values in x from 0 to 10, and then plot it:

lnvals <- bayestools::lognormalBeast(M = 1, S = 1, meanInRealSpace = TRUE, from = 0, to = 10)
plot(lnvals, type = "l", lwd = 3)

lognormalBeast

Sensitivity

One hot topic in model and tool comparisons is the sensitivity of the posterior to the prior, that is, how much could our results be determined by the prior. For instance, if the posterior is totally dependent on the prior, that means that the likelihood (or the data) is not providing information, and this is very risky as there would be the potential to manipulate the results of bayesian analyses.

The measureSensit function

Measure and plot the sensitivity between two distributions partially overlapping:

set.seed(1985)
colors <- c("red", "blue", "lightgray")
below <- bayestools::measureSensit(d1 = rnorm(1000000, mean = 3, 1),
                       d2 = rnorm(1000000, mean = 0, 1),
                       main = "Partial dependence",
                       colors = colors)
legend(x = "topright", legend = round(below, digits = 2))

measureSensit

The number in the legend indicates an overlap of 0.13, and as such, a measure of interdependence between the densities (Ballen in prep.).

Geochronology

In the third case of possible age estimates for our calibrations, can we clump together a number of age estimates from several samples in order to build a general uncertainty for the interval?

Do the age estimates for the boundaries of the Honda Group (i.e., samples at meters 56.4 and 675.0) conform to the isochron hypothesis?

data(laventa)
hondaIndex <- which(laventa$elevation == 56.4 | laventa$elevation == 675.0) 
bayestools::mswd.test(age = laventa$age[hondaIndex], sd = laventa$one_sigma[hondaIndex])

The p-value is smaller than the nominal alpha of 0.05, so we can reject the null hypothesis of isochron conditions.

Do the age estimates for the samples JG-R 88-2 and JG-R 89-2 conform to the isochron hypothesis?

twoLevelsIndex <- which(laventa$sample == "JG-R 89-2" | laventa$sample == "JG-R 88-2")
dataset <- laventa[twoLevelsIndex, ]
# Remove the values 21 and 23 because of their abnormally large standard deviations
bayestools::mswd.test(age = dataset$age[c(-21, -23)], sd = dataset$one_sigma[c(-21, -23)])

The p-value is larger than the nominal alpha of 0.05, so we cannot reject the null hypothesis of isochron conditions

Advertisements

R package development in Emacs with ESS

This post is somewhat an autorreference for myself from the future as I only develop packages from time to time and tend to forget things easily; it is also a quick reference source as several of these topics are covered in varying detail in Hadley Wickham’s “R Packages. Organize, Test, Document and Share Your Code”. The following questions are answered with code below and comments whenever appropriate. The two most important tools we will need are rmarkdown and devtools. I assume that we already have important dependencies installed such as knitr or pandoc so their install process will not be covered here (google will surely point you out to the proper answer).

Why do I do this if Wickham’s book is already a reference? Because I don’t like/use Rstudio (reasons will not be discussed here) but instead use Emacs+ESS and information on workflow under that beautiful platform in the specific context of R package development is scarce. In order to mimic Rstudio’s behavior the secret is to run devtools and rmarkdown functions in R’s command line, that’s it.

Before starting, install and load the packages of interest:

# install the packages of interest
install.packages("rmarkdown")
install.packages("devtools")

# once installed, load the packages
library(rmarkdown)
library(devtools)

How do I create a package from scratch?

The code below will create the layout of a package with the minimum this you will need to care about in case of producing a very simple package. These include documentation and code directories, and description files that you will need to fill in with your own information

create("myPackageName")

I have already coded a function, how do I include it in my package?

Simply copy it into the R/ folder in a file with extension .R so that R knows it contains code.

How do I document my function?

Take a look at the help file for any function and see what information you usually find there: A description of what the function does, which arguments it has, and what kinds of values are allowed in each arguments are a good place to start. Take a look at Karl Broman’s material on writing documentation for further details and tag options.

Basically, documentation using the roxygen2 includes the text inside the script that contains the function. It uses special comments (#') that are skipped when R loads the code, but is used by roxygen2 in order to build the documentation files. The advantage of this is that in the very same file you put your code and describe it and its usage. roxygen2 uses special tags starting with @ to allow the user to fill in the several sections of a documentation file. A simple example would be:

#' A function for adding up two numbers
#'
#' @usage addTwo(x, y)
#'
#' @param x The first number to add
#'
#' @param y The second number to add
#'
#' @return A numeric vector with the value of the sum x + y
#'
#' @examples
#' # Adding up two positive numbers
#' addTwo(x = 4, y = 6)
#' # Adding up one positive and one negative numbers
#' addTow(x = -2, y = 7)
#'
addTow <- function(x, y) {
    output <- x + y
    return(output)
}

The code above contains the code for our function addTwo along with a bunch of lines starting with #', these compose the documentation. I would save this in a file called the same as the function: addTwo.R.

I managed to document my functions, how do I build the documentation?

Open a console with the working directory pointing to the package directory, maybe navigating to such folder with Dired (C-x d) and then opening an R session there (M-x R). After loading devtools you can use the function document to build the documentation files:

# check that we are where we need to
> getwd()
"myPackage"
> document()
Updating myPackage documentation
Loading myPackage

The function document will produce a file with extension .Rd in the man/ directory, corresponding to the documentation of our function. We can not access the documentation as with regular functions because the packages is not formally installed, to the documentation files can not be found by R. Instead, we can preview them in the command line with ? AND THEN , this will prompt a + sign for completing a command, and then we can write the function name. It would look like the following when called from the command line:

> ?
+ addTwo

addTwo                    package:myPackage                  R Documentation

A function for adding up two numbers


Usage:

     addTwo(x, y)

Arguments:

       x: The first number to add

       y: The second number to add

Value:

     A numeric vector with the value of the sum x + y

Examples:

     # Adding up two positive numbers
     addTwo(x = 4, y = 6)
     # Adding up one positive and one negative numbers
     addTow(x = -2, y = 7)

I’ve already documented all of my functions, how can I build a pdf with all the documentation as in famous packages?

Open a shell (e.g., M-x eshell), and run:

# If eshell is inside the package directory...
R CMD Rd2pdf ../myPackage
# If eshell is just above the package directory...
R CMD Rd2pdf myPackage

Hmm ... looks like a package
Converting Rd files to LaTeX 
Creating pdf output from LaTeX ...

This is pdfTeX, Version 3.14159265-2.6-1.40.16 (TeX Live 2015/Debian) (preloaded format=pdflatex)
 restricted \write18 enabled.
entering extended mode
...
Output written on Rd2.pdf (8 pages, 106389 bytes).
Transcript written on Rd2.log.
Saving output to ‘myPackage.pdf’ ...
Done

This will generate a nicely-formatted pdf with all your documentation in one place

I’ve changed some code, how do I see the result interactively?

You need to re-load your package in order to have access of these changes but need to be in the package main directory in order to use the function load_all. Do not forget to save all changes in all code files before re-loading:

# check that we are where we need to
> getwd()
"myPackage"
> load_all()
Loading myPackage

It is good idea to re-build the documentation just to be sure that everything is up to date (see above).

I’ve updated the README.Rmd file, how do I apply the changes to the .md and .html files?

These three files are of special relevance if you are hosting your package development repository in github, as the markdown files are already optimized for looking pretty under github’s flavour of markdown. Always modify the r markdown file (the one with .Rmd extension) and then use the package rmarkdown to convert it to .md and .html:

> render("README.Rmd)

processing file: README.Rmd
  |........                                                         |  12%
  ordinary text without R code

  |................                                                 |  25%
label: unnamed-chunk-1 (with options) 
List of 1
 $ echo: logi FALSE

  |........................                                         |  38%
  ordinary text without R code

  |................................                                 |  50%
label: unnamed-chunk-2 (with options) 
List of 1
 $ eval: logi FALSE

  |.........................................                        |  62%
  ordinary text without R code

  |.................................................                |  75%
label: unnamed-chunk-3 (with options) 
List of 1
 $ echo: logi FALSE

  |.........................................................        |  88%
  ordinary text without R code

  |.................................................................| 100%
label: unnamed-chunk-4 (with options) 
List of 2
 $ eval   : logi TRUE
 $ warning: logi FALSE


output file: README.knit.md

/usr/bin/pandoc +RTS -K512m -RTS README.utf8.md --to markdown_github-ascii_identifiers --from markdown+autolink_bare_uris+tex_math_single_backslash --output README.md --standalone --template /.../R/x86_64-pc-linux-gnu-library/3.4/rmarkdown/rmarkdown/templates/github_document/resources/default.md
...
Preview created: README.html

Output created: README.md

The output above indicates that everything was successful with the last two lines, telling us that the .md and .html files were created.

First paper in 2018!

Today I’m pleased to announce that our manuscript “Fishes of the Cusiana River (Meta River basin, Colombia), with an identification key to its species” (Urbano-Bonilla et al., 2018) was finally published after a long way to completion. The main goal of the paper is to document the freshwater fish fauna in the Cusiana river sub-basin, a tributary of the Meta basin in the Colombian Orinoco drainage. Checklists in general use to be mere listings of scientific names documenting the presence of species in an area of interest, and as such are of limited value aside from large-scale record collections; however, they tend to be less valuable as sources of data for further research such as studies in biogeography, climate change, or macroecology. However, we devised this project as a proposal for a novel style in checklist studies where additional information, diversity estimation, and identification keys were prepared for the region of study.

First of all, our checklist differs from others in that it follows an attitude towards open source data. We accomplished this by providing open access to the raw data (i.e., the specific museum specimens that document the presence of a given species in our area of study) including its scientific name, coordinates, abundance, elevation and any other relevant aspect of biological records. We uploaded our dataset to the SiB Colombia data repository (http://doi.org/10.15472/er3svl) that allows to replicate exactly all of our claims. It also can be used for future research, even for uses that we did not envision in the first place! This is the magic of open access to data and reproducible research.

In second place, we wanted to provide an identification key to the species of freshwater fishes that we documented for this drainage as an aid for a whole bunch of biologists and other professionals that face the problem of identifying their fish samples from this area. I should note that the Colombian Orinoco drainage is an area with very active extractive activities such as mining and specially petroleum industry; as a consequence, environmental impact studies are routinely carried out during several phases of oil extraction as required by the environmental Colombian legislation. Given the high diversity of this area, an identification key can improve the quality of such studies, resulting in better actions for environmentally-responsible industrial activities. This sounds like dreams and I’d really like to believe that our tools will have the expected impact, but even when in general studies of environmental impact are of very poor quality, at last the lack of taxonomic tools would not be an excuse anymore.

Third, we presented an gross estimation of richness for the area. Even when we used methods developed for estimating species richness under standardized sampling and surely some model assumptions are no met (e.g., same sampling effort across units of study), this approach has been used for richness estimation in other non-standard scenarios, such as continental-scale species richness estimations based on literature data (Reis et al., 2016). Despite uneven sampling has the effect of overestimating rarity and therefore inflating uncertainty around estimated richness (e.g., by inflating the amount of species considered rare because of poor sampling), but provided that there is not a consistent pattern in sampling bias, our estimates are adequate in general. In addition, we used an integrative approach based on Hill numbers that allows to carry out rarefaction and extrapolation with the same inferential model (Chao et al., 2014). Armed with this, we can have an idea of the total species richness in our study area so that we can compare it to the estimated richness of other areas; this is way better than comparing raw observed richness numbers when assessing the conservation potential of river basins based on the fish fauna.

One of the most striking results is that we show a strong elevational sampling bias, where sampling effort is inversely proportional to elevation. This in fact is a general aspect of out knowledge of river basins that drain the Andes. Since species richness is also inversely proportional to elevation, we have a poor characterization of the species-poor portion of the basin. This has little effect for overall richness estimation, but raises important concerns regarding our ability to develop management strategies for the Andean portions of the basin, that are in general more fragile than the lowlands. This calls our attention as ichthyological community towards our need to better study the Andean portions of these basins, as we currently know very little about its poor yet highly-endemic components.

We hope to call attention to the several additional aspects that checklists can include in order to serve better as biodiversity documentation efforts, and look forward to seeing other regional checklists that go beyond species lists and explore other aspects of biodiversity. The potential for novel and interesting research with a broader impact in conservation is large.

Literature cited

Chao, A., Gotelli, N. J., Hsieh, T. C., Sander, E. L., Ma, K. H., Colwell, R. K., & Ellison, A. M. (2014). Rarefaction and extrapolation with Hill numbers: A framework for sampling and estimation in species diversity studies. Ecological Monographs, 84(1), 45–67. http://doi.org/10.1890/13-0133.1

Reis, R. E., Albert, J. S., Di Dario, F., Mincarone, M. M., Petry, P. and Rocha, L. A. (2016), Fish biodiversity and conservation in South America. J Fish Biol, 89: 12–47. doi:10.1111/jfb.13016

Urbano-Bonilla, A., Ballen, G. A., Herrera-R, G. A., Zamudio, J., Herrera-Collazos, E. E., DoNascimiento, C., … Maldonado-Ocampo, J. A. (2018). Fishes of the Cusiana River (Meta River basin, Colombia), with an identification key to its species. ZooKeys, 733, 65–97. http://doi.org/10.3897/zookeys.733.20159

Short introduction to writing the thesis in LaTeX with Emacs

This post aims at a smaller audience than previous ones and happens to be a short tutorial for generating a thesis with \LaTeX using Emacs. This assumes that you have a document with multi-file structure where chapters are separated in individual .tex files so that we have a master document that includes the content from other files. Before we begin, some Emacs idiosyncrasies:

  • C means to press the Control key; and C-f means to press the Control key along with the ‘f’ key (without releasing Control until you press ‘f’).
  • M means to press the Alt (once known as Meta, hence M) key; M-a consequently will be to press Alt and the ‘a’ key at the same time.
  • Uppercase letters are generated in the usual way with the Shift key, so that C-A means to press the Control key AND the Shift AND the ‘a’ keys; when released they will run the wanted commands.
  • C-M-w will mean the good ol’ MS Windows key combination Ctrl+Alt+w, for instance.
  • A command such as C-c C-c can be generated by keeping Control pressed and the ‘c’ twice.

Ever considered writing your thesis in \LaTeX? there’s actually a whole bunch of sources and you always can see if your university supports/encourages the use of \LaTeX. As a starting point, take a look at these templates from the IME @ USP, the Universidad Nacional de Colombia LaTeX template, or maybe the template from the University of Bristol housed at Overleaf. I’m also preparing a thesis template for the Museu de Zoologia that hope to launch on Github soon.

An important aspect to keep in mind is that working with documents with or without a bibliography are very different things, and the latter tends to become a lot more complex that when you typeset the references and in-text citations. On the other hand, the latter process is automatic and kind of error-free.

On the main file

Almost all of the thesis templates in the wild have this multi-file structure and there is a good reason for it: LaTeX documents tend to grow in size and complexity pretty quickly, so it’s better to keep things separate during document preparation. As a consequence, we must centralize compilation somehow so that we don’t have to compile and paste together a bunch of independent files. There where the master document appears. This file has the preamble, that is, the definition of filetype (with the documentclass command), all the package definitions, new commads, and user-defined code that does not come directly from the packages. It may look something like this:

\documentclass[12pt,twoside,a4paper]{book}

% ---------------------------------------------------------------------------- %
% Pacotes 
\usepackage[T1]{fontenc}
\usepackage{amsmath}                    % the AMS equations package
\usepackage[USenglish]{babel}
\usepackage[utf8]{inputenc}
\usepackage[pdftex]{graphicx}           % usamos arquivos pdf/png como figuras
\usepackage{setspace}                   % espaçamento flexível
\usepackage{indentfirst}                % indentação do primeiro parágrafo
\usepackage{makeidx}                    % índice remissivo
\usepackage[nottoc]{tocbibind}          % acrescentamos a bibliografia/indice/conteudo no Table of Contents
\usepackage{courier}                    % usa o Adobe Courier no lugar de Computer Modern Typewriter
\usepackage{type1cm}                    % fontes realmente escaláveis
\usepackage{listings}                   % para formatar código-fonte (ex. em Java)
\usepackage{titletoc}
\usepackage{longtable}                  % long tables and across-page table layout
\usepackage{array}
\newcolumntype{C}[1]{>{\centering\arraybackslash}p{#1}} % create a new command for a centered width-defined column in a longtable environment

\usepackage{enumitem} % nested lists with more control on labels
%\usepackage[bf,small,compact]{titlesec} % cabeçalhos dos títulos: menores e compactos
\usepackage[fixlanguage]{babelbib}
\usepackage[font=small,format=plain,labelfont=bf,up,textfont=it,up]{caption}
\usepackage[usenames,svgnames,dvipsnames]{xcolor}
\usepackage{tikz}                       % drawing stuff such as the bibliographic card
\usepackage[a4paper,top=2.54cm,bottom=2.0cm,left=2.0cm,right=2.54cm]{geometry} % margens
%\usepackage[pdftex,plainpages=false,pdfpagelabels,pagebackref,colorlinks=true,citecolor=black,linkcolor=black,urlcolor=black,filecolor=black,bookmarksopen=true]{hyperref} % links em preto
\usepackage[pdftex,plainpages=false,pdfpagelabels,pagebackref,colorlinks=true,citecolor=DarkGreen,linkcolor=NavyBlue,urlcolor=DarkRed,filecolor=green,bookmarksopen=true,breaklinks=true]{hyperref} % links coloridos
\usepackage[all]{hypcap}                    % soluciona o problema com o hyperref e capitulos
\usepackage[round,sort,nonamebreak]{natbib} % citação bibliográfica textual(plainnat-ime.bst)
%\usepackage{breakcites}
%\bibpunct{(}{)}{;}{a}{\hspace{-0.7ex},}{,} % estilo de citação. Veja alguns exemplos em http://merkel.zoneo.net/Latex/natbib.php ORIGINAL. CAUSES A SMALL SPACE BETWEEN AUTHORS AND THE COMMA SEPARATING THESE FROM YEARS, AWFUL.
\bibpunct{(}{)}{;}{a}{,}{,} % estilo de citação. Veja alguns exemplos em http://merkel.zoneo.net/Latex/natbib.php EDITED BY GAB. THE NICE COMMA AFTER AUTHORS AS EXPECTED.

\fontsize{60}{62}\usefont{OT1}{cmr}{m}{n}{\selectfont}

\let\proglang=\textsf

Above we have an excerpt from my own thesis main file, heavily modified from the IME template. No other files in our setting will have a preamble, and this means that compilation of text must be centralized on this file. Supposing that we don’t have bibliography (yet), it suffices to open this file and compile it with C-c C-c and then pick LaTeX from the list in Emacs’ minibuffer (the tiny line below the window that spits out some text when triggered with keystrokes such as our compilation C-c C-c). Once there we must tell Emacs that we want to compile by moving with the arrows up and down until we find the LaTeX text (see the awful white arrow I included) and then press Enter. This should compile the whole thesis and produce a nice PDF with the “end” product.

compilation.png

On chapter files

Why does Emacs compile the whole thesis even when the text from other chapters does not reside in the main file? Your main document should have commands called \input or \include{} and will generally be located down to the end of the main file. These tell the main document to visit these independent files and compile them along with the whole thing:

% ---------------------------------------------------------------------------- %
% Chapters, all ending with .tex for the \input items
\mainmatter

% cabeçalho para as páginas de todos os capítulos
\fancyhead[RE,LO]{\thesection}

\singlespacing              % espaçamento simples
%\onehalfspacing            % espaçamento um e meio

\input chapIntro            % Introduction to the whole thesis, called chapIntro.tex
\input chapTerminology      % Chapter on Spine terminology, called chapTerminology.tex
\input chapFossilFishes     % Chapter on the fossil fishes from the Ware and Sincelejo Fms. called chapFossilFishes.tex
\input chapFour             % Chapter four called chapFour.tex

% cabeçalho para os apêndices
\renewcommand{\chaptermark}[1]{\markboth{\MakeUppercase{\appendixname\ \thechapter}} {\MakeUppercase{#1}} }
\fancyhead[RE,LO]{}

\appendix

\include{appTerminology} % appendix for the chapter on terminology
\include{appMaterialExamined}      % material examined
\include{appCode} % Computer code for
 something interesting

Now suppose you are ready to put your hands on the first chapter of your thesis. In general, these files will lack a preamble, so we should not see any usepackage or renewcommand or whatever in them; in contrast, we usually will start with \chapter{Name of my thesis chapter} and then commands for individual sections \section{Introduction} or \section{Materials and Methods}. As an example, this is an excerpt of one of my chapter files:

%% ------------------------------------------------------------------------- %%
\chapter{A Standardized Terminology of Spines in Neotropical Siluriformes}
\label{chap:chapTerminology}

\section{Abstract}
\label{sec:abstract}
Some text in the abstract

%% ------------------------------------------------------------------------- %%
\section{Introduction}
\label{sec:introduction}

The order Siluriformes is one of the most important components of Neotropical freshwaters with more than XXXX species currently described and bla bla bla...

Given the confusion in anatomical terms historically applied to spines and its ornaments, herein I propose a new standard terminological system to avoid future confusions. Each term is accompanied by its conditions as coded from the material examined, and an attempt at synonymy along the numerous references examined. A quantitative approach at standardization is herein proposed for picking the optimal system from among the already proposed terms...

Please note that the lines highlighted show these two commands for defining the title of the chapter and the section headers (such as in any research article). Please note too that we lack a preamble and therefore these files are NOT TO BE COMPILED INDIVIDUALLY. As a test, try to run C-c C-c LaTeX on any of these and you will end up with an error, because compilation must be carried out on the main file. Such an error is shown below, emphasizing on what appears at the mini-buffer.

errorCompilChap

Take-home message: Don’t compile (C-c C-c) on chapter files, always do it on main text files.

What if we have references? One bibliography to rule them all

There are some LaTeX packages that deal with bibliographies. The one I use is BibTex, a quite old system with some disgusting behaviors I will mention below. However, it gets the work done, and is what I’ve learnt to use so far.

In order to use BibTex we need a .bib file, that is, a file that houses our references in a specific format. I generate such files with Mendeley Desktop and edit them with a Bash script that changes to italics scientific names in paper titles, something that Mendeley is quite bad at. Once we have such file, it suffices to declare the type of bibliography we want along with the bibliographic style (e.g., APA). Note that these are managed by the natbib package in the preamble (see above), so that if we called the package in the preamble, we are ready to define the following lines wherever we find appropriate to place the bibliography:

\bibliographystyle{palelec} % using the citation style of palaeontologica electronica
\bibliography{/home/user/Documents/Thesis/Bibtex/PhDThesis}  % the path to the '.bib' file

The behavior of the compilation depends on whether we have a single bibliography for the whole thesis or whether we want a bibliography per chapter. Let’s assume first that we want a single general bibliography; in that case we want to first compile the bibliography with C-c C-c and then look for BibTex instead of LaTeX with the up and down arrows:

bibtex

For reasons that I honestly ignore, you must compile first once with BibTex and then twice with LaTeX (C-c C-c and looking for LaTeX with the arrow keys). You may need to repeat these three steps if there were references missing or until Emacs tells you that the document was successfully formatted, with the number of pages:

success

If so, you will end up with a nice bibliography at the end of the PDF (or wherever you put it):

biblio

What can go wrong? Actually quite a few things can prevent you from compiling the references adequately. In the first place, make sure that the reference you are using is in the .bib file; this might require you to close and open Mendeley in order to re-build the database. Try to repeat the compilation stepts then, and if it fails, go to the directory where the .tex files are located and remove every associated file except the .tex and .pdf files; from these called mainText.* only keep those ending with .pdf and .tex:

files.png

One bibliography per chapter

This is where things can get a bit more complicated. Here, you will need to compile the bibliography on the chapter file instead of the main file (contrary to what I already told you to do) and then compile the text on the main document. If you need to do it again, you have to keep in mind that the bibliography will need to be compiled in the chapter file. If something go wrong, you will need to remove the auxiliary files associated to the chapter file (but not that file itself) and try again to compile with BibTex and LaTeX as indicated. In order to compile one bibliography per chapter your chapter file will need to have the following lines at the end:

% Note: Compile with bibtex HERE, not in the main file
\bibliographystyle{apalike2} % Pick the reference style for the chapter bib here
\bibliography{/home/Thesis/Bibtex/refs.bib} point to the .bib file

Compiling with BibTex on the chapter file
bibtexOnChapter
…will produce…
bibtexOnChapterOk

If the references can not be compiled try to see what went wrong with C-c C-l and read carefully the error messages. If refreshing the .bib file does not help, please note that BibTex generates some auxiliary files where these references get formatted, so you may need to remove them. If you don’t know specifically where the error is, please keep in mind that the only essential file you should always keep is the .tex file; all other files named the same as the chapter file but ending with other extensions are generated each time you compile, so technically you can remove them all and compile again. Just to be sure, do the following:

  1. Remove all auxiliary files such as .toc, .aux, .bbl, etc.
  2. Compile the references with C-c C-c and pick BibTex from the list on the chapter file
  3. Go back to the main text file and compile it with C-c C-c and pick LaTeX form the list
  4. Repeat the compilation in step 3.
  5. See if a pdf with the right references was generated; if not, repeat steps 2 through 4.

Common compilation errors in Beamer and possible solutions (part 1)

This blog post is the first of a series that aims at documenting the most common errors I’ve found when preparing presentation slides in beamer AND using the knitr package for compiling both \LaTeX and R code in the same document and the explanations and solutions found for them. Most of these, of course, are based on a huge pile of stack overflow and most of them were found about a year or so ago when preparing the slides for my quals, unfortunately, I did not save any of these sources since. As for the date of preparation of this post, I will link the solutions found elsewhere, that frequently are used partially or in addition to other sources/tests. As a final note, some of the problems/solutions apply when authoring slides with R code evaluation, so you will not likely find them unless using knitr. Please note that the error messages come from Emacs+ESS, so I’m not sure if you will find them spelled exactly the same in other tools such as RStudio; I’m not even sure that you can compile them so compactly tools other than Emacs. I will try to reproduce some examples of them as code so that you can compare them to your own code. That said, it just suffices to point out that Emacs rules!

Caveat. I am testing the \LaTeX code with Emacs+ESS compiling with the keystrokes M-n r and then M-n P (uppercase) and RET in the buffer of the .Rnw file in order to compile the PDF slides. I haven’t tried other latex environment but would love to hear about alternative error messages for the same cases herein highlighted.

'Missing $ inserted'

Most likely you used a character or symbol reserved for math mode (e.g., underscore _). Please note that these characters need to be escaped with the \ or its respective reserved word, or even enclosed into a math inline “pharse”, for instance when using greek letters:

Good: $\alpha$-diversity; Bad: \alpha-diversity
Good: filename\_without\_spaces; Bad: filename_without_spaces

Example:

\documentclass[svgnames,mathserif,serif]{beamer}

\title{Awesome beamer presentation}
%\subtitle{}
\author{Gustavo A. Ballen, D.Sc.(c)}
\institute{University of Sao Paulo \\ 
           Ichthyology \\
           \texttt{myEmail@usp.email.com}}
\date{\today}

\begin{document}

\frame{\titlepage}

\begin{frame}
  \frametitle{My Slide}
  \begin{itemize}
    \item First item with good use\_of\_subscripts
    \item Second item with bad use_of_subscripts 
  \end{itemize}
\end{frame}

\end{document}

The code above will produce the error:

./mathSymbol.tex:71: Missing $ inserted.
<inserted text> 
                $
l.71 \end{frame}
                
? 

./myFile.tex:71: Emergency stop.
<inserted text> 
                $
l.71 \end{frame}
                
./myFile.tex:71:  ==> Fatal error occurred, no output PDF file produced!
Transcript written on myFile.log.
/usr/bin/texi2dvi: pdflatex exited with bad status, quitting.

Please also note that the error message refers to the .tex file with the line number of the problem, not to the .Rnw file. Actually, this message error does not tell us that the problem is with the underline, yet it gives a clue about the math mode as the $ is inserted, and this one is used in order to open and close in-line math text (e.g., formulae).

Whenever this error happens, check whether you are using uncommon characters in your text that might be expected to play a reserved role in math mode. For instance:

% Let's assume you have exactly the same code 
% before this point as lines 1-14
\begin{frame}
  \frametitle{My Slide}
  \begin{itemize}
    \item First item with good use of $\alpha$-diversity in-line math mode
    \item Second item with bad \alpha-diversity in-line math mode
  \end{itemize}
\end{frame}

will produce the same error as the underscore case but this time associated to \alpha, that is a command for inputting the greek letter \alpha. Enclosing the \alpha command with the $ operators will solve the problem.

Searching for text content into GNU/Linux files (and possibly OSX too)

Machines running GNU/Linux are very powerful tools once we are used to the command line. As an example, let us imagine that we are looking for a file named importantScript.py, a text file containing python code that we forgot where we put but are sure about its name. The standard file search engine called catfish is the tool for looking for files by filename and path in the hard drive. It is similar to search engines in Windows and OSX, possibly with the difference that apparently Finder in OSX can look for keywords into the files (the latter property of Finder is according to my wife, I honestly stopped using OSX long long ago).

But, how about keywords inside files? For instance, we were working on a project with a lot of files written in \LaTeX with the knitr package, so that we can embed R code to execute during document compilation. In just one from a whole bunch of text files we used a special package but forgot how to use it again (it was useful but we just needed it once so far), and internet fell so that we can not rely on good ol’ google in order to look for the usage. For reasons that don’t matter for now (probably ignorance) we also overlook the local documentation. Are we lost? Probably not.

Let us assume we remember a keyword, maybe a package name, or another sort of keyword, for instance a macro (\mymacro in \LaTeX). We can use the command grep with the -r option in order to visit all of the files and subdirectories in the current directory looking for the keyword(s) of interest. First we open an instance of the command line (e.g., bash) and navigate to the directory from where we want to search. Use of explicit or relative paths are also possible but I leave them to the reader for exploration. After calling grep, there will appear some output indicating file names and some text after a semicolon (:); this means that the command found our keywords in some of the files in the search path. On the other hand, if the terminal gives us the control again by showing the prompt ($), it failed to find the keywords inside any of the files and files-within-subdirectories of our current path. For instance:

user@machine:~$ cd ~/Documents/millionsOfScripts
# now let's see its content
# by the way, this directory is completely hypothetical!
user@machine:~$ ls
block  class  devices   fs          kernel  power
bus    dev    firmware  hypervisor  module
# search recursively the files into each of the directories show above for 
# the word 'subfigure'
user@machine:~$ grep -r subfigure
module/project1/scripts4/toCompile/blocks.Rnw:\begin{subfigure}

The code above indicates that our recursive search for the term ‘subfigure’ found a single file called blocks.Rnw in the subdirectory module/project1/scripts4/toCompile that contained inside such file a line with the text \begin{subfigure}. Then we can visit that file, open it, and see how did we use that tool in the past when working with \LaTeX. Pretty useful when we just have a bare idea of what we are looking for.

I found it useful for exactly that case when I did not remember how to use the subfigure \LaTeX package and did not want to search in google; I then just went to a directory where I had some indication that files using the package could be placed (e.g., my folder with beamer presentations or my PhD folder with some of my scholarship report files, all of them in \LaTeX) and then let grep look for them. Hope you find it useful and funny, it could save your life some day (as most of the bash commands, so learn to love the command line!)

PS: grep can take very very long to execute if your directory has a lot of files, subdirectories, and they are large.

New species (to science, sometimes actually)

This blog post is about the real meaning of a expression/term commonly found in technical biodiversity jargon that tend to be misunderstood by people outside the field, commonly journalists and media consumers, and is the concept of new species.

Despite its literal meaning, what scientists mean by “new” is not what the public expects. I’ve found myself in the situation of being criticized by media consumers (people outside biology) by claiming that a species is “new”. I can remember a specific case when Farlowella yarigui was described (Ballen & Mojica, 2014). This work received a lot of media coverage thanks to the interest of journalists from Unimedios on my work with fishes collected during one of the courses at the Universidad Nacional when I was still an undergrad student. For some reason, such news post was “recycled” by some other newspapers/journals in their news section on environment (e.g., El Espectador) and posted in their social media sites.

Figure 1
Farlowella yarigui, a species from the Magdalena basin in Colombia.

Some of the readers claimed that the species can by no means be new since they have seen that fish before, sometimes since they were young, and in a few cases, implying that scientists lie to the public by telling the species is new! I did not take it as an attack because I acknowledge the ambiguity of saying that a species is “new”; in fact, at least three different cases fall within the concept of “new” to science as opposed to “new” to everyone in the world:

  • A species can be mistakenly called with a scientific name when it in fact is not such species. This happens when biologists fail to recognize that the scientific name that is applied to a given species is erroneously used, usually because we are confusing two species under the same name. This is analogous to calling “metal” to the music by Nirvana and Death: We are erroneously calling Nirvana as ‘metal’ when it is in fact Grunge. A good example is calling the tiger catfish of the Magdalena river basin in Colombia as Pseudoplatystoma fasciatum when it is not such species but another one, that was given the name Pseudoplatystoma magdaleniatum after recognizing the differences between both species (Buitrago-Suarez & Burr, 2007).

  • A species can be known to be around but without a scientific name to date. In this case, scientists recognize it is not any of the species already named properly by them but still further work could be needed in order to apply a different name for it. This happens when the differences among species of a group are difficult to ascertain, or more specimens in museums are needed in order to be confident that we are not naming a species that already has a name. This is similar to finding something striking and unfamiliar to us, such as when astronomers name a planet we already saw but that did not have a name yet for it.

  • A species can be completely unknown to everyone, and consequently, lack a proper scientific name. This happens with rare, cryptic, or difficult-to-find living things, such as fishes from the deep waters that even fishermen usually cannot catch; these fishes can live to incredible depths always, living and dying without a minimal chance of reaching the surface. In these cases, neither the public nor the scientific community have been in contact with such organisms before. They are, in a sense, really new species.

In all these three cases the technical name in taxonomy (the biological field that aims at documenting living things and naming them scientifically) is “new species”, sometimes even species novum in latin. We need to understand that taxonomical rules started formally during the XVIII century by the works of Carolus Linnaeus, and by that time the Latin was the mandatory language of natural sciences. As a consequence, even today we call them by the translation of the sentence species novum to modern languages such as English, the current standard language in science.

From time to time different scientists decide to avoid such ambiguity by stating that the species is not a new one but an undescribed one, or even avoid such implications at all by saying that a scientific paper proposes a scientific name for a given species. As an example I can think of Grant et al.’s 2007 paper describing Allobates niputidea, a frog species from Colombia, entitled in fact “A name for the species of Allobates (Anura: Dendrobatoidea: Aromobatidae) from the Magdalena Valley of Colombia“.

That said, a scientific name is actually defined as a special name that we apply under certain clear rules in order to avoid ambiguity among scientists, and whose meaning cannot change among languages, therefore being stable. The advantages of using such system is that we avoid calling with several names, sometimes even common names, the same thing, and that once we use a scientific name our peers know exactly to which kind of organism we are referring to. It is in some sense as using our given-and-family name combination (e.g., John Doe) in order to ascertain our own uniqueness from among people sharing our given or family name (e.g., John Doe is different from both John Jameson and Evelyn Doe); however, even such combination is subject to synonyms, something that biological nomenclature tries to avoid by using such rules.

This ambiguity implies at least two possible solutions: To avoid using “new species” in article names in favor of “undescribed species”, or to do the titanic labour (given that there is much more people out there that is nor part of the scientific community; actually these groups differ by orders of magnitude) of explaining the people what we mean by new species. This blog post is an effort to the latter direction as the former is so widespread in scientific literature than it might not be practical to abandon the term at all. A second benefit of the latter alternative is to force us scientists to interact with the people outside our field in order to promote making science accessible to the public.

References

Ballen, G. A., & Mojica, J. I. (2014). A new trans-Andean Stick Catfish of the genus Farlowella Eigenmann & Eigenmann, 1889 (Siluriformes: Loricariidae) with the first record of the genus for the río Magdalena Basin in Colombia. Zootaxa, 3765(2), 134-142.

Buitrago-Suarez, U. A., & Burr, B. M. (2007). Taxonomy of the catfish genus Pseudoplatystoma Bleeker (Siluriformes: Pimelodidae) with recognition of eight species. Zootaxa, 1512(1), 1-38.

Grant, T., Acosta, A., & Rada, M. (2007). A name for the species of Allobates (Anura: Dendrobatoidea: Aromobatidae) from the Magdalena Valley of Colombia. Copeia, 2007(4), 844-854.