Assorted tools for interacting with servers

Personal list of commands to remember when running analyses on servers. This post is expected to grow from time to time as new commands/command-combinations are found to be useful.

# login with ssh
ssh -l user ip # e.g., myUser 186.333.444.111

# create a virtual screen
screen -S screenName # use something identifying the analysis in screenName

# back to virtual screen called "screenName"
screen -r screenName

# list screens
screen -r

# detach virtual screen keeping the analysis
Ctrl + a + d

# detach virtual screen AND kill the job
Ctrl + c

# copy files through ssh to server
# supports the -r recursive tag for several files/directories
scp user@ip:/path/to/files path/to/local/directory # from directory in user at ip to local directory 
scp path/to/local/file user@ip:/destination/directory
scp -r path/to/local/directory user@ip:/destination/directory

# monitor processes while they are running based on the filetype and modification time
find . -name *.log -ls | grep "date-right-now" # e.g., "Apr  06" or "Apr 21", note the space in the former
Advertisements

Why to use the mean of the quantiles as initial values in ‘optim’?

This post is actually one of the vignettes that accompany the bayestools package, and I’m sharing here because vignettes does not seem to enjoy of the same spreading potential as blog posts.

One issue with the use of the optim function for parameter approximation from a ser of percentiles and quantiles is that it requires initial values for its heuristic search of values. If these initial values are much distant from the unknown real parameter value, then the function has serious problems with convergence and may produce results that are simply wrong. In pre-release versions of findParams the initial value was a vector of 1s corresponding to the number of parameters to estimate (e.g., c(1, 1) when estimating mean and sd in pnorm), but this produced simply wrong results when, for instance, the real mean was 10 or a larger value.

With a little help of simulation we can show that the best initial guess is in fact the median of the quantiles.

the following code will generate a lot of parameter estimates from several trials using different initial values. With a bit of large number theory, a decent estimate can be found. Here, q and p are quantiles and percentiles under a given, known distribution; a different anonymous function where sapply varies the values of the initial values allows to get the $par element of the optim call, and the the density plot shows that overall a median estimate approaches at the correct value. The values of q come from qDIST given the probabilities of interest (generally 0.025, 0.5, and 0.975). For instance: qbeta(p = c(0.05, 0.5, 0.95), shape1 = 10, shape2 = 1) for the example below:

# X = seq... is the set of values to try, from min(q) to max(q). 
parameters <- sapply(X = seq(0.6915029, 0.9974714, length.out = 1000),
                     FUN = function(x) {
                         findParamsPrototype(q = c(0.6915029, 0.9330330, 0.9974714),
                                    p = c(0.025, 0.5, 0.975),
                                    densit = "pbeta",
                                    params = c("shape1", "shape2"),
                                    initVals = c(x, x))$par
                     },
                     simplify = TRUE)

plot(density(t(parameters)[, 1]), main = "Density for shape1", xlab = "Shape1", ylab = "Density")
abline(v = median(parameters[1, ]), col = "red")

plot(density(t(parameters)[, 2]), main = "Density for shape2", xlab = "Shape2", ylab = "Density")
abline(v = median(parameters[2, ]), col = "red")

initval1initval2

Large number theory allows us to expect such result, but, what if the specific initial value matters? Another simulation plotting the parameter value as a function of the initial value can be prepared with, say, a random variable X ~ N(\mu = 10, \sigma = 1). Such a large mean is expected to cause problems with initial values, since in these simulations 10 is huge when compared to 1. Initial values were simulated to take values between 0.001 and 10 because \sigma > 0 so zero and negative values would break the code.

# check that quantiles are rigth:
qnorm(c(0.025, 0.5, 0.975), mean = 10, sd = 1)
## 8.040036 10.000000 11.959964

# simulate the parameters
simInitVals <- seq(0.001, 10, length.out = 10000)
parameters2 <- sapply(X = simInitVals,
                     FUN = function(x) {
                         findParamsPrototype(q = c(8.040036, 10.000000, 11.959964),
                                    p = c(0.025, 0.5, 0.975),
                                    densit = "pnorm",
                                    params = c("mean", "sd"),
                                    initVals = c(x, x))$par
                     },
                     simplify = TRUE)

# plot the results
plot(y = parameters2[1,], x = simInitVals, main = "Estimates for the mean", xlab = "Simulated init. vals.", ylab = "Parameter estimate")
abline(h = 10, col = "red")
plot(y = parameters2[2,], x = simInitVals, main = "Estimates for the st.dev.", xlab = "Simulated init. vals.", ylab = "Parameter estimate")
abline(h = 1, col = "red")

initval3initval4

Here we see a very interesting result: The initial values near zero up to a bit above 1 cause very odd and unreliable estimates of each parameter, while larger values, closer to the real parameter values invariantly provide reliable estimates. Please note that the red line is the true parameter value that we started with above. But, what happens in the neighborhood of the mean of the quantiles?

meanNeighbors <- which(simInitVals > (mean(simInitVals) - 0.1) & simInitVals < (mean(simInitVals) + 0.1))
plot(y = parameters2[1,][meanNeighbors], x = simInitVals[meanNeighbors], main = "Neighbors of mean(quantiles)", xlab = "Simulated init. vals.", ylab = "Parameter estimate")
abline(h = 10, col = "red")
plot(y = parameters2[2,][meanNeighbors], x = simInitVals[meanNeighbors], main = "Neighbors of mean(quantiles)", xlab = "Simulated init. vals.", ylab = "Parameter estimate")
abline(h = 1, col = "red")

initval5initval6

Now let’s visualize it as densities:

plot(density(parameters2[1,][meanNeighbors]), main = "Neighbors of mean(quantiles)", ylab = "Parameter estimate")
abline(v = 10, col = "red")
plot(density(parameters2[2,][meanNeighbors]), main = "Neighbors of mean(quantiles)", ylab = "Parameter estimate")
abline(v = 1, col = "red")

initval7initval8

Here we see that the values around the mean of the quantiles as initial values behave with the regular properties of large-number theory (real parameter number in red line as above). Therefore, it is advisable to pick this as a regular initial value in the context of the optmi function.

Why did the first example behave nicely? Results not shown (just change the values in seq(from = 0, to = 1)) indicate that convergence is not a concern when estimating the parameters in the beta distribution and there is a reason for it. The beta PDF is defined over [0,1], so it makes no sense at all to try values outside this interval:

Beta(\alpha,\beta):\,\, P(x|\alpha,\beta)=\frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha,\beta)}

where B is the beta function

B(\alpha,\beta)=\int_{0}^{1}t^{\alpha-1}(1-t)^{\beta-1}dt

Being 0 and 1 not that far away from each other, we find they behaving as our simulation of the neighborhood of the mean of quantiles in the esitmation of the normal PDF parameters.

First release of the ‘bayestools’ R package

After some time developing functions for my own research on using fossil information as a source of prior information for divergence time estimation studies (in prep.) I decided to release the first version of an R package with functions that I found useful for pre- and post-processing of data in evolutionary analyses using bayesian methods. It has been first released to its development repository on github (https://github.com/gaballench/bayestools/) and I expect to implement further tools, so the package must be understood as a release with the minimal set of functions.

I expect to continue active development and eventually to submit it to CRAN, but until then you can install the development version with devtools‘s function install_github:

# install devtools if needed
install.packages("devtools")

# load devtools to access the install_github function
library(devtools)

# install bayestools from the github repository
install_github(repo = "gaballench/bayestools")

What the package can do?

There are two main sets of functions that bayestools provide: Pre-processing and post-processing functions. In the first set we have functions that allow us to specify priors, and plot probability density functions from Beast2 parameters in order toi check visually their correctness. In the second group we have a tool for measuring interdependence between empirical densities, in addition to some ways to visualize it.

The main place in evolutionary biology where the features contained in bayestools belong is in the core of divergence time estimation and incorporation of paleontological and geological information into estimation of the time component of phylogenies and diversification analyses. Here, the tools present in the package aid in implementing information from these fields into bayesian analyses, mainly through specification of priors, the most conspicuous feature of bayesian inference. However, prior specification is a very difficult task and there are few if any rules to follow, along with a lot of misunderstanding along with clearly-unjustified practices.

Priors

Prior specification depends strongly on what we are trying to use in order to calibrate a given phylogeny. For instance, node calibration and tip calibration work in different ways, and prior specification will not only depend on this but also on the nature of the information we are trying to use as priors.

In the most common case, we have a fossil taxon (or occurrence) that could inform us about the age of a given node (node calibration). However, we lack exact information on two things: When the organism lived (measured without error), and when a given diversification event took place (i.e., the exact age of a node). What we have is a fossil, whose age has been inferred or measured with error or uncertainty. It is this information, uncertainty, what we expect to model through priors (or at lease what should be done in real life). A prior is in itself a probability density function (PDF hereafter), that is, a mathematical function describing the probability that a variable take a given value inside an given interval. More or less precise statements can be made with aid of PDFs, such as that it is more likely that a variable X takes a value between values a and b than rather between c and d; or which is the expected value for the variable of interest.

Now, what we have is age information such as the following instances:

  • The fossil comes from a volcanic layer that has been dated with radiometric techniques with a specific value \pm uncertainty, for instance, 40.6 \pm 0.5 Ma, where 0.5 is the standard deviation \sigma.

  • The fossil comes from a sedimentary unit that is said to be of Miocene age (5.33 to 23.03 Ma).

  • The fossil comes from a layer bracketed by two levels with volcanic ash that have been dated as 10.2 \pm 0.2 and 12.4 \pm 0.4 respectively. So, the fossil itself have an age determined as interpolated between the layers that have real age information.

How can we convert this information into legitimate priors?

The findParams function

Given that we have prior on the age of a fossil to be 1 – 10 Ma and that we want to model it with a lognormal distribution, fin the parameters of the PDF that best reflect the uncertainty in question (i.e., the parameters for which the observed quantiles are 1, 5.5, and 10, assuming that we want the midpoint to reflect the mean of the PDF:

bayestools::findParams(q = c(1, 5.5, 10),
          p = c(0.025,  0.50, 0.975),
          output = "complete",
          pdfunction = "plnorm",
          params = c("meanlog", "sdlog"))
## $par
## [1] 1.704744 0.305104
## 
## $value
## [1] 0.0006250003
## 
## $counts
## function gradient 
##      101       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

Now we have a published study that specified a lognormal prior (with or without justification, that’s another question) and we want to actually plot it outside of beauti, for instance, in order to assess sensitivity of the posterior to the prior. How can we plot it in R and use its data as any other statistical density?

The lognormalBeast function

Generate a matrix for the lognormal density with mean 1 and standard deviation 1, with mean in real space, and spanning values in x from 0 to 10, and then plot it:

lnvals <- bayestools::lognormalBeast(M = 1, S = 1, meanInRealSpace = TRUE, from = 0, to = 10)
plot(lnvals, type = "l", lwd = 3)

lognormalBeast

Sensitivity

One hot topic in model and tool comparisons is the sensitivity of the posterior to the prior, that is, how much could our results be determined by the prior. For instance, if the posterior is totally dependent on the prior, that means that the likelihood (or the data) is not providing information, and this is very risky as there would be the potential to manipulate the results of bayesian analyses.

The measureSensit function

Measure and plot the sensitivity between two distributions partially overlapping:

set.seed(1985)
colors <- c("red", "blue", "lightgray")
below <- bayestools::measureSensit(d1 = rnorm(1000000, mean = 3, 1),
                       d2 = rnorm(1000000, mean = 0, 1),
                       main = "Partial dependence",
                       colors = colors)
legend(x = "topright", legend = round(below, digits = 2))

measureSensit

The number in the legend indicates an overlap of 0.13, and as such, a measure of interdependence between the densities (Ballen in prep.).

Geochronology

In the third case of possible age estimates for our calibrations, can we clump together a number of age estimates from several samples in order to build a general uncertainty for the interval?

Do the age estimates for the boundaries of the Honda Group (i.e., samples at meters 56.4 and 675.0) conform to the isochron hypothesis?

data(laventa)
hondaIndex <- which(laventa$elevation == 56.4 | laventa$elevation == 675.0) 
bayestools::mswd.test(age = laventa$age[hondaIndex], sd = laventa$one_sigma[hondaIndex])

The p-value is smaller than the nominal alpha of 0.05, so we can reject the null hypothesis of isochron conditions.

Do the age estimates for the samples JG-R 88-2 and JG-R 89-2 conform to the isochron hypothesis?

twoLevelsIndex <- which(laventa$sample == "JG-R 89-2" | laventa$sample == "JG-R 88-2")
dataset <- laventa[twoLevelsIndex, ]
# Remove the values 21 and 23 because of their abnormally large standard deviations
bayestools::mswd.test(age = dataset$age[c(-21, -23)], sd = dataset$one_sigma[c(-21, -23)])

The p-value is larger than the nominal alpha of 0.05, so we cannot reject the null hypothesis of isochron conditions

R package development in Emacs with ESS

This post is somewhat an autorreference for myself from the future as I only develop packages from time to time and tend to forget things easily; it is also a quick reference source as several of these topics are covered in varying detail in Hadley Wickham’s “R Packages. Organize, Test, Document and Share Your Code”. The following questions are answered with code below and comments whenever appropriate. The two most important tools we will need are rmarkdown and devtools. I assume that we already have important dependencies installed such as knitr or pandoc so their install process will not be covered here (google will surely point you out to the proper answer).

Why do I do this if Wickham’s book is already a reference? Because I don’t like/use Rstudio (reasons will not be discussed here) but instead use Emacs+ESS and information on workflow under that beautiful platform in the specific context of R package development is scarce. In order to mimic Rstudio’s behavior the secret is to run devtools and rmarkdown functions in R’s command line, that’s it.

Before starting, install and load the packages of interest:

# install the packages of interest
install.packages("rmarkdown")
install.packages("devtools")

# once installed, load the packages
library(rmarkdown)
library(devtools)

How do I create a package from scratch?

The code below will create the layout of a package with the minimum this you will need to care about in case of producing a very simple package. These include documentation and code directories, and description files that you will need to fill in with your own information

create("myPackageName")

I have already coded a function, how do I include it in my package?

Simply copy it into the R/ folder in a file with extension .R so that R knows it contains code.

How do I document my function?

Take a look at the help file for any function and see what information you usually find there: A description of what the function does, which arguments it has, and what kinds of values are allowed in each arguments are a good place to start. Take a look at Karl Broman’s material on writing documentation for further details and tag options.

Basically, documentation using the roxygen2 includes the text inside the script that contains the function. It uses special comments (#') that are skipped when R loads the code, but is used by roxygen2 in order to build the documentation files. The advantage of this is that in the very same file you put your code and describe it and its usage. roxygen2 uses special tags starting with @ to allow the user to fill in the several sections of a documentation file. A simple example would be:

#' A function for adding up two numbers
#'
#' @usage addTwo(x, y)
#'
#' @param x The first number to add
#'
#' @param y The second number to add
#'
#' @return A numeric vector with the value of the sum x + y
#'
#' @examples
#' # Adding up two positive numbers
#' addTwo(x = 4, y = 6)
#' # Adding up one positive and one negative numbers
#' addTow(x = -2, y = 7)
#'
addTow <- function(x, y) {
    output <- x + y
    return(output)
}

The code above contains the code for our function addTwo along with a bunch of lines starting with #', these compose the documentation. I would save this in a file called the same as the function: addTwo.R.

I managed to document my functions, how do I build the documentation?

Open a console with the working directory pointing to the package directory, maybe navigating to such folder with Dired (C-x d) and then opening an R session there (M-x R). After loading devtools you can use the function document to build the documentation files:

# check that we are where we need to
> getwd()
"myPackage"
> document()
Updating myPackage documentation
Loading myPackage

The function document will produce a file with extension .Rd in the man/ directory, corresponding to the documentation of our function. We can not access the documentation as with regular functions because the packages is not formally installed, to the documentation files can not be found by R. Instead, we can preview them in the command line with ? AND THEN , this will prompt a + sign for completing a command, and then we can write the function name. It would look like the following when called from the command line:

> ?
+ addTwo

addTwo                    package:myPackage                  R Documentation

A function for adding up two numbers


Usage:

     addTwo(x, y)

Arguments:

       x: The first number to add

       y: The second number to add

Value:

     A numeric vector with the value of the sum x + y

Examples:

     # Adding up two positive numbers
     addTwo(x = 4, y = 6)
     # Adding up one positive and one negative numbers
     addTow(x = -2, y = 7)

I’ve already documented all of my functions, how can I build a pdf with all the documentation as in famous packages?

Open a shell (e.g., M-x eshell), and run:

# If eshell is inside the package directory...
R CMD Rd2pdf ../myPackage
# If eshell is just above the package directory...
R CMD Rd2pdf myPackage

Hmm ... looks like a package
Converting Rd files to LaTeX 
Creating pdf output from LaTeX ...

This is pdfTeX, Version 3.14159265-2.6-1.40.16 (TeX Live 2015/Debian) (preloaded format=pdflatex)
 restricted \write18 enabled.
entering extended mode
...
Output written on Rd2.pdf (8 pages, 106389 bytes).
Transcript written on Rd2.log.
Saving output to ‘myPackage.pdf’ ...
Done

This will generate a nicely-formatted pdf with all your documentation in one place

I’ve changed some code, how do I see the result interactively?

You need to re-load your package in order to have access of these changes but need to be in the package main directory in order to use the function load_all. Do not forget to save all changes in all code files before re-loading:

# check that we are where we need to
> getwd()
"myPackage"
> load_all()
Loading myPackage

It is good idea to re-build the documentation just to be sure that everything is up to date (see above).

I’ve updated the README.Rmd file, how do I apply the changes to the .md and .html files?

These three files are of special relevance if you are hosting your package development repository in github, as the markdown files are already optimized for looking pretty under github’s flavour of markdown. Always modify the r markdown file (the one with .Rmd extension) and then use the package rmarkdown to convert it to .md and .html:

> render("README.Rmd)

processing file: README.Rmd
  |........                                                         |  12%
  ordinary text without R code

  |................                                                 |  25%
label: unnamed-chunk-1 (with options) 
List of 1
 $ echo: logi FALSE

  |........................                                         |  38%
  ordinary text without R code

  |................................                                 |  50%
label: unnamed-chunk-2 (with options) 
List of 1
 $ eval: logi FALSE

  |.........................................                        |  62%
  ordinary text without R code

  |.................................................                |  75%
label: unnamed-chunk-3 (with options) 
List of 1
 $ echo: logi FALSE

  |.........................................................        |  88%
  ordinary text without R code

  |.................................................................| 100%
label: unnamed-chunk-4 (with options) 
List of 2
 $ eval   : logi TRUE
 $ warning: logi FALSE


output file: README.knit.md

/usr/bin/pandoc +RTS -K512m -RTS README.utf8.md --to markdown_github-ascii_identifiers --from markdown+autolink_bare_uris+tex_math_single_backslash --output README.md --standalone --template /.../R/x86_64-pc-linux-gnu-library/3.4/rmarkdown/rmarkdown/templates/github_document/resources/default.md
...
Preview created: README.html

Output created: README.md

The output above indicates that everything was successful with the last two lines, telling us that the .md and .html files were created.

First paper in 2018!

Today I’m pleased to announce that our manuscript “Fishes of the Cusiana River (Meta River basin, Colombia), with an identification key to its species” (Urbano-Bonilla et al., 2018) was finally published after a long way to completion. The main goal of the paper is to document the freshwater fish fauna in the Cusiana river sub-basin, a tributary of the Meta basin in the Colombian Orinoco drainage. Checklists in general use to be mere listings of scientific names documenting the presence of species in an area of interest, and as such are of limited value aside from large-scale record collections; however, they tend to be less valuable as sources of data for further research such as studies in biogeography, climate change, or macroecology. However, we devised this project as a proposal for a novel style in checklist studies where additional information, diversity estimation, and identification keys were prepared for the region of study.

First of all, our checklist differs from others in that it follows an attitude towards open source data. We accomplished this by providing open access to the raw data (i.e., the specific museum specimens that document the presence of a given species in our area of study) including its scientific name, coordinates, abundance, elevation and any other relevant aspect of biological records. We uploaded our dataset to the SiB Colombia data repository (http://doi.org/10.15472/er3svl) that allows to replicate exactly all of our claims. It also can be used for future research, even for uses that we did not envision in the first place! This is the magic of open access to data and reproducible research.

In second place, we wanted to provide an identification key to the species of freshwater fishes that we documented for this drainage as an aid for a whole bunch of biologists and other professionals that face the problem of identifying their fish samples from this area. I should note that the Colombian Orinoco drainage is an area with very active extractive activities such as mining and specially petroleum industry; as a consequence, environmental impact studies are routinely carried out during several phases of oil extraction as required by the environmental Colombian legislation. Given the high diversity of this area, an identification key can improve the quality of such studies, resulting in better actions for environmentally-responsible industrial activities. This sounds like dreams and I’d really like to believe that our tools will have the expected impact, but even when in general studies of environmental impact are of very poor quality, at last the lack of taxonomic tools would not be an excuse anymore.

Third, we presented an gross estimation of richness for the area. Even when we used methods developed for estimating species richness under standardized sampling and surely some model assumptions are no met (e.g., same sampling effort across units of study), this approach has been used for richness estimation in other non-standard scenarios, such as continental-scale species richness estimations based on literature data (Reis et al., 2016). Despite uneven sampling has the effect of overestimating rarity and therefore inflating uncertainty around estimated richness (e.g., by inflating the amount of species considered rare because of poor sampling), but provided that there is not a consistent pattern in sampling bias, our estimates are adequate in general. In addition, we used an integrative approach based on Hill numbers that allows to carry out rarefaction and extrapolation with the same inferential model (Chao et al., 2014). Armed with this, we can have an idea of the total species richness in our study area so that we can compare it to the estimated richness of other areas; this is way better than comparing raw observed richness numbers when assessing the conservation potential of river basins based on the fish fauna.

One of the most striking results is that we show a strong elevational sampling bias, where sampling effort is inversely proportional to elevation. This in fact is a general aspect of out knowledge of river basins that drain the Andes. Since species richness is also inversely proportional to elevation, we have a poor characterization of the species-poor portion of the basin. This has little effect for overall richness estimation, but raises important concerns regarding our ability to develop management strategies for the Andean portions of the basin, that are in general more fragile than the lowlands. This calls our attention as ichthyological community towards our need to better study the Andean portions of these basins, as we currently know very little about its poor yet highly-endemic components.

We hope to call attention to the several additional aspects that checklists can include in order to serve better as biodiversity documentation efforts, and look forward to seeing other regional checklists that go beyond species lists and explore other aspects of biodiversity. The potential for novel and interesting research with a broader impact in conservation is large.

Literature cited

Chao, A., Gotelli, N. J., Hsieh, T. C., Sander, E. L., Ma, K. H., Colwell, R. K., & Ellison, A. M. (2014). Rarefaction and extrapolation with Hill numbers: A framework for sampling and estimation in species diversity studies. Ecological Monographs, 84(1), 45–67. http://doi.org/10.1890/13-0133.1

Reis, R. E., Albert, J. S., Di Dario, F., Mincarone, M. M., Petry, P. and Rocha, L. A. (2016), Fish biodiversity and conservation in South America. J Fish Biol, 89: 12–47. doi:10.1111/jfb.13016

Urbano-Bonilla, A., Ballen, G. A., Herrera-R, G. A., Zamudio, J., Herrera-Collazos, E. E., DoNascimiento, C., … Maldonado-Ocampo, J. A. (2018). Fishes of the Cusiana River (Meta River basin, Colombia), with an identification key to its species. ZooKeys, 733, 65–97. http://doi.org/10.3897/zookeys.733.20159

Short introduction to writing the thesis in LaTeX with Emacs

This post aims at a smaller audience than previous ones and happens to be a short tutorial for generating a thesis with \LaTeX using Emacs. This assumes that you have a document with multi-file structure where chapters are separated in individual .tex files so that we have a master document that includes the content from other files. Before we begin, some Emacs idiosyncrasies:

  • C means to press the Control key; and C-f means to press the Control key along with the ‘f’ key (without releasing Control until you press ‘f’).
  • M means to press the Alt (once known as Meta, hence M) key; M-a consequently will be to press Alt and the ‘a’ key at the same time.
  • Uppercase letters are generated in the usual way with the Shift key, so that C-A means to press the Control key AND the Shift AND the ‘a’ keys; when released they will run the wanted commands.
  • C-M-w will mean the good ol’ MS Windows key combination Ctrl+Alt+w, for instance.
  • A command such as C-c C-c can be generated by keeping Control pressed and the ‘c’ twice.

Ever considered writing your thesis in \LaTeX? there’s actually a whole bunch of sources and you always can see if your university supports/encourages the use of \LaTeX. As a starting point, take a look at these templates from the IME @ USP, the Universidad Nacional de Colombia LaTeX template, or maybe the template from the University of Bristol housed at Overleaf. I’m also preparing a thesis template for the Museu de Zoologia that hope to launch on Github soon.

An important aspect to keep in mind is that working with documents with or without a bibliography are very different things, and the latter tends to become a lot more complex that when you typeset the references and in-text citations. On the other hand, the latter process is automatic and kind of error-free.

On the main file

Almost all of the thesis templates in the wild have this multi-file structure and there is a good reason for it: LaTeX documents tend to grow in size and complexity pretty quickly, so it’s better to keep things separate during document preparation. As a consequence, we must centralize compilation somehow so that we don’t have to compile and paste together a bunch of independent files. There where the master document appears. This file has the preamble, that is, the definition of filetype (with the documentclass command), all the package definitions, new commads, and user-defined code that does not come directly from the packages. It may look something like this:

\documentclass[12pt,twoside,a4paper]{book}

% ---------------------------------------------------------------------------- %
% Pacotes 
\usepackage[T1]{fontenc}
\usepackage{amsmath}                    % the AMS equations package
\usepackage[USenglish]{babel}
\usepackage[utf8]{inputenc}
\usepackage[pdftex]{graphicx}           % usamos arquivos pdf/png como figuras
\usepackage{setspace}                   % espaçamento flexível
\usepackage{indentfirst}                % indentação do primeiro parágrafo
\usepackage{makeidx}                    % índice remissivo
\usepackage[nottoc]{tocbibind}          % acrescentamos a bibliografia/indice/conteudo no Table of Contents
\usepackage{courier}                    % usa o Adobe Courier no lugar de Computer Modern Typewriter
\usepackage{type1cm}                    % fontes realmente escaláveis
\usepackage{listings}                   % para formatar código-fonte (ex. em Java)
\usepackage{titletoc}
\usepackage{longtable}                  % long tables and across-page table layout
\usepackage{array}
\newcolumntype{C}[1]{>{\centering\arraybackslash}p{#1}} % create a new command for a centered width-defined column in a longtable environment

\usepackage{enumitem} % nested lists with more control on labels
%\usepackage[bf,small,compact]{titlesec} % cabeçalhos dos títulos: menores e compactos
\usepackage[fixlanguage]{babelbib}
\usepackage[font=small,format=plain,labelfont=bf,up,textfont=it,up]{caption}
\usepackage[usenames,svgnames,dvipsnames]{xcolor}
\usepackage{tikz}                       % drawing stuff such as the bibliographic card
\usepackage[a4paper,top=2.54cm,bottom=2.0cm,left=2.0cm,right=2.54cm]{geometry} % margens
%\usepackage[pdftex,plainpages=false,pdfpagelabels,pagebackref,colorlinks=true,citecolor=black,linkcolor=black,urlcolor=black,filecolor=black,bookmarksopen=true]{hyperref} % links em preto
\usepackage[pdftex,plainpages=false,pdfpagelabels,pagebackref,colorlinks=true,citecolor=DarkGreen,linkcolor=NavyBlue,urlcolor=DarkRed,filecolor=green,bookmarksopen=true,breaklinks=true]{hyperref} % links coloridos
\usepackage[all]{hypcap}                    % soluciona o problema com o hyperref e capitulos
\usepackage[round,sort,nonamebreak]{natbib} % citação bibliográfica textual(plainnat-ime.bst)
%\usepackage{breakcites}
%\bibpunct{(}{)}{;}{a}{\hspace{-0.7ex},}{,} % estilo de citação. Veja alguns exemplos em http://merkel.zoneo.net/Latex/natbib.php ORIGINAL. CAUSES A SMALL SPACE BETWEEN AUTHORS AND THE COMMA SEPARATING THESE FROM YEARS, AWFUL.
\bibpunct{(}{)}{;}{a}{,}{,} % estilo de citação. Veja alguns exemplos em http://merkel.zoneo.net/Latex/natbib.php EDITED BY GAB. THE NICE COMMA AFTER AUTHORS AS EXPECTED.

\fontsize{60}{62}\usefont{OT1}{cmr}{m}{n}{\selectfont}

\let\proglang=\textsf

Above we have an excerpt from my own thesis main file, heavily modified from the IME template. No other files in our setting will have a preamble, and this means that compilation of text must be centralized on this file. Supposing that we don’t have bibliography (yet), it suffices to open this file and compile it with C-c C-c and then pick LaTeX from the list in Emacs’ minibuffer (the tiny line below the window that spits out some text when triggered with keystrokes such as our compilation C-c C-c). Once there we must tell Emacs that we want to compile by moving with the arrows up and down until we find the LaTeX text (see the awful white arrow I included) and then press Enter. This should compile the whole thesis and produce a nice PDF with the “end” product.

compilation.png

On chapter files

Why does Emacs compile the whole thesis even when the text from other chapters does not reside in the main file? Your main document should have commands called \input or \include{} and will generally be located down to the end of the main file. These tell the main document to visit these independent files and compile them along with the whole thing:

% ---------------------------------------------------------------------------- %
% Chapters, all ending with .tex for the \input items
\mainmatter

% cabeçalho para as páginas de todos os capítulos
\fancyhead[RE,LO]{\thesection}

\singlespacing              % espaçamento simples
%\onehalfspacing            % espaçamento um e meio

\input chapIntro            % Introduction to the whole thesis, called chapIntro.tex
\input chapTerminology      % Chapter on Spine terminology, called chapTerminology.tex
\input chapFossilFishes     % Chapter on the fossil fishes from the Ware and Sincelejo Fms. called chapFossilFishes.tex
\input chapFour             % Chapter four called chapFour.tex

% cabeçalho para os apêndices
\renewcommand{\chaptermark}[1]{\markboth{\MakeUppercase{\appendixname\ \thechapter}} {\MakeUppercase{#1}} }
\fancyhead[RE,LO]{}

\appendix

\include{appTerminology} % appendix for the chapter on terminology
\include{appMaterialExamined}      % material examined
\include{appCode} % Computer code for
 something interesting

Now suppose you are ready to put your hands on the first chapter of your thesis. In general, these files will lack a preamble, so we should not see any usepackage or renewcommand or whatever in them; in contrast, we usually will start with \chapter{Name of my thesis chapter} and then commands for individual sections \section{Introduction} or \section{Materials and Methods}. As an example, this is an excerpt of one of my chapter files:

%% ------------------------------------------------------------------------- %%
\chapter{A Standardized Terminology of Spines in Neotropical Siluriformes}
\label{chap:chapTerminology}

\section{Abstract}
\label{sec:abstract}
Some text in the abstract

%% ------------------------------------------------------------------------- %%
\section{Introduction}
\label{sec:introduction}

The order Siluriformes is one of the most important components of Neotropical freshwaters with more than XXXX species currently described and bla bla bla...

Given the confusion in anatomical terms historically applied to spines and its ornaments, herein I propose a new standard terminological system to avoid future confusions. Each term is accompanied by its conditions as coded from the material examined, and an attempt at synonymy along the numerous references examined. A quantitative approach at standardization is herein proposed for picking the optimal system from among the already proposed terms...

Please note that the lines highlighted show these two commands for defining the title of the chapter and the section headers (such as in any research article). Please note too that we lack a preamble and therefore these files are NOT TO BE COMPILED INDIVIDUALLY. As a test, try to run C-c C-c LaTeX on any of these and you will end up with an error, because compilation must be carried out on the main file. Such an error is shown below, emphasizing on what appears at the mini-buffer.

errorCompilChap

Take-home message: Don’t compile (C-c C-c) on chapter files, always do it on main text files.

What if we have references? One bibliography to rule them all

There are some LaTeX packages that deal with bibliographies. The one I use is BibTex, a quite old system with some disgusting behaviors I will mention below. However, it gets the work done, and is what I’ve learnt to use so far.

In order to use BibTex we need a .bib file, that is, a file that houses our references in a specific format. I generate such files with Mendeley Desktop and edit them with a Bash script that changes to italics scientific names in paper titles, something that Mendeley is quite bad at. Once we have such file, it suffices to declare the type of bibliography we want along with the bibliographic style (e.g., APA). Note that these are managed by the natbib package in the preamble (see above), so that if we called the package in the preamble, we are ready to define the following lines wherever we find appropriate to place the bibliography:

\bibliographystyle{palelec} % using the citation style of palaeontologica electronica
\bibliography{/home/user/Documents/Thesis/Bibtex/PhDThesis}  % the path to the '.bib' file

The behavior of the compilation depends on whether we have a single bibliography for the whole thesis or whether we want a bibliography per chapter. Let’s assume first that we want a single general bibliography; in that case we want to first compile the bibliography with C-c C-c and then look for BibTex instead of LaTeX with the up and down arrows:

bibtex

For reasons that I honestly ignore, you must compile first once with BibTex and then twice with LaTeX (C-c C-c and looking for LaTeX with the arrow keys). You may need to repeat these three steps if there were references missing or until Emacs tells you that the document was successfully formatted, with the number of pages:

success

If so, you will end up with a nice bibliography at the end of the PDF (or wherever you put it):

biblio

What can go wrong? Actually quite a few things can prevent you from compiling the references adequately. In the first place, make sure that the reference you are using is in the .bib file; this might require you to close and open Mendeley in order to re-build the database. Try to repeat the compilation stepts then, and if it fails, go to the directory where the .tex files are located and remove every associated file except the .tex and .pdf files; from these called mainText.* only keep those ending with .pdf and .tex:

files.png

One bibliography per chapter

This is where things can get a bit more complicated. Here, you will need to compile the bibliography on the chapter file instead of the main file (contrary to what I already told you to do) and then compile the text on the main document. If you need to do it again, you have to keep in mind that the bibliography will need to be compiled in the chapter file. If something go wrong, you will need to remove the auxiliary files associated to the chapter file (but not that file itself) and try again to compile with BibTex and LaTeX as indicated. In order to compile one bibliography per chapter your chapter file will need to have the following lines at the end:

% Note: Compile with bibtex HERE, not in the main file
\bibliographystyle{apalike2} % Pick the reference style for the chapter bib here
\bibliography{/home/Thesis/Bibtex/refs.bib} point to the .bib file

Compiling with BibTex on the chapter file
bibtexOnChapter
…will produce…
bibtexOnChapterOk

If the references can not be compiled try to see what went wrong with C-c C-l and read carefully the error messages. If refreshing the .bib file does not help, please note that BibTex generates some auxiliary files where these references get formatted, so you may need to remove them. If you don’t know specifically where the error is, please keep in mind that the only essential file you should always keep is the .tex file; all other files named the same as the chapter file but ending with other extensions are generated each time you compile, so technically you can remove them all and compile again. Just to be sure, do the following:

  1. Remove all auxiliary files such as .toc, .aux, .bbl, etc.
  2. Compile the references with C-c C-c and pick BibTex from the list on the chapter file
  3. Go back to the main text file and compile it with C-c C-c and pick LaTeX form the list
  4. Repeat the compilation in step 3.
  5. See if a pdf with the right references was generated; if not, repeat steps 2 through 4.

Common compilation errors in Beamer and possible solutions (part 1)

This blog post is the first of a series that aims at documenting the most common errors I’ve found when preparing presentation slides in beamer AND using the knitr package for compiling both \LaTeX and R code in the same document and the explanations and solutions found for them. Most of these, of course, are based on a huge pile of stack overflow and most of them were found about a year or so ago when preparing the slides for my quals, unfortunately, I did not save any of these sources since. As for the date of preparation of this post, I will link the solutions found elsewhere, that frequently are used partially or in addition to other sources/tests. As a final note, some of the problems/solutions apply when authoring slides with R code evaluation, so you will not likely find them unless using knitr. Please note that the error messages come from Emacs+ESS, so I’m not sure if you will find them spelled exactly the same in other tools such as RStudio; I’m not even sure that you can compile them so compactly tools other than Emacs. I will try to reproduce some examples of them as code so that you can compare them to your own code. That said, it just suffices to point out that Emacs rules!

Caveat. I am testing the \LaTeX code with Emacs+ESS compiling with the keystrokes M-n r and then M-n P (uppercase) and RET in the buffer of the .Rnw file in order to compile the PDF slides. I haven’t tried other latex environment but would love to hear about alternative error messages for the same cases herein highlighted.

'Missing $ inserted'

Most likely you used a character or symbol reserved for math mode (e.g., underscore _). Please note that these characters need to be escaped with the \ or its respective reserved word, or even enclosed into a math inline “pharse”, for instance when using greek letters:

Good: $\alpha$-diversity; Bad: \alpha-diversity
Good: filename\_without\_spaces; Bad: filename_without_spaces

Example:

\documentclass[svgnames,mathserif,serif]{beamer}

\title{Awesome beamer presentation}
%\subtitle{}
\author{Gustavo A. Ballen, D.Sc.(c)}
\institute{University of Sao Paulo \\ 
           Ichthyology \\
           \texttt{myEmail@usp.email.com}}
\date{\today}

\begin{document}

\frame{\titlepage}

\begin{frame}
  \frametitle{My Slide}
  \begin{itemize}
    \item First item with good use\_of\_subscripts
    \item Second item with bad use_of_subscripts 
  \end{itemize}
\end{frame}

\end{document}

The code above will produce the error:

./mathSymbol.tex:71: Missing $ inserted.
<inserted text> 
                $
l.71 \end{frame}
                
? 

./myFile.tex:71: Emergency stop.
<inserted text> 
                $
l.71 \end{frame}
                
./myFile.tex:71:  ==> Fatal error occurred, no output PDF file produced!
Transcript written on myFile.log.
/usr/bin/texi2dvi: pdflatex exited with bad status, quitting.

Please also note that the error message refers to the .tex file with the line number of the problem, not to the .Rnw file. Actually, this message error does not tell us that the problem is with the underline, yet it gives a clue about the math mode as the $ is inserted, and this one is used in order to open and close in-line math text (e.g., formulae).

Whenever this error happens, check whether you are using uncommon characters in your text that might be expected to play a reserved role in math mode. For instance:

% Let's assume you have exactly the same code 
% before this point as lines 1-14
\begin{frame}
  \frametitle{My Slide}
  \begin{itemize}
    \item First item with good use of $\alpha$-diversity in-line math mode
    \item Second item with bad \alpha-diversity in-line math mode
  \end{itemize}
\end{frame}

will produce the same error as the underscore case but this time associated to \alpha, that is a command for inputting the greek letter \alpha. Enclosing the \alpha command with the $ operators will solve the problem.