## Assorted tools for interacting with servers

Personal list of commands to remember when running analyses on servers. This post is expected to grow from time to time as new commands/command-combinations are found to be useful.

# login with ssh
ssh -l user ip # e.g., myUser 186.333.444.111

# create a virtual screen
screen -S screenName # use something identifying the analysis in screenName

# back to virtual screen called "screenName"
screen -r screenName

# list screens
screen -r

# detach virtual screen keeping the analysis
Ctrl + a + d

# detach virtual screen AND kill the job
Ctrl + c

# copy files through ssh to server
# supports the -r recursive tag for several files/directories
scp user@ip:/path/to/files path/to/local/directory # from directory in user at ip to local directory
scp path/to/local/file user@ip:/destination/directory
scp -r path/to/local/directory user@ip:/destination/directory

# monitor processes while they are running based on the filetype and modification time
find . -name *.log -ls | grep "date-right-now" # e.g., "Apr  06" or "Apr 21", note the space in the former


## Common compilation errors in Beamer and possible solutions (part 1)

This blog post is the first of a series that aims at documenting the most common errors I’ve found when preparing presentation slides in beamer AND using the knitr package for compiling both $\LaTeX$ and R code in the same document and the explanations and solutions found for them. Most of these, of course, are based on a huge pile of stack overflow and most of them were found about a year or so ago when preparing the slides for my quals, unfortunately, I did not save any of these sources since. As for the date of preparation of this post, I will link the solutions found elsewhere, that frequently are used partially or in addition to other sources/tests. As a final note, some of the problems/solutions apply when authoring slides with R code evaluation, so you will not likely find them unless using knitr. Please note that the error messages come from Emacs+ESS, so I’m not sure if you will find them spelled exactly the same in other tools such as RStudio; I’m not even sure that you can compile them so compactly tools other than Emacs. I will try to reproduce some examples of them as code so that you can compare them to your own code. That said, it just suffices to point out that Emacs rules!

Caveat. I am testing the $\LaTeX$ code with Emacs+ESS compiling with the keystrokes M-n r and then M-n P (uppercase) and RET in the buffer of the .Rnw file in order to compile the PDF slides. I haven’t tried other latex environment but would love to hear about alternative error messages for the same cases herein highlighted.

## 'Missing $inserted' Most likely you used a character or symbol reserved for math mode (e.g., underscore _). Please note that these characters need to be escaped with the \ or its respective reserved word, or even enclosed into a math inline “pharse”, for instance when using greek letters: Good: $\alpha$-diversity; Bad: \alpha-diversity Good: filename\_without\_spaces; Bad: filename_without_spaces Example: \documentclass[svgnames,mathserif,serif]{beamer} \title{Awesome beamer presentation} %\subtitle{} \author{Gustavo A. Ballen, D.Sc.(c)} \institute{University of Sao Paulo \\ Ichthyology \\ \texttt{myEmail@usp.email.com}} \date{\today} \begin{document} \frame{\titlepage} \begin{frame} \frametitle{My Slide} \begin{itemize} \item First item with good use\_of\_subscripts \item Second item with bad use_of_subscripts \end{itemize} \end{frame} \end{document}  The code above will produce the error: ./mathSymbol.tex:71: Missing$ inserted.
<inserted text>
$l.71 \end{frame} ? ./myFile.tex:71: Emergency stop. <inserted text>$
l.71 \end{frame}

./myFile.tex:71:  ==> Fatal error occurred, no output PDF file produced!
Transcript written on myFile.log.
/usr/bin/texi2dvi: pdflatex exited with bad status, quitting.


Please also note that the error message refers to the .tex file with the line number of the problem, not to the .Rnw file. Actually, this message error does not tell us that the problem is with the underline, yet it gives a clue about the math mode as the $ is inserted, and this one is used in order to open and close in-line math text (e.g., formulae). Whenever this error happens, check whether you are using uncommon characters in your text that might be expected to play a reserved role in math mode. For instance: % Let's assume you have exactly the same code % before this point as lines 1-14 \begin{frame} \frametitle{My Slide} \begin{itemize} \item First item with good use of$\alpha$-diversity in-line math mode \item Second item with bad \alpha-diversity in-line math mode \end{itemize} \end{frame}  will produce the same error as the underscore case but this time associated to \alpha, that is a command for inputting the greek letter $\alpha$. Enclosing the \alpha command with the $ operators will solve the problem.

## Searching for text content into GNU/Linux files (and possibly OSX too)

Machines running GNU/Linux are very powerful tools once we are used to the command line. As an example, let us imagine that we are looking for a file named importantScript.py, a text file containing python code that we forgot where we put but are sure about its name. The standard file search engine called catfish is the tool for looking for files by filename and path in the hard drive. It is similar to search engines in Windows and OSX, possibly with the difference that apparently Finder in OSX can look for keywords into the files (the latter property of Finder is according to my wife, I honestly stopped using OSX long long ago).

But, how about keywords inside files? For instance, we were working on a project with a lot of files written in $\LaTeX$ with the knitr package, so that we can embed R code to execute during document compilation. In just one from a whole bunch of text files we used a special package but forgot how to use it again (it was useful but we just needed it once so far), and internet fell so that we can not rely on good ol’ google in order to look for the usage. For reasons that don’t matter for now (probably ignorance) we also overlook the local documentation. Are we lost? Probably not.

Let us assume we remember a keyword, maybe a package name, or another sort of keyword, for instance a macro (\mymacro in $\LaTeX$). We can use the command grep with the -r option in order to visit all of the files and subdirectories in the current directory looking for the keyword(s) of interest. First we open an instance of the command line (e.g., bash) and navigate to the directory from where we want to search. Use of explicit or relative paths are also possible but I leave them to the reader for exploration. After calling grep, there will appear some output indicating file names and some text after a semicolon (:); this means that the command found our keywords in some of the files in the search path. On the other hand, if the terminal gives us the control again by showing the prompt ($), it failed to find the keywords inside any of the files and files-within-subdirectories of our current path. For instance: user@machine:~$ cd ~/Documents/millionsOfScripts
# now let's see its content
# by the way, this directory is completely hypothetical!
user@machine:~$ls block class devices fs kernel power bus dev firmware hypervisor module # search recursively the files into each of the directories show above for # the word 'subfigure' user@machine:~$ grep -r subfigure
module/project1/scripts4/toCompile/blocks.Rnw:\begin{subfigure}


The code above indicates that our recursive search for the term ‘subfigure’ found a single file called blocks.Rnw in the subdirectory module/project1/scripts4/toCompile that contained inside such file a line with the text \begin{subfigure}. Then we can visit that file, open it, and see how did we use that tool in the past when working with $\LaTeX$. Pretty useful when we just have a bare idea of what we are looking for.

I found it useful for exactly that case when I did not remember how to use the subfigure $\LaTeX$ package and did not want to search in google; I then just went to a directory where I had some indication that files using the package could be placed (e.g., my folder with beamer presentations or my PhD folder with some of my scholarship report files, all of them in $\LaTeX$) and then let grep look for them. Hope you find it useful and funny, it could save your life some day (as most of the bash commands, so learn to love the command line!)

PS: grep can take very very long to execute if your directory has a lot of files, subdirectories, and they are large.

## New species (to science, sometimes actually)

This blog post is about the real meaning of a expression/term commonly found in technical biodiversity jargon that tend to be misunderstood by people outside the field, commonly journalists and media consumers, and is the concept of new species.

Despite its literal meaning, what scientists mean by “new” is not what the public expects. I’ve found myself in the situation of being criticized by media consumers (people outside biology) by claiming that a species is “new”. I can remember a specific case when Farlowella yarigui was described (Ballen & Mojica, 2014). This work received a lot of media coverage thanks to the interest of journalists from Unimedios on my work with fishes collected during one of the courses at the Universidad Nacional when I was still an undergrad student. For some reason, such news post was “recycled” by some other newspapers/journals in their news section on environment (e.g., El Espectador) and posted in their social media sites.

Some of the readers claimed that the species can by no means be new since they have seen that fish before, sometimes since they were young, and in a few cases, implying that scientists lie to the public by telling the species is new! I did not take it as an attack because I acknowledge the ambiguity of saying that a species is “new”; in fact, at least three different cases fall within the concept of “new” to science as opposed to “new” to everyone in the world:

• A species can be mistakenly called with a scientific name when it in fact is not such species. This happens when biologists fail to recognize that the scientific name that is applied to a given species is erroneously used, usually because we are confusing two species under the same name. This is analogous to calling “metal” to the music by Nirvana and Death: We are erroneously calling Nirvana as ‘metal’ when it is in fact Grunge. A good example is calling the tiger catfish of the Magdalena river basin in Colombia as Pseudoplatystoma fasciatum when it is not such species but another one, that was given the name Pseudoplatystoma magdaleniatum after recognizing the differences between both species (Buitrago-Suarez & Burr, 2007).

• A species can be known to be around but without a scientific name to date. In this case, scientists recognize it is not any of the species already named properly by them but still further work could be needed in order to apply a different name for it. This happens when the differences among species of a group are difficult to ascertain, or more specimens in museums are needed in order to be confident that we are not naming a species that already has a name. This is similar to finding something striking and unfamiliar to us, such as when astronomers name a planet we already saw but that did not have a name yet for it.

• A species can be completely unknown to everyone, and consequently, lack a proper scientific name. This happens with rare, cryptic, or difficult-to-find living things, such as fishes from the deep waters that even fishermen usually cannot catch; these fishes can live to incredible depths always, living and dying without a minimal chance of reaching the surface. In these cases, neither the public nor the scientific community have been in contact with such organisms before. They are, in a sense, really new species.

In all these three cases the technical name in taxonomy (the biological field that aims at documenting living things and naming them scientifically) is “new species”, sometimes even species novum in latin. We need to understand that taxonomical rules started formally during the XVIII century by the works of Carolus Linnaeus, and by that time the Latin was the mandatory language of natural sciences. As a consequence, even today we call them by the translation of the sentence species novum to modern languages such as English, the current standard language in science.

From time to time different scientists decide to avoid such ambiguity by stating that the species is not a new one but an undescribed one, or even avoid such implications at all by saying that a scientific paper proposes a scientific name for a given species. As an example I can think of Grant et al.’s 2007 paper describing Allobates niputidea, a frog species from Colombia, entitled in fact “A name for the species of Allobates (Anura: Dendrobatoidea: Aromobatidae) from the Magdalena Valley of Colombia“.

That said, a scientific name is actually defined as a special name that we apply under certain clear rules in order to avoid ambiguity among scientists, and whose meaning cannot change among languages, therefore being stable. The advantages of using such system is that we avoid calling with several names, sometimes even common names, the same thing, and that once we use a scientific name our peers know exactly to which kind of organism we are referring to. It is in some sense as using our given-and-family name combination (e.g., John Doe) in order to ascertain our own uniqueness from among people sharing our given or family name (e.g., John Doe is different from both John Jameson and Evelyn Doe); however, even such combination is subject to synonyms, something that biological nomenclature tries to avoid by using such rules.

This ambiguity implies at least two possible solutions: To avoid using “new species” in article names in favor of “undescribed species”, or to do the titanic labour (given that there is much more people out there that is nor part of the scientific community; actually these groups differ by orders of magnitude) of explaining the people what we mean by new species. This blog post is an effort to the latter direction as the former is so widespread in scientific literature than it might not be practical to abandon the term at all. A second benefit of the latter alternative is to force us scientists to interact with the people outside our field in order to promote making science accessible to the public.

References

Ballen, G. A., & Mojica, J. I. (2014). A new trans-Andean Stick Catfish of the genus Farlowella Eigenmann & Eigenmann, 1889 (Siluriformes: Loricariidae) with the first record of the genus for the río Magdalena Basin in Colombia. Zootaxa, 3765(2), 134-142.

Buitrago-Suarez, U. A., & Burr, B. M. (2007). Taxonomy of the catfish genus Pseudoplatystoma Bleeker (Siluriformes: Pimelodidae) with recognition of eight species. Zootaxa, 1512(1), 1-38.

Grant, T., Acosta, A., & Rada, M. (2007). A name for the species of Allobates (Anura: Dendrobatoidea: Aromobatidae) from the Magdalena Valley of Colombia. Copeia, 2007(4), 844-854.

## Checking .xml Beast2 files for errors

Note: This entry assumes that you are using a Unix-like operative system, ideally GNU/Linux and alternatively OSX. Not tested in BSD or others.

I’ve been using Beast2 for a while as the bayesian platform for a study on the effect of calibration priors using a Neotropical fish group as study model because of its reasonable fossil record and availability of DNA sequences. The study requires at least 18 different analyses to be run, with runtime varying from one to four days in order to reach convergence1, so it can be defined as computationally exhaustive. As I had access to a server with good amount of processors I could “parallelize” the analysis by running several instances of beast2 at the same time, each with several threads. However, even with this approach I had to limit the number of analyses to batches of four to five each time so that I did not use all of the resources, because others were also running their own analyses on it.

From time to time I noticed that a given analysis failed without producing any useful signal that I could use for picking the problem in time, so I had to wait until the end of the analysis in order to discover that a given one did not even run at all because problems in the definition of some priors. Until then I didn’t know about the -validate option of Beast2 that allows to check that most of the xml file definition is OK and then the analysis will run. Plainly this option will attempt to run a given xml file and report any error or exit without completing the analysis. It’s usage is as follows:

beast -validate myFile.xml


where myFile.xml is the name of the xml file to be tested. However, what if we wanted to check a bunch of files automatically? Here we can use the command-line options of Beast2 along with bash control structures in order to iterate over each file with some conditions (e.g., the file extension, or the file name) and then check whether it is a valid xml file that will be run by Beast2 without problem. This approach assumes that you have all of your xml files in the same directory (but alternative versions can avoid this requirement and traverse recursively path structures, that is, navigate sub-directories) for a simple case.

We will use six main software pieces: for, ls beast, 2>&1, |, and grep:

• for is an iterative control structure already available in any unix-based system (e.g., GNU/Linux or OSX operative systems).

• ls will show a list of file names composed of any text terminating in .xml (*.xml reads any text and then dot, x, m, and l to the end of the name).

• beast is the program that we will use for bayesian inference of divergence time (not covered in this post though).

• 2>&1 redirects the text from the console as input to another command.

• | is the “pipe” operator, it will redirect the result of the commands on the left of the operator as input to the commands to the right. It reads “evaluate the code to the left of |, take the result and input it to those other commands to the right of |”. In our case, it will take the text being redirected from the screen by 2>&1 and to use it as the input of grep (as its last argument, actually).

• grep tool is a powerful command-line tool for searching either simple text or regular expressions (not covered in this post either), in this case, the content redirected by 2>&1 from beast -validate to grep.

Basically our code will pick one xml file at a time in a for loop, test with beast -validate whether it is a correct xml file, and then redirect the text on the console as the input for grep to look for specific text indicating that something is wrong with out files. The code is then:

# one-liner
for i in ls *.xml; do beast -validate $i 2>&1 | grep "Error"; done # several-lines version for i in ls *.xml do beast -validate$i 2>&1 | grep "Error"
done


The expected output in presence of corrupted xml files will be text containing the word Error printed to the console, indicating that a given xml file contained errors. Otherwise, out code will finish silently, indicating that we can run our analyses. This control-quality step is crucial for cases where you are submitting processes to a cluster or a server where you need to wait until completion of the analyses for a given reason.

1. Assessed primarily from the ESS value but also comparing different runs of the same analysis, and also by examining the behavior of the estimation along the analysis so that no pattern is detected.