Assorted tools for interacting with servers

Personal list of commands to remember when running analyses on servers. This post is expected to grow from time to time as new commands/command-combinations are found to be useful.

# login with ssh
ssh -l user ip # e.g., myUser 186.333.444.111

# create a virtual screen
screen -S screenName # use something identifying the analysis in screenName

# back to virtual screen called "screenName"
screen -r screenName

# list screens
screen -r

# detach virtual screen keeping the analysis
Ctrl + a + d

# detach virtual screen AND kill the job
Ctrl + c

# copy files through ssh to server
# supports the -r recursive tag for several files/directories
scp user@ip:/path/to/files path/to/local/directory # from directory in user at ip to local directory
scp path/to/local/file user@ip:/destination/directory
scp -r path/to/local/directory user@ip:/destination/directory

# monitor processes while they are running based on the filetype and modification time
find . -name *.log -ls | grep "date-right-now" # e.g., "Apr  06" or "Apr 21", note the space in the former


Why to use the mean of the quantiles as initial values in ‘optim’?

This post is actually one of the vignettes that accompany the bayestools package, and I’m sharing here because vignettes does not seem to enjoy of the same spreading potential as blog posts.

One issue with the use of the optim function for parameter approximation from a ser of percentiles and quantiles is that it requires initial values for its heuristic search of values. If these initial values are much distant from the unknown real parameter value, then the function has serious problems with convergence and may produce results that are simply wrong. In pre-release versions of findParams the initial value was a vector of 1s corresponding to the number of parameters to estimate (e.g., c(1, 1) when estimating mean and sd in pnorm), but this produced simply wrong results when, for instance, the real mean was 10 or a larger value.

With a little help of simulation we can show that the best initial guess is in fact the median of the quantiles.

the following code will generate a lot of parameter estimates from several trials using different initial values. With a bit of large number theory, a decent estimate can be found. Here, q and p are quantiles and percentiles under a given, known distribution; a different anonymous function where sapply varies the values of the initial values allows to get the $par element of the optim call, and the the density plot shows that overall a median estimate approaches at the correct value. The values of q come from qDIST given the probabilities of interest (generally 0.025, 0.5, and 0.975). For instance: qbeta(p = c(0.05, 0.5, 0.95), shape1 = 10, shape2 = 1) for the example below: # X = seq... is the set of values to try, from min(q) to max(q). parameters <- sapply(X = seq(0.6915029, 0.9974714, length.out = 1000), FUN = function(x) { findParamsPrototype(q = c(0.6915029, 0.9330330, 0.9974714), p = c(0.025, 0.5, 0.975), densit = "pbeta", params = c("shape1", "shape2"), initVals = c(x, x))$par
},
simplify = TRUE)

plot(density(t(parameters)[, 1]), main = "Density for shape1", xlab = "Shape1", ylab = "Density")
abline(v = median(parameters[1, ]), col = "red")

plot(density(t(parameters)[, 2]), main = "Density for shape2", xlab = "Shape2", ylab = "Density")
abline(v = median(parameters[2, ]), col = "red")


Large number theory allows us to expect such result, but, what if the specific initial value matters? Another simulation plotting the parameter value as a function of the initial value can be prepared with, say, a random variable $X ~ N(\mu = 10, \sigma = 1)$. Such a large mean is expected to cause problems with initial values, since in these simulations 10 is huge when compared to 1. Initial values were simulated to take values between 0.001 and 10 because $\sigma$ > 0 so zero and negative values would break the code.

# check that quantiles are rigth:
qnorm(c(0.025, 0.5, 0.975), mean = 10, sd = 1)
## 8.040036 10.000000 11.959964

# simulate the parameters
simInitVals <- seq(0.001, 10, length.out = 10000)
parameters2 <- sapply(X = simInitVals,
FUN = function(x) {
findParamsPrototype(q = c(8.040036, 10.000000, 11.959964),
p = c(0.025, 0.5, 0.975),
densit = "pnorm",
params = c("mean", "sd"),
initVals = c(x, x))$par }, simplify = TRUE) # plot the results plot(y = parameters2[1,], x = simInitVals, main = "Estimates for the mean", xlab = "Simulated init. vals.", ylab = "Parameter estimate") abline(h = 10, col = "red") plot(y = parameters2[2,], x = simInitVals, main = "Estimates for the st.dev.", xlab = "Simulated init. vals.", ylab = "Parameter estimate") abline(h = 1, col = "red")  Here we see a very interesting result: The initial values near zero up to a bit above 1 cause very odd and unreliable estimates of each parameter, while larger values, closer to the real parameter values invariantly provide reliable estimates. Please note that the red line is the true parameter value that we started with above. But, what happens in the neighborhood of the mean of the quantiles? meanNeighbors <- which(simInitVals > (mean(simInitVals) - 0.1) & simInitVals < (mean(simInitVals) + 0.1)) plot(y = parameters2[1,][meanNeighbors], x = simInitVals[meanNeighbors], main = "Neighbors of mean(quantiles)", xlab = "Simulated init. vals.", ylab = "Parameter estimate") abline(h = 10, col = "red") plot(y = parameters2[2,][meanNeighbors], x = simInitVals[meanNeighbors], main = "Neighbors of mean(quantiles)", xlab = "Simulated init. vals.", ylab = "Parameter estimate") abline(h = 1, col = "red")  Now let’s visualize it as densities: plot(density(parameters2[1,][meanNeighbors]), main = "Neighbors of mean(quantiles)", ylab = "Parameter estimate") abline(v = 10, col = "red") plot(density(parameters2[2,][meanNeighbors]), main = "Neighbors of mean(quantiles)", ylab = "Parameter estimate") abline(v = 1, col = "red")  Here we see that the values around the mean of the quantiles as initial values behave with the regular properties of large-number theory (real parameter number in red line as above). Therefore, it is advisable to pick this as a regular initial value in the context of the optmi function. Why did the first example behave nicely? Results not shown (just change the values in seq(from = 0, to = 1)) indicate that convergence is not a concern when estimating the parameters in the beta distribution and there is a reason for it. The beta PDF is defined over $[0,1]$, so it makes no sense at all to try values outside this interval: $Beta(\alpha,\beta):\,\, P(x|\alpha,\beta)=\frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha,\beta)}$ where $B$ is the beta function $B(\alpha,\beta)=\int_{0}^{1}t^{\alpha-1}(1-t)^{\beta-1}dt$ Being 0 and 1 not that far away from each other, we find they behaving as our simulation of the neighborhood of the mean of quantiles in the esitmation of the normal PDF parameters. First release of the ‘bayestools’ R package After some time developing functions for my own research on using fossil information as a source of prior information for divergence time estimation studies (in prep.) I decided to release the first version of an R package with functions that I found useful for pre- and post-processing of data in evolutionary analyses using bayesian methods. It has been first released to its development repository on github (https://github.com/gaballench/bayestools/) and I expect to implement further tools, so the package must be understood as a release with the minimal set of functions. I expect to continue active development and eventually to submit it to CRAN, but until then you can install the development version with devtools‘s function install_github: # install devtools if needed install.packages("devtools") # load devtools to access the install_github function library(devtools) # install bayestools from the github repository install_github(repo = "gaballench/bayestools")  What the package can do? There are two main sets of functions that bayestools provide: Pre-processing and post-processing functions. In the first set we have functions that allow us to specify priors, and plot probability density functions from Beast2 parameters in order toi check visually their correctness. In the second group we have a tool for measuring interdependence between empirical densities, in addition to some ways to visualize it. The main place in evolutionary biology where the features contained in bayestools belong is in the core of divergence time estimation and incorporation of paleontological and geological information into estimation of the time component of phylogenies and diversification analyses. Here, the tools present in the package aid in implementing information from these fields into bayesian analyses, mainly through specification of priors, the most conspicuous feature of bayesian inference. However, prior specification is a very difficult task and there are few if any rules to follow, along with a lot of misunderstanding along with clearly-unjustified practices. Priors Prior specification depends strongly on what we are trying to use in order to calibrate a given phylogeny. For instance, node calibration and tip calibration work in different ways, and prior specification will not only depend on this but also on the nature of the information we are trying to use as priors. In the most common case, we have a fossil taxon (or occurrence) that could inform us about the age of a given node (node calibration). However, we lack exact information on two things: When the organism lived (measured without error), and when a given diversification event took place (i.e., the exact age of a node). What we have is a fossil, whose age has been inferred or measured with error or uncertainty. It is this information, uncertainty, what we expect to model through priors (or at lease what should be done in real life). A prior is in itself a probability density function (PDF hereafter), that is, a mathematical function describing the probability that a variable take a given value inside an given interval. More or less precise statements can be made with aid of PDFs, such as that it is more likely that a variable X takes a value between values a and b than rather between c and d; or which is the expected value for the variable of interest. Now, what we have is age information such as the following instances: • The fossil comes from a volcanic layer that has been dated with radiometric techniques with a specific value $\pm$ uncertainty, for instance, 40.6 $\pm$ 0.5 Ma, where 0.5 is the standard deviation $\sigma$. • The fossil comes from a sedimentary unit that is said to be of Miocene age (5.33 to 23.03 Ma). • The fossil comes from a layer bracketed by two levels with volcanic ash that have been dated as 10.2 $\pm$ 0.2 and 12.4 $\pm$ 0.4 respectively. So, the fossil itself have an age determined as interpolated between the layers that have real age information. How can we convert this information into legitimate priors? The findParams function Given that we have prior on the age of a fossil to be 1 – 10 Ma and that we want to model it with a lognormal distribution, fin the parameters of the PDF that best reflect the uncertainty in question (i.e., the parameters for which the observed quantiles are 1, 5.5, and 10, assuming that we want the midpoint to reflect the mean of the PDF: bayestools::findParams(q = c(1, 5.5, 10), p = c(0.025, 0.50, 0.975), output = "complete", pdfunction = "plnorm", params = c("meanlog", "sdlog")) ##$par
## [1] 1.704744 0.305104
##
## $value ## [1] 0.0006250003 ## ##$counts
##      101       NA
##
## $convergence ## [1] 0 ## ##$message
## NULL


Now we have a published study that specified a lognormal prior (with or without justification, that’s another question) and we want to actually plot it outside of beauti, for instance, in order to assess sensitivity of the posterior to the prior. How can we plot it in R and use its data as any other statistical density?

The lognormalBeast function

Generate a matrix for the lognormal density with mean 1 and standard deviation 1, with mean in real space, and spanning values in x from 0 to 10, and then plot it:

lnvals <- bayestools::lognormalBeast(M = 1, S = 1, meanInRealSpace = TRUE, from = 0, to = 10)
plot(lnvals, type = "l", lwd = 3)


Sensitivity

One hot topic in model and tool comparisons is the sensitivity of the posterior to the prior, that is, how much could our results be determined by the prior. For instance, if the posterior is totally dependent on the prior, that means that the likelihood (or the data) is not providing information, and this is very risky as there would be the potential to manipulate the results of bayesian analyses.

The measureSensit function

Measure and plot the sensitivity between two distributions partially overlapping:

set.seed(1985)
colors <- c("red", "blue", "lightgray")
below <- bayestools::measureSensit(d1 = rnorm(1000000, mean = 3, 1),
d2 = rnorm(1000000, mean = 0, 1),
main = "Partial dependence",
colors = colors)
legend(x = "topright", legend = round(below, digits = 2))


The number in the legend indicates an overlap of 0.13, and as such, a measure of interdependence between the densities (Ballen in prep.).

Geochronology

In the third case of possible age estimates for our calibrations, can we clump together a number of age estimates from several samples in order to build a general uncertainty for the interval?

Do the age estimates for the boundaries of the Honda Group (i.e., samples at meters 56.4 and 675.0) conform to the isochron hypothesis?

data(laventa)
hondaIndex <- which(laventa$elevation == 56.4 | laventa$elevation == 675.0)
bayestools::mswd.test(age = laventa$age[hondaIndex], sd = laventa$one_sigma[hondaIndex])


The p-value is smaller than the nominal alpha of 0.05, so we can reject the null hypothesis of isochron conditions.

Do the age estimates for the samples JG-R 88-2 and JG-R 89-2 conform to the isochron hypothesis?

twoLevelsIndex <- which(laventa$sample == "JG-R 89-2" | laventa$sample == "JG-R 88-2")
dataset <- laventa[twoLevelsIndex, ]
# Remove the values 21 and 23 because of their abnormally large standard deviations
bayestools::mswd.test(age = dataset$age[c(-21, -23)], sd = dataset$one_sigma[c(-21, -23)])


The p-value is larger than the nominal alpha of 0.05, so we cannot reject the null hypothesis of isochron conditions

R package development in Emacs with ESS

This post is somewhat an autorreference for myself from the future as I only develop packages from time to time and tend to forget things easily; it is also a quick reference source as several of these topics are covered in varying detail in Hadley Wickham’s “R Packages. Organize, Test, Document and Share Your Code”. The following questions are answered with code below and comments whenever appropriate. The two most important tools we will need are rmarkdown and devtools. I assume that we already have important dependencies installed such as knitr or pandoc so their install process will not be covered here (google will surely point you out to the proper answer).

Why do I do this if Wickham’s book is already a reference? Because I don’t like/use Rstudio (reasons will not be discussed here) but instead use Emacs+ESS and information on workflow under that beautiful platform in the specific context of R package development is scarce. In order to mimic Rstudio’s behavior the secret is to run devtools and rmarkdown functions in R’s command line, that’s it.

Before starting, install and load the packages of interest:

# install the packages of interest
install.packages("rmarkdown")
install.packages("devtools")

# once installed, load the packages
library(rmarkdown)
library(devtools)


How do I create a package from scratch?

The code below will create the layout of a package with the minimum this you will need to care about in case of producing a very simple package. These include documentation and code directories, and description files that you will need to fill in with your own information

create("myPackageName")


I have already coded a function, how do I include it in my package?

Simply copy it into the R/ folder in a file with extension .R so that R knows it contains code.

How do I document my function?

Take a look at the help file for any function and see what information you usually find there: A description of what the function does, which arguments it has, and what kinds of values are allowed in each arguments are a good place to start. Take a look at Karl Broman’s material on writing documentation for further details and tag options.

Basically, documentation using the roxygen2 includes the text inside the script that contains the function. It uses special comments (#') that are skipped when R loads the code, but is used by roxygen2 in order to build the documentation files. The advantage of this is that in the very same file you put your code and describe it and its usage. roxygen2 uses special tags starting with @ to allow the user to fill in the several sections of a documentation file. A simple example would be:

#' A function for adding up two numbers
#'
#'
#' @param x The first number to add
#'
#' @param y The second number to add
#'
#' @return A numeric vector with the value of the sum x + y
#'
#' @examples
#' # Adding up two positive numbers
#' addTwo(x = 4, y = 6)
#' # Adding up one positive and one negative numbers
#' addTow(x = -2, y = 7)
#'
output <- x + y
return(output)
}


The code above contains the code for our function addTwo along with a bunch of lines starting with #', these compose the documentation. I would save this in a file called the same as the function: addTwo.R.

I managed to document my functions, how do I build the documentation?

Open a console with the working directory pointing to the package directory, maybe navigating to such folder with Dired (C-x d) and then opening an R session there (M-x R). After loading devtools you can use the function document to build the documentation files:

# check that we are where we need to
> getwd()
"myPackage"
> document()
Updating myPackage documentation


The function document will produce a file with extension .Rd in the man/ directory, corresponding to the documentation of our function. We can not access the documentation as with regular functions because the packages is not formally installed, to the documentation files can not be found by R. Instead, we can preview them in the command line with ? AND THEN , this will prompt a + sign for completing a command, and then we can write the function name. It would look like the following when called from the command line:

> ?

A function for adding up two numbers

Usage:

Arguments:

x: The first number to add

y: The second number to add

Value:

A numeric vector with the value of the sum x + y

Examples:

# Adding up two positive numbers
addTwo(x = 4, y = 6)
# Adding up one positive and one negative numbers
addTow(x = -2, y = 7)


I’ve already documented all of my functions, how can I build a pdf with all the documentation as in famous packages?

Open a shell (e.g., M-x eshell), and run:

# If eshell is inside the package directory...
R CMD Rd2pdf ../myPackage
# If eshell is just above the package directory...
R CMD Rd2pdf myPackage

Hmm ... looks like a package
Converting Rd files to LaTeX
Creating pdf output from LaTeX ...

This is pdfTeX, Version 3.14159265-2.6-1.40.16 (TeX Live 2015/Debian) (preloaded format=pdflatex)
restricted \write18 enabled.
entering extended mode
...
Output written on Rd2.pdf (8 pages, 106389 bytes).
Transcript written on Rd2.log.
Saving output to ‘myPackage.pdf’ ...
Done


This will generate a nicely-formatted pdf with all your documentation in one place

I’ve changed some code, how do I see the result interactively?

You need to re-load your package in order to have access of these changes but need to be in the package main directory in order to use the function load_all. Do not forget to save all changes in all code files before re-loading:

# check that we are where we need to
> getwd()
"myPackage"


It is good idea to re-build the documentation just to be sure that everything is up to date (see above).

I’ve updated the README.Rmd file, how do I apply the changes to the .md and .html files?

These three files are of special relevance if you are hosting your package development repository in github, as the markdown files are already optimized for looking pretty under github’s flavour of markdown. Always modify the r markdown file (the one with .Rmd extension) and then use the package rmarkdown to convert it to .md and .html:

> render("README.Rmd)

|........                                                         |  12%
ordinary text without R code

|................                                                 |  25%
label: unnamed-chunk-1 (with options)
List of 1
$echo: logi FALSE |........................ | 38% ordinary text without R code |................................ | 50% label: unnamed-chunk-2 (with options) List of 1$ eval: logi FALSE

|.........................................                        |  62%
ordinary text without R code

|.................................................                |  75%
label: unnamed-chunk-3 (with options)
List of 1
$echo: logi FALSE |......................................................... | 88% ordinary text without R code |.................................................................| 100% label: unnamed-chunk-4 (with options) List of 2$ eval   : logi TRUE

Most likely you used a character or symbol reserved for math mode (e.g., underscore _). Please note that these characters need to be escaped with the \ or its respective reserved word, or even enclosed into a math inline “pharse”, for instance when using greek letters:

Good: $\alpha$-diversity; Bad: \alpha-diversity
Good: filename\_without\_spaces; Bad: filename_without_spaces

Example:

\documentclass[svgnames,mathserif,serif]{beamer}

\title{Awesome beamer presentation}
%\subtitle{}
\author{Gustavo A. Ballen, D.Sc.(c)}
\institute{University of Sao Paulo \\
Ichthyology \\
\texttt{myEmail@usp.email.com}}
\date{\today}

\begin{document}

\frame{\titlepage}

\begin{frame}
\frametitle{My Slide}
\begin{itemize}
\item First item with good use\_of\_subscripts
\item Second item with bad use_of_subscripts
\end{itemize}
\end{frame}

\end{document}


The code above will produce the error:

./mathSymbol.tex:71: Missing $inserted. <inserted text>$
l.71 \end{frame}

?

./myFile.tex:71: Emergency stop.
<inserted text>
$l.71 \end{frame} ./myFile.tex:71: ==> Fatal error occurred, no output PDF file produced! Transcript written on myFile.log. /usr/bin/texi2dvi: pdflatex exited with bad status, quitting.  Please also note that the error message refers to the .tex file with the line number of the problem, not to the .Rnw file. Actually, this message error does not tell us that the problem is with the underline, yet it gives a clue about the math mode as the $ is inserted, and this one is used in order to open and close in-line math text (e.g., formulae).

Whenever this error happens, check whether you are using uncommon characters in your text that might be expected to play a reserved role in math mode. For instance:

% Let's assume you have exactly the same code
% before this point as lines 1-14
\begin{frame}
\frametitle{My Slide}
\begin{itemize}
\item First item with good use of $\alpha$-diversity in-line math mode
\item Second item with bad \alpha-diversity in-line math mode
\end{itemize}
\end{frame}


will produce the same error as the underscore case but this time associated to \alpha, that is a command for inputting the greek letter $\alpha$. Enclosing the \alpha command with the \$ operators will solve the problem.