Merry Christmas!

It’s Christmas time, and I decided to enter the mood by making a Christmas tree representing my current interests. Back in 2015 I constructed a Christmas tree based on the shape of an arbitrary cladogram and then added color and shapes in order to represent a Christmas tree. Below I describe how I did it long with the final output. Then I describe the process for constructing another Christmas tree, this time using a random walk stochastic process and some automatic additions to the plot of several realizations of such process. As with the cladistic Christmas tree, I present below the output along with the script for constructing it. These two graphics represent my shift from a hard-core parsimony guy to a fanatic of all quantitative things.

Cladistic, manual version

First thing is to create a taxon-only matrix in mesquite, and then to manually create the relationships and also to transpose the daughter branches for each node. The output would be something like this:

#NEXUS
[written Wed Dec 23 14:54:24 BRST 2015 by Mesquite  version 3.04 (build 725) at ware/127.0.1.1]

BEGIN TAXA;
	TITLE Taxa;
	DIMENSIONS NTAX=20;
	TAXLABELS
		taxon_1 taxon_2 taxon_3 taxon_4 taxon_5 taxon_6 taxon_7 taxon_8 taxon_9 taxon_10 taxon_11 taxon_12 taxon_13 taxon_14 taxon_15 taxon_16 taxon_17 taxon_18 taxon_19 taxon_20
	;

END;

BEGIN TREES;
	Title Trees;
	LINK Taxa = Taxa;
	TRANSLATE
[0] 		1 taxon_1,
[1] 		2 taxon_2,
[2] 		3 taxon_3,
[3] 		4 taxon_4,
[4] 		5 taxon_5,
[5] 		6 taxon_6,
[6] 		7 taxon_7,
[7] 		8 taxon_8,
[8] 		9 taxon_9,
[9] 		10 taxon_10,
[10] 		11 taxon_11,
[11] 		12 taxon_12,
[12] 		13 taxon_13,
[13] 		14 taxon_14,
[14] 		15 taxon_15,
[15] 		16 taxon_16,
[16] 		17 taxon_17,
[17] 		18 taxon_18,
[18] 		19 taxon_19,
[19] 		20 taxon_20;
	TREE tree1 = (20,(18,((1,((3,((15,((4,((5,((7,((11,((9,10),12)),8)),6)),13)),14)),16)),2)),17)),19)[% ] [% ] [%  setBetweenBits = selected ];

END;

Then I manually edited it in figtree, mostly making the branches thicker, and fixing the fonts, resulting in the following file:

#NEXUS
begin taxa;
	dimensions ntax=20;
	taxlabels
	taxon_1
	taxon_10
	taxon_11
	taxon_12
	taxon_13
	taxon_14
	taxon_15
	taxon_16
	taxon_17
	taxon_18
	taxon_19
	taxon_2
	taxon_20
	taxon_3
	taxon_4
	taxon_5
	taxon_6
	taxon_7
	taxon_8
	taxon_9
;
end;

begin trees;
	tree tree_1 = [&R] (taxon_20:1.0,(taxon_18:1.0,((taxon_1:1.0,((taxon_3:1.0,((taxon_15:1.0,((taxon_4:1.0,((taxon_5:1.0,((taxon_7:1.0,((taxon_11:1.0,((taxon_9:1.0,taxon_10:1.0):1.0,taxon_12:1.0):1.0):1.0,taxon_8:1.0):1.0):1.0,taxon_6:1.0):1.0):1.0,taxon_13:1.0):1.0):1.0,taxon_14:1.0):1.0):1.0,taxon_16:1.0):1.0):1.0,taxon_2:1.0):1.0):1.0,taxon_17:1.0):1.0):1.0,taxon_19:1.0);
end;

begin figtree;
	set appearance.backgroundColorAttribute="Default";
	set appearance.backgroundColour=#ffffff;
	set appearance.branchColorAttribute="User selection";
	set appearance.branchColorGradient=false;
	set appearance.branchLineWidth=7.0;
	set appearance.branchMinLineWidth=0.0;
	set appearance.branchWidthAttribute="Fixed";
	set appearance.foregroundColour=#000000;
	set appearance.hilightingGradient=false;
	set appearance.selectionColour=#2d3680;
	set branchLabels.colorAttribute="User selection";
	set branchLabels.displayAttribute="Branch times";
	set branchLabels.fontName="Abyssinica SIL";
	set branchLabels.fontSize=8;
	set branchLabels.fontStyle=0;
	set branchLabels.isShown=false;
	set branchLabels.significantDigits=4;
	set layout.expansion=642;
	set layout.layoutType="RECTILINEAR";
	set layout.zoom=0;
	set legend.attribute=null;
	set legend.fontSize=10.0;
	set legend.isShown=false;
	set legend.significantDigits=4;
	set nodeBars.barWidth=4.0;
	set nodeBars.displayAttribute=null;
	set nodeBars.isShown=false;
	set nodeLabels.colorAttribute="User selection";
	set nodeLabels.displayAttribute="Node ages";
	set nodeLabels.fontName="Abyssinica SIL";
	set nodeLabels.fontSize=8;
	set nodeLabels.fontStyle=0;
	set nodeLabels.isShown=false;
	set nodeLabels.significantDigits=4;
	set nodeShape.colourAttribute=null;
	set nodeShape.isShown=false;
	set nodeShape.minSize=10.0;
	set nodeShape.scaleType=Width;
	set nodeShape.shapeType=Circle;
	set nodeShape.size=4.0;
	set nodeShape.sizeAttribute=null;
	set polarLayout.alignTipLabels=false;
	set polarLayout.angularRange=0;
	set polarLayout.rootAngle=0;
	set polarLayout.rootLength=100;
	set polarLayout.showRoot=true;
	set radialLayout.spread=0.0;
	set rectilinearLayout.alignTipLabels=false;
	set rectilinearLayout.curvature=0;
	set rectilinearLayout.rootLength=100;
	set scale.offsetAge=0.0;
	set scale.rootAge=1.0;
	set scale.scaleFactor=1.0;
	set scale.scaleRoot=false;
	set scaleAxis.automaticScale=true;
	set scaleAxis.fontSize=8.0;
	set scaleAxis.isShown=false;
	set scaleAxis.lineWidth=1.0;
	set scaleAxis.majorTicks=1.0;
	set scaleAxis.origin=0.0;
	set scaleAxis.reverseAxis=false;
	set scaleAxis.showGrid=true;
	set scaleBar.automaticScale=true;
	set scaleBar.fontSize=10.0;
	set scaleBar.isShown=true;
	set scaleBar.lineWidth=1.0;
	set scaleBar.scaleRange=0.0;
	set tipLabels.colorAttribute="User selection";
	set tipLabels.displayAttribute="Names";
	set tipLabels.fontName="Abyssinica SIL";
	set tipLabels.fontSize=16;
	set tipLabels.fontStyle=2;
	set tipLabels.isShown=true;
	set tipLabels.significantDigits=4;
	set trees.order=false;
	set trees.orderType="increasing";
	set trees.rooting=false;
	set trees.rootingType="User Selection";
	set trees.transform=false;
	set trees.transformType="cladogram";
end;

We can save the file above in svg vectorial format so that then we can do a bit of edition in Inkscape I ended up with this:

christmas_phy.newick_white

The process is fairly manual, but it pleased both my contacts in social networks and myself, and it was funny to tweak things so that an evolutionary construct based on graph theory could mimic a Christmas tree. It is noteworthy that all was carried out with open source programs, so you can try it yourself without need to pay for the programs 🙂

Stochastic, automated Christmas tree

Stochastic processes are models that describe phenomena incorporating uncertainty across time, that is, where the final result and the trajectory of a sequence of events can not be determined, however, they can be characterized so that we can build expectations about them. A big deal of real-world phenomena can be described using stochastic processes, from time series analysis (including financial applications, medical imaging, and world-scale phenomena such as global warming) to modeling of evolution (e.g., Brownian motion, the birth-death process).

One of these is often called random walk, and says that for a given variable, the values it takes through time are allowed to show variation (or to show uncertainty) between certain admissible values, so that the change in the value from a step 1 to a step 2 can not be predicted exactly and can change if we repeat the realization of the events. Mathematically we then say that, for a time step t and its previous step t - 1 in variable x, the value of x at time step t would depend on the previous value t - 1 plus or minus an uncertain quantity that applies to the current step w_t:

x_{t} = x_{t - 1} + w_{t}

Oftentimes, we use the normal distribution for describing w_t centered at \mu = 0 and with variance \sigma^2 taking smaller or larger values depending on whether we want the process to vary a lot or just a little; the larger the variance, the larger the likely values that w_t can sum or rest from the previous value.

w_t = N(\mu,\sigma^2)

If we repeat a lot of times such stochastic processes we can consider a whole bunch of possible trajectories, that once plotted together, can produce the shape of a pine with tip on the origin.

Using unicode characters we can plot a point with a colored star to the plot area with the points function, and choose colors with finer control with the function RGB that combines values of the colors red, green, and blue, along with a transparency value alpha. We can use this source for picking a color visually and then look for the adequate amounts of red, green, and blue.

# set seed for repeatable results
set.seed(1985)
# initialize the plot with the tip star
plot(x = 0, y = 0, xlim = c(-60, 60), ylim = c(-500, 0),
     xlab = NA, ylab = NA, pch = -9733, cex = 3,
     col = rgb(red = 255/255, green = 191/255, blue = 0/255))
# set initial values
x <- 0
y <- x
# design a data frame for saving the values from each realization
allPoints <- data.frame(x, y, stringsAsFactors = FALSE)
# iterate over realizations and numbers of steps for producing
# 1000 different trajectories, each with 500 steps
for (j in 1:1000) {
    x <- 0
    y <- x
    for (i in 1:500) {
        # x to the left of the arrow is x_t and the one to the right is x_{t-1}
        # plus or minus an amount drawn at random from a normal(0,0.7)
        x <- x + rnorm(1, sd = 0.7)
        # append the value to the series
        y <- c(y, x)
    }
    # plot the line with the values saved before
    lines(x = y, y = -seq_along(y),
          col = rgb(green = 128/255,
                    red = 22/255,
                    blue = 20/255,
                    alpha = 0.3))
    # save the data to the data frame
    allPoints <- rbind(allPoints, data.frame(x = y, y = -seq_along(y)))
}
# plot the star again
points(x = 0, y = 0, pch = -9733, cex = 3,
       col = rgb(red = 255/255, green = 191/255, blue = 0/255))
# red balls
# red balls, y axis
redY <- seq(from = 0, to = 500, by = 80)[-1]
# sequence for calculating the values of each x component for a given y in red balls
for (i in redY) {
    redX <- c(seq(from = min(allPoints[allPoints$y == -i, "x"]),
                  to = max(allPoints[allPoints$y == -i, "x"]),
                  by = 20),
              max(allPoints[allPoints$y == -i, "x"]))
    points(x = redX, y = rep(x = -i, times = length(redX)),
           pch = 21,  bg = "red", cex = 2)
}
# yellow balls
# yellow balls, y axis
yellowY <- (seq(from = 0, to = 500, by = 80) + 40)[-length(seq(from = 0,
                                                               to = 500,
                                                               by = 80) + 40)]
# sequence for calculating the values of each x component for a given y in red balls
for (i in yellowY) {
    yellowX <- c(seq(from = min(allPoints[allPoints$y == -i, "x"]),
                     to = max(allPoints[allPoints$y == -i, "x"]),
                     by = 20), max(allPoints[allPoints$y == -i, "x"]))
    points(x = yellowX, y = rep(x = -i, times = length(yellowX)),
           pch = 21, bg = "yellow", cex = 2)
}

unnamed-chunk-3-1

A couple of tricks were needed in order to produce the orientation of trajectories and position of red and yellow balls: explicit transposition of variables where the values of x were flipped to its negative -x and the y as the independent variable and x as the dependent one. Also, the coordinate points for the balls were automated and therefore they may look weird to the right, since I did not found a way to automatically calculate coordinates and variable distance among points so that the initial and final ones were always on the largest or smallest point. It might be possible but I just got tired of thinking about an alternative, in the end, a script that builds the whole thing seems fair enough.

That’s all for now, happy holidays to everybody!

Advertisements

The tntphylo-mode Emacs package

Since approximately mid-year I started developing an Emacs package for doing interactive work with RevBayes, a program for bayesian phylogenetic inference. Since then the project has grown steadily and now it’s in a better shape, with a cleaner setup and execution, and is about to be submitted to Melpa. However, this post is not about that package but instead concerns a new one more or less forked from rev-mode but this time to allow interactive work with TNT, a program for phylogenetic analysis using parsimony.

Screenshot_2018-11-12_08-58-23
An arbitrary script with `tntphylo-mode` active

This first release ‘just works’ in the sense that some features are still unavailable; however, up to now the package is able to start a TNT process whenever there is a command for text redirection and there is still no process available, and if so, it redirects text from the active buffer (e.g., a TNT script) to the buffer with the process.

Screenshot_2018-11-12_08-59-21
Sending the first three commands in the script to the TNT process with `C-c C-n`

Font locking is still derived from sh-mode and thus most of syntax formatting is not as nice as if it was deriving from c-mode; alternatively, at its current state, the package inherits the sh-mode redirection capabilities that are absent in c-mode (and with all reason, since C is a compiled language where you can’t carry out interactive work as opposed to interpreted languages). Long story short, we need the text redirection from sh-mode and syntax font locking from c-mode; and I’m still trying to figure out how to manage to accomplish such hybrid.

How to use it

  1. Get tntphylo-mode from https://github.org/gaballench/tntphylo-mode.
  2. Add the following to your .emacs
(add-to-list 'load-path "path/to/tntphylo-mode")
(require 'tntphylo-mode)
  1. Open a TNT script.
  2. Call the function tntphylo-mode. It will load the package on the active buffer.
  3. Step on a line with a TNT command (e.g., mxram) and press C-c C-n for redirecting just one line. It will send the text to the TNT process that will execute it and print output to the TNT buffer.
  4. Rinse and repeat.

Final comments

You might be wondering why the odd “tntphylo” part of the name instead of just “tnt” as the program; it happens that there is already a package called tnt for Emacs and I didn’t want to mess things up in the remote case that somebody needs such package. Please note that it is customary in Emacs to start the function names with the package name (e.g., tntphylo-next-code-line) and therefore it will be confusing whether a given function starting with tnt- belongs to this package or to the earlier tnt.el package, unrelated to this one.

A few other functions are planned for this package, among others, inspection and maybe visualization of the objects produced through the interactive section, autocompletion of commands, and interactive help.

Please note that TNT is NOT open source (unfortunately) but instead freeware with a license agreement. You can call the TNT process only after agreeing with such license during the first time you launch the program. In order to work properly, your copy of TNT must be in your PATH, that is, in a directory where your operative system can look for executables.

Emacs as an Operating System

Please note that in the present document, “operating system” does not refer to the full implementation of software that controls the hardware, boots the system, and runs tasks tasks for the user, but instead as the last part.

Introduction

In my early years of immersion into computing I had the advantage of having access to a friend that was really good at it and ended up doing a PhD. in bioinformatics. At the time I barely had a first-try of installing Ubuntu on a desktop computer that happened to be shared with other members of my family (failing miserably and loosing during the process all of the data therein saved). Talking to this knowledgeable friend as well as reading about good editors for using R (that was not as popular at the time in my university as it is today), I heard about this weird editor with an African ruminant in his logo, called Emacs. The very first thing I heard was that it was a very good R editor that it was worth giving a try, probably from some online free course or such; and the second came from my friend, who described the editor as basically “an operating system that was also useful as a coding editor, but so hard to learn that it was basically pointless to even give it a try!”. I decided that I wanted to give it a try, just to perceive that I could not even understand its most basic commands, and that I could not even create or open a file. I dropped immediately and ended up using the built-in R’s script editor, concluding that the editor was basically beyond my reach and that I did not want to try harder.

I continued learning, next focusing on GNU/Linux and learning R along the way, writing code in RStudio and concluding that I found was just everything I needed for code production. As the time passed (several years, actually), I started perceiving some weird behaviors on RStudio’s side (specially concerning graphics) that made me look for another code editor. Then I tried sublime, and sticked to it for a while just to detect some other nuances
but nothing so serious as to make me look for some other editor. At the time I was starting my PhD. and had some “free” time, so that I started learning other “useful” tools such as LaTeX. I then decided to try again Emacs, as I was more familiar to coding and computing in general. This time I just sticked to it, learning quickly the small details that made its learning curve somewhat steep in the beginning but that were funny to learn; actually, the very same thing happened with LaTeX, so that with time it become my main editor. As a side note I did not touch again a computer running with Windows in years, so that I even forgot where to find things and such. GNU/Linux with Emacs was just wonderland I was happy, and had no reason to change something about. However, what if I had to interact with a machine running Windows?

I have to use such machines sometimes out of both my lab and home, where all machines run GNU/Linux. I found myself also using Windows machines when doing an internship abroad in a lab that did not have GNU/Linux available and had the urgency to compile a LaTeX document, and that time had to ask my sister to do it for me at home. What if I could just bring with me a mini-laptop with all my beloved GNU/Linux tools ready to use so that situations like that could not happen again? Something like a tablet with laptop capabilities, eventually capable of running code, compiling stuff and still being portable in the sense that I could bring it with me all the time. I already knew that USB sticks were useful as booting devices for computer administration or just for installing GNU/Linux, so in theory a small device such as a USB stick could behave as a full-featured system capable of turning a sad machine suffering with Windows into a powerful computing beast using GNU/Linux as its operating system. Would it be possible of having something similar for emergency situations?

I recalled the conversation with my old friend on Emacs and after some thoughts concluded that it was already a sub-operating system in my computers, where I spent most of the time when logged in, but my experience was mostly restricted to GNU/Linux, and I had no clue about whether it ran on Windows and how. Luckily I discovered that Emacs’ install process was just to unzip the directory into the desired location, and it happened to run also from a USB. It was possible to bring the Windows version of Emacs in the USB stick and even to install things on it! Then an extension seemed natural: Provided that a program can be executed from the USB without the need for any component outside the device, was it possible to make Emacs to interact with such programs? To what extent? Could I use portable versions of things such as a LaTeX distribution, R, git, and pandoc inside a USB that I could execute in any Windows computer? The short answer is: Yes (up to now), and below I describe step by step how I managed to do it. Nearly half of the process was carried out in a virtual machine with Windows 8.1 using virtualbox on GNU/Linux, and the other on a Windows machine with Windows 10. There was no perceivable difference in execution. Please note that I “name” files in the code blocks with an underline consisting of lines of equal signs (====) so that it is clear that the text below should be located in the file name.

Setting the scene

The first thing we need to do is to format the USB stick with a file system compatible with Windows. The USB stick was an old one with a couple years of use and 16 Gb of capacity. For this case I used the NTFS filesystem (following this source) and a single partition. whether you format it on Windows or GNU/Linux the result should be the same, or at least I could not find any difference as used both approaches. As a result you should have a device completely blank and mountable on Windows machines. Now we need to get some programs and decide how will we organize the files in the device.

Cluttering can easily happen when installing a bunch of programs into any device in the same directory, so that a special directory called miniOS (after mini-operating-system) was created for this purpose. There we will put all the programs that compose our “operating system.” Next thing is to download the first and most important program -Emacs-, and to set some things up on it. Please note that I will refer to the USB initial point as \ or \\ (or /, or even //) throughout the text. The last important directory will be user, where Emacs will install stuff that usually goes in .emacs.d because Windows doesn’t like dots in file and directory names.

Installing emacs and setting things up for real portability

Emacs can be found here for any platform, we will need to download the Windows version and then unzip the content into miniOS. I chose version 25 for this.

Some notes on portability

There is an issue when running Emacs on a USB stick: Every time you plug it into a machine the device can potentially be named differently, despite most of the time its name is E:\. Otherwise, our “root” and other paths can not be used if we explicitly call them E and they happen to be mounted as F or whatever. Inspired in this thread on StackExchange I found that creating a DOS script that calls emacs with a specific init file can overcome this issue, since Emacs reads the location from where it was called at startup and seizing on this we can create “relative” paths to our different directories, specially to those bearing the executable files.

Starting up Emacs

There is an special directory in the emacs directory that was unzipped: dir. It contains the executable files for running emacs under several situations. Keep it in mind when preparing the scripts.

First, create the script that sets the basic HOME and PATH variables (don’t know what do these weird guys are? take a look here), and then call Emacs from the binary that we should be using, called runemacs.exe. Why this file if there are other executables in the same directory, one of them even called just emacs.exe? See the reasons here. Our startup script will be called Emacs.cmd, and we will double-click it every time we want to open Emacs. Its content will be:

\Emacs.cmd
==========
set "HOME=%~dp0"
set PATH="%PATH%;\miniOS\emacs\bin"
miniOS\emacs\bin\runemacs.exe   -q   --load \miniOS\emacs\init.el %*

The first two lines set the HOME environment variable and then use it in order to set the PATH variable that Emacs will remember them as the starting execution point, and where to find programs for execution, respectively. The third line runs Emacs with the -q option that starts Emacs faster (with the cost of not running some startup code, see here again). The first line is where magic happens: It sets the path to the USB whatever the name of the device: E, F, G, whatever. This script makes out Emacs instance fully portable. Please note that each directory put in the PATH must be separated by a semicolon (;).

Next thing to do is to create an init script so that Emacs Lisp code is run at startup. Important code that should always be run at startup goes here:

\miniOS\emacs\init.el
=====================
(setenv "HOME"
        (expand-file-name (concat invocation-directory "../../..")))
(setq user-emacs-directory
      (expand-file-name (concat (getenv "HOME") "user/")))
(setq abbrev-file-name
      (locate-user-emacs-file "abbrev_defs"))
(setq auto-save-list-file-prefix
      (locate-user-emacs-file "auto-save-list/.saves-"))
(cd "~")
(load "~/_emacs")

The third important file is the so-called .emacs file (or dot-emacs), that we will call instead _emacs and place in \ so that Windows does not complain about having dots in the file name:

\_emacs
=======
;; see if .emacs is loading
(message ".emacs loading")

This small elisp code tells us that Emacs is loading appropriately the dot-emacs file, and therefore we should see the text “.emacs loading” in the minibuffer. Next we will need to tell Emacs to use also the MELPA repository in order to install packages with the following code in _emacs:

\_emacs
=======
;; setting up melpa
(require 'package)
(let* ((no-ssl (and (memq system-type '(windows-nt ms-dos))
                    (not (gnutls-available-p))))
       (proto (if no-ssl "http" "https")))
  ;; Comment/uncomment these two lines to enable/disable MELPA and MELPA Stable as desired
  (add-to-list 'package-archives (cons "melpa" (concat proto "://melpa.org/packages/")) t)
  ;;(add-to-list 'package-archives (cons "melpa-stable" (concat proto "://stable.melpa.org/packages/")) t)
  (when ( setwd('e:/')
> 

Installing git

Git has a portable version called “thumbdrive edition” under the windows release that will install everything in the specified location. Just download it into \miniOS and let it install everything in the default folder called PortableGit.

Interacting with git: The magit package

We will use magit in order to interact with git and install it with M-x package-install and then magit. After a lot of times trying I could not make magit to find the executables and ended up modifying the magit-git-executable variable in _emacs:

\_emacs
==========
;; set magit git executable from portable git
(setq magit-git-executable "\\miniOS\\PortableGit\\mingw64\\bin\\git")

We can test that everything works properly by cloning any repo from github, and it simultaneously will tell us if we can access the internet from Emacs: M-x magit-clone and then type the repo address: http://github.com/user/repository.git. Please note that sometimes secure protocols can give problems, so maybe it’s better to type http and not https.

Installing LaTeX

The recommended approach at installing a LaTeX distribution was found to be MiKTeX. This distribution has a portable version that we will use for our case. Please note that rmarkdown will have problems interacting with MiKTeX (see here) and you should choose the “Install missing packages” option as “Yes” during install. After installing it into \miniOS\miktex-portable, we will modify the PATH in Emacs.cmd:

\Emacs.cmd
==========
set PATH="%PATH%;\miniOS\emacs\bin;\miniOS\R-3.4.4\bin\x64;\miniOS\miktex-portable\texmfs\install\miktex\bin"

I also did it in init.el because Emacs was struggling with the windows backslashes and just found the executables when using unix slashes:

\miniOS\emacs\init.el
=====================
(setenv "PATH" (concat
		(getenv "PATH")
		";"
		"/miniOS/miktex-portable/texmfs/install/miktex/bin"))

I actually got the inspiration when seeing that MiKTeX itself uses both kinds of slash in order to avoid problems depending on the platform.

Interacting with LaTeX: The AUCTeX package

The AUCTeX package is another marvel from Emacs that offers a full-featured system for compiling LaTeX documents from inside Emacs. This is the responsible of allowing us to use Emacs as ourur LaTeX platform provided that it finds the executables for compilation of such documents that we can used for authoring papers, theses, and even presentations. We can install it with M-x package-install and then auctex. if the PATH was properly set, AUCTeX should not have problems finding the LaTeX compilers (e.g., pdflatex or bibtex). We can test that everything is in place creating a test LaTeX file and compiling it with C-c C-c and choosing LaTeX from the list of compilers:

\documentclass{article}
\begin{document}
Hello world!
\end{document}

If everything is OK we would compile our first LaTeX document to pdf in the USB stick.

Some games

Sudoku

Install the sudoku package as already indicated for other Emacs packages.

Go

We can use the GNU Go engine for such wonderful game. Install gnugo as we did with sudoku, and also install the program from here in miniOS. We will need to add it to the path both in Emacs.cmd and init.el as well as to set the variable gnugo-program to point to the path in _emacs. Maybe just one of these three will suffice, but so far I wanted to guarantee it:

\Emacs.cmd
==========
set PATH="%PATH%;\miniOS\emacs\bin;\miniOS\R-3.4.4\bin\x64;\miniOS\miktex-portable\texmfs\install\miktex\bin;\miniOS\gnugo-3.8"
\miniOS\emacs\init.el
=====================
(setenv "PATH" (concat
		(getenv "PATH")
		";"
		"/miniOS/miktex-portable/texmfs/install/miktex/bin"
                ";"
		"/miniOS/gnugo-3.8"))
\_emacs
=======
;; set gnugo executable explicit path
(setq gnugo-program "\\miniOS\\gnugo-3.8\\gnugo")

Portable Pandoc

Pandoc is a program for converting text files between a wide array of formats, and is the battelhorse of packages such as knitr and rmarkdown, so if we want to author markdown and/or rmarkdown files we must guarantee that it can be used from the USB. We can install it from here. Please note that we want the .zip compressed file that acts as a portable version and not the .msi file.

Then we need to add pandoc to the PATH in init.el as already mentioned:

\miniOS\emacs\init.el
=======
(setenv "PATH" (concat
		(getenv "PATH")
		";"
		"/miniOS/miktex-portable/texmfs/install/miktex/bin"
                ";"
		"/miniOS/gnugo-3.8"
		";"
		"/miniOS/pandoc-2.4-windows-x86_64"))

Please remember the warning when discussing MiKTeX: Pick the auto-install option during install of the latter package. If you try to run the render function of rmarkdown you will have problems.

Authoring markdown

There is an Emacs package that I like to use for markdown, called markdown-mode. You can install it as already described for other packages. Another package is polymode, also available as already described. Finally, you may want to take a look at rmarkdown-goodies, my own Emacs package for extending markdown specifically for providing some functionality that I was missing from other packages more general in scope (such as markdown-mode). I plan to describe such package in more detail in the future (as well as to submit it to melpa), but for the time being you can clone it from Emacs with magit-clone and htttp://github.com/gaballench/rmarkdown-goodies.git. As a comment, I prepared the present document with these three packages directly from Emacs.

The final chunk of software we will need in order to author markdown documents from Emacs is the rmarkdown R package. Open R with M-x R and then install it from CRAN with:

install.packages("rmarkdown", repos = "http://cran.us.r-project.org")

Please note the explicit use of the repos argument in install.packages. Since we are using http instead of https, we try to avoid problems with this protocol, and also, we avoid the X emergent window asking us to specify a repository; it may not work from within Emacs!

Assorted additions

There is a number of functionalities I like in Emacs, and all of them can be specified in _emacs. First install the packages mc and undo-tree as usual and then add the following to the _emacs file:

\_emacs
=======
;; set save abbrevs to silent
(setq save-abbrevs 'silent)        ;; save abbrevs when files are saved

;; deal with temporary files
;; store all backup and autosave files in the tmp dir
(setq temporary-file-directory "\\miniOS\\tmp")
;(setq backup-directory-alist
;      `((".*" . ,temporary-file-directory)))
;(setq auto-save-file-name-transforms
;      `((".*" ,temporary-file-directory t)))

;; use count-words instead of count-words-region as it works on buffer
;; if no region is selected
(global-set-key (kbd "M-=") 'count-words)

;; multiple cursors
;; first install mc with `package-install' and also install libgnutls
;; see the following https://emacs.stackexchange.com/questions/27202/how-do-i-install-gnutls-for-emacs-25-1-on-windows
;; http://gnu.c3sl.ufpr.br/ftp/emacs/windows/emacs-25/emacs25-x86_64-deps.zip
;; extract in e:/miniOS/emacs so that binaries are properly included in its directory (i.e., emacs/bin)
(global-set-key (kbd "C-c m c") 'mc/edit-lines)

;;; undo tree mode
;;turn on everywhere
(global-undo-tree-mode 1)
;; make ctrl-z undo
(global-set-key (kbd "C-z") 'undo)
;; make ctrl-Z redo
(defalias 'redo 'undo-tree-redo)
(global-set-key (kbd "C-S-z") 'redo)

;; maximized at startup
(add-to-list 'initial-frame-alist '(fullscreen . maximized))

Please note that in order to use the multiple cursors you will need to install the emacs25 dependencies by downloading the file and unzipping it into \miniOS\emacs so that binaries are properly included in its directory (i.e., emacs\bin).

Closing summary of scripts

All of the important scripts are transcribed below for further reference. The system runs adequately and I hope to prepare a zipped file for sharing the current state of the USB portable software. Please drop me a note if you want it before I release it publicly.

\Emacs.cmd
==========
set "HOME=%~dp0"
set PATH="%PATH%;\miniOS\emacs\bin;\miniOS\R-3.4.4\bin\x64;\miniOS\miktex-portable\texmfs\install\miktex\bin;\miniOS\gnugo-3.8"
miniOS\emacs\bin\runemacs.exe   -q   --load \miniOS\emacs\init.el %*

\_emacs
=======
;; see if .emacs is loading
(message ".emacs loading")

;; setting up melpa
(require 'package)
(let* ((no-ssl (and (memq system-type '(windows-nt ms-dos))
                    (not (gnutls-available-p))))
       (proto (if no-ssl "http" "https")))
  ;; Comment/uncomment these two lines to enable/disable MELPA and MELPA Stable as desired
  (add-to-list 'package-archives (cons "melpa" (concat proto "://melpa.org/packages/")) t)
  ;;(add-to-list 'package-archives (cons "melpa-stable" (concat proto "://stable.melpa.org/packages/")) t)
  (when (< emacs-major-version 24)
    ;; For important compatibility libraries like cl-lib
    (add-to-list 'package-archives (cons "gnu" (concat proto "://elpa.gnu.org/packages/")))))
(package-initialize)

;; set save abbrevs to silent
(setq save-abbrevs 'silent)        ;; save abbrevs when files are saved

;; set magit git executable from portable git
(setq magit-git-executable "\\miniOS\\PortableGit\\mingw64\\bin\\git")

;; deal with temporary files
;; store all backup and autosave files in the tmp dir
(setq temporary-file-directory "\\miniOS\\tmp")
;(setq backup-directory-alist
;      `((".*" . ,temporary-file-directory)))
;(setq auto-save-file-name-transforms
;      `((".*" ,temporary-file-directory t)))

;; use count-words instead of count-words-region as it works on buffer
;; if no region is selected
(global-set-key (kbd "M-=") 'count-words)

;; multiple cursors
;; first install mc with `package-install' and also install libgnutls
;; see the following https://emacs.stackexchange.com/questions/27202/how-do-i-install-gnutls-for-emacs-25-1-on-windows
;; http://gnu.c3sl.ufpr.br/ftp/emacs/windows/emacs-25/emacs25-x86_64-deps.zip
;; extract in e:/miniOS/emacs so that binaries are properly included in its directory (i.e., emacs/bin)
(global-set-key (kbd "C-c m c") 'mc/edit-lines)

;;; undo tree mode
;;turn on everywhere
(global-undo-tree-mode 1)
;; make ctrl-z undo
(global-set-key (kbd "C-z") 'undo)
;; make ctrl-Z redo
(defalias 'redo 'undo-tree-redo)
(global-set-key (kbd "C-S-z") 'redo)

;; maximized at startup
(add-to-list 'initial-frame-alist '(fullscreen . maximized))

;; set gnugo executable explicit path
(setq gnugo-program "\\miniOS\\gnugo-3.8\\gnugo")

\miniOS\emacs\init.el
=====================
(setenv "HOME"
        (expand-file-name (concat invocation-directory "../../..")))
(setq user-emacs-directory
      (expand-file-name (concat (getenv "HOME") "user/")))
(setq abbrev-file-name
      (locate-user-emacs-file "abbrev_defs"))
(setq auto-save-list-file-prefix
      (locate-user-emacs-file "auto-save-list/.saves-"))
(cd "~")
(setenv "PATH" (concat
		(getenv "PATH")
		";"
		"/miniOS/miktex-portable/texmfs/install/miktex/bin"
                ";"
		"/miniOS/gnugo-3.8"
		";"
		"/miniOS/pandoc-2.4-windows-x86_64"))
(load "~/_emacs")

First “release” of the rev-mode package

A week or so ago I attended a workshop on the RevBayes platform on graphic probabilistic models for evolutionary biology held at the Museo Argentino de Ciencias Naturales. There is abundant literature documenting the theory behind the design of RevBayes, that can be thought as MrBayes with steroids with a completely-different design based on the theory of probabilistic graphs; an appealing theory of model specification with a big deal of application in phylogenetics and related fields.

One major drawback of it is that it lacks an interpreter with code redirection capabilities (mostly for MS Windows, where the batch mode does not support copy-paste capabilities); however, Dr. Höhna has just released RevStudio, an IDE for the Rev language with code redirection (for Windows and Mac), pretty much the same thing as RStudio is for the R language. Although its design targets Windows users (and maybe MacOS ones too), it still lacks support for Linux, and therefore, I started to develop a linux-based interpreter with IDE capabilities under a GNU-based system.

As I already use Emacs, I decided that it was pretty simple to develop a major mode for the Rev language with all the basic features for an IDE, that is, an interpreter, code redirection, and (soon to be implemented) syntax aesthetics. I basically constructed a Frankenstein of julia-mode (for the rb process) as well as essh (mostly for code redirection). With these tools at hand, a Rev IDE could be set up “nearly” running smoothly.

Syntax highlight (feature pending, actually), auto-completion (pending, too), and command redirection features are expected to work in this mode to provide a development environment for RevBayes.

At this moment it just runs, provided that you include the following in your .emacs file or wherever appropriate:

(add-to-list 'load-path "/path/to/rev-mode")
(require 'rev-mode)                                                    ;;
(defun rev-mode-sh-hook ()                                             ;;
  (define-key sh-mode-map "\C-c\C-r" 'pipe-region-to-rev)        ;;
  (define-key sh-mode-map "\C-c\C-b" 'pipe-buffer-to-rev)        ;;
  (define-key sh-mode-map "\C-c\C-j" 'pipe-line-to-rev)          ;;
  (define-key sh-mode-map "\C-c\C-n" 'pipe-line-to-rev-and-step) ;;
  (define-key sh-mode-map "\C-c\C-f" 'pipe-function-to-rev))      ;;
(add-hook 'sh-mode-hook 'rev-mode-sh-hook)
;; setup files ending in “.rev” to open in rev-mode
(add-to-list 'auto-mode-alist '("\\.rev\\'" . rev-mode))

How to use

First of all be sure that RevBayes is already installed in your system and that it is in your path so that you can call it from the command line. At present rev-mode does run automatically so first you will need to open a .rev file (it will be recognized as rev-mode); this unfortunately opens a terminal buffer that is unnecessary but also harmless as all the code redirection goes directly to the rb process. You will need to start a RevBayes process with M-x run-rev; this will open a command line interpreter of the rb program. Afterwards just use the keystroke combinations described above.

You can download rev-mode from https://github.com/gaballench/rev-mode

Troubleshooting installation of the mpi version of revbayes

I’ll be attending a workshop in Buenos Aires next week on revbayes, a program for bayesian inference in evolutionary biology that looks very promising for divergence time estimation, an area in which I have particular interest. When preparing the software for the workshop, I noted that the single-thread version compiled without problems. However, when trying to build the mpi version (multi-thread, revbayes with steroids), release 1.0.7 (master branch) I found the error (see line 90, highlighted):

user@computer:~/programs/revbayes/projects/cmake$ ./build.sh -mpi true
/home/user/programs/revbayes/projects/cmake/build<span id="mce_SELREST_start" style="overflow:hidden;line-height:0;"></span>
Building boost libraries
you can turn this of with argument "-boost false"
Building Boost.Build engine with toolset gcc... tools/build/src/engine/bin.linuxx86_64/b2
Unicode/ICU support for Boost.Regex?... not found.
Generating Boost.Build configuration in project-config.jam...

Bootstrapping is done. To build, run:

    ./b2

To adjust configuration, edit 'project-config.jam'.
Further information:

   - Command line help:
     ./b2 --help

   - Getting started guide:
     http://www.boost.org/more/getting_started/unix-variants.html

   - Boost.Build documentation:
     http://www.boost.org/build/doc/html/index.html

Performing configuration checks

    - 32-bit                   : no  (cached)
    - 64-bit                   : yes (cached)
    - arm                      : no  (cached)
    - mips1                    : no  (cached)
    - power                    : no  (cached)
    - sparc                    : no  (cached)
    - x86                      : yes (cached)

Building the Boost C++ Libraries.

    - symlinks supported       : yes (cached)
    - gcc visibility           : yes (cached)
    - long double support      : yes (cached)
    - has_icu builds           : yes (cached)
    - lockfree boost::atomic_flag : yes (cached)

Component configuration:

    - atomic                   : building
    - chrono                   : building
    - container                : not building
    - context                  : not building
    - coroutine                : not building
    - coroutine2               : not building
    - date_time                : building
    - exception                : not building
    - filesystem               : building
    - graph                    : not building
    - graph_parallel           : not building
    - iostreams                : not building
    - locale                   : not building
    - log                      : not building
    - math                     : building
    - mpi                      : not building
    - program_options          : building
    - python                   : not building
    - random                   : not building
    - regex                    : building
    - serialization            : building
    - signals                  : building
    - system                   : building
    - test                     : not building
    - thread                   : building
    - timer                    : not building
    - type_erasure             : not building
    - wave                     : not building

...patience...
...patience...
...patience...
...found 3944 targets...

The Boost C++ Libraries were successfully built!

The following directory should be added to compiler include paths:

    /home/user/programs/revbayes/boost_1_60_0

The following directory should be added to linker library paths:

    /home/user/programs/revbayes/boost_1_60_0/stage/lib

CMake Error at /usr/share/cmake-3.5/Modules/FindPackageHandleStandardArgs.cmake:148 (message):
  Could NOT find MPI_C (missing: MPI_C_LIBRARIES MPI_C_INCLUDE_PATH)
Call Stack (most recent call first):
  /usr/share/cmake-3.5/Modules/FindPackageHandleStandardArgs.cmake:388 (_FPHSA_FAILURE_MESSAGE)
  /usr/share/cmake-3.5/Modules/FindMPI.cmake:614 (find_package_handle_standard_args)
  CMakeLists.txt:26 (find_package)

-- Configuring incomplete, errors occurred!
See also "/home/user/programs/revbayes/projects/cmake/build/CMakeFiles/CMakeOutput.log".
See also "/home/user/programs/revbayes/projects/cmake/build/CMakeFiles/CMakeError.log".
CMake Error at /usr/share/cmake-3.5/Modules/FindPackageHandleStandardArgs.cmake:148 (message):
  Could NOT find MPI_C (missing: MPI_C_LIBRARIES MPI_C_INCLUDE_PATH)
Call Stack (most recent call first):
  /usr/share/cmake-3.5/Modules/FindPackageHandleStandardArgs.cmake:388 (_FPHSA_FAILURE_MESSAGE)
  /usr/share/cmake-3.5/Modules/FindMPI.cmake:614 (find_package_handle_standard_args)
  CMakeLists.txt:26 (find_package)

-- Configuring incomplete, errors occurred!
See also "/home/user/programs/revbayes/projects/cmake/build/CMakeFiles/CMakeOutput.log".
See also "/home/user/programs/revbayes/projects/cmake/build/CMakeFiles/CMakeError.log".
Makefile:218: recipe for target 'cmake_check_build_system' failed
make: *** [cmake_check_build_system] Error 1

The error message looked somewhat obscure and finding that there was something wrong with the mpi library I looked how to install it for Ubuntu. I ended up installing everything labeled openmpi with apt-get. However, this did NOT solve my problem as the installation process still was unable to find the mpi component. Most of the day was spent trying unsuccessfully to switch paths to libraries, edit bash profile files, try to switch flags for the build.sh file, and a long list of other useless and desperate options.

Once at home I decided to give it a try again and noted that the installation process makes reference to the ‘boost’ library and its ‘mpi’ component. The compilation of the mpi version of revbayes actually complains about being unable to find MPI_C, and since the component configuration mpi is said to be not building (see highlighted line 60 in the first code block), I ended up looking for something related to the boost library (actually called libboost) and mpi with apt-cache (actually, apt-cache search libboost gives a list of components that matches the components listed during the preparation for compiling):

user@computer:~$ apt-cache search libboost-mpi # look for the mpi component of the boost library
libboost-mpi-dev - C++ interface to the Message Passing Interface (MPI) (default version)
libboost-mpi-python-dev - C++ interface to the Message Passing Interface (MPI), Python Bindings (default version)
libboost-mpi-python1.58-dev - C++ interface to the Message Passing Interface (MPI), Python Bindings
libboost-mpi-python1.58.0 - C++ interface to the Message Passing Interface (MPI), Python Bindings
libboost-mpi1.58-dev - C++ interface to the Message Passing Interface (MPI)
libboost-mpi1.58.0 - C++ interface to the Message Passing Interface (MPI)

user@computer:~$ sudo apt-get install libboost-mpi-dev # install the first one in the results

Afterwards, we can try to compile revbayes with the -mpi true flag (last compilation lines show for a positive case of successful compilation):

user@computer:~/programs/revbayes/projects/cmake$ ./build.sh -mpi true
/home/user/programs/revbayes/projects/cmake/build
Building boost libraries
...
... # compilation takes some time and is quite verbose
...
[100%] Linking CXX static library librb-parser.a
[100%] Built target rb-parser
Scanning dependencies of target rb-mpi
[100%] Building CXX object CMakeFiles/rb-mpi.dir/home/balleng/programs/revbayes/src/revlanguage/main.cpp.o
[100%] Linking CXX executable ../rb-mpi
[100%] Built target rb-mpi

Then we can try to check whether the revbayes mpi can be found and then run it with rb-mpi:

user@computer:~$ whereis rb-mpi
rb-mpi: /home/user/programs/revbayes/projects/cmake/rb-mpi
user@computer:~$ rb-mpi

RevBayes version (1.0.7)
Build from master (31efda) on Wed Apr 11 01:21:44 -03 2018

Visit the website www.RevBayes.com for more information about RevBayes.

RevBayes is free software released under the GPL license, version 3. Type 'license()' for details.

To quit RevBayes type 'quit()' or 'q()'.

>

With everything in order, I can start the workshop and be sure that the power of multi-thread tools can be used in my own analyses.

Why to use the mean of the quantiles as initial values in ‘optim’?

This post is actually one of the vignettes that accompany the bayestools package, and I’m sharing here because vignettes does not seem to enjoy of the same spreading potential as blog posts.

One issue with the use of the optim function for parameter approximation from a ser of percentiles and quantiles is that it requires initial values for its heuristic search of values. If these initial values are much distant from the unknown real parameter value, then the function has serious problems with convergence and may produce results that are simply wrong. In pre-release versions of findParams the initial value was a vector of 1s corresponding to the number of parameters to estimate (e.g., c(1, 1) when estimating mean and sd in pnorm), but this produced simply wrong results when, for instance, the real mean was 10 or a larger value.

With a little help of simulation we can show that the best initial guess is in fact the median of the quantiles.

the following code will generate a lot of parameter estimates from several trials using different initial values. With a bit of large number theory, a decent estimate can be found. Here, q and p are quantiles and percentiles under a given, known distribution; a different anonymous function where sapply varies the values of the initial values allows to get the $par element of the optim call, and the the density plot shows that overall a median estimate approaches at the correct value. The values of q come from qDIST given the probabilities of interest (generally 0.025, 0.5, and 0.975). For instance: qbeta(p = c(0.05, 0.5, 0.95), shape1 = 10, shape2 = 1) for the example below:

# X = seq... is the set of values to try, from min(q) to max(q). 
parameters <- sapply(X = seq(0.6915029, 0.9974714, length.out = 1000),
                     FUN = function(x) {
                         findParamsPrototype(q = c(0.6915029, 0.9330330, 0.9974714),
                                    p = c(0.025, 0.5, 0.975),
                                    densit = "pbeta",
                                    params = c("shape1", "shape2"),
                                    initVals = c(x, x))$par
                     },
                     simplify = TRUE)

plot(density(t(parameters)[, 1]), main = "Density for shape1", xlab = "Shape1", ylab = "Density")
abline(v = median(parameters[1, ]), col = "red")

plot(density(t(parameters)[, 2]), main = "Density for shape2", xlab = "Shape2", ylab = "Density")
abline(v = median(parameters[2, ]), col = "red")

initval1initval2

Large number theory allows us to expect such result, but, what if the specific initial value matters? Another simulation plotting the parameter value as a function of the initial value can be prepared with, say, a random variable X ~ N(\mu = 10, \sigma = 1). Such a large mean is expected to cause problems with initial values, since in these simulations 10 is huge when compared to 1. Initial values were simulated to take values between 0.001 and 10 because \sigma > 0 so zero and negative values would break the code.

# check that quantiles are rigth:
qnorm(c(0.025, 0.5, 0.975), mean = 10, sd = 1)
## 8.040036 10.000000 11.959964

# simulate the parameters
simInitVals <- seq(0.001, 10, length.out = 10000)
parameters2 <- sapply(X = simInitVals,
                     FUN = function(x) {
                         findParamsPrototype(q = c(8.040036, 10.000000, 11.959964),
                                    p = c(0.025, 0.5, 0.975),
                                    densit = "pnorm",
                                    params = c("mean", "sd"),
                                    initVals = c(x, x))$par
                     },
                     simplify = TRUE)

# plot the results
plot(y = parameters2[1,], x = simInitVals, main = "Estimates for the mean", xlab = "Simulated init. vals.", ylab = "Parameter estimate")
abline(h = 10, col = "red")
plot(y = parameters2[2,], x = simInitVals, main = "Estimates for the st.dev.", xlab = "Simulated init. vals.", ylab = "Parameter estimate")
abline(h = 1, col = "red")

initval3initval4

Here we see a very interesting result: The initial values near zero up to a bit above 1 cause very odd and unreliable estimates of each parameter, while larger values, closer to the real parameter values invariantly provide reliable estimates. Please note that the red line is the true parameter value that we started with above. But, what happens in the neighborhood of the mean of the quantiles?

meanNeighbors <- which(simInitVals > (mean(simInitVals) - 0.1) & simInitVals < (mean(simInitVals) + 0.1))
plot(y = parameters2[1,][meanNeighbors], x = simInitVals[meanNeighbors], main = "Neighbors of mean(quantiles)", xlab = "Simulated init. vals.", ylab = "Parameter estimate")
abline(h = 10, col = "red")
plot(y = parameters2[2,][meanNeighbors], x = simInitVals[meanNeighbors], main = "Neighbors of mean(quantiles)", xlab = "Simulated init. vals.", ylab = "Parameter estimate")
abline(h = 1, col = "red")

initval5initval6

Now let’s visualize it as densities:

plot(density(parameters2[1,][meanNeighbors]), main = "Neighbors of mean(quantiles)", ylab = "Parameter estimate")
abline(v = 10, col = "red")
plot(density(parameters2[2,][meanNeighbors]), main = "Neighbors of mean(quantiles)", ylab = "Parameter estimate")
abline(v = 1, col = "red")

initval7initval8

Here we see that the values around the mean of the quantiles as initial values behave with the regular properties of large-number theory (real parameter number in red line as above). Therefore, it is advisable to pick this as a regular initial value in the context of the optmi function.

Why did the first example behave nicely? Results not shown (just change the values in seq(from = 0, to = 1)) indicate that convergence is not a concern when estimating the parameters in the beta distribution and there is a reason for it. The beta PDF is defined over [0,1], so it makes no sense at all to try values outside this interval:

Beta(\alpha,\beta):\,\, P(x|\alpha,\beta)=\frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha,\beta)}

where B is the beta function

B(\alpha,\beta)=\int_{0}^{1}t^{\alpha-1}(1-t)^{\beta-1}dt

Being 0 and 1 not that far away from each other, we find they behaving as our simulation of the neighborhood of the mean of quantiles in the esitmation of the normal PDF parameters.

First release of the ‘bayestools’ R package

After some time developing functions for my own research on using fossil information as a source of prior information for divergence time estimation studies (in prep.) I decided to release the first version of an R package with functions that I found useful for pre- and post-processing of data in evolutionary analyses using bayesian methods. It has been first released to its development repository on github (https://github.com/gaballench/bayestools/) and I expect to implement further tools, so the package must be understood as a release with the minimal set of functions.

I expect to continue active development and eventually to submit it to CRAN, but until then you can install the development version with devtools‘s function install_github:

# install devtools if needed
install.packages("devtools")

# load devtools to access the install_github function
library(devtools)

# install bayestools from the github repository
install_github(repo = "gaballench/bayestools")

What the package can do?

There are two main sets of functions that bayestools provide: Pre-processing and post-processing functions. In the first set we have functions that allow us to specify priors, and plot probability density functions from Beast2 parameters in order toi check visually their correctness. In the second group we have a tool for measuring interdependence between empirical densities, in addition to some ways to visualize it.

The main place in evolutionary biology where the features contained in bayestools belong is in the core of divergence time estimation and incorporation of paleontological and geological information into estimation of the time component of phylogenies and diversification analyses. Here, the tools present in the package aid in implementing information from these fields into bayesian analyses, mainly through specification of priors, the most conspicuous feature of bayesian inference. However, prior specification is a very difficult task and there are few if any rules to follow, along with a lot of misunderstanding along with clearly-unjustified practices.

Priors

Prior specification depends strongly on what we are trying to use in order to calibrate a given phylogeny. For instance, node calibration and tip calibration work in different ways, and prior specification will not only depend on this but also on the nature of the information we are trying to use as priors.

In the most common case, we have a fossil taxon (or occurrence) that could inform us about the age of a given node (node calibration). However, we lack exact information on two things: When the organism lived (measured without error), and when a given diversification event took place (i.e., the exact age of a node). What we have is a fossil, whose age has been inferred or measured with error or uncertainty. It is this information, uncertainty, what we expect to model through priors (or at lease what should be done in real life). A prior is in itself a probability density function (PDF hereafter), that is, a mathematical function describing the probability that a variable take a given value inside an given interval. More or less precise statements can be made with aid of PDFs, such as that it is more likely that a variable X takes a value between values a and b than rather between c and d; or which is the expected value for the variable of interest.

Now, what we have is age information such as the following instances:

  • The fossil comes from a volcanic layer that has been dated with radiometric techniques with a specific value \pm uncertainty, for instance, 40.6 \pm 0.5 Ma, where 0.5 is the standard deviation \sigma.

  • The fossil comes from a sedimentary unit that is said to be of Miocene age (5.33 to 23.03 Ma).

  • The fossil comes from a layer bracketed by two levels with volcanic ash that have been dated as 10.2 \pm 0.2 and 12.4 \pm 0.4 respectively. So, the fossil itself have an age determined as interpolated between the layers that have real age information.

How can we convert this information into legitimate priors?

The findParams function

Given that we have prior on the age of a fossil to be 1 – 10 Ma and that we want to model it with a lognormal distribution, fin the parameters of the PDF that best reflect the uncertainty in question (i.e., the parameters for which the observed quantiles are 1, 5.5, and 10, assuming that we want the midpoint to reflect the mean of the PDF:

bayestools::findParams(q = c(1, 5.5, 10),
          p = c(0.025,  0.50, 0.975),
          output = "complete",
          pdfunction = "plnorm",
          params = c("meanlog", "sdlog"))
## $par
## [1] 1.704744 0.305104
## 
## $value
## [1] 0.0006250003
## 
## $counts
## function gradient 
##      101       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

Now we have a published study that specified a lognormal prior (with or without justification, that’s another question) and we want to actually plot it outside of beauti, for instance, in order to assess sensitivity of the posterior to the prior. How can we plot it in R and use its data as any other statistical density?

The lognormalBeast function

Generate a matrix for the lognormal density with mean 1 and standard deviation 1, with mean in real space, and spanning values in x from 0 to 10, and then plot it:

lnvals <- bayestools::lognormalBeast(M = 1, S = 1, meanInRealSpace = TRUE, from = 0, to = 10)
plot(lnvals, type = "l", lwd = 3)

lognormalBeast

Sensitivity

One hot topic in model and tool comparisons is the sensitivity of the posterior to the prior, that is, how much could our results be determined by the prior. For instance, if the posterior is totally dependent on the prior, that means that the likelihood (or the data) is not providing information, and this is very risky as there would be the potential to manipulate the results of bayesian analyses.

The measureSensit function

Measure and plot the sensitivity between two distributions partially overlapping:

set.seed(1985)
colors <- c("red", "blue", "lightgray")
below <- bayestools::measureSensit(d1 = rnorm(1000000, mean = 3, 1),
                       d2 = rnorm(1000000, mean = 0, 1),
                       main = "Partial dependence",
                       colors = colors)
legend(x = "topright", legend = round(below, digits = 2))

measureSensit

The number in the legend indicates an overlap of 0.13, and as such, a measure of interdependence between the densities (Ballen in prep.).

Geochronology

In the third case of possible age estimates for our calibrations, can we clump together a number of age estimates from several samples in order to build a general uncertainty for the interval?

Do the age estimates for the boundaries of the Honda Group (i.e., samples at meters 56.4 and 675.0) conform to the isochron hypothesis?

data(laventa)
hondaIndex <- which(laventa$elevation == 56.4 | laventa$elevation == 675.0) 
bayestools::mswd.test(age = laventa$age[hondaIndex], sd = laventa$one_sigma[hondaIndex])

The p-value is smaller than the nominal alpha of 0.05, so we can reject the null hypothesis of isochron conditions.

Do the age estimates for the samples JG-R 88-2 and JG-R 89-2 conform to the isochron hypothesis?

twoLevelsIndex <- which(laventa$sample == "JG-R 89-2" | laventa$sample == "JG-R 88-2")
dataset <- laventa[twoLevelsIndex, ]
# Remove the values 21 and 23 because of their abnormally large standard deviations
bayestools::mswd.test(age = dataset$age[c(-21, -23)], sd = dataset$one_sigma[c(-21, -23)])

The p-value is larger than the nominal alpha of 0.05, so we cannot reject the null hypothesis of isochron conditions