R is driving me loopy! – A brief blog on loops in R.

The etymology of R can be traced back to mean a few things… R could have originated from: “AARRRRR!!!” Damn you R I hate you R you and your apostrophes in the wrong place. Or “Aaaaaaa” I’m not sure how I achieved that result? But, I’m a little chuffed by my nifty bit of code work.

I’ve getting a few more Aaaaaa’s than ARARRARRRR’s of late so I thought it’d be good to pass on some of my wisdom. One great part of any computer language is the ability to batch tasks that would normally involve a brain numbing amount of mouse clicking of command prompt typing.

In R there are a number of ways to get a loop going.

The most basic loop can be achieved using for:

for (i in 1:10)
{ mat=matrix(c(i,i))

The above loop will produce a two column matrix that contains the values one to ten for increasing with each row.

If you wanted you could turn the above loop into a function. That’s right! You can easily create your own functions in R.

Matrix.Gen <- function(x)


for (i in 1:10)

{ mat=matrix(c(i,i))




Matrix.Gen ()

To run your own function you just type in Matrix.Gen () into your R command line. When creating a loop in R it is
important to make sure you understand how the loops are subset in your function. This is especially important when
you have more than one loop.There are two schools of thought when sub setting code one where you just make sure
you close a bracket once you open it.


Matrix.Gen <- function(x) {

for (i in 1:10) {


print(mat)  }


I personally don’t like this method as if you lose a bracket it can get really confusing! I prefer making sure that my brackets line up and you indent the code as you introduce a loop.


Matrix.Gen <- function(x)

        {  for (i in 1:10)
            {   mat=matrix(c(i,i))

If you set the open and close brackets in a line then they are a lot easier to track. I also like to place my brackets under the function or for loop so you can easily link the appropriate bracket to loop.

A good use for a loop in r is creating a function that runs through your model fitting or model selection. Using a loop to do this can speed up the process of manually entering a single model at a time. The code I will display below run an all model selection method for 249 individual species. This bit of code runs two loops, often referred to as a nested loop. A nested loop is a loop within a loop and this is where you need to start taking care with your bracketing, otherwise it can get really confusing.

Import data as csv files. There are three files that will be called into this loop. Environmental data, ##Species Data and the models I want to fit

Env_AusNZ360 = read.csv ("EnvPoints10m360lon.csv)
Oph_spp = read.csv ("Spp20_Oph_AusNZ.csv")
mods = read.csv("PPM_mod_select_lin_poly.csv")

This next bit of code creates a factor R can recognize based on my species data. I do this so I can get R to call in a single species at a time. These individual species will be called in the first loop (the j loop).

Oph_spp = Oph_spp[,c(1,2,3,4,5,6,7,9,11,12)]
WhichSpp = factor(Oph_spp$Spp, exclude = NA)
classes = levels(WhichSpp)
nClasses = length(classes)
fit.ppm <- function(x)
 { for (j in 1:nClasses)
     { print(classes[j])
       dat2add <- data.frame(who = Oph_spp[,1], Oph_spp[,2:3], depth = Oph_spp[,5], Oph_spp[,6:9])
         for (i in 1:78)
          { ft.ppm0.4 =  ppm(mods[i,2], dat2add[dat2add$who==classes[j],], Env_AusNZ360, 0.5)
            mods[i,2] <- mods[i,2]
            mods[i,3] <- classes[j]
            mods[i,4] <- ft.ppm0.4$ll
          write.table(mods, "C:\\Possion PP\\PPMresults_Spp_10m_05scale.txt",j, sep=".", quote=F)

So we’re going down a level ( and if you’ve seen inception you’ll get that link) To clarify: the  i loop is nested within the j loop so It’ll run 78 models for the number of species I have. I’ve then closed the i loop and asked it so save my model results to my mods csv file within the j loop. This will mean it’ll append the result of each species as it is solved by R.

So loops can be confusing, but they can greatly speed up your outputs if you get the basics right. Another way to increase the speed of doing calculations in R is to introduce parallelization.  Most modern computers have more than one core or processor which means instead of running a single loop like the above code you can set your computer to paralleling your code. The speed you can increase your loops is based on the number of cores you have. Most modern computers have 4-6, which means if you have a large data set or something that is slow, you can increase the output speed by 4-6 times! I know what you’re thinking that sounds complicated?… In fact that sounds way above my head?… You’d have to be a computer programmer or something like that to get parallelization working? We’ll luckily in R all the hard work has been done.

All you need is two software packages that you can download with ease. I use linux for parallelization, but there are windows versions available.

For linux the packages are:


For windows:


Then load packages:


This tells R how many cores to use, the more cores you use the faster the loop will run, but the harder your machine will have to work.
For the purposes of this example I’m going to set it at four.


For windows it’s slightly different:

cl <- makeCluster(4)

To get R to start paralleling all you need to do is convert your for loops to foreach loops:

Matrix.Gen <- function(x)
   { for (i in 1:10)
       { mat=matrix(c(i,i))

Will become:

Matrix.Gen <- function(x)
     { foreach(i=1:10) %dopar% 
         { mat=matrix(c(i,i))

Now the foreach call will parallelize your loop based on the number of cores you assign. This should now run 4 times faster!
The same can be done for the nested loop example above.

fit.ppm <- function(x)
   { foreach(j=1:nClasses) %dopar%
       { print(classes[j])
         dat2add <- data.frame(who = Oph_spp[,1], Oph_spp[,2:3], depth = Oph_spp[,5], Oph_spp[,6:9])
            foreach(i=1:78) %dopar%
               { ft.ppm0.4 =  ppm(mods[i,2], dat2add[dat2add$who==classes[j],], Env_AusNZ360, 0.5)
                 mods[i,2] <- mods[i,2]
                 mods[i,3] <- classes[j]
                 mods[i,4] <- ft.ppm0.4$ll
          write.table(mods, "C:\\Possion PP\\PPMresults_Spp_10m_05scale.txt",j, sep=".", quote=F)

I hope this has been useful. Looping and paralleling your code should speed up your analyses. And while your loop is running you can continue to work on other projects or enjoy a cup of tea and a Monte Carlo biscuit. Like I’m going to do now. Until next time. Skip.

Posted in Uncategorized | Tagged , , | 4 Comments

GLMs Blog Part 1.

A recent discussion on a paper I’m writing made me take a step back and ponder what do I really know about GLMs? After a few moments I realised expanding the acronym to Generalised Linear Model was about as far as I could get. So decided it was time to crack this nut.

So a GLM is made up of three key components:

  • The error structure
  • The linear predictor.
  • The link function.

A GLM allows models to incorporate a variety of different error distributions, this is really useful when dealing with data that isn’t normally distributed, such as count data (Poisson errors), presence/absence data (binomial errors) which is really useful depending on the data you have and the questions you are intending to ask.

GLM have a number of applications, but one common use is to make predictions on species distributions. GLMs relate an observed value to a predicted value. The observed values are essentially your raw data or say the environmental variables at site x, the predicted values are obtained by transforming the linear predictor. The linear predictor is the sum of the effects of one or more explanatory variables used in the model.

In the model I’ve been using I been mixing GLMs using finite mixture models, I won’t go into detail on mixing GLMs, but in my example the model is comprised of four GLMs, the below R code is a way at looking how well the models are fitting the data. I’ll be showing you how to explore an individual GLM.

An example of generalised linear model equation:

>  final.formula <- final.mod$formula

> y ~ t + tsq + s + ssq + o + osq + flux3 + flux3sq

In the above equation y (chance of observing this groups of species) is explained by the environmental covariates t = temperature, s= salinity, o = oxygen concentration and flux3 = particulate organic carbon drift to the sea floor (a measure of productivity) and there squared versions (for looking at non-linear relations of the data).

Note: you can use the poly function in R to fit polynomials to your data,

Eg: poly(t, 2) is the same at tsq above.

In essence the linear predictor is the observed values * the model coefficients. Here is an example of calculating the linear predictors across the sites used to fit the above model.

# This gives you the environmental variables for each site in my model

>  model.env.data <- model.matrix(final.formula,data=cbind(obs=1,env_data))

#  This gives the linear predictor across all these sites.

>   lpre <- model.env.data%*%final.mod$coef[1,]

To determine the fit of a given model, a GLM evaluates the linear predictor for each value of the response variable, then compares the predicted value with a transformed value of y. The transformation to be employed is specified in the link function

The link function can take a number of different forms depending on the type of error structure in the model. Classically there are four main link functions which can be used to transform y. If your data is normally distributed: identity link; poisson: log link; binomial: logit and gamma: reciprocal. In my example I’m using presence/absence data which means I have a binomial error distribution and hence I use a logit link function.

#Setting the link function

>  link.fun <- make.link(“logit”)

If you are using a function in R such as glm you can define the link function when you enter the model formula:

>   glm(y ~ t + tsq + s + ssq + o + osq + flux3 + flux3sq, family=binomial)

The size of each term in linear predictor indicates it contribution to the overall likelihood, big negative indicates low probability, big positive indicates high probability.

To get the overall log likelihood you transform the linear predictor using the link function as is shown below:

>   ll <- link.fun$linkinv(sum(lpre))

However often you want to have a look at how the environmental covariates (coefficients) are responding within the model. There are a number of ways to do this but one way that works quite well is to visualise the response. For instance if you wanted to see how temperature was responding you could look at what range of values temperature contribute positively or negatively to overall log likelihood.

Here is a little bit of code which plots the response of temperature in the above model.

>   tmp.temp <- seq(2,26,length=100)

>   t.plot <-  plot(tmp.temp,final.mod$coef[1,2]*tmp.temp+final.mod$coef[1,3]*tmp.temp^2,type=”l”,ylab=”Total contribution of Temperature to Likelihood in G1″,xlab=”Temperature”)

If everything worked correctly you should a figure such as the one below:

In this figure you can see the non-linear response of temperature and can see for temperatures greater than 20°C temperature is contributing to the overall likelihood of the model. You could potentially do this with all the environmental covariates and visualise their respective responses.

If you want to be a bit fancy you can also add in the intercept:

plot(tmp.temp,final.mod$coef[1,1]+final.mod$coef[1,2]*tmp.temp+final.mod$coef[1,3]*tmp.temp^2,type=”l”,ylab=”Total contribution of Temperature to Likelihood + Intercept”, xlab=”Temperature”)

Another potential use of the linear predictor is to check the contribution of environmental variables when using a model to make predictions across an extent. This problem first cropped up when I was getting extremely high probabilities at sites where it didn’t correlate with other the majority of high probabilities.

To assess the contributions of environmental variables at prediction sites, you can follow a similar pattern to the code above but instead of entering the environmental covariates used to fit the model you enter the environmental variables that you want to predict your model onto.

model.pred.data <- model.matrix(final.formula,data=cbind(obs=1,PrePoint))

link.fun <- make.link(“logit”)

lpre.pred <- model.pred.data%*%final.mod$coef[1,]

## log likelihood

ll <- link.fun$linkinv(lpre.pred)

This will give you the log likelihood for all the points in space. Once you have achieved this you can look at the data points that are producing abnormally high log likelihood values.

>  ll.sort <-  head(sort(ll, decreasing=TRUE))

An example in my model was site 20087 where the probability of was detection was abnormally high and not aggregated with the majority of high probability points. You can  manually look at the environmental covariates by entering the coefficient values from the model and environmental variables from site x (in this case 20087), but keeping all values except one (temperature in this case) at the mean.

Which should look something like this:

lpre.g1.20087.t <- 3522.88236315 + -0.23010015*27.21660 + 0.01096224*tsqmn + -200.04575957*smn + 2.84021577*ssqmn + -1.98680350*omn + 0.36839100*osqmn + -0.43793932*flux3mn + 0.01525628*flux3sqmn

Environmental variables at site 20087:

(Intercept)           t             tsq               s             ssq             o          osq        flux3     flux3sq

1.000            27.2166   740.743   34.421   1184.825    4.494   20.197    2.379    5.662

> lpre.g1.20087.t

[1] -5.747775

From this you can see that temperature isn’t contributing to the overall log likelihood at this site. After some detective work I found that salinity was the culprit. See below:

>  lpre.g1.20087.s <- 3522.88236315 + -0.23010015*tmn + 0.01096224*tsqmn + -200.04575957*34.4212 + 2.84021577*ssqmn + -1.98680350*omn + 0.36839100*osqmn + -0.43793932*flux3mn + 0.01525628*flux3sqmn

> lpre.g1.20087.s

[1] 79.23035

Going back to those nifty graphs above:

tmp.sal <- seq(34,36,length=100)

>  plot(tmp.sal,final.mod$coef[1,4]*tmp.sal+final.mod$coef[1,5]*tmp.sal^2, type=”l”,ylab=”Total contribution of Salinity to Likelihood in G1″,xlab=”Salinity”)

You can see that the values for salinity at 34.4 are contributing highly to the likelihood of the overall model and could account for the really high probability at this site. With some further investigation I found that the environmental covariate values sat outside the bounds of my model space. This is a problematic issue as you can’t estimate a relationship where you don’t have any data. So the best thing to do in this situation is to remove these points from my prediction space. Once I clipped the sites with environmental variables outside the sites used to fit my initial model things started looking good.

Well that’s my first GLM blog. I hope that it makes sense and doesn’t raise to many questions. I’d love some further advice on dealing with GLMs in R so feel free to post a comment or two. Cheers Skip.

Posted in Uncategorized | Tagged , | Leave a comment

PhD is underway.

So it’s early days, PhD is underway and I’m overrun with reading, R code and loving it. In fact I’m almost Imageas happy as the guy in that retro wetsuit about to catch a few barrels.

I’m going to keep this first post short, but I was sent a really cool video of a scaleworm (a family of polycheates I did my masters research on) living commensally with a keyhole limpet. The two make a great underwater Batman and Robin, with the scaleworm coming to the rescue, as a good sidekick should! Enjoy

Aside | Posted on by | 1 Comment