It’s been a while

In all news Skip’s research. I’ve got a lot of blogging to catch up on. A few post to come on upcoming adventures to US and Canada.

Cool R trick and tips with Knitr, creating R Packages to make your work flow much more efficiently and a pretty handy little code checking tool.

I found this little gem the other day, made me laugh.

seashell

Posted in Uncategorized | 1 Comment

Read and listen.

A Spectrogram of music. The colour intensity is logarithmic!

I’m one of those people that loves to listen to music while working and reading. So here are a few papers I’m enjoying along with some tracks to accompany them.

A classic paper (May, 1976) deserves a classic tune. There is some challenging maths. But I have this (unrealistic but awesome) image of Lord Robert May bopping to Blue Oyster Cult and (Don’t fear) The Reaper and doing some differential equations.

I also like this paper by Olivero et al 2012 (Olivero et al., 2012), this is a good methods paper that provides some interesting insights to defining biogeographic regions. Being a methods paper you sometimes feel like you can hardly keep you head above water, but these tracks should help you swim.

A main part of my PhD is trying to describe drivers of deep-sea biodiversity at oceanic and global scales. This paper looks at the temporal variability in ocean productivity (Behrenfeld et al., 2006). One it’s an interesting paper, but two, temporal variability should be explored when looking at predictors of biodiversity. Here is a classic Thelonious Monk album which is well known for its treacherous tempo changes. If you’re not a Jazz fan, maybe pop is more your thang

And finally,  Belanger et al (2012) predict global biogeographic structure for shallow marine environments. An interesting paper that compares existing qualitative bioregionalisation to models built on correlations between species occurrence data and environmental predictors. Not surprisingly temperature came out on top.

Until next time,

Skip.

Behrenfeld, M.J., O’Malley, R.T., Siegel, D.A., McClain, C.R., Sarmiento, J.L., Feldman, G.C., Milligan, A.J., Falkowski, P.G., Letelier, R.M. & Boss, E.S. (2006) Climate-driven trends in contemporary ocean productivity. Nature, 444, 752-755.

Belanger, C.L., Jablonski, D., Roy, K., Berke, S.K., Krug, A.Z. & Valentine, J.W. (2012) Global environmental predictors of benthic marine biogeographic structure. Proceedings of the National Academy of Sciences, 109, 14046-14051.

May, R.M. (1976) Simple mathematical models with very complicated dynamics. Nature, 261, 459-467.

Olivero, J., Márquez, A.L. & Real, R. (2012) Integrating Fuzzy Logic and Statistics to Improve the Reliable Delimitation of Biogeographic Regions and Transition Zones. Systematic Biology,

Posted in Uncategorized | 1 Comment

Ten “crack that paper” commandments

Biggie

Ten commandments so you can publish during your PhD,an ode to Pia Lentini and The Notorious B.I.G. Over the weekend I was trying to write up my first PhD chapter as a paper, when The Notorious B.I.G got shuffled into my playlist. It’s been 20 years since Biggie released ‘Ready to die’. This got me thinking about a talk a Post Doc in QAECO lab gave on publishing during your PhD. So I decided to procrastinate from the task at hand and write an ode to Pia and Biggie Smalls. I give you the:

Ten “Crack That Paper” Commandments (play this track and recite the commandments with me!)

I been in this game for a year, doing it for me love of animals
There’s rules so you don’t publish shit, I wrote me a manual
A step by step booklet for you to get
Your game on track, not your wig pushed back

Rule nombre uno: never start the show
without, a plan ready to unfold, modify as you go.
The calendar decreed planning specially
If your times backed up, head down bum up

Number two: never not let em (supervisors) know your next move
Don’t you know em bad boys remain in silence
Take it from your highness (uh-huh)
You need to squeeze them chaps, ask them for their tricks and tips.

Number three: do tasks mu-tip-ly
If your draft is receiving feedback, don’t put your feet up
Ramp your analysis up, or even do a quick graph
Start thinkin’ about your next chapter and get a step up

Number four: know you heard this before
Never deny your timeline supply (Be realistic with your timelines ~3 Drafts per paper)

Number five: Ignore superiors comments is a liability
I don’t care if they’re a nonce, snub remarks and your paper will bounce

Number six: that first draft edit, dead it
You think a lab-head writing your manuscript, shit forget it

Seven: this rule is so underrated
Pitch your paper and ideas to a journal list updated
Methods and “Nature” don’t mix like no fingers and an itch
The wrong pitch and you’re in serious shit

Number eight: never keep no hate in you
Them cats that review your paper get knocked back too

Number nine shoulda been number one to me
If you ain’t kickin’ goals and your paper is rejected by the journal police
If reviews think your works decent by ain’t glisten
Just remember chance plays a role, there is no need to be frettin’

Number ten: a strong word called consignment
Strictly for all men, not just the best men
If you ain’t put in the hours they’ll say “hell no
Cause they gonna want that paper as pure as the drive snow

Follow these rules you’ll have mad papers to post up
If so, all the way to the top
Slug out your paper, watch your thesis stand up
Marker read your papers, they’ll say this is too good to pass up

Hopefully Pia will provide an updated link to her of slides. But in the mean time here are some useful links to look at when thinking about publishing your work: Joern Fischer’s writing blog: http://writingajournalarticle.wordpress.com/
and Corey Bradshaw’s: http://conservationbytes.com/2012/10/22/how-to-write-a-scientific-paper/.

Cheers Skip.

And if you haven’t heard the original here it is! (Apologies to Biggie, hopefully he doesn’t roll over in his grave).

Posted in Uncategorized | 1 Comment

Conference Presentation at 13th International Deep-sea Biodiversity Symposium

I’m speaking next Monday the 3rd of December at the 13th International Deep-sea Biodiversity Symposium in Wellington, New Zealand.

I’ll be speaking on beta-diversity in the deep-sea. My talk will focus on briefly describing the bathyal environment around Australia and New Zealand, the methodological approach I’ve used to create maps of beta-diversity based on presence-only records and I will attempt to test the hypothesis does deep-sea biogeographical patterns resemble those seen in shallow marine environments.

If you can’t make it to Wellington and are keen to have a look at a few of my maps and slides they are available for your viewing pleasure.

Posted in Uncategorized | Leave a comment

R is driving me loopy! – A brief blog on loops in R.

The etymology of R can be traced back to mean a few things… R could have originated from: “AARRRRR!!!” Damn you R I hate you R you and your apostrophes in the wrong place. Or “Aaaaaaa” I’m not sure how I achieved that result? But, I’m a little chuffed by my nifty bit of code work.

I’ve getting a few more Aaaaaa’s than ARARRARRRR’s of late so I thought it’d be good to pass on some of my wisdom. One great part of any computer language is the ability to batch tasks that would normally involve a brain numbing amount of mouse clicking of command prompt typing.

In R there are a number of ways to get a loop going.

The most basic loop can be achieved using for:

for (i in 1:10)
{ mat=matrix(c(i,i))
print(mat)
}

The above loop will produce a two column matrix that contains the values one to ten for increasing with each row.

If you wanted you could turn the above loop into a function. That’s right! You can easily create your own functions in R.

Matrix.Gen <- function(x)

{

for (i in 1:10)

{ mat=matrix(c(i,i))

print(mat)

}

}

Matrix.Gen ()

To run your own function you just type in Matrix.Gen () into your R command line. When creating a loop in R it is
important to make sure you understand how the loops are subset in your function. This is especially important when
you have more than one loop.There are two schools of thought when sub setting code one where you just make sure
you close a bracket once you open it.

Eg:

Matrix.Gen <- function(x) {

for (i in 1:10) {

mat=matrix(c(i,i))

print(mat)  }

}

I personally don’t like this method as if you lose a bracket it can get really confusing! I prefer making sure that my brackets line up and you indent the code as you introduce a loop.

Eg:

Matrix.Gen <- function(x)

        {  for (i in 1:10)
            {   mat=matrix(c(i,i))
                print(mat)
             }
        }

If you set the open and close brackets in a line then they are a lot easier to track. I also like to place my brackets under the function or for loop so you can easily link the appropriate bracket to loop.

A good use for a loop in r is creating a function that runs through your model fitting or model selection. Using a loop to do this can speed up the process of manually entering a single model at a time. The code I will display below run an all model selection method for 249 individual species. This bit of code runs two loops, often referred to as a nested loop. A nested loop is a loop within a loop and this is where you need to start taking care with your bracketing, otherwise it can get really confusing.

Import data as csv files. There are three files that will be called into this loop. Environmental data, ##Species Data and the models I want to fit

Env_AusNZ360 = read.csv ("EnvPoints10m360lon.csv)
Oph_spp = read.csv ("Spp20_Oph_AusNZ.csv")
mods = read.csv("PPM_mod_select_lin_poly.csv")

This next bit of code creates a factor R can recognize based on my species data. I do this so I can get R to call in a single species at a time. These individual species will be called in the first loop (the j loop).

Oph_spp = Oph_spp[,c(1,2,3,4,5,6,7,9,11,12)]
WhichSpp = factor(Oph_spp$Spp, exclude = NA)
classes = levels(WhichSpp)
nClasses = length(classes)
fit.ppm <- function(x)
 { for (j in 1:nClasses)
     { print(classes[j])
       dat2add <- data.frame(who = Oph_spp[,1], Oph_spp[,2:3], depth = Oph_spp[,5], Oph_spp[,6:9])
         for (i in 1:78)
          { ft.ppm0.4 =  ppm(mods[i,2], dat2add[dat2add$who==classes[j],], Env_AusNZ360, 0.5)
            mods[i,2] <- mods[i,2]
            mods[i,3] <- classes[j]
            mods[i,4] <- ft.ppm0.4$ll
          }
          print(mods)
          write.table(mods, "C:\\Possion PP\\PPMresults_Spp_10m_05scale.txt",j, sep=".", quote=F)
      }
   print("Finished")
 }

So we’re going down a level ( and if you’ve seen inception you’ll get that link) To clarify: the  i loop is nested within the j loop so It’ll run 78 models for the number of species I have. I’ve then closed the i loop and asked it so save my model results to my mods csv file within the j loop. This will mean it’ll append the result of each species as it is solved by R.

So loops can be confusing, but they can greatly speed up your outputs if you get the basics right. Another way to increase the speed of doing calculations in R is to introduce parallelization.  Most modern computers have more than one core or processor which means instead of running a single loop like the above code you can set your computer to paralleling your code. The speed you can increase your loops is based on the number of cores you have. Most modern computers have 4-6, which means if you have a large data set or something that is slow, you can increase the output speed by 4-6 times! I know what you’re thinking that sounds complicated?… In fact that sounds way above my head?… You’d have to be a computer programmer or something like that to get parallelization working? We’ll luckily in R all the hard work has been done.

All you need is two software packages that you can download with ease. I use linux for parallelization, but there are windows versions available.

For linux the packages are:

install.packages("foreach")
install.packages(“doMC”)

For windows:

install.packages("foreach")
install.packages(“doSNOW”)

Then load packages:

library(“foreach”)
library(“doSNOW”)

This tells R how many cores to use, the more cores you use the faster the loop will run, but the harder your machine will have to work.
For the purposes of this example I’m going to set it at four.

registerDoMC(cores=x)
registerDoMC(cores=4)

For windows it’s slightly different:

cl <- makeCluster(4)
registerDoSNOW(cl)

To get R to start paralleling all you need to do is convert your for loops to foreach loops:

Matrix.Gen <- function(x)
   { for (i in 1:10)
       { mat=matrix(c(i,i))
         print(mat)
        }
   }

Will become:

Matrix.Gen <- function(x)
     { foreach(i=1:10) %dopar% 
         { mat=matrix(c(i,i))
           print(mat)
          }
      }

Now the foreach call will parallelize your loop based on the number of cores you assign. This should now run 4 times faster!
The same can be done for the nested loop example above.

fit.ppm <- function(x)
   { foreach(j=1:nClasses) %dopar%
       { print(classes[j])
         dat2add <- data.frame(who = Oph_spp[,1], Oph_spp[,2:3], depth = Oph_spp[,5], Oph_spp[,6:9])
            foreach(i=1:78) %dopar%
               { ft.ppm0.4 =  ppm(mods[i,2], dat2add[dat2add$who==classes[j],], Env_AusNZ360, 0.5)
                 mods[i,2] <- mods[i,2]
                 mods[i,3] <- classes[j]
                 mods[i,4] <- ft.ppm0.4$ll
               }
          print(mods)
          write.table(mods, "C:\\Possion PP\\PPMresults_Spp_10m_05scale.txt",j, sep=".", quote=F)
        }
      print("Finished")
    }

I hope this has been useful. Looping and paralleling your code should speed up your analyses. And while your loop is running you can continue to work on other projects or enjoy a cup of tea and a Monte Carlo biscuit. Like I’m going to do now. Until next time. Skip.

Posted in Uncategorized | Tagged , , | 1 Comment

GLMs Blog Part 1.

A recent discussion on a paper I’m writing made me take a step back and ponder what do I really know about GLMs? After a few moments I realised expanding the acronym to Generalised Linear Model was about as far as I could get. So decided it was time to crack this nut.

So a GLM is made up of three key components:

  • The error structure
  • The linear predictor.
  • The link function.

A GLM allows models to incorporate a variety of different error distributions, this is really useful when dealing with data that isn’t normally distributed, such as count data (Poisson errors), presence/absence data (binomial errors) which is really useful depending on the data you have and the questions you are intending to ask.

GLM have a number of applications, but one common use is to make predictions on species distributions. GLMs relate an observed value to a predicted value. The observed values are essentially your raw data or say the environmental variables at site x, the predicted values are obtained by transforming the linear predictor. The linear predictor is the sum of the effects of one or more explanatory variables used in the model.

In the model I’ve been using I been mixing GLMs using finite mixture models, I won’t go into detail on mixing GLMs, but in my example the model is comprised of four GLMs, the below R code is a way at looking how well the models are fitting the data. I’ll be showing you how to explore an individual GLM.

An example of generalised linear model equation:

>  final.formula <- final.mod$formula

> y ~ t + tsq + s + ssq + o + osq + flux3 + flux3sq

In the above equation y (chance of observing this groups of species) is explained by the environmental covariates t = temperature, s= salinity, o = oxygen concentration and flux3 = particulate organic carbon drift to the sea floor (a measure of productivity) and there squared versions (for looking at non-linear relations of the data).

Note: you can use the poly function in R to fit polynomials to your data,

Eg: poly(t, 2) is the same at tsq above.

In essence the linear predictor is the observed values * the model coefficients. Here is an example of calculating the linear predictors across the sites used to fit the above model.

# This gives you the environmental variables for each site in my model

>  model.env.data <- model.matrix(final.formula,data=cbind(obs=1,env_data))

#  This gives the linear predictor across all these sites.

>   lpre <- model.env.data%*%final.mod$coef[1,]

To determine the fit of a given model, a GLM evaluates the linear predictor for each value of the response variable, then compares the predicted value with a transformed value of y. The transformation to be employed is specified in the link function

The link function can take a number of different forms depending on the type of error structure in the model. Classically there are four main link functions which can be used to transform y. If your data is normally distributed: identity link; poisson: log link; binomial: logit and gamma: reciprocal. In my example I’m using presence/absence data which means I have a binomial error distribution and hence I use a logit link function.

#Setting the link function

>  link.fun <- make.link(“logit”)

If you are using a function in R such as glm you can define the link function when you enter the model formula:

>   glm(y ~ t + tsq + s + ssq + o + osq + flux3 + flux3sq, family=binomial)

The size of each term in linear predictor indicates it contribution to the overall likelihood, big negative indicates low probability, big positive indicates high probability.

To get the overall log likelihood you transform the linear predictor using the link function as is shown below:

>   ll <- link.fun$linkinv(sum(lpre))

However often you want to have a look at how the environmental covariates (coefficients) are responding within the model. There are a number of ways to do this but one way that works quite well is to visualise the response. For instance if you wanted to see how temperature was responding you could look at what range of values temperature contribute positively or negatively to overall log likelihood.

Here is a little bit of code which plots the response of temperature in the above model.

>   tmp.temp <- seq(2,26,length=100)

>   t.plot <-  plot(tmp.temp,final.mod$coef[1,2]*tmp.temp+final.mod$coef[1,3]*tmp.temp^2,type=”l”,ylab=”Total contribution of Temperature to Likelihood in G1″,xlab=”Temperature”)

If everything worked correctly you should a figure such as the one below:

In this figure you can see the non-linear response of temperature and can see for temperatures greater than 20°C temperature is contributing to the overall likelihood of the model. You could potentially do this with all the environmental covariates and visualise their respective responses.

If you want to be a bit fancy you can also add in the intercept:

plot(tmp.temp,final.mod$coef[1,1]+final.mod$coef[1,2]*tmp.temp+final.mod$coef[1,3]*tmp.temp^2,type=”l”,ylab=”Total contribution of Temperature to Likelihood + Intercept”, xlab=”Temperature”)

Another potential use of the linear predictor is to check the contribution of environmental variables when using a model to make predictions across an extent. This problem first cropped up when I was getting extremely high probabilities at sites where it didn’t correlate with other the majority of high probabilities.

To assess the contributions of environmental variables at prediction sites, you can follow a similar pattern to the code above but instead of entering the environmental covariates used to fit the model you enter the environmental variables that you want to predict your model onto.

model.pred.data <- model.matrix(final.formula,data=cbind(obs=1,PrePoint))

link.fun <- make.link(“logit”)

lpre.pred <- model.pred.data%*%final.mod$coef[1,]

## log likelihood

ll <- link.fun$linkinv(lpre.pred)

This will give you the log likelihood for all the points in space. Once you have achieved this you can look at the data points that are producing abnormally high log likelihood values.

>  ll.sort <-  head(sort(ll, decreasing=TRUE))

An example in my model was site 20087 where the probability of was detection was abnormally high and not aggregated with the majority of high probability points. You can  manually look at the environmental covariates by entering the coefficient values from the model and environmental variables from site x (in this case 20087), but keeping all values except one (temperature in this case) at the mean.

Which should look something like this:

lpre.g1.20087.t <- 3522.88236315 + -0.23010015*27.21660 + 0.01096224*tsqmn + -200.04575957*smn + 2.84021577*ssqmn + -1.98680350*omn + 0.36839100*osqmn + -0.43793932*flux3mn + 0.01525628*flux3sqmn

Environmental variables at site 20087:

(Intercept)           t             tsq               s             ssq             o          osq        flux3     flux3sq

1.000            27.2166   740.743   34.421   1184.825    4.494   20.197    2.379    5.662

> lpre.g1.20087.t

[1] -5.747775

From this you can see that temperature isn’t contributing to the overall log likelihood at this site. After some detective work I found that salinity was the culprit. See below:

>  lpre.g1.20087.s <- 3522.88236315 + -0.23010015*tmn + 0.01096224*tsqmn + -200.04575957*34.4212 + 2.84021577*ssqmn + -1.98680350*omn + 0.36839100*osqmn + -0.43793932*flux3mn + 0.01525628*flux3sqmn

> lpre.g1.20087.s

[1] 79.23035

Going back to those nifty graphs above:

tmp.sal <- seq(34,36,length=100)

>  plot(tmp.sal,final.mod$coef[1,4]*tmp.sal+final.mod$coef[1,5]*tmp.sal^2, type=”l”,ylab=”Total contribution of Salinity to Likelihood in G1″,xlab=”Salinity”)

You can see that the values for salinity at 34.4 are contributing highly to the likelihood of the overall model and could account for the really high probability at this site. With some further investigation I found that the environmental covariate values sat outside the bounds of my model space. This is a problematic issue as you can’t estimate a relationship where you don’t have any data. So the best thing to do in this situation is to remove these points from my prediction space. Once I clipped the sites with environmental variables outside the sites used to fit my initial model things started looking good.

Well that’s my first GLM blog. I hope that it makes sense and doesn’t raise to many questions. I’d love some further advice on dealing with GLMs in R so feel free to post a comment or two. Cheers Skip.

Posted in Uncategorized | Tagged , | Leave a comment

PhD is underway.

So it’s early days, PhD is underway and I’m overrun with reading, R code and loving it. In fact I’m almost Imageas happy as the guy in that retro wetsuit about to catch a few barrels.

I’m going to keep this first post short, but I was sent a really cool video of a scaleworm (a family of polycheates I did my masters research on) living commensally with a keyhole limpet. The two make a great underwater Batman and Robin, with the scaleworm coming to the rescue, as a good sidekick should! Enjoy

Aside | Posted on by | 1 Comment