The Virtual Ecologist refers to the practice of simulating complete experimental designs, that is, the ecological process, the observation process and the statistical analysis. This simulation tool can be used to evaluate trade-offs among experimental design components when design complexity prevents the use of traditional power analysis. This post is the third module of the Virtual Ecologist blog series. The first module introduced the Virtual Ecologist method and the second module demonstrated its use for exploring occupancy model designs. This third module will demonstrate the use of the Virtual Ecologist method for performing a statistical power analysis for occupancy models. Specifically, we will explore how the number of sampling site and replicate samples per site trade-off and influence the statistical power of a sampling design.

## The Question

Imagine that you are an ecologist that wants to understand how the distribution of a given species relates to a given habitat feature (or other environmental factor or management regime, etc.). To investigate this relationship, you design an occupancy study where you will sample multiple sites multiple times to generate your replicated data. But you don’t know how many sites and how many replicates per site you will need to detect the relationship you are interested in. Ideally, the researcher defines the effect size they wish to detect and an acceptable probability to detect (i.e. maybe 80% statistical power) and then evaluates a range of designs to determine which design meets the statistical power criterion for that effect size. In practice, it can be difficult to choose an effect size and acceptable power as these things may be dependent on aspects of the management system that are difficult to define, such as biological significance or the cost of not detecting the effect. For this exercise we will simply set them and move on to demonstrate the mechanics of the power analysis.

## The Simulation

Our simulation is a modification of module #2 and consists of three functions. The first function (“datagen”) simulates the combination of two processes, (i) the ecological process and (ii) the sampling process.

datagen = function(beta1, beta2,alpha1,sites,reps){ hab = rnorm(sites,0,1) logit.psi = beta1 + beta2*hab logit.p = alpha1 psi = exp(logit.psi)/(1+exp(logit.psi)) p = exp(logit.p)/(1+exp(logit.p)) z = rbinom(sites,1,psi) # sampling of true occupancy state y <- matrix(NA, sites, reps) for(i in 1:sites) { y[i,] <- rbinom(reps, 1, z[i]*p) } # places data in formate for unmarked umf <- unmarkedFrameOccu(y = y, siteCovs=as.data.frame(hab)) umf }

The ecological process is simulated as a number of sites that are occupied at a specified rate (i.e. psi, the occurrence probability). The value of psi is simulated as a function of a linear model with a covariate that we refer to as ‘hab’ to represent an environmental variable such as habitat. The relationship between psi and hab is what we are trying to detect with our simulated occupancy experiment. The output of the first function is a simulated data set in the format needed by the R package unmarked to fit an occupancy model.

The second function (“analysis”) analyzes the simulated data sets generated with the first function (“datagen”). The function fits two occupancy models to the data….i.e. a null model that excludes the habitat covariate (fm1) and a model that includes the habitat covariate (fm2). Then it compares the models with AIC.

analysis = function(datum){ fm1 <- occu(~1 ~1, datum) fm2 <- occu(~1 ~hab, datum) converge1 = slot(fm1,"opt")$convergence converge2 = slot(fm2,"opt")$convergence aic1 = 2*-logLik(fm1)+2*2 aic2 = 2*-logLik(fm2)+2*3 ((aic1-aic2)>4)+0 }

We consider the habitat covariate relationship detected if two criteria are met, (i) the correct model (fm2) has a delta AIC of 0 and (ii) the delta AIC of the null model (fm1) is >4. The output of this function is a 1 if the habitat relationship was detected or 0 if it was not.

The third function (“runfun”) simply runs the first two functions in order. You will notice the use of the try function followed by a logical test embedded in runfun.

runfun=function(sites,reps,beta1,beta2,alpha1){ dat = datagen(beta1, beta2,alpha1,sites,reps) out = try(analysis(dat),T) output = if (inherits(out,"try-error")) { 0 } else { out } output }

The reason for the use of the try function is that the analysis function could fail, but we do not want the simulation to simply stop when this happens. We want it to record the analysis failure as a failure to detect the habitat effect and continue with the simulation. The reason why the analysis function could fail is that the datagen function could randomly produce a simulated data set that does not contain enough information to fit the data set as would happen is there were zero detections.

## Using the simulation to evaluate statistical power

To use the simulation, we must first define the input parameters. The parameters beta1 and beta2 are the intercept and habitat effect of the occurrence sub-model. These represent the “true” values for the study system. You would normally want to set bet1 to a value that approximates the true value of the study system. This value could be determined by an analysis of some preliminary data, borrowed from similar systems, or simply expert opinion. For our example we will arbitrarily set it to a value of one. Remember that this value is in logit space so a value of one translates to an

beta1 = 1 beta2 = -1 alpha1 = -1 sites = seq(11,110,length=10) reps = seq(2,20,by=2) iter = 200

average occurrence probability of about 0.73. For this exercise, we set the habitat effect to a value of -1, which models a negative relationship of occurrence to habitat depicted in the graph to the left. The input parameter alpha1 represents the average detection probability across sites. We set this value to -1, which translates to a mean detection probability of about 0.27. We now specify a vector of the number of sites (“sites”) and the number of replicate samples per site (“reps”). We will evaluate sampling designs of all possible combinations of these values. The parameter input, “iter” is the number of times you will simulate and analyze a data set generated from all possible combinations of the sites and reps vectors. Keep in mind that the sites and reps vectors contain 10 values each in our example. This represents 100 possible sampling designs for which a simulated data set will be generated and analyzed with two occupancy models. That represents one iteration of the simulation. Performing 1000 iterations could take some time. On my laptop it took about one hour. So start off with few iteration just to make sure the simulation is running properly and then do a full 500 to a 1000 runs when you have the time. The code for running all iterations of the simulation follows.

home = array(NA,dim=c(iter,length(sites),length(reps))) ptm = proc.time() for(i in 1:length(sites)){ for(j in 1:length(reps)){ home[,i,j] = replicate(iter,runfun(sites=sites[i],reps=reps[j],beta1,beta2, alpha1),simplify=T) }} endtime = proc.time()-ptm endtime[3]/60 results = colMeans(home,na.rm=T) windows(5,5) contour(sites,reps,results,ylab="Replicates",xlab="Sites")

## Interpretation

The interpretation of the output is fairly straight forward. Below is a graph of 1000 iterations. The isopleths are the mean of the output across all iterations for each design combination. Remember that the output of runfun was a logical test where a 1 indicated that the habitat effect was detected and a zero indicated that it was not. So, for each design combination (i.e. number of sites and number of replicates) the output will be a vector of ones and zeros that is the length of the number of specified iterations. The mean of this vector (i.e. the probability of getting a one) is analogous to the statistical power of the design to detect the specified effect.

So what can we conclude from these results? One useful way to look at this graph is to pick a particular statistical power and follow the isopleth to see all the designs that are expected to achieve it. For example, if we pick 80% statistical power (0.8 isopleth on graph), we can see that designs that achieve this range between 60 sites sampled with 20 replicates to 110 sites sampled with 5 replicates. In this way we can easily visualize the the site versus replicate design trade-off.

When evaluating monitoring designs, we are often starting off with a design that is already established and are asking the question of how we might make the design more suitable (i.e. higher statistical power? more cost effective?). For example what if we started with a design that employed 10 replicate samples at 60 sites. This design is expected to achieve a statistical power of ~0.7, however the goal of the monitoring program is to achieve a statistical power of 0.8. We may ask, “How can we increase the statistical power of our program to 0.8 for the least cost?” Well, we can either increase the design by adding an additional 10 replicates to each of the existing 60 sites or sample an additional 20 sites with 10 replicates. Either of these additions to the monitoring program are expected to achieve the goal of 0.8 statistical power….the questions now becomes about cost. In other words, what is the relative cost of adding an additional site versus adding an additional replicate per site. If this is known than the most cost effective way to improve the sampling design can be known.

I think that’s enough for now. Hope you enjoyed the module. All R code is below.

## Full simulation code:

library(unmarked) #------------------------------------------------------------------------------# # Function that makes data #------------------------------------------------------------------------------# datagen = function(beta1, beta2,alpha1,sites,reps){ hab = rnorm(sites,0,1) logit.psi = beta1 + beta2*hab logit.p = alpha1 psi = exp(logit.psi)/(1+exp(logit.psi)) p = exp(logit.p)/(1+exp(logit.p)) z = rbinom(sites,1,psi) # sampling of true occupancy state y <- matrix(NA, sites, reps) for(i in 1:sites) { y[i,] <- rbinom(reps, 1, z[i]*p) } # places data in formate for unmarked umf <- unmarkedFrameOccu(y = y, siteCovs=as.data.frame(hab)) umf } #------------------------------------------------------------------------------# # Function that analyzes data #------------------------------------------------------------------------------# analysis = function(datum){ fm1 <- occu(~1 ~1, datum) fm2 <- occu(~1 ~hab, datum) converge1 = slot(fm1,"opt")$convergence converge2 = slot(fm2,"opt")$convergence aic1 = 2*-logLik(fm1)+2*2 aic2 = 2*-logLik(fm2)+2*3 ((aic1-aic2)>4)+0 } #------------------------------------------------------------------------------# # Function that runs the datagen function and plugs it into the analysis fun #------------------------------------------------------------------------------# runfun=function(sites,reps,beta1,beta2,alpha1){ dat = datagen(beta1, beta2,alpha1,sites,reps) out = try(analysis(dat),T) output = if (inherits(out,"try-error")) { 0 } else { out } output } #------------------------------------------------------------------------------# # Simulation inputs #------------------------------------------------------------------------------# beta1 = 1 beta2 = -1 alpha1 = -1 iter = 1000 sites = seq(11,110,length=10) reps = seq(2,20,by=2) #------------------------------------------------------------------------------# # This part runs the simulation #------------------------------------------------------------------------------# home = array(NA,dim=c(iter,length(sites),length(reps))) ptm = proc.time() for(i in 1:length(sites)){ for(j in 1:length(reps)){ home[,i,j] = replicate(iter,runfun(sites=sites[i],reps=reps[j],beta1,beta2, alpha1),simplify=T) }} endtime = proc.time()-ptm endtime[3]/60 results = colMeans(home,na.rm=T) windows(5,5) contour(sites,reps,results,ylab="Replicates",xlab="Sites")

Thanks Dan for continuing to work on this!

Thanks Bill.

Things have gotten busy and I put it down for a while, but I would like to keep up with it. Its a fun side project!