The Virtual Ecologist method is a form of stochastic simulation where the process of collecting, analyzing, and interpreting data are simulated. This post is the second module of the Virtual Ecologist blog series. The first module was an introduction to using the method for evaluating experimental designs. This module will introduce occupancy models and how to use the Virtual Ecologist approach to explore the design trade-off between the number of sample sites and the number of replicate samples per site. Future posts will extend this to formal power analysis and design optimization.

## Occupancy Modeling

Occupancy models are used to describe species distributional patterns. The major virtue of these models is that they account for the detection process separately from the occurrence process, thereby producing unbiased descriptions of species distributions. The sampling design used for fitting occupancy models has two components. Firstly, there is a spatial component which consists of multiple locations or sites. The occupancy model serves to describe the pattern in species occurrence across these sites. If the sites are selected in some systematic way to represent a larger spatial region (e.g. randomly), the pattern in occurrence can be inferred to the this region. Secondly, there is replicated sampling at each site. This replication provides the information needed to estimate the detection probability of the species.

Each of the two design components will impact the precision of the estimates to a different degree which should depend on the magnitude of the true occurrence probability and the true detection probability. Fore example, if occurrence probability is low across sites and detection probability is high, optimal designs will favor more sites relative to replicates per site. Alternatively, if occurrence probability in high and the detection probability is low, optimal designs will favor replicates per site over sites.

## The Simulation

To demonstrate this trade-off between sampling design components, we will build a simulation that is made up of three functions. The first function will simulate a number of sites that are occupied at a certain rate and the sampling of those sites. When this function is run, it will generate a ‘fake’ data set. We will call this the ‘data_generator’ function. The second function will employ the ‘unmarked’ R package to analyze the fake data set. We will call this the ‘analysis’ function. The third function will call the data_generator function and pass the simulated data set to the analysis function. It will then extract some output from the analysis function that we will use to grade the results. For now, we will ask it to output the standard error of the occurrence probability parameter (*psi*). To run this simulation in R, you will need to install and load the unmarked package.

```
install.packages("unmarked")
library(unmarked)
```

### Generating fake data

The first function that simulates data is as follows:

data_generator = function(psipars,ppars,nSites,nReps){ psi <- plogis(psipars) # occupancy probability p <- plogis(ppars) # detection probability # true occupancy state z <- rbinom(nSites, 1, psi) # sampling of true occupancy state y <- matrix(NA, nSites, nReps) for(i in 1:nSites) { y[i,] <- rbinom(nReps, 1, z[i]*p) } # places data in formate for unmarked umf <- unmarkedFrameOccu(y = y) umf}

This function has four inputs:

(1) The number of sites, *nSites*.

(2) The number of replicate samples per site, *nReps*.

(3) The parameter that describes the pattern in occurrence across sites, *psipars*. This input will be passed to the function as a single value in logit space.

(4) The parameter that describes the pattern in detection probability across sites, *ppars*. Similar to *psipars*, *ppars* is also a single logit-space value that will be passed to the function.

With the following code you can run the function and explore what it is doing.

data_generator(psipars = 0, ppars = -1, nSites = 10, nReps = 2)

Notice that the value of *psipar* is zero in logit space. This indicates that the occupancy probability of all sites is 0.5 (i.e. 0.5 = exp(0)/(1+exp(0))). The detection probability in logit space is specified as -1, which is a detection probability of ~0.27. The output should look something like the figure below.Each row represents a site and the two columns represent the detections from the first and second replicate sample at each site.

### Analyzing fake data

The following is the function that analyzes the fake data outputted by the data_generator function.

analysis = function(datum){ fm1 <- occu(~1 ~1, datum) converge = slot(fm1,"opt")$convergence c(coef(fm1),SE(fm1),convergence=converge) }

The only input for this function is the unmarked data object generated by the data_generator function (labeled *datum*) and the output are the parameter values, their standard errors and a number that indicates whether the model converged or not.

### Putting the simulation together

The next function that we use (*runfun*) puts the data generation and the analysis together and looks like the following:

runfun=function(psipars,ppars,nSites,nReps){ y = data_generator(psipars,ppars,nSites,nReps) out = try(analysis(y),T) output = if (inherits(out,"try-error")) { rep(NA,5) } else { out } params = c(out[1],out[2]) params }

The inputs are the same as the data_generator function and the output is the *psi* and *p* estimates. However, we could choose to output any component of the analysis or data depending on our particular question. Keep in mind that the *psi and p* are in logit space.

Notice the use of the try() function when running the analysis function, followed by a logical test. This code essentially flags when the analysis fails and inserts NAs. The failure of an analysis can become quite problematic when we automate the generation and analysis of the data in a simulation that is reiterated many times. For example, if a data set is randomly generated where a detection is made at every site and replicate, such as the example on the right, the model will produce the following error.

> analysis(y) Error: Hessian is singular. Try providing starting values or using fewer covariates.

If this error occurs when the analysis is embedded in a loop, the loop will stop and the simulation will be prematurely ended. What we would like is for the results to be flagged (perhaps by placing NAs in for the results) and for the simulation to continue. To deal with this problem, I employ a try() function followed by a logical test. The try() function will report a “try error” when the analysis fails, but will not stop the simulation. The logical test ask if there was a “try error” and returns NAs if there was or the actual model results if there wasn’t. Issues such as these errors can complicate simulations. Although the try() function and logical tests can help solve these problems, I am always looking for more efficient solutions.

You can evaluate the bias of *psi* and *p *with the following code.

out = replicate(1000,runfun(psipars=0,ppar=-1,nSites=100,nReps=3),simplify=T) windows(height=4,width=4) prob = exp(out)/(1+exp(out)) boxplot(t(prob))

This code evaluates the expected values of *psi* and *p* for a design that employs 100 sites

and 3 replicate samples per site. The resultant box plots should look something like the graph to the left. You can see that the median values for the *psi* and *p* estimates are approximately *psi* = 0.50 and *p* = 0.27, which are the true values that we simulated. This indicates that the median of the estimates is unbiased.

Next we can change the runfun function to evaluate the precision of the *psi* esitmates. This is simply done by changing the output of the runfun function as follows:

runfun=function(psipars,ppars,nSites,nReps){ y = data_generator(psipars,ppars,nSites,nReps) out = try(analysis(y),T) output = if (inherits(out,"try-error")) { rep(NA,5) } else { out } SE= out[3] SE }

You can evaluate the precision of logit *psi* with the following code.

out = replicate(1000,runfun(psipars=0,ppar=-1,nSites=100,nReps=3),simplify=T) windows(height=4,width=4) hist(out,xlab="Standard error of logit psi",main='') abline(v=median(out,na.rm=T),lwd=3,col='red')

This code evaluates the standard error (SE) of logit *psi* for a design that employs 100 sites and 3 replicate samples per site. The resultant histogram should look something like the one on the left. The SE ranges from near zero to ~40 with a median values of ~0.48 (the median value is represented by the vertical red line). For this particular run, only one NA was produced indicating that there was one incidence of error of the analysis. However, these results will vary some from run to run.

### Evaluating a range of designs

To evaluate the precision of the *psi* estimates across a range of designs, we have to define the range of design elements and loop the simulation across these elements. The code for this follows:

iter = 1000 #number of iterations sites = seq(50,100,by=5) reps = seq(4,22,by=2) home = array(NA,dim=c(iter,length(sites),length(reps))) ptm = proc.time() for(i in 1:length(sites)){ for(j in 1:length(reps)){ home[,i,j] = replicate(iter,runfun(psipars,ppars,nSites=sites[i],nReps=reps[j]),simplify=T) }} endtime = proc.time()-ptm endtime[3]/60

Running 1000 iterations may take nearly an hour. For your first run, I recommend setting iter=100 (or another smaller number).

The results file can be summaries and plotted to visualize the mean SE produced for each design combination.

results = colMeans(home,na.rm=T)/iter windows() contour(sites,reps,results,ylab="Replicates",xlab="Sites")

To the right is an isopleth of the mean SE of *psi* across all design combinations. This graph is a little messy because few iterations that result in unusually large SEs can influence the mean to a large degree. None the less, the graph demonstrates the trade-off between sites and replicates. The largest increases in the precision of the *psi* estimate are made with increases in the number of replicates up to seven. Once the design has greater that 7 replicates the trade-off between sites and replicates is more obvious. For example, a SE of 0.3 can be achieved with a design that employs 100 sites and 7 replicates or 60 sites and 15 replicates.

Another way to visualize this trade-off is to plot the probability of achieving a given SE for each design combination. The following code constructs this plot for a SE ≤ 0.30.

x = (replace(home, is.na(home), 999999)<=.3)+0 results = colSums(x,na.rm=T)/iter windows(height=4,width=4) contour(sites,reps,results,ylab="Replicates",xlab="Sites")

The plot on the left is the output of the above code. The isopleth now represent the probability of achieving a SE ≤ 0.30 for each design. The trade-off between design elements is much more obvious when presented this way. We can look at the results in different ways on the graph. One approach is to pick a value for one of the design elements and see how changes in the other element impact the probability of a SE ≤ 0.3. For example, for designs with 90 sites, we see a quick increase in the probability of a SE ≤ .3 when we increase the number of replicates from 2 to ~8. Beyond 8 replicates there is no noticeable increase in this probability. This is similar for designs that emplogy 15 replicates. We observe a large increase in the probability of SE ≤ 0.3 between 50 and 70 sites. However, increasing beyond 70 sites does not noticeably increase this probability. You can also follow one line on the graph to see all the design combinations that will result in that probability. For example, an 0.8 probability of SE ≤ 0.3 can be achieved with sites = 80 & reps = 8 or sites = 70 and reps = 10. In other words, the same precision can be achieved with 10 fewer sites if 2 more replicate samples are collected at each site. Depending on the relative cost of additional sites versus additional replicates, this design change may represent a significant saving for a field sampling program.

## Concluding remarks

With this module we demonstrated how to visualize the trade-off between occupancy design elements, which is useful to understand, but is not quite useful enough to inform an actual field sampling program. In future posts we will explore how to extend this simulation code to conduct a power analysis and optimize the efficiency of the sampling design. This is important because the relative influence of the number of sites and replicate samples per site on the statistical power will be situation dependent. Furthermore the relative costs of additional sites will be different from the relative cost of additional replicate samples per site. Thus, there will be one combination of sites and replicates that will achieve a desired statistical power for the cheapest cost, which is our goal. Hope you enjoyed this post and will check back for the next module.

Pingback: Virtual ecologist module #3: occupancy model power analysis | Dan Gwinn