Recently, I reviewed several papers where the statistical analysis was pseudoreplicated. I pointed this out to the authors and made some suggestions to remedy the problem. The responses to the reviewer (i.e. me) came back with several reasons why the authors thought the design was OK (and even preferred). Some of the responses seemed like simple misunderstandings of pseudoreplication, while other responses, I wasn’t so certain about. As I don’t want the review process to be unnecessarily difficult for anyone and I also didn’t want to read a dozen papers on the topic, this seemed like a perfect opportunity to use the Virtual Ecologist approach to explore and demonstrate the impacts of a pseudoreplicated statistical design on inference.

To keep this investigation of pseudoreplication tractable, I want to focus on several questions in the context of using linear models for the statistical analysis. The questions are:

- What is the impact of pseudoreplication on type-I error rates?
- What is the impact of pseudoreplication on effect size estimates?
- What is the impact of pseudoreplication on statistical power?
- What is the impact of pseudoreplication on correlation coefficients (or r²)?
- Can I statistically detect and demonstrate that pseudoreplication is or is not a problem in my analysis?

We will cover the first four questions in this blog and leave the last one for the next blog. But firstly, we should probably review what pseudoreplication is and establish a model example to work with.

### What is pseudoreplication?

I like to think of pseudoreplication, not as a sampling/statistical design unto itself, but as a *consequence* of a sampling/statistical design because, in the end, its the consequence to inference that is important. From this perspective, pseudoreplication is when the p-value of a statistical analysis is artificially low because there are unaccounted for group effects or dependencies in the data set. Said differently, the statistical analysis requires the assumption of independence among data points, but there are groups of data points that are not (quite) independent. This is not an official definition and is likely imperfect in some ways, so lets move on to an example to demonstrate the sort of sampling/statistical design where pseudoreplication could impact inference.

### An Australian fish example.

Lets say you are evaluating the effects of flood history on an index of the fish community in the Murray Darling Basin in southeast Australia. You sample the fish community at a variety of locations across the basin and obtain flow information for each location. Your spatial sampling design isn’t quite random because you are confined by access to the river (which is a common problem) and it looks like the following figure.

You can immediately see that the logistical limitation to river access has resulted in 15 clusters of four sample sites. This is a total of 60 data points and you intend to analyze your data with a linear regression of the form y = b + mx, where y is the fish community metric, x is a measure of flood history, b is the intercept and m is the slope…i.e. the effect of flood history on the fish community metric. A lot could go wrong with a design like this because the fish communities may be mixing among sample site causing spatial auto-correlation among the fish community metrics. The flood history could also be spatially auto-correlated. For the sake of learning, we need to simplify our problem, so let’s say that there is no mixing of the fish communities among sample sites. But let’s add an additional limitation: lets say that there is only one flow gauge per grouping…perhaps it is located at the river access point. This means that each of the four sample sites in each group share a value for the x data point.

So let’s review the design: 60 data points located in 15 spatial clusters (groups). The four sample sites in each group share the same measure of flood history.

This sort of design should immediately grab your attention because there is a miss-match between the unit of inference of the y variable (i.e. the site) and the x variable (i.e. the group). If you ignore the groupings, as in our linear regression, this design is at risk of pseudoreplication, but that does not mean that inference will be impacted. This is what we will explore with our simulation. So lets start building the simulation.

### Building the simulation.

We are going to build a function that simulates data from our example sampling design and analyzes it with a linear regression which assumes that all data points are independent. The function will have five inputs:

- ng = the total number of groups.
- ns = the number of samples per group (ns*ng = total sample size).
- effect = the slope of the linear regression or flood effect size.
- tsig = standard deviation describing variation in fish metric among samples.
- gsig = standard deviation describing variation in the mean fish metric among groups.

The code is below:

statsig = function(ng,ns,effect,tsig,gsig){ x = rnorm(ng,0,1) x1 = rep(x,each=ns) y = .2 + effect*x + rnorm(length(x),0,gsig) y1 = rnorm(length(x1),rep(y,each=ns),tsig) fit1= lm(y1~x1) s1 = summary(fit1) round(c(slope1=s1$coefficients[2,1],R1=s1$adj.r.squared, Type_I_error_1=s1[[4]][2,4]<=0.05),3) }

So, now lets step through what each line of code does. The first line simulates a single flood variable for of the 15 groups (ng). Its meant to mimic a generic centered and scaled covariate. The second line expands the vector x to be repeated for each of the four (ns) sample sites in a group, by repeating each value of x ns times. The third line predicts the expected (or true mean) fish metric for each group based on the flood variable x and the effect size (I set the intercept arbitrarily to a vale of 0.2). Then I add on some random variation among the expected fish metric for the groups (with a standard deviation of gsig). This is a very important part of the function…if gsig ~ 0, the expected values of the fish metric y will only be a function of the flood variable. This formalizes the assumption that there are no other things influencing the variation in the fish metric among groups. But if gsig > 0, the expected y will vary beyond the effect of floods. This simulates a situation where there are other things in the environment varying among the groups and influencing the fish metric. Make sure this makes sense to you because it is key. The next line (line 4) adds some additional “residual” variation among the samples within the groups. Our resulting simulated data are x1 (flood variable) and y1 (fish metric). Now we fit them with a standard linear regression on line 5. This analysis does not account for any potential variation in the expected values of the fish metric among groups outside of that predicted by the effect of x1. The next line (6) defines the summary of the model fit, from which we will extract an effect size (slope_1), an r-squared value (R1), and an indicator of whether the p-value was less than or equal to 0.05 (Type_I_error_1). Type_I_error_1 will equal 0 when the p-value is > 0.05 and will equal 1 when the p-value is ≤ 0.05. And the final line puts those three outputs into a vector.

So lets run this thing and see what the output looks like. Simply run the function code and then following code.

statsig(ns=4,ng=15,effect=0,tsig=.5,gsig=0)

This code simulates an effect size of zero, a residual standard deviation of 0.5, but no variation among the true means of the fish metric among groups outside of what is explained by the flood variable (which in this case is none, i.e. effect=0).

My output looks like this but yours will look different because of the random components:

The estimated slope and correlation coefficient are near zero and the p-value was greater than 0.05 (indicated by a zero). This is the correct answer, but we can’t know if the analysis is working correctly until we repeat the simulation for multiple iterations and look at the average of the outputs. The following code uses the replicate function to do this:

out=replicate(1000,statsig(ns=4,ng=15,effect=0,tsig=.5,gsig=0),simplify=T) rowMeans(out)

My output looks like the following:

This is telling me that the average estimated slope is near zero, the average R1 is near zero, and the proportion of iterations that resulted in a p-value ≤ 0.05 was ~ 0.05. In other words the type-I error rate as defined by an alpha=0.05 was truly 0.05. This means that the analysis is working perfectly. But how can that be? The design is pseudoreplicated, right? This is because the consequences of an apparent pseudoreplicated design will not be realized unless there is variation in the group means beyond what can be explained by the independent variables and the residual error.

**The consequences of an apparent psuedoreplicated design will not be realized unless there is variation in the group means beyond what can be explained by the independent variables and the residual error.**

In our case, the analysis performed perfectly because gsig was set to zero. So lets set it to a non-zero value to simulate some variation in the group means and see what happens. Try setting gsig to the same value as tsig, 0.5.

My output looks like the following:

Notice that the big effect of this was that the realized type-I error rate has increased to 0.22. This is the main consequence and concern for pseudoreplicated designs…which is that you are more likely to get a p-value ≤ 0.05 when, in fact, there is no statistical effect. Furthermore, its not possible to predict how much your type-I error rate will become inflated. This answers our first question of the effect of pseudoreplication on type-I error rates.

**Pseudoreplicated designs result in artificially inflated type-I error rates when group effects are unaccounted for. In other words, you are more likely to falsely conclude a statistical effect when, in fact, there is none.**

Now lets modify our simulation to include an appropriate analysis of the data that accounts for the group effects. There are two ways to achieve this:

**Use a mixed effects model that includes a random group effect.****Average the y and x variables for each group and use this as a reduced data set.**

Lets start by including the mixture-model approach. For this we will need to load the package, ‘lme4’ and ‘lmerTest’, and add several lines of code to our simulation function. First, we need to add code that creates a variable that identifies each group (line 3). This variable will then be used in a mixture model (line 7), with results summarized on line 9. We then add the mixture model slope and indicator of statistical significance to the output vector. Note that the lmer function does not provide r-squared values, so we will skip this comparison for now. The new function looks like the following:

statsig = function(ns,ng,effect,tsig,gsig){ x = rnorm(ng,0,1) x1 = rep(x,each=ns) group = as.factor(rep(1:ng,each=ns)) y = .2 + effect*x + rnorm(length(x),0,gsig) y1 = rnorm(length(x1),rep(y,each=ns),tsig) fit1= lm(y1~x1) fit2= lmer(y1~x1+(1|group),REML=F) s1 = summary(fit1) s2 = summary(fit2) round(c(slope1=s1$coefficients[2,1],R1=s1$adj.r.squared, Type_I_error_1=s1[[4]][2,4]<=0.05,slope2=s2$coefficients[2,1], Type_I_error_2=s2$coefficients[2,5]<=0.05),3) }

Now lets run the simulation with a non-zero value for gsig and look at the output. My out put looks like this:

There are two important things to notice. First, notice that the type-I error rate of the mixed model (Type_I_error_2) is close to the appropriate value of 0.05 now that we have accounted for the group effect. Also notice that the average effect estimates are exactly the same (i.e. slope1 = slope2). In other words, there is no effect of pseudoreplication on the point estimate of the effect size. This is the answer to question #2.

**There is no effect of pseudoreplication on the point estimates of the effect sizes.**

This means that you cannot compare the effect size of a pseudoreplicated analysis to an appropriate analysis to determine if pseudoreplication is a problem.

Now lets look to see how pseudoreplication impacts statistical power (question #3). For this we need to simulate an actual effect that is different than zero. Let start by rerunning our simulation with an effect size of 0.2 and have a look at the output.

First we have to change our interpretation of the “Type_I_error” output. This output is literally, the proportion of simulation iterations that resulted in a p-value ≤ 0.05. When we simulate an effect = 0, this output is a measure of the realized type-I error rate; however, when we simulated an effect ≠ 0, this output is a measure of statistical power…i.e. the proportion of times we conclude a non-zero slope when it is truly non-zero. So, we can see that analyzing the pseudoreplicated design with the linear regression results in greater statistical power than accounting for the group effect with the mixed model. However, our output could be misleading because some p-values could be associated with effect estimates that have confidence intervals that do not include the true effect size. For example, what if an iteration resulted in an estimate of slope1 = -0.2 with 95% confidence intervals of -0.29, -0.02. This would be counted as a statistically significant result and falsely inflate our estimate of statistical power. So, lets alter the simulation a bit and take a different approach to determining if the analyses are getting appropriately correct answers. The approach we are going to take is to add two logical tests that asks if the 95% confidence intervals of the effect estimates exclude zero and the true effect is within theses confidence intervals. If the true value of the effect is within the confidence intervals and the confidence intervals do not include zero, then we conclude that the effect was *correctly* detected. This is probably a more appropriate measure of statistical power for this context. The code with the new logical tests (pw1 and pw2) is below.

statsig = function(ns,ng,effect,tsig,gsig){ x = rnorm(ng,0,1) x1 = rep(x,each=ns) group = as.factor(rep(1:ng,each=ns)) y = .2 + effect*x + rnorm(length(x),0,gsig) y1 = rnorm(length(x1),rep(y,each=ns),tsig) fit1= lm(y1~x1) fit2= lmer(y1~x1+(1|group),REML=F) s1 = summary(fit1) s2 = summary(fit2) pw1=effect<=(s1[[4]][2,1]+1.96*s1[[4]][2,2])& effect>=(s1[[4]][2,1]-1.96*s1[[4]][2,2])& 0<=(s1[[4]][2,1]+1.96*s1[[4]][2,2])*(s1[[4]][2,1]-1.96*s1[[4]][2,2]) pw2=effect<=(s2$coefficients[2,1]+1.96*s2$coefficients[2,2])& effect>=(s2$coefficients[2,1]-1.96*s2$coefficients[2,2])& 0<=(s2$coefficients[2,1]+1.96*s2$coefficients[2,2])* (s2$coefficients[2,1]-1.96*s2$coefficients[2,2]) round(c(slope1=s1$coefficients[2,1],R1=s1$adj.r.squared, Type_I_error_1=s1[[4]][2,4]<=0.05,pw1=pw1,slope2=s2$coefficients[2,1], Type_I_error_2=s2$coefficients[2,5]<=0.05,pw2=pw2),3) }

Now lets run our simulations with this code.

We can see that the values of pw1 and pw2 are slightly different than the Type_I_error outputs. The difference between pw2 and Type_I_error_2 is quite small and likely due to rounding error in my confidence interval calculations. On the other hand, the difference between pw1 and Type_I_error_1 is much larger and is due to statistically significant results when the slope is significantly biased relative to the true effect (i.e. the truth is not contained within the confidence intervals).

So, we might conclude that statistical power will be greater if we are willing to accept a pseudoreplicated design. This was one argument for not accounting for pseudoreplication put forth to me in one of the reviewer responses, and, seductive as it is (I want that statistical significance so bad!!), it is not as it seems. For one, your analysis will suffer inflated type-I rates at an unknown level and two, the increased statistical power is just due to the random chance that the analysis concluded that noise was a real pattern of effect similar to the true pattern. It is not reliable.

**A pseudoreplicated analysis gives the false impression of increased statistical power compared to an appropriate analysis.**

If you feel that the importance of detecting the potential effect is more important than the risk of a type-I error, I would advise using the appropriate analysis, choosing a larger alpha such as 0.1, and making this choice explicit in your methods. The benefit of doing this is that you will have a known type-I error rate rather than an unknown type-I error rate…And you get the coveted increased statistical power. Furthermore, this practice is not that uncommon in the ecological literature and is likely more appropriate than alpha = 0.05 in a conservation context (Mapstone 1995, Field et al. 2004). I do it sometimes….e.g. Beesley et al. (2014) and King et al. (2016). Lets try it in our simulation. Lets increase the threshold for statistical significance in our mixture model to alpha = 0.1 and see how it impacts the results.

We simply need to change the critical value in the pw2 logical test from 1.96 to 1.645 and change the alpha in Type_I_error_2 from 0.05 to 0.1, and then rerun the simulation.

statsig = function(ns,ng,effect,tsig,gsig){ x = rnorm(ng,0,1) x1 = rep(x,each=ns) group = as.factor(rep(1:ng,each=ns)) y = .2 + effect*x + rnorm(length(x),0,gsig) y1 = rnorm(length(x1),rep(y,each=ns),tsig) fit1= lm(y1~x1) fit2= lmer(y1~x1+(1|group),REML=F) s1 = summary(fit1) s2 = summary(fit2) pw1=effect<=(s1[[4]][2,1]+1.96*s1[[4]][2,2])& effect>=(s1[[4]][2,1]-1.96*s1[[4]][2,2])& 0<=(s1[[4]][2,1]+1.96*s1[[4]][2,2])*(s1[[4]][2,1]-1.96*s1[[4]][2,2]) pw2=effect<=(s2$coefficients[2,1]+1.645*s2$coefficients[2,2])& effect>=(s2$coefficients[2,1]-1.645*s2$coefficients[2,2])& 0<=(s2$coefficients[2,1]+1.645*s2$coefficients[2,2])* (s2$coefficients[2,1]-1.645*s2$coefficients[2,2]) round(c(slope1=s1$coefficients[2,1],R1=s1$adj.r.squared, Type_I_error_1=s1[[4]][2,4]<=0.05,pw1=pw1,slope2=s2$coefficients[2,1], Type_I_error_2=s2$coefficients[2,5]<=0.1,pw2=pw2),3) }

The following are two outputs. The first is with a true effect = 0.2 to evaluate statistical power and the second is with a true effect = 0 to evaluate the type-I error rate.

And…

Inspection of the first output shows us that the use of the mixed model with an alpha = 0.1 produces a statistical power nearly as high as the apparent statistical power of the pseudoreplicated linear regression. The second output shows us that the type-I error rate of the mixed model is very close to the expected value of 0.1 (I removed the logical test because it does not work properly with an effect=0). This is a nice option to keep in mind.

Lets wrap this lesson up with an evaluation of the effects of pseudoreplication on correlation coefficients or in our case the coefficient of determination…i.e. r² (Question #4). For this investigation, lets alter the code to account for the group effects by averaging the fish metric across groups and applying a simple linear regression to this reduced data set.

statsig = function(ns,ng,effect,tsig,gsig){ x = rnorm(ng,0,1) x1 = rep(x,each=ns) group = as.factor(rep(1:ng,each=ns)) y = .2 + effect*x + rnorm(length(x),0,gsig) y1 = rnorm(length(x1),rep(y,each=ns),tsig) y2=aggregate(y1,list(group),mean)$x x2=aggregate(x1,list(group),mean)$x fit1= lm(y1~x1) fit2= lm(y2~x2) s1 = summary(fit1) s2 = summary(fit2) round(c(slope1=s1$coefficients[2,1],R1=s1$adj.r.squared, Type_I_error_1=s1[[4]][2,4]<=0.05,slope2=s2$coefficients[2,1], R2=s2$adj.r.squared,Type_I_error_2=s2[[4]][2,4]<=0.05),3) }

Notice how I aggregated the x and y variables by the groups (i.e. y2 and x2) and replaced the mixture model with standard regression of the group averages in the above code. Now run the simulation with an effect size of zero.

My output looks like the following:

The true R2 should be zero and both analyses result in a mean r² value that is fairly close to zero. Lets run the simulation with a slope of 0.2 and look at the results.

Hmm…The r² from the pseudoreplicated analysis is slightly smaller than the analysis that accounted for the group effect, but the difference is small. Lets automate this simulation to repeat itself across a vector of different effect sizes to speed up the learning process. We will do this by defining a vector of effect sizes and then looping the simulation across the vector.

ptm = proc.time() effect_vec = seq(0,2,length=20) out=replicate(1000,statsig(ns=4,ng=15,effect=effect_vec[1],tsig=.5,gsig=.5),simplify=T) results = rowMeans(out) for(i in 2:length(effect_vec)){ out=replicate(1000,statsig(ns=4,ng=15,effect=effect_vec[i],tsig=.5,gsig=.5),simplify=T) results = rbind(results,rowMeans(out)) } endtime = proc.time()-ptm endtime[3]/60 results

I added several lines of code at the beginning and end so that we can see how long it takes the simulation to run. It could take a few minutes so I’ll go get a cup of coffee.

Ok…It took about 2.5 minutes…not quite enough time to get that cup of coffee. Lets have a look at the output by looking at the results matrix.

We can see that the mean R1 of the pseudoreplicated analysis is not exactly the same as R2 of the appropriate analysis. Lets make a graph so we can visualize it better.

windows(4,4) plot(effect_vec,results[,'R2'],type='l',ylim=0:1,xlab='Effect size',ylab='Coefficient of determination') lines(effect_vec,results[,'R1'],type='l',lty=2) legend(.3,.2,bty='n',lty=1:2,legend=c('Correct analysis','Pseudoreplicated analysis'),cex=.85)

**A pseudoreplicated analysis produces systematically biased coefficients of determinations.**

We can see from the graph on the right that the pseudoreplicated analysis produces systematically biased coefficients of determination in the negative direction. This is likely the result of the residual variance being miss specified in the pseudoreplicated analysis. In other words, the linear regression assumes that the residual error is normally distributed, but it is not. The residual has dependencies in it due to the group effects that are not being explained by the linear model. However, in the analysis of the reduced data set, we removed those dependencies by averaging the y and x variable within each group. So this answers our fourth question and concludes this module of the virtual ecologist.

**Summary**

Lets summarize what we have learned.

Analyzing a data set with linear models that do not account for groupings will be at risk of pseudoreplicaton if there is variation in group means that cannot be explained by model parameters or the residual variance.

If this variation exists, the pseudoreplicated analysis will (1) result in inflated type-I error rates of unknown magnitude, (2) will produce the exact same effects size estimates as the appropriate analysis, (3) will give the false impression of increased statistical power, and (4) will result in negatively biased r² values for the linear model.

We saw how using a mixed effects model that models the intercept of the regression as a random effect across groups OR averaging the y and x variables within groups removes the potential effects of pseudoreplication.

We also saw how accounting for pseudoreplicaton and adjusting the alpha level of the analysis (e.g. α = 0.05 to α = 0.10) may be a more honest way to truly increase statistical power while maintaining a known type-I error rate.

So, in the future when I review a paper with a design that is at risk of pseudoreplication, I will provide three suggestions: (1) Use a mixed effects model to account for the group effect, (2) reduce the data set to the group means of the y and x variables, or (3) provide a convincing justification for why it is unrealistic to expect variation in the group means beyond that explained by the model residual and covariates. And lastly, if #3 is not possible and doing #1 or #2 is too painful because of the loss of perceived statistical power, then consider if relaxing the alpha level is appropriate for the context of the study.

In the next module, we will explore a fourth option which is to perform a test to detect if group effects are present in the data set. Hope you enjoyed.

Really nice post Dan. Good work.

Fantastic post Dan! Wondering if you’re thinking of using this as the starting point for a useful paper for young ecologists. I think it reads really quite well to demonstrate and teach some good take home points on pseudoreplication, and how to solve these problems.

Also, for the stats jargon, you define the lmer model as a “mixture model” but did you mean a “mixed model”? I truly don’t know if there is a difference in your application but I think it implies a difference in their broader usages.

I did mean a mixed model! Good catch, Kyle. All fixed. I really enjoy writing these little lesson blogs, but I really don’t know if there is an appropriate publication forum for something like this. Any ideas?….

I’m not sure. I see maybe a PLos One or Methods in Ecology avenue? I think it reads really well in your conversational style, which reminds me of some of Alain Zuur’s and Ben Bolker’s articles. I think Zuur et al. 2010 in MEE and Bolker et al. 2009 in TREE come to mind (although they do get into the more typical academic writing at times too). But you put together a really good lesson on pseudoreplication, inference, statistical power, and why we care about these sorts of things with a cool case study… so I feel its all there!