**The Biotelligences Glossary**

This is our new initiative to help you with your biostatistics:

**The Biotelligences Glosssary**. We will regularly post definitions of statistical words and concepts (in plain English!). Entries are in alphabetical order. Use this glossary for your own lectures, publications, lab meetings... Read, copy, paste, re-write, quote, etc.

This glossary is yours.

Multiple comparisons (the inflation of the alpha threshold and the FWER)

P-values (see this entry) quantify the probability of observing rare events, namely a large result such as a difference or a correlation, although there is in fact no effect to be detected. In this context, the larger the number of tests we perform on a dataset, the more likely we are to find rare events by chance alone. This problem is called the

To understand this inflation, we must remember that even when the null hypothesis is true (meaning that there is no difference or no correlation), a statistical test has a small probability of erroneously rejecting it and therefore conclude of a likely effect. This probability is the alpha threshold, which is very often set at 0.05 in life sciences. Alpha is valid if only ONE comparison is made between two groups. When more than one comparison are performed, therefore between more than two groups, there is an inflation of the probability of obtaining AT LEAST one significant result by chance alone and thus of getting at least one false positive result.

Here is the mathematical demonstration of this problem:

P-values (see this entry) quantify the probability of observing rare events, namely a large result such as a difference or a correlation, although there is in fact no effect to be detected. In this context, the larger the number of tests we perform on a dataset, the more likely we are to find rare events by chance alone. This problem is called the

**inflation of the alpha threshold**or the cumulative Type 1 error. Although this issue concerns all branches of science, some domains are particularly vulnerable (e.g. omics or functional brain imaging).To understand this inflation, we must remember that even when the null hypothesis is true (meaning that there is no difference or no correlation), a statistical test has a small probability of erroneously rejecting it and therefore conclude of a likely effect. This probability is the alpha threshold, which is very often set at 0.05 in life sciences. Alpha is valid if only ONE comparison is made between two groups. When more than one comparison are performed, therefore between more than two groups, there is an inflation of the probability of obtaining AT LEAST one significant result by chance alone and thus of getting at least one false positive result.

Here is the mathematical demonstration of this problem:

The series of comparisons made is called the “family”, the probability of making at least one Type 1 error is therefore called the

**Familywise Error Rate (FWER)**. The FWER quickly tends to 1 as the number of groups increases, meaning that we can be almost sure that at least one false positive result will appear among all our comparisons.Various strategies were created to overcome this problem and although none is perfect, a correction for multiple comparisons is strongly recommended to prevent false positives.

Power

To start with a definition:

The notion of power is intimately interconnected with the notion of

But sample size is not the only factor that influences statistical power. Firstly, power increases when the

In conclusion, always optimize the power of your design by increasing experimental group sizes (of course this must be balanced with technical, ethical and financial constraints), along with your efforts to reduce variability where possible.

We recommend the following reading:

Krzywinski M. and Altman N.

To start with a definition:

*The power of a statistical procedure can be defined as the probability of detecting an effect if there is an actual effect to be detected.*The notion of power is intimately interconnected with the notion of

**sample size**. Indeed, as variability is inevitable in biology the probability of erroneously concluding that there is a difference solely due to sampling inconsistency increases as sample sizes shrink. Imagine a study that aims at evaluating whether males and females have different heights. A design based on samples of only 3 males and 3 females has a substantial chance of concluding that there is no difference, especially if one subject has an unusual stature. The larger the sample, the more accurate the estimation of the true means of the populations and the more confident the experimenter can be when assuming the difference observed between the samples is genuine.But sample size is not the only factor that influences statistical power. Firstly, power increases when the

**effects are larger**(e.g. if males and females differ by 20 cm, the chances of detecting this difference are higher than if the difference is 10 cm). Secondly, power increases when**variability**decreases (if the difference is 10 cm, the chances of detecting it are higher if the average individual shift from the mean height is 2 cm than if it is 15 cm). Finally, power is increased when the**alpha**threshold is higher, which means that the probability of detecting a true effect is augmented as well as the risk of getting a false positive.In conclusion, always optimize the power of your design by increasing experimental group sizes (of course this must be balanced with technical, ethical and financial constraints), along with your efforts to reduce variability where possible.

We recommend the following reading:

Krzywinski M. and Altman N.

*Power and sample size*. Nature Methods. 2013.*LINK* P-Value

The concept of the p-value is difficult to understand and is one of the most misused in biostatistics. Yet, it is omnipresent in publications and virtually every statistical analysis culminates with a p-value. The definition of p-values may be developed from different angles, but the easiest and most intelligible definition to non-mathematicians is:

Let’s now have a closer look at what p-values are not.

There are many more subtleties in p-values and we recommend the following additional reading:

Goodman S.

Colquhoun D.

The concept of the p-value is difficult to understand and is one of the most misused in biostatistics. Yet, it is omnipresent in publications and virtually every statistical analysis culminates with a p-value. The definition of p-values may be developed from different angles, but the easiest and most intelligible definition to non-mathematicians is:

*The probability of observing a result (statisticians say “a test statistic”) that is at least as large as the one observed, assuming that there is in fact no difference to be detected (statisticians say “ assuming the null hypothesis H0 is true”).*Let’s now have a closer look at what p-values are not.

**P-values are not probabilities that the observed results are false.**Addressing the probability that findings reflect real effects means taking into account the prior probability that H0 (no-effect hypothesis) is true, an issue completely ignored by the calculation of p-values. Indeed, as a p-value is computed assuming that H0 is true it cannot simultaneously be a probability that H0 is false. The detailed calculation of the probability of H0 uses Bayesian statistics and gives estimations of around 20% for p=0.05, but this is beyond the scope of this glossary entry.**P-values only quantify the evidence against H0, not the magnitude of the detected effect**. A p-value of 0.01 does not in anyway indicate a more intense effect than a p-value of 0.06, but just that any effect is less likely to have been observed if there is no difference, irrespective of the detected magnitude. The magnitude of an effect is estimated by computing the effect size and the confidence interval.**P-values deemed “statistically significant” do not show whether the effect is biologically important.**First, as mentioned above, the magnitude of an effect is not given by the p-value, as such, the p-value does not shed any light on whether the observed difference may lead to a biological impact. Secondly, the biological or medical importance of the difference is linked to the importance of the scientific end point of the study. There is no helpful statistical procedure for this issue, just your wisdom and intuition as a scientist.There are many more subtleties in p-values and we recommend the following additional reading:

Goodman S.

*A dirty dozen: twelve p-value misconceptions*. Semin. Hematol. 2008.*LINK*Colquhoun D.

*An investigation of the false discovery rate and the misinterpretation of p-values*. Royal Society Open Science. 2014.*LINK* Samples, sampling and sample size

Sometimes, our objective is simply to describe the characteristics of a given set of data (i.e. a ‘

For a sample to be useful, it should be representative of the broader population from which it was taken. If it is not, our inferences about the population could be incorrect. Given that

Choosing an appropriate sample size is not trivial. Sample sizes must be large enough that we can get a good idea of the population’s characteristics, but… at the same time we are often limited by ethical, technical and/or financial constraints.

We saw above that insufficient sampling could misrepresent the population simply by chance alone. However, sampling methodologies often involve systematic biases too. For example, we may be more likely to sample patients that live close to a testing centre; or animals that, for whatever reason, are more noticeable or more easily caught. Also, when experimenters are assembling a set of potential subjects for a test of a new medical procedure, they might unconsciously favour healthier-looking subjects more likely to achieve a positive outcome. Such biases can lead to samples that are strongly misrepresentative of our actual population of interest.

Sampling methodologies should therefore explicitly be designed to minimize bias. For example, subjects could be strictly randomly assigned to treatments. Such methods are typically costlier and harder to implement, but extremely useful for reducing biases.

Given their importance, sample sizes and sampling methods must be clearly indicated in all publications and grant applications.

Sometimes, our objective is simply to describe the characteristics of a given set of data (i.e. a ‘

*sample*’). For this, we use*descriptive statistics*. Yet, quite often, there is a deeper question at stake: What do these data tell us about the processes that generated them? In other words, we want to use our sample to extrapolate and draw inferences about the characteristics of the more general dataset (i.e. the ‘*population*’) comprising all possible values our sample might have taken. For this, we use*i**nferential**statistics*.

*A sample is a subset of observations (i.e. data) drawn from a population, and sampling refers to the process by which samples are collected.*For a sample to be useful, it should be representative of the broader population from which it was taken. If it is not, our inferences about the population could be incorrect. Given that

*insufficient*or*biased*sampling can easily ruin an entire study, it is imperative that sampling be planned carefully.*Insufficient sampling:*Choosing an appropriate sample size is not trivial. Sample sizes must be large enough that we can get a good idea of the population’s characteristics, but… at the same time we are often limited by ethical, technical and/or financial constraints.

- Small samples are likely to misrepresent the population – just by chance alone. They carry an intrinsically higher risk of ‘
**sampling errors**’. - Because we must be wary of small samples, we cannot be as confident about the inferences we draw from such samples. Small samples lead to inherently
**lower**, and hence a higher chance of obtaining ‘false negative’ results (see glossary entry on statistical power for more information).*statistical power* - With small samples, it is harder to judge whether the assumptions about the underlying distributions are being met. Therefore, we may be more restricted in the
**choice of tests**we can use to investigate patterns of interest.

*Biased sampling:*We saw above that insufficient sampling could misrepresent the population simply by chance alone. However, sampling methodologies often involve systematic biases too. For example, we may be more likely to sample patients that live close to a testing centre; or animals that, for whatever reason, are more noticeable or more easily caught. Also, when experimenters are assembling a set of potential subjects for a test of a new medical procedure, they might unconsciously favour healthier-looking subjects more likely to achieve a positive outcome. Such biases can lead to samples that are strongly misrepresentative of our actual population of interest.

Sampling methodologies should therefore explicitly be designed to minimize bias. For example, subjects could be strictly randomly assigned to treatments. Such methods are typically costlier and harder to implement, but extremely useful for reducing biases.

Given their importance, sample sizes and sampling methods must be clearly indicated in all publications and grant applications.

Sampling Distribution

This statement may sound a bit like mumbo jumbo to most of us, but its meaning can be deciphered as follows:

Experimentally, researchers do not have access to the entirety of the population they wish to study. Investigations are therefore conducted on

The shape of the sampling distribution is influenced by the distribution of the variable in the population. For example, if the values are normally distributed within the population (Gaussian) the sampling distribution of the mean will be normal. One important feature of the sampling distribution is that is may be considered Gaussian if the sample size is large enough regardless the shape of the population (this is due to the so called Central Limit Theorem).

Importantly, the mean of the sampling distribution is the same as the mean of the population but they do not have the same standard deviation. The standard deviation of the sampling distribution is the Standard Error of the Mean (

The sampling distribution is a key concept because statistical tests strongly rely on its parameters. For instance, there is a frequent misconception that assumes that normality-dependent parametric tests can only be conducted if the values are normally distributed within the population or sample. This idea is untrue since these tests consider the normality of the sampling distribution.

We recommend the following online video material:

https://www.youtube.com/watch?v=uPX0NBrJfRI

*The Sampling Distribution of the Mean is the hypothetical probability distribution of the mean when independent and random sampling is repeated.*This statement may sound a bit like mumbo jumbo to most of us, but its meaning can be deciphered as follows:

Experimentally, researchers do not have access to the entirety of the population they wish to study. Investigations are therefore conducted on

**samples**, which are small groups of subjects supposedly representative of the entire population. However, independent samples will inevitably show differences and have different average measures. Nevertheless, each of the mean values will be clustered around the population mean. A graph may then be plotted showing how often you can expect to encounter the different values of the mean if you sample the population many times. The curve has a peak centred on the population mean value and a width that indicates the means you are most likely to get in future samples. This graphical representation is the so-called sampling distribution of the mean.The shape of the sampling distribution is influenced by the distribution of the variable in the population. For example, if the values are normally distributed within the population (Gaussian) the sampling distribution of the mean will be normal. One important feature of the sampling distribution is that is may be considered Gaussian if the sample size is large enough regardless the shape of the population (this is due to the so called Central Limit Theorem).

Importantly, the mean of the sampling distribution is the same as the mean of the population but they do not have the same standard deviation. The standard deviation of the sampling distribution is the Standard Error of the Mean (

**SEM**), which is estimated by dividing the standard deviation of the initial population by the square root of the sample size (see the SEM entry in this Glossary). This makes sense as the more subjects there are in your sample the more precise you will be in estimating the true mean of the population.The sampling distribution is a key concept because statistical tests strongly rely on its parameters. For instance, there is a frequent misconception that assumes that normality-dependent parametric tests can only be conducted if the values are normally distributed within the population or sample. This idea is untrue since these tests consider the normality of the sampling distribution.

We recommend the following online video material:

https://www.youtube.com/watch?v=uPX0NBrJfRI

Standard Error of the Mean (S.E.M.)

The Standard Error of the Mean (SEM) is probably the most widely used representation of error in the life sciences. However, although virtually all biologists know the SEM, very few actually comprehend its real meaning. As a consequence, the SEM is almost invariably misused and misinterpreted.

The SEM is approximated by dividing the standard deviation of the sample by the square root of the sample size. Theoretically, the calculation should use the standard deviation of the population, but this later is rarely accessible. Any increase in the sample size leads to a smaller SEM, a subsequent narrowing of the sampling distribution (see the corresponding entry in the Glossary) and a more precise estimation of the population mean.

The first frequent misuse of the SEM is as a measure of variability. The SEM indicates how precisely the mean of the population has been estimated but does not indicate the spread of the values in this population. It misleads the reader about a possible clustering of individual values around the mean, since the size of the SEM is an artificial consequence of the sample size.

The second misinterpretation is the belief that ± 1 SEM gives an interval in which the population mean is “very likely” to be truly located. The SEM is the standard deviation of the sampling distribution of the mean. Assuming a Gaussian sampling distribution, this would imply that the actual mean has only around 68% chance of being within an interval of ± 1 SEM (the usual display in publications). In order to increase this chance to 95%, the biostatistician must consider an interval of ± 2 SEM centred on the sample mean (if the sample size is large enough, otherwise the interval is even larger!).

In conclusion,

We recommend the following reading:

Cumming G. et al.

The Standard Error of the Mean (SEM) is probably the most widely used representation of error in the life sciences. However, although virtually all biologists know the SEM, very few actually comprehend its real meaning. As a consequence, the SEM is almost invariably misused and misinterpreted.

*The SEM is the standard deviation of the sampling distribution, which is the probability distribution of the mean. The SEM is therefore a measure of how precisely the mean of the population has been estimated.*The SEM is approximated by dividing the standard deviation of the sample by the square root of the sample size. Theoretically, the calculation should use the standard deviation of the population, but this later is rarely accessible. Any increase in the sample size leads to a smaller SEM, a subsequent narrowing of the sampling distribution (see the corresponding entry in the Glossary) and a more precise estimation of the population mean.

The first frequent misuse of the SEM is as a measure of variability. The SEM indicates how precisely the mean of the population has been estimated but does not indicate the spread of the values in this population. It misleads the reader about a possible clustering of individual values around the mean, since the size of the SEM is an artificial consequence of the sample size.

The second misinterpretation is the belief that ± 1 SEM gives an interval in which the population mean is “very likely” to be truly located. The SEM is the standard deviation of the sampling distribution of the mean. Assuming a Gaussian sampling distribution, this would imply that the actual mean has only around 68% chance of being within an interval of ± 1 SEM (the usual display in publications). In order to increase this chance to 95%, the biostatistician must consider an interval of ± 2 SEM centred on the sample mean (if the sample size is large enough, otherwise the interval is even larger!).

In conclusion,

**the use of SEM should be avoided**. Whenever possible, it should be replaced by 95% confidence intervals to indicate the estimation of the population mean and standard deviations to show variability.We recommend the following reading:

Cumming G. et al.

*Error Bars in Experimental Biology*. Journal of Cell Biology. 2007. LINKVariable

A variable is a

The term

When choosing a statistical test, one must consider the so-called scales of measurement (or levels of measurement). There are two main scales of measurement that represent two types of variables:

Discrete variables have a countable set of possible values. One first subgroup of discrete variables are

Continuous variables have uncountable set of possible values and can take on any value between their maximum and minimum (note that, strictly speaking, the probability to observe any particular value is zero). Another assumption that an increase in 1 unit is equivalent in any region of the scale, which makes intervals of values meaningful.

You may encounter difficulty in answering some of the above descriptions since the categories are not as clear cut as they look. Firstly, some characteristics may fall into different categories depending on how the experimental design is constructed, such as time (e.g. exact time in seconds and one-tenths would be continuous whereas time intervals would be categorical) or colour (e.g. continuous wavelength or nominal categories). Second, some variables have almost all attributes of one type of scale, but not all… For instance, a cell count looks like a ratio variable, except it must be an integer so it cannot take any value! If only small counts are possible, say only between 0 and 10, an ordinal scale is likely more relevant.

In a nutshell, the key question is not the exact nature of the scale but rather whether it is helpful for the purposes at hand: assumptions are made, but never fully achieved

A variable is a

**characteristic, quantity or number that can take different values from individual to individual**. Variables are the very essence of statistical testing since their nature and number largely drive the correct choice of a statistical test.The term

**independent**(a.k.a. experimental or predictor) variables (I.V.) refers to variables that influence**dependent**(a.k.a. outcome) variables (D.V.), an influence that is the primary focus of the conducted research. For example, in an experimental study that addresses the influence of a specific gene on neuronal death in mice, the independent variable may be the genotype (e.g. wild-type or knock out) and the dependent variable can be the number of apoptotic neurons (or any other method quantifying cell death/survival).When choosing a statistical test, one must consider the so-called scales of measurement (or levels of measurement). There are two main scales of measurement that represent two types of variables:

**discrete**and**continuous**variables, each in turn subdivided into two sub-categories.Discrete variables have a countable set of possible values. One first subgroup of discrete variables are

**nominal**variables (sometimes named categorical or qualitative), for which no order exists between the different values (e.g. gender, ethnicity, colour…) Another subgroup of discrete variables are**ordinal**variables whose possible values have a natural order (e.g. disease score, visual analogue scale).Continuous variables have uncountable set of possible values and can take on any value between their maximum and minimum (note that, strictly speaking, the probability to observe any particular value is zero). Another assumption that an increase in 1 unit is equivalent in any region of the scale, which makes intervals of values meaningful.

**Interval**variables are continuous variables for which the “zero” does not mean “none of the variable” (e.g. Temperature in Celsius) as opposed to**ratio**variables for which a clear definition of “zero” (e.g. Temperature in Kelvins, weight, height).You may encounter difficulty in answering some of the above descriptions since the categories are not as clear cut as they look. Firstly, some characteristics may fall into different categories depending on how the experimental design is constructed, such as time (e.g. exact time in seconds and one-tenths would be continuous whereas time intervals would be categorical) or colour (e.g. continuous wavelength or nominal categories). Second, some variables have almost all attributes of one type of scale, but not all… For instance, a cell count looks like a ratio variable, except it must be an integer so it cannot take any value! If only small counts are possible, say only between 0 and 10, an ordinal scale is likely more relevant.

In a nutshell, the key question is not the exact nature of the scale but rather whether it is helpful for the purposes at hand: assumptions are made, but never fully achieved