Statistical Analysis of Biomedical Data — an Overview
From p-values and regression to clustering and classification
The last decade has seen an enormous boom in the production of patient data. From wearables like smartwatches logging our heart rates to RNA sequencing for differential expression, our ability to monitor and observe individual patients’ health has never been more data-intensive. While this enables extensive statistical analysis of clinical data for diagnosis and research, statistical interpretations can be quite tricky, to say the least. Being able to understand and utilize statistical tools correctly can be a powerful addition to your arsenal. Former British Prime Minister Benjamin Disraeli supposedly once said, “There are three kinds of lies; lies, damned lies, and statistics”, although the actual origin of this quote is still unclear.
All data is either numerical or categorical. Numerical data is either continuous (can be divided into infinitely smaller units, like drug concentrations) or discrete (cannot be divided into infinitely smaller units, like integers). Categorical data on the other hand is of three types: binary (yes or no, present or absent), nominal (named, like the 20 amino acids), and ordinal (named and has an innate order, like disease severity).
Univariate statistics deals with one variable of interest, while multivariate statistics deals with multiple variables. If we look at the distribution of heart rates of a group of patients, for example, it’s univariate. But if we look at the heart rate on one axis measured against the age of the patients on another, then that’s multivariate.
To describe the central tendency of a distribution, statisticians use point estimates like the mean, median, or mode. To describe the spread of the distribution, standard deviation or variances are commonly used. Standard deviation in particular is a measure of variation in the dataset around the mean value.
This distribution on the left is also called the Gaussian distribution, named after the German mathematician who discovered it, Carl Friedrich Gauss. Most clinical data follows this Normal distribution. Because it is observed in numerous instances in nature, it has become an underlying assumption in most statistical approaches. Why does it show up so much? Central limit theorem. The more the number of samples, the closer the sample is to the target population.