5 Basic statistics

This chapter introduces a practical “starter kit” for working in R from a statistician’s perspective. Before we discuss formal statistical concepts, we need a stable workflow: how to create objects, inspect and clean vectors, manipulate data frames, summarize data, and fit a basic model. These fundamentals are what you will repeatedly use in real projects—whether you are cleaning EHR data, summarizing trial endpoints, or building an analysis dataset for modeling.

The goal here is not to memorize functions, but to understand what each operation is doing and why it matters. Most errors in applied statistics are not due to a complex model; they come from small issues early in the pipeline: missing values, incorrect variable types, accidental coercion to character, or merges that silently duplicate rows. This chapter makes those risks explicit and gives you a set of reliable patterns.


5.1 The essentials of R

R is built around objects. You create an object (vector, matrix, list, data frame), inspect it, transform it, and then use it as input to another function. When you become comfortable with object types and common manipulations, statistical workflows become much faster and safer.

5.1.1 Manipulation of vector

A vector is the simplest data structure in R: an ordered collection of values. However, vectors can hide common pitfalls—especially when they contain mixed types (numbers + characters + missing values). In practice, mixed-type vectors appear when importing data (e.g., a numeric column contains "O" due to a data entry issue).

The code below demonstrates several key diagnostics:

  • unique() and length() are used to quickly inspect distinct values and count how many unique entries exist—useful when checking a categorical variable, or spotting unexpected values.
  • as.numeric() converts the vector to numeric, but any non-numeric values become NA. This is one of the most common sources of “silent data loss” in analyses.
  • log() illustrates that once coercion introduces NA, downstream transformations may produce missing results.
  • sum(..., na.rm=TRUE) shows a safe pattern for aggregation in the presence of missing values.
  • sort(decreasing=TRUE) is a quick way to inspect extremes and potential outliers.
  • is.na() and indexing (x[!is.na(x)]) demonstrate a standard workflow for filtering out missing values.
  • %in% tests membership (very useful for validation checks).
  • grepl() performs pattern matching and is helpful for detecting problematic strings during cleaning.
library(tidyverse)
library(dplyr)
vec <- c(3,5,2,1,5,"O",NA)
length(unique(vec))
## [1] 6
num_vec <- as.numeric(vec)
log(num_vec)
## [1] 1.0986123 1.6094379 0.6931472 0.0000000 1.6094379        NA        NA
sum(c(num_vec, NA), na.rm=T)
## [1] 16
sort(num_vec, decreasing = T)
## [1] 5 5 3 2 1
is.na(num_vec)
## [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
num_vec[!is.na(num_vec)]
## [1] 3 5 2 1 5
c(5,6) %in% vec
## [1]  TRUE FALSE
grepl("5", vec)
## [1] FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE

A practical habit: when you coerce types (e.g., as.numeric()), always check how many NAs were created and why. If a numeric variable suddenly has many NAs, the root cause is usually dirty input values (spaces, commas, symbols, or typos like "O" instead of 0).


5.1.2 Generate sequence or repeted sequece

Simulating data, creating index variables, and generating repeated patterns are extremely common tasks in statistics. Two workhorses are:

  • seq() to generate sequences (e.g., time points, dose levels, grid search values).
  • rep() to repeat values by cycles (times) or in blocks (each), often used to build study designs or longitudinal datasets.
seq(from = 0, to = 10, by = 0.5)
##  [1]  0.0  0.5  1.0  1.5  2.0  2.5  3.0  3.5  4.0  4.5  5.0  5.5  6.0  6.5  7.0
## [16]  7.5  8.0  8.5  9.0  9.5 10.0
rep(x = 1:3, times = 4)
##  [1] 1 2 3 1 2 3 1 2 3 1 2 3
rep(x = 1:3, each = 4)
##  [1] 1 1 1 1 2 2 2 2 3 3 3 3

Conceptually: - times repeats the whole vector multiple times. - each repeats each element multiple times before moving to the next.


5.1.3 Get directory and write data out and in

A reproducible workflow needs a stable approach to file paths. getwd() tells you the working directory, and setwd() sets it. Writing and reading data are also routine steps when sharing outputs, debugging, or building analysis datasets.

Important practice notes: - If your project grows, prefer project-based workflows (e.g., RStudio projects) rather than repeatedly calling setwd(). - When exporting, keep track of whether row names are included; they can accidentally become a new column on import.

getwd()
## [1] "C:/Users/hed2/Downloads/others/mybook2/mybook2"
setwd(getwd())
write.csv(cars, "cars.csv", row.names=F)
dataframe  <- read.csv("cars.csv")

5.1.4 Function

Functions let you encapsulate repeated logic and ensure consistency. In applied statistics, functions are often used to: - standardize transformations, - compute derived variables, - generate reports, - run simulation loops.

The function below transforms x into a modified value. This is intentionally simple, but the pattern is the same for more complex analysis utilities.

my_func <- function(x){
  x_mod <- (x + 7) * 4
  return(x_mod)
}

my_func(num_vec)
## [1] 40 48 36 32 48 NA NA

Practical note: a function is safest when it handles missing values and validates input types. Even when you don’t add validation now, it helps to remember that your “future self” (or collaborator) will appreciate defensive checks.


5.1.5 Plot

Exploratory plots help you understand distributions, detect outliers, and identify relationships before modeling. Base R plotting is fast and lightweight, which is why it remains common in statistical practice.

  • A scatterplot (plot(y ~ x, data=...)) is the basic tool for relationships.
  • A histogram (hist()) checks distribution shape, skewness, and potential anomalies.
plot(dist ~ speed, data=cars)

hist(cars$dist )


5.1.6 Build model and plot

A linear model (lm) is often the first modeling step: it provides a baseline, helps you understand effect size and direction, and reveals whether a relationship is approximately linear.

This section fits a simple model and overlays the fitted regression line on the scatterplot. The additional vertical and horizontal lines serve as reference thresholds (e.g., a clinically meaningful cutoff, or a design constraint).

model <- lm(dist ~ speed, data=cars)
plot(dist ~ speed, data=cars)
abline(model)
abline(v = 25)
abline(h = 15)

In practice, it is common to annotate plots with reference lines—especially when discussing thresholds, eligibility criteria, or operational boundaries.


5.1.7 Rename names of columns

Clean variable names are more than aesthetics: they affect model formulas, joining keys, and the readability of analysis code. The code below inspects column names and then renames them.

A caution: introducing spaces (e.g., "speed per hour") makes later coding more cumbersome because you must use backticks in formulas and selection. In many applied projects, analysts prefer names like speed_per_hour for reliability.

names(cars)
## [1] "speed" "dist"
names(cars) <- c("speed per hour", "total dist")

5.1.8 Class of dataframe

Understanding classes is crucial because many R functions behave differently depending on the object type.

  • matrix and data.frame look similar but differ in important ways:
    • A matrix is homogeneous (all values must be the same type).
    • A data frame can store different types across columns (numeric, factor, character).

The code below converts cars to a matrix and back to a data frame, then checks classes. It also demonstrates transposition (t()), which is defined for matrices.

matrix <- as.matrix(cars)
df <- as.data.frame(matrix)
class(matrix)
## [1] "matrix" "array"
class(df)
## [1] "data.frame"
# tranform
t(matrix)
speed per hour 4 4 7 7 8 9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14 15 15 15 16 16 17 17 17 18 18 18 18 19 19 19 20 20 20 20 20 22 23 24 24 24 24 25
total dist 2 10 4 22 16 10 18 26 34 17 28 14 20 24 28 26 34 34 46 26 36 60 80 20 26 54 32 40 32 40 50 42 56 76 84 36 46 68 32 48 52 56 64 66 54 70 92 93 120 85

Practical warning: converting a data frame with mixed types to a matrix often forces everything to character. That can break models and summaries if you do not convert back carefully.


5.1.9 Generate new variable for dataframe (character)

Identifiers and grouping variables are often created using string concatenation. paste0() is a clean way to build IDs without spaces.

The examples below create patterned labels like "raster_1", then attach them to the data frame. These patterns are useful when simulating repeated measures or defining cluster membership.

paste0("raster_", 1:10)
##  [1] "raster_1"  "raster_2"  "raster_3"  "raster_4"  "raster_5"  "raster_6" 
##  [7] "raster_7"  "raster_8"  "raster_9"  "raster_10"
paste0("raster_", rep(x = 1:5, times = 10))
##  [1] "raster_1" "raster_2" "raster_3" "raster_4" "raster_5" "raster_1"
##  [7] "raster_2" "raster_3" "raster_4" "raster_5" "raster_1" "raster_2"
## [13] "raster_3" "raster_4" "raster_5" "raster_1" "raster_2" "raster_3"
## [19] "raster_4" "raster_5" "raster_1" "raster_2" "raster_3" "raster_4"
## [25] "raster_5" "raster_1" "raster_2" "raster_3" "raster_4" "raster_5"
## [31] "raster_1" "raster_2" "raster_3" "raster_4" "raster_5" "raster_1"
## [37] "raster_2" "raster_3" "raster_4" "raster_5" "raster_1" "raster_2"
## [43] "raster_3" "raster_4" "raster_5" "raster_1" "raster_2" "raster_3"
## [49] "raster_4" "raster_5"
df$group <- paste0("raster_", rep(x = 1:5, times = 10))
df$id <-  paste0("raster_",  1:50)

5.1.10 Create a new dataframe using ‘rnorm’ - random number from distribution

Simulation is a core skill in modern statistical practice. Here we generate: - a numeric variable (sample) from a normal distribution, - a grouping variable, - an ID variable to support merging.

The function rnorm(n, mean, sd) generates normal random variables. Rounding is used for readability.

sample <-  round((rnorm(50,0, 1)),2)
group <- paste0("raster_", rep(x = 1:5, times = 10))

df_join <- data.frame(sample, group)
df_join$id <-  paste0("raster_",  1:50)

5.1.11 Left join two dataframes

Merging tables is one of the most error-prone steps in applied analysis. left_join() keeps all rows from the left table and adds matching columns from the right table.

Key practice: - Always confirm uniqueness of the key (id) in each table before joining. - After joining, check row counts and inspect for accidental duplication.

library(dplyr)
data_all <- left_join(df, df_join, by="id")
head(data_all)
speed per hour total dist group.x id sample group.y
4 2 raster_1 raster_1 0.84 raster_1
4 10 raster_2 raster_2 0.15 raster_2
7 4 raster_3 raster_3 -1.14 raster_3
7 22 raster_4 raster_4 1.25 raster_4
8 16 raster_5 raster_5 0.43 raster_5
9 10 raster_1 raster_6 -0.30 raster_1

5.1.12 Select variables

Selecting columns is a common step for building analysis-ready datasets. This also helps reduce clutter when checking intermediate results.

select(data_all, group.x, id  )
group.x id
raster_1 raster_1
raster_2 raster_2
raster_3 raster_3
raster_4 raster_4
raster_5 raster_5
raster_1 raster_6
raster_2 raster_7
raster_3 raster_8
raster_4 raster_9
raster_5 raster_10
raster_1 raster_11
raster_2 raster_12
raster_3 raster_13
raster_4 raster_14
raster_5 raster_15
raster_1 raster_16
raster_2 raster_17
raster_3 raster_18
raster_4 raster_19
raster_5 raster_20
raster_1 raster_21
raster_2 raster_22
raster_3 raster_23
raster_4 raster_24
raster_5 raster_25
raster_1 raster_26
raster_2 raster_27
raster_3 raster_28
raster_4 raster_29
raster_5 raster_30
raster_1 raster_31
raster_2 raster_32
raster_3 raster_33
raster_4 raster_34
raster_5 raster_35
raster_1 raster_36
raster_2 raster_37
raster_3 raster_38
raster_4 raster_39
raster_5 raster_40
raster_1 raster_41
raster_2 raster_42
raster_3 raster_43
raster_4 raster_44
raster_5 raster_45
raster_1 raster_46
raster_2 raster_47
raster_3 raster_48
raster_4 raster_49
raster_5 raster_50

5.1.13 Filter observations

Filtering creates analytic subsets, such as: - a treatment arm, - a subgroup, - an eligibility population, - a set of observations meeting a condition.

This section shows filtering by a grouping string, and filtering by numeric conditions (with a variable name that contains spaces, requiring backticks).

raster_1 <- filter(data_all, group.x == "raster_1")
raster_1
speed per hour total dist group.x id sample group.y
4 2 raster_1 raster_1 0.84 raster_1
9 10 raster_1 raster_6 -0.30 raster_1
11 28 raster_1 raster_11 0.55 raster_1
13 26 raster_1 raster_16 -0.21 raster_1
14 36 raster_1 raster_21 -0.40 raster_1
15 54 raster_1 raster_26 -0.03 raster_1
17 50 raster_1 raster_31 -1.55 raster_1
19 36 raster_1 raster_36 -0.50 raster_1
20 52 raster_1 raster_41 0.45 raster_1
24 70 raster_1 raster_46 -2.31 raster_1
speed_dist <- filter(data_all, data_all$`speed per hour` < 11 & data_all$`total dist` >= 10)
speed_dist
speed per hour total dist group.x id sample group.y
4 10 raster_2 raster_2 0.15 raster_2
7 22 raster_4 raster_4 1.25 raster_4
8 16 raster_5 raster_5 0.43 raster_5
9 10 raster_1 raster_6 -0.30 raster_1
10 18 raster_2 raster_7 0.90 raster_2
10 26 raster_3 raster_8 0.88 raster_3
10 34 raster_4 raster_9 0.82 raster_4

5.1.14 Append rows

Row-binding is used when you want to stack two datasets with the same structure. This is common when combining: - multiple batches, - subsets, - cohorts.

rbind() requires matching columns (names and order). In tidyverse workflows, bind_rows() is often more forgiving, but rbind() is fine when structures match exactly.

rbind(raster_1,speed_dist)
speed per hour total dist group.x id sample group.y
4 2 raster_1 raster_1 0.84 raster_1
9 10 raster_1 raster_6 -0.30 raster_1
11 28 raster_1 raster_11 0.55 raster_1
13 26 raster_1 raster_16 -0.21 raster_1
14 36 raster_1 raster_21 -0.40 raster_1
15 54 raster_1 raster_26 -0.03 raster_1
17 50 raster_1 raster_31 -1.55 raster_1
19 36 raster_1 raster_36 -0.50 raster_1
20 52 raster_1 raster_41 0.45 raster_1
24 70 raster_1 raster_46 -2.31 raster_1
4 10 raster_2 raster_2 0.15 raster_2
7 22 raster_4 raster_4 1.25 raster_4
8 16 raster_5 raster_5 0.43 raster_5
9 10 raster_1 raster_6 -0.30 raster_1
10 18 raster_2 raster_7 0.90 raster_2
10 26 raster_3 raster_8 0.88 raster_3
10 34 raster_4 raster_9 0.82 raster_4

5.1.15 Create new variables instead of old variables

Data cleaning often involves transforming a variable into a more usable form. Here we round sample to one decimal place. Note that mutate() returns a modified data frame; you typically assign it back if you want to keep the change.

mutate(data_all, 
       sample = round(sample,1))
speed per hour total dist group.x id sample group.y
4 2 raster_1 raster_1 0.8 raster_1
4 10 raster_2 raster_2 0.1 raster_2
7 4 raster_3 raster_3 -1.1 raster_3
7 22 raster_4 raster_4 1.2 raster_4
8 16 raster_5 raster_5 0.4 raster_5
9 10 raster_1 raster_6 -0.3 raster_1
10 18 raster_2 raster_7 0.9 raster_2
10 26 raster_3 raster_8 0.9 raster_3
10 34 raster_4 raster_9 0.8 raster_4
11 17 raster_5 raster_10 0.7 raster_5
11 28 raster_1 raster_11 0.6 raster_1
12 14 raster_2 raster_12 -0.1 raster_2
12 20 raster_3 raster_13 -0.3 raster_3
12 24 raster_4 raster_14 -0.4 raster_4
12 28 raster_5 raster_15 -0.7 raster_5
13 26 raster_1 raster_16 -0.2 raster_1
13 34 raster_2 raster_17 -1.3 raster_2
13 34 raster_3 raster_18 2.2 raster_3
13 46 raster_4 raster_19 1.2 raster_4
14 26 raster_5 raster_20 -1.1 raster_5
14 36 raster_1 raster_21 -0.4 raster_1
14 60 raster_2 raster_22 -0.5 raster_2
14 80 raster_3 raster_23 0.8 raster_3
15 20 raster_4 raster_24 -0.1 raster_4
15 26 raster_5 raster_25 0.2 raster_5
15 54 raster_1 raster_26 0.0 raster_1
16 32 raster_2 raster_27 0.0 raster_2
16 40 raster_3 raster_28 1.4 raster_3
17 32 raster_4 raster_29 -0.2 raster_4
17 40 raster_5 raster_30 1.5 raster_5
17 50 raster_1 raster_31 -1.6 raster_1
18 42 raster_2 raster_32 0.6 raster_2
18 56 raster_3 raster_33 0.1 raster_3
18 76 raster_4 raster_34 0.2 raster_4
18 84 raster_5 raster_35 0.4 raster_5
19 36 raster_1 raster_36 -0.5 raster_1
19 46 raster_2 raster_37 -0.3 raster_2
19 68 raster_3 raster_38 -1.0 raster_3
20 32 raster_4 raster_39 -1.1 raster_4
20 48 raster_5 raster_40 0.3 raster_5
20 52 raster_1 raster_41 0.4 raster_1
20 56 raster_2 raster_42 0.0 raster_2
20 64 raster_3 raster_43 0.9 raster_3
22 66 raster_4 raster_44 2.0 raster_4
23 54 raster_5 raster_45 -0.5 raster_5
24 70 raster_1 raster_46 -2.3 raster_1
24 92 raster_2 raster_47 1.0 raster_2
24 93 raster_3 raster_48 -0.7 raster_3
24 120 raster_4 raster_49 -0.7 raster_4
25 85 raster_5 raster_50 1.0 raster_5

5.1.16 summarise statistics

Summarization produces descriptive statistics and quick QA checks. In practice, it is a good idea to confirm that you are summarizing the intended variables and that the variable types are correct.

A practical note for this code chunk: max("total dist") will not compute the maximum of the column; it is taking a character string. In real analyses, always verify that your summary outputs look plausible.

 summarise(data_all,
          mean_speed = mean(sample),
          max_dist = max( "total dist" ))
mean_speed max_dist
0.1104 total dist

5.1.17 Group dataframe then summarise statistics

Grouping is essential for stratified summaries (by arm, site, subgroup, visit). The typical pattern is:

  1. group_by()
  2. summarise()

This yields one row per group.

data_all_group <-   group_by(data_all, group.x)   
 summarise(data_all_group, 
          mean_speed = mean(sample),
          max_dist = max( "total dist" ))
mean_speed max_dist
0.1104 total dist

5.1.18 Ungroup then summarise statistics

After group operations, the data may remain grouped. ungroup() removes grouping, which prevents unexpected behavior in later steps.

This is a common best practice: ungroup after grouped summaries unless you intentionally want grouping to persist.

ungroup_data <- ungroup( data_all_group)
 summarise(  ungroup_data , 
          mean_speed = mean(sample),
          max_dist = max( "total dist" ))
mean_speed max_dist
0.1104 total dist

5.1.19 Summary linear regression model

This section fits a linear regression using the renamed columns. The summary() output provides: - coefficient estimates, - standard errors, - t-tests and p-values (under standard assumptions), - R-squared and residual standard error.

Even when you plan to use more advanced models, a simple linear regression is a valuable baseline for interpretation and for detecting obvious data issues.

mod1 <- lm(cars$`total dist` ~ cars$`speed per hour` )
summary(mod1) 
## 
## Call:
## lm(formula = cars$`total dist` ~ cars$`speed per hour`)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -17.5791     6.7584  -2.601   0.0123 *  
## cars$`speed per hour`   3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

5.1.20 Create frequency table

Frequency tables help you check distributions across groups, detect empty cells, and validate merges.

Two-way tables are also a quick way to identify whether a categorical variable is unevenly distributed across groups.

table(data_all_group$`speed per hour`,data_all_group$group.x  )
/ raster_1 raster_2 raster_3 raster_4 raster_5
4 1 1 0 0 0
7 0 0 1 1 0
8 0 0 0 0 1
9 1 0 0 0 0
10 0 1 1 1 0
11 1 0 0 0 1
12 0 1 1 1 1
13 1 1 1 1 0
14 1 1 1 0 1
15 1 0 0 1 1
16 0 1 1 0 0
17 1 0 0 1 1
18 0 1 1 1 1
19 1 1 1 0 0
20 1 1 1 1 1
22 0 0 0 1 0
23 0 0 0 0 1
24 1 1 1 1 0
25 0 0 0 0 1

5.1.21 Value and variable label

Labels are especially useful for reporting, tables, and clinical datasets where you want human-readable metadata. This section shows:

  • inspecting levels of a factor,
  • relabeling factor levels,
  • adding a variable label using Hmisc::label().
table(iris$Species)
setosa versicolor virginica
50 50 50
iris$Species <- factor(iris$Species,labels = c( "setosanew","versicolornew","virginianew"))
table(iris$Species)
setosanew versicolornew virginianew
50 50 50
library(Hmisc)
label(iris$Species) <- "Species types"
table(iris$Species)
setosanew versicolornew virginianew
50 50 50

In applied work, consistent labeling helps downstream reporting tools and reduces ambiguity when sharing datasets with collaborators.


5.1.22 Recode a variable

Recoding is frequently used to: - create categorical versions of continuous variables, - define risk groups, - implement analysis definitions (e.g., responder/non-responder).

This chunk uses nested ifelse() to create a derived variable based on Sepal.Length. While nested ifelse() works, in complex real projects, case_when() is often clearer and less error-prone. The key concept remains: define rules explicitly and validate results with a frequency table.

irisifelse <-  iris%>% 
mutate(Sepal.Length2 = ifelse(Sepal.Length < 6 , "level1", ifelse(Sepal.Length < 7 , "level2", Sepal.Length)))

table(irisifelse$Sepal.Length2)
7 7.1 7.2 7.3 7.4 7.6 7.7 7.9 level1 level2
1 1 3 1 1 1 4 1 83 54

5.2 Central Limit Theorem

The Central Limit Theorem (CLT) is one of the most important ideas in statistics: it justifies why normal-based inference often works even when the underlying data are not normal, as long as sample sizes are reasonably large and observations are independent.

In practice, the CLT supports: - approximate confidence intervals for means, - normal approximations for many estimators, - reasoning about sampling variability.

see here


5.3 Common statistical distribution

Statistical distributions are the language of uncertainty. In applied work, you encounter them in: - modeling outcomes (normal, binomial, Poisson), - generating simulations, - defining priors and likelihoods, - interpreting p-values and confidence intervals.

see here


Chapter takeaways

By the end of this chapter, you should be comfortable with:

  • Inspecting vectors, handling missing values, and diagnosing coercion issues
  • Generating sequences and repeated patterns for indexing and simulation
  • Reading/writing data and understanding the working directory
  • Writing simple functions to standardize repeated steps
  • Making quick exploratory plots
  • Fitting and interpreting a basic linear regression
  • Managing variable names, classes, and joins
  • Building group-wise summaries and validating derived variables

These are not “intro programming trivia”—they are the daily tools of statistical practice. Once these fundamentals are stable, you can scale up to robust workflows: reproducible reporting, simulation-based power analysis, and model-based inference.