1 Data Wrangling
1.1 How to do data wrangling
Data wrangling is the process of turning raw data into an analysis-ready dataset. In practice, this includes (1) selecting the right observations and variables, (2) creating or transforming variables, (3) cleaning and standardizing values, and (4) reshaping the dataset into the structure required by downstream analyses and reporting.
In this chapter, we use functions from the tidyverse (mainly dplyr and tidyr) because they provide a consistent grammar for data manipulation. The goal here is not only to learn syntax, but also to understand common patterns that appear repeatedly in real projects.
We will use the built-in iris dataset for simple examples, and then create small simulated datasets (e.g., drug_trial, babies, blood_pressure) to demonstrate typical workflows.
1.1.1 Load data and package
Before doing any wrangling, it is good practice to quickly inspect the dataset. Looking at the first few rows helps confirm variable names, data types, and whether the values look reasonable.
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Load the tidyverse package. This automatically loads common packages for wrangling and visualization, including dplyr, tidyr, tibble, and ggplot2.
## Warning: package 'ggplot2' was built under R version 4.4.3
## Warning: package 'readr' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 4.0.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ───────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
1.1.2 Select certain rows
Row selection is typically used to (1) create analysis populations, (2) subset by treatment, time window, or eligibility criteria, or (3) debug issues using a small subset.
The filter() function keeps rows that satisfy a condition. Here, we select only the setosa species.
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5.0 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## 11 5.4 3.7 1.5 0.2 setosa
## 12 4.8 3.4 1.6 0.2 setosa
## 13 4.8 3.0 1.4 0.1 setosa
## 14 4.3 3.0 1.1 0.1 setosa
## 15 5.8 4.0 1.2 0.2 setosa
## 16 5.7 4.4 1.5 0.4 setosa
## 17 5.4 3.9 1.3 0.4 setosa
## 18 5.1 3.5 1.4 0.3 setosa
## 19 5.7 3.8 1.7 0.3 setosa
## 20 5.1 3.8 1.5 0.3 setosa
## 21 5.4 3.4 1.7 0.2 setosa
## 22 5.1 3.7 1.5 0.4 setosa
## 23 4.6 3.6 1.0 0.2 setosa
## 24 5.1 3.3 1.7 0.5 setosa
## 25 4.8 3.4 1.9 0.2 setosa
## 26 5.0 3.0 1.6 0.2 setosa
## 27 5.0 3.4 1.6 0.4 setosa
## 28 5.2 3.5 1.5 0.2 setosa
## 29 5.2 3.4 1.4 0.2 setosa
## 30 4.7 3.2 1.6 0.2 setosa
## 31 4.8 3.1 1.6 0.2 setosa
## 32 5.4 3.4 1.5 0.4 setosa
## 33 5.2 4.1 1.5 0.1 setosa
## 34 5.5 4.2 1.4 0.2 setosa
## 35 4.9 3.1 1.5 0.2 setosa
## 36 5.0 3.2 1.2 0.2 setosa
## 37 5.5 3.5 1.3 0.2 setosa
## 38 4.9 3.6 1.4 0.1 setosa
## 39 4.4 3.0 1.3 0.2 setosa
## 40 5.1 3.4 1.5 0.2 setosa
## 41 5.0 3.5 1.3 0.3 setosa
## 42 4.5 2.3 1.3 0.3 setosa
## 43 4.4 3.2 1.3 0.2 setosa
## 44 5.0 3.5 1.6 0.6 setosa
## 45 5.1 3.8 1.9 0.4 setosa
## 46 4.8 3.0 1.4 0.3 setosa
## 47 5.1 3.8 1.6 0.2 setosa
## 48 4.6 3.2 1.4 0.2 setosa
## 49 5.3 3.7 1.5 0.2 setosa
## 50 5.0 3.3 1.4 0.2 setosa
filter the first 5 rows
When you need a small subset for quick checking, you can take the first few rows. slice() selects rows by position rather than by condition.
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
1.1.3 Select certain columns
Selecting columns is useful for (1) focusing on variables needed for an analysis, (2) reducing clutter when printing tables, and (3) preparing a dataset before joining or reshaping.
select() keeps only specified variables.
## Sepal.Length Species
## 1 5.1 setosa
## 2 4.9 setosa
## 3 4.7 setosa
## 4 4.6 setosa
## 5 5.0 setosa
## 6 5.4 setosa
## 7 4.6 setosa
## 8 5.0 setosa
## 9 4.4 setosa
## 10 4.9 setosa
## 11 5.4 setosa
## 12 4.8 setosa
## 13 4.8 setosa
## 14 4.3 setosa
## 15 5.8 setosa
## 16 5.7 setosa
## 17 5.4 setosa
## 18 5.1 setosa
## 19 5.7 setosa
## 20 5.1 setosa
## 21 5.4 setosa
## 22 5.1 setosa
## 23 4.6 setosa
## 24 5.1 setosa
## 25 4.8 setosa
## 26 5.0 setosa
## 27 5.0 setosa
## 28 5.2 setosa
## 29 5.2 setosa
## 30 4.7 setosa
## 31 4.8 setosa
## 32 5.4 setosa
## 33 5.2 setosa
## 34 5.5 setosa
## 35 4.9 setosa
## 36 5.0 setosa
## 37 5.5 setosa
## 38 4.9 setosa
## 39 4.4 setosa
## 40 5.1 setosa
## 41 5.0 setosa
## 42 4.5 setosa
## 43 4.4 setosa
## 44 5.0 setosa
## 45 5.1 setosa
## 46 4.8 setosa
## 47 5.1 setosa
## 48 4.6 setosa
## 49 5.3 setosa
## 50 5.0 setosa
## 51 7.0 versicolor
## 52 6.4 versicolor
## 53 6.9 versicolor
## 54 5.5 versicolor
## 55 6.5 versicolor
## 56 5.7 versicolor
## 57 6.3 versicolor
## 58 4.9 versicolor
## 59 6.6 versicolor
## 60 5.2 versicolor
## 61 5.0 versicolor
## 62 5.9 versicolor
## 63 6.0 versicolor
## 64 6.1 versicolor
## 65 5.6 versicolor
## 66 6.7 versicolor
## 67 5.6 versicolor
## 68 5.8 versicolor
## 69 6.2 versicolor
## 70 5.6 versicolor
## 71 5.9 versicolor
## 72 6.1 versicolor
## 73 6.3 versicolor
## 74 6.1 versicolor
## 75 6.4 versicolor
## 76 6.6 versicolor
## 77 6.8 versicolor
## 78 6.7 versicolor
## 79 6.0 versicolor
## 80 5.7 versicolor
## 81 5.5 versicolor
## 82 5.5 versicolor
## 83 5.8 versicolor
## 84 6.0 versicolor
## 85 5.4 versicolor
## 86 6.0 versicolor
## 87 6.7 versicolor
## 88 6.3 versicolor
## 89 5.6 versicolor
## 90 5.5 versicolor
## 91 5.5 versicolor
## 92 6.1 versicolor
## 93 5.8 versicolor
## 94 5.0 versicolor
## 95 5.6 versicolor
## 96 5.7 versicolor
## 97 5.7 versicolor
## 98 6.2 versicolor
## 99 5.1 versicolor
## 100 5.7 versicolor
## 101 6.3 virginica
## 102 5.8 virginica
## 103 7.1 virginica
## 104 6.3 virginica
## 105 6.5 virginica
## 106 7.6 virginica
## 107 4.9 virginica
## 108 7.3 virginica
## 109 6.7 virginica
## 110 7.2 virginica
## 111 6.5 virginica
## 112 6.4 virginica
## 113 6.8 virginica
## 114 5.7 virginica
## 115 5.8 virginica
## 116 6.4 virginica
## 117 6.5 virginica
## 118 7.7 virginica
## 119 7.7 virginica
## 120 6.0 virginica
## 121 6.9 virginica
## 122 5.6 virginica
## 123 7.7 virginica
## 124 6.3 virginica
## 125 6.7 virginica
## 126 7.2 virginica
## 127 6.2 virginica
## 128 6.1 virginica
## 129 6.4 virginica
## 130 7.2 virginica
## 131 7.4 virginica
## 132 7.9 virginica
## 133 6.4 virginica
## 134 6.3 virginica
## 135 6.1 virginica
## 136 7.7 virginica
## 137 6.3 virginica
## 138 6.4 virginica
## 139 6.0 virginica
## 140 6.9 virginica
## 141 6.7 virginica
## 142 6.9 virginica
## 143 5.8 virginica
## 144 6.8 virginica
## 145 6.7 virginica
## 146 6.7 virginica
## 147 6.3 virginica
## 148 6.5 virginica
## 149 6.2 virginica
## 150 5.9 virginica
You can also exclude variables using a minus sign. This is helpful when you want to drop only a few variables while keeping most columns.
## Sepal.Width Petal.Length Petal.Width
## 1 3.5 1.4 0.2
## 2 3.0 1.4 0.2
## 3 3.2 1.3 0.2
## 4 3.1 1.5 0.2
## 5 3.6 1.4 0.2
## 6 3.9 1.7 0.4
## 7 3.4 1.4 0.3
## 8 3.4 1.5 0.2
## 9 2.9 1.4 0.2
## 10 3.1 1.5 0.1
## 11 3.7 1.5 0.2
## 12 3.4 1.6 0.2
## 13 3.0 1.4 0.1
## 14 3.0 1.1 0.1
## 15 4.0 1.2 0.2
## 16 4.4 1.5 0.4
## 17 3.9 1.3 0.4
## 18 3.5 1.4 0.3
## 19 3.8 1.7 0.3
## 20 3.8 1.5 0.3
## 21 3.4 1.7 0.2
## 22 3.7 1.5 0.4
## 23 3.6 1.0 0.2
## 24 3.3 1.7 0.5
## 25 3.4 1.9 0.2
## 26 3.0 1.6 0.2
## 27 3.4 1.6 0.4
## 28 3.5 1.5 0.2
## 29 3.4 1.4 0.2
## 30 3.2 1.6 0.2
## 31 3.1 1.6 0.2
## 32 3.4 1.5 0.4
## 33 4.1 1.5 0.1
## 34 4.2 1.4 0.2
## 35 3.1 1.5 0.2
## 36 3.2 1.2 0.2
## 37 3.5 1.3 0.2
## 38 3.6 1.4 0.1
## 39 3.0 1.3 0.2
## 40 3.4 1.5 0.2
## 41 3.5 1.3 0.3
## 42 2.3 1.3 0.3
## 43 3.2 1.3 0.2
## 44 3.5 1.6 0.6
## 45 3.8 1.9 0.4
## 46 3.0 1.4 0.3
## 47 3.8 1.6 0.2
## 48 3.2 1.4 0.2
## 49 3.7 1.5 0.2
## 50 3.3 1.4 0.2
## 51 3.2 4.7 1.4
## 52 3.2 4.5 1.5
## 53 3.1 4.9 1.5
## 54 2.3 4.0 1.3
## 55 2.8 4.6 1.5
## 56 2.8 4.5 1.3
## 57 3.3 4.7 1.6
## 58 2.4 3.3 1.0
## 59 2.9 4.6 1.3
## 60 2.7 3.9 1.4
## 61 2.0 3.5 1.0
## 62 3.0 4.2 1.5
## 63 2.2 4.0 1.0
## 64 2.9 4.7 1.4
## 65 2.9 3.6 1.3
## 66 3.1 4.4 1.4
## 67 3.0 4.5 1.5
## 68 2.7 4.1 1.0
## 69 2.2 4.5 1.5
## 70 2.5 3.9 1.1
## 71 3.2 4.8 1.8
## 72 2.8 4.0 1.3
## 73 2.5 4.9 1.5
## 74 2.8 4.7 1.2
## 75 2.9 4.3 1.3
## 76 3.0 4.4 1.4
## 77 2.8 4.8 1.4
## 78 3.0 5.0 1.7
## 79 2.9 4.5 1.5
## 80 2.6 3.5 1.0
## 81 2.4 3.8 1.1
## 82 2.4 3.7 1.0
## 83 2.7 3.9 1.2
## 84 2.7 5.1 1.6
## 85 3.0 4.5 1.5
## 86 3.4 4.5 1.6
## 87 3.1 4.7 1.5
## 88 2.3 4.4 1.3
## 89 3.0 4.1 1.3
## 90 2.5 4.0 1.3
## 91 2.6 4.4 1.2
## 92 3.0 4.6 1.4
## 93 2.6 4.0 1.2
## 94 2.3 3.3 1.0
## 95 2.7 4.2 1.3
## 96 3.0 4.2 1.2
## 97 2.9 4.2 1.3
## 98 2.9 4.3 1.3
## 99 2.5 3.0 1.1
## 100 2.8 4.1 1.3
## 101 3.3 6.0 2.5
## 102 2.7 5.1 1.9
## 103 3.0 5.9 2.1
## 104 2.9 5.6 1.8
## 105 3.0 5.8 2.2
## 106 3.0 6.6 2.1
## 107 2.5 4.5 1.7
## 108 2.9 6.3 1.8
## 109 2.5 5.8 1.8
## 110 3.6 6.1 2.5
## 111 3.2 5.1 2.0
## 112 2.7 5.3 1.9
## 113 3.0 5.5 2.1
## 114 2.5 5.0 2.0
## 115 2.8 5.1 2.4
## 116 3.2 5.3 2.3
## 117 3.0 5.5 1.8
## 118 3.8 6.7 2.2
## 119 2.6 6.9 2.3
## 120 2.2 5.0 1.5
## 121 3.2 5.7 2.3
## 122 2.8 4.9 2.0
## 123 2.8 6.7 2.0
## 124 2.7 4.9 1.8
## 125 3.3 5.7 2.1
## 126 3.2 6.0 1.8
## 127 2.8 4.8 1.8
## 128 3.0 4.9 1.8
## 129 2.8 5.6 2.1
## 130 3.0 5.8 1.6
## 131 2.8 6.1 1.9
## 132 3.8 6.4 2.0
## 133 2.8 5.6 2.2
## 134 2.8 5.1 1.5
## 135 2.6 5.6 1.4
## 136 3.0 6.1 2.3
## 137 3.4 5.6 2.4
## 138 3.1 5.5 1.8
## 139 3.0 4.8 1.8
## 140 3.1 5.4 2.1
## 141 3.1 5.6 2.4
## 142 3.1 5.1 2.3
## 143 2.7 5.1 1.9
## 144 3.2 5.9 2.3
## 145 3.3 5.7 2.5
## 146 3.0 5.2 2.3
## 147 2.5 5.0 1.9
## 148 3.0 5.2 2.0
## 149 3.4 5.4 2.3
## 150 3.0 5.1 1.8
reorder the variables
Reordering columns is often done for reporting or for readability (e.g., keep key identifiers first). everything() selects all remaining variables in their current order.
## Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 setosa 5.1 3.5 1.4 0.2
## 2 setosa 4.9 3.0 1.4 0.2
## 3 setosa 4.7 3.2 1.3 0.2
## 4 setosa 4.6 3.1 1.5 0.2
## 5 setosa 5.0 3.6 1.4 0.2
## 6 setosa 5.4 3.9 1.7 0.4
## 7 setosa 4.6 3.4 1.4 0.3
## 8 setosa 5.0 3.4 1.5 0.2
## 9 setosa 4.4 2.9 1.4 0.2
## 10 setosa 4.9 3.1 1.5 0.1
## 11 setosa 5.4 3.7 1.5 0.2
## 12 setosa 4.8 3.4 1.6 0.2
## 13 setosa 4.8 3.0 1.4 0.1
## 14 setosa 4.3 3.0 1.1 0.1
## 15 setosa 5.8 4.0 1.2 0.2
## 16 setosa 5.7 4.4 1.5 0.4
## 17 setosa 5.4 3.9 1.3 0.4
## 18 setosa 5.1 3.5 1.4 0.3
## 19 setosa 5.7 3.8 1.7 0.3
## 20 setosa 5.1 3.8 1.5 0.3
## 21 setosa 5.4 3.4 1.7 0.2
## 22 setosa 5.1 3.7 1.5 0.4
## 23 setosa 4.6 3.6 1.0 0.2
## 24 setosa 5.1 3.3 1.7 0.5
## 25 setosa 4.8 3.4 1.9 0.2
## 26 setosa 5.0 3.0 1.6 0.2
## 27 setosa 5.0 3.4 1.6 0.4
## 28 setosa 5.2 3.5 1.5 0.2
## 29 setosa 5.2 3.4 1.4 0.2
## 30 setosa 4.7 3.2 1.6 0.2
## 31 setosa 4.8 3.1 1.6 0.2
## 32 setosa 5.4 3.4 1.5 0.4
## 33 setosa 5.2 4.1 1.5 0.1
## 34 setosa 5.5 4.2 1.4 0.2
## 35 setosa 4.9 3.1 1.5 0.2
## 36 setosa 5.0 3.2 1.2 0.2
## 37 setosa 5.5 3.5 1.3 0.2
## 38 setosa 4.9 3.6 1.4 0.1
## 39 setosa 4.4 3.0 1.3 0.2
## 40 setosa 5.1 3.4 1.5 0.2
## 41 setosa 5.0 3.5 1.3 0.3
## 42 setosa 4.5 2.3 1.3 0.3
## 43 setosa 4.4 3.2 1.3 0.2
## 44 setosa 5.0 3.5 1.6 0.6
## 45 setosa 5.1 3.8 1.9 0.4
## 46 setosa 4.8 3.0 1.4 0.3
## 47 setosa 5.1 3.8 1.6 0.2
## 48 setosa 4.6 3.2 1.4 0.2
## 49 setosa 5.3 3.7 1.5 0.2
## 50 setosa 5.0 3.3 1.4 0.2
## 51 versicolor 7.0 3.2 4.7 1.4
## 52 versicolor 6.4 3.2 4.5 1.5
## 53 versicolor 6.9 3.1 4.9 1.5
## 54 versicolor 5.5 2.3 4.0 1.3
## 55 versicolor 6.5 2.8 4.6 1.5
## 56 versicolor 5.7 2.8 4.5 1.3
## 57 versicolor 6.3 3.3 4.7 1.6
## 58 versicolor 4.9 2.4 3.3 1.0
## 59 versicolor 6.6 2.9 4.6 1.3
## 60 versicolor 5.2 2.7 3.9 1.4
## 61 versicolor 5.0 2.0 3.5 1.0
## 62 versicolor 5.9 3.0 4.2 1.5
## 63 versicolor 6.0 2.2 4.0 1.0
## 64 versicolor 6.1 2.9 4.7 1.4
## 65 versicolor 5.6 2.9 3.6 1.3
## 66 versicolor 6.7 3.1 4.4 1.4
## 67 versicolor 5.6 3.0 4.5 1.5
## 68 versicolor 5.8 2.7 4.1 1.0
## 69 versicolor 6.2 2.2 4.5 1.5
## 70 versicolor 5.6 2.5 3.9 1.1
## 71 versicolor 5.9 3.2 4.8 1.8
## 72 versicolor 6.1 2.8 4.0 1.3
## 73 versicolor 6.3 2.5 4.9 1.5
## 74 versicolor 6.1 2.8 4.7 1.2
## 75 versicolor 6.4 2.9 4.3 1.3
## 76 versicolor 6.6 3.0 4.4 1.4
## 77 versicolor 6.8 2.8 4.8 1.4
## 78 versicolor 6.7 3.0 5.0 1.7
## 79 versicolor 6.0 2.9 4.5 1.5
## 80 versicolor 5.7 2.6 3.5 1.0
## 81 versicolor 5.5 2.4 3.8 1.1
## 82 versicolor 5.5 2.4 3.7 1.0
## 83 versicolor 5.8 2.7 3.9 1.2
## 84 versicolor 6.0 2.7 5.1 1.6
## 85 versicolor 5.4 3.0 4.5 1.5
## 86 versicolor 6.0 3.4 4.5 1.6
## 87 versicolor 6.7 3.1 4.7 1.5
## 88 versicolor 6.3 2.3 4.4 1.3
## 89 versicolor 5.6 3.0 4.1 1.3
## 90 versicolor 5.5 2.5 4.0 1.3
## 91 versicolor 5.5 2.6 4.4 1.2
## 92 versicolor 6.1 3.0 4.6 1.4
## 93 versicolor 5.8 2.6 4.0 1.2
## 94 versicolor 5.0 2.3 3.3 1.0
## 95 versicolor 5.6 2.7 4.2 1.3
## 96 versicolor 5.7 3.0 4.2 1.2
## 97 versicolor 5.7 2.9 4.2 1.3
## 98 versicolor 6.2 2.9 4.3 1.3
## 99 versicolor 5.1 2.5 3.0 1.1
## 100 versicolor 5.7 2.8 4.1 1.3
## 101 virginica 6.3 3.3 6.0 2.5
## 102 virginica 5.8 2.7 5.1 1.9
## 103 virginica 7.1 3.0 5.9 2.1
## 104 virginica 6.3 2.9 5.6 1.8
## 105 virginica 6.5 3.0 5.8 2.2
## 106 virginica 7.6 3.0 6.6 2.1
## 107 virginica 4.9 2.5 4.5 1.7
## 108 virginica 7.3 2.9 6.3 1.8
## 109 virginica 6.7 2.5 5.8 1.8
## 110 virginica 7.2 3.6 6.1 2.5
## 111 virginica 6.5 3.2 5.1 2.0
## 112 virginica 6.4 2.7 5.3 1.9
## 113 virginica 6.8 3.0 5.5 2.1
## 114 virginica 5.7 2.5 5.0 2.0
## 115 virginica 5.8 2.8 5.1 2.4
## 116 virginica 6.4 3.2 5.3 2.3
## 117 virginica 6.5 3.0 5.5 1.8
## 118 virginica 7.7 3.8 6.7 2.2
## 119 virginica 7.7 2.6 6.9 2.3
## 120 virginica 6.0 2.2 5.0 1.5
## 121 virginica 6.9 3.2 5.7 2.3
## 122 virginica 5.6 2.8 4.9 2.0
## 123 virginica 7.7 2.8 6.7 2.0
## 124 virginica 6.3 2.7 4.9 1.8
## 125 virginica 6.7 3.3 5.7 2.1
## 126 virginica 7.2 3.2 6.0 1.8
## 127 virginica 6.2 2.8 4.8 1.8
## 128 virginica 6.1 3.0 4.9 1.8
## 129 virginica 6.4 2.8 5.6 2.1
## 130 virginica 7.2 3.0 5.8 1.6
## 131 virginica 7.4 2.8 6.1 1.9
## 132 virginica 7.9 3.8 6.4 2.0
## 133 virginica 6.4 2.8 5.6 2.2
## 134 virginica 6.3 2.8 5.1 1.5
## 135 virginica 6.1 2.6 5.6 1.4
## 136 virginica 7.7 3.0 6.1 2.3
## 137 virginica 6.3 3.4 5.6 2.4
## 138 virginica 6.4 3.1 5.5 1.8
## 139 virginica 6.0 3.0 4.8 1.8
## 140 virginica 6.9 3.1 5.4 2.1
## 141 virginica 6.7 3.1 5.6 2.4
## 142 virginica 6.9 3.1 5.1 2.3
## 143 virginica 5.8 2.7 5.1 1.9
## 144 virginica 6.8 3.2 5.9 2.3
## 145 virginica 6.7 3.3 5.7 2.5
## 146 virginica 6.7 3.0 5.2 2.3
## 147 virginica 6.3 2.5 5.0 1.9
## 148 virginica 6.5 3.0 5.2 2.0
## 149 virginica 6.2 3.4 5.4 2.3
## 150 virginica 5.9 3.0 5.1 1.8
You can also select a range of variables using the order they appear in the dataset. This is a quick way to take a contiguous block of variables.
## Species Petal.Width Petal.Length Sepal.Width Sepal.Length
## 1 setosa 0.2 1.4 3.5 5.1
## 2 setosa 0.2 1.4 3.0 4.9
## 3 setosa 0.2 1.3 3.2 4.7
## 4 setosa 0.2 1.5 3.1 4.6
## 5 setosa 0.2 1.4 3.6 5.0
## 6 setosa 0.4 1.7 3.9 5.4
## 7 setosa 0.3 1.4 3.4 4.6
## 8 setosa 0.2 1.5 3.4 5.0
## 9 setosa 0.2 1.4 2.9 4.4
## 10 setosa 0.1 1.5 3.1 4.9
## 11 setosa 0.2 1.5 3.7 5.4
## 12 setosa 0.2 1.6 3.4 4.8
## 13 setosa 0.1 1.4 3.0 4.8
## 14 setosa 0.1 1.1 3.0 4.3
## 15 setosa 0.2 1.2 4.0 5.8
## 16 setosa 0.4 1.5 4.4 5.7
## 17 setosa 0.4 1.3 3.9 5.4
## 18 setosa 0.3 1.4 3.5 5.1
## 19 setosa 0.3 1.7 3.8 5.7
## 20 setosa 0.3 1.5 3.8 5.1
## 21 setosa 0.2 1.7 3.4 5.4
## 22 setosa 0.4 1.5 3.7 5.1
## 23 setosa 0.2 1.0 3.6 4.6
## 24 setosa 0.5 1.7 3.3 5.1
## 25 setosa 0.2 1.9 3.4 4.8
## 26 setosa 0.2 1.6 3.0 5.0
## 27 setosa 0.4 1.6 3.4 5.0
## 28 setosa 0.2 1.5 3.5 5.2
## 29 setosa 0.2 1.4 3.4 5.2
## 30 setosa 0.2 1.6 3.2 4.7
## 31 setosa 0.2 1.6 3.1 4.8
## 32 setosa 0.4 1.5 3.4 5.4
## 33 setosa 0.1 1.5 4.1 5.2
## 34 setosa 0.2 1.4 4.2 5.5
## 35 setosa 0.2 1.5 3.1 4.9
## 36 setosa 0.2 1.2 3.2 5.0
## 37 setosa 0.2 1.3 3.5 5.5
## 38 setosa 0.1 1.4 3.6 4.9
## 39 setosa 0.2 1.3 3.0 4.4
## 40 setosa 0.2 1.5 3.4 5.1
## 41 setosa 0.3 1.3 3.5 5.0
## 42 setosa 0.3 1.3 2.3 4.5
## 43 setosa 0.2 1.3 3.2 4.4
## 44 setosa 0.6 1.6 3.5 5.0
## 45 setosa 0.4 1.9 3.8 5.1
## 46 setosa 0.3 1.4 3.0 4.8
## 47 setosa 0.2 1.6 3.8 5.1
## 48 setosa 0.2 1.4 3.2 4.6
## 49 setosa 0.2 1.5 3.7 5.3
## 50 setosa 0.2 1.4 3.3 5.0
## 51 versicolor 1.4 4.7 3.2 7.0
## 52 versicolor 1.5 4.5 3.2 6.4
## 53 versicolor 1.5 4.9 3.1 6.9
## 54 versicolor 1.3 4.0 2.3 5.5
## 55 versicolor 1.5 4.6 2.8 6.5
## 56 versicolor 1.3 4.5 2.8 5.7
## 57 versicolor 1.6 4.7 3.3 6.3
## 58 versicolor 1.0 3.3 2.4 4.9
## 59 versicolor 1.3 4.6 2.9 6.6
## 60 versicolor 1.4 3.9 2.7 5.2
## 61 versicolor 1.0 3.5 2.0 5.0
## 62 versicolor 1.5 4.2 3.0 5.9
## 63 versicolor 1.0 4.0 2.2 6.0
## 64 versicolor 1.4 4.7 2.9 6.1
## 65 versicolor 1.3 3.6 2.9 5.6
## 66 versicolor 1.4 4.4 3.1 6.7
## 67 versicolor 1.5 4.5 3.0 5.6
## 68 versicolor 1.0 4.1 2.7 5.8
## 69 versicolor 1.5 4.5 2.2 6.2
## 70 versicolor 1.1 3.9 2.5 5.6
## 71 versicolor 1.8 4.8 3.2 5.9
## 72 versicolor 1.3 4.0 2.8 6.1
## 73 versicolor 1.5 4.9 2.5 6.3
## 74 versicolor 1.2 4.7 2.8 6.1
## 75 versicolor 1.3 4.3 2.9 6.4
## 76 versicolor 1.4 4.4 3.0 6.6
## 77 versicolor 1.4 4.8 2.8 6.8
## 78 versicolor 1.7 5.0 3.0 6.7
## 79 versicolor 1.5 4.5 2.9 6.0
## 80 versicolor 1.0 3.5 2.6 5.7
## 81 versicolor 1.1 3.8 2.4 5.5
## 82 versicolor 1.0 3.7 2.4 5.5
## 83 versicolor 1.2 3.9 2.7 5.8
## 84 versicolor 1.6 5.1 2.7 6.0
## 85 versicolor 1.5 4.5 3.0 5.4
## 86 versicolor 1.6 4.5 3.4 6.0
## 87 versicolor 1.5 4.7 3.1 6.7
## 88 versicolor 1.3 4.4 2.3 6.3
## 89 versicolor 1.3 4.1 3.0 5.6
## 90 versicolor 1.3 4.0 2.5 5.5
## 91 versicolor 1.2 4.4 2.6 5.5
## 92 versicolor 1.4 4.6 3.0 6.1
## 93 versicolor 1.2 4.0 2.6 5.8
## 94 versicolor 1.0 3.3 2.3 5.0
## 95 versicolor 1.3 4.2 2.7 5.6
## 96 versicolor 1.2 4.2 3.0 5.7
## 97 versicolor 1.3 4.2 2.9 5.7
## 98 versicolor 1.3 4.3 2.9 6.2
## 99 versicolor 1.1 3.0 2.5 5.1
## 100 versicolor 1.3 4.1 2.8 5.7
## 101 virginica 2.5 6.0 3.3 6.3
## 102 virginica 1.9 5.1 2.7 5.8
## 103 virginica 2.1 5.9 3.0 7.1
## 104 virginica 1.8 5.6 2.9 6.3
## 105 virginica 2.2 5.8 3.0 6.5
## 106 virginica 2.1 6.6 3.0 7.6
## 107 virginica 1.7 4.5 2.5 4.9
## 108 virginica 1.8 6.3 2.9 7.3
## 109 virginica 1.8 5.8 2.5 6.7
## 110 virginica 2.5 6.1 3.6 7.2
## 111 virginica 2.0 5.1 3.2 6.5
## 112 virginica 1.9 5.3 2.7 6.4
## 113 virginica 2.1 5.5 3.0 6.8
## 114 virginica 2.0 5.0 2.5 5.7
## 115 virginica 2.4 5.1 2.8 5.8
## 116 virginica 2.3 5.3 3.2 6.4
## 117 virginica 1.8 5.5 3.0 6.5
## 118 virginica 2.2 6.7 3.8 7.7
## 119 virginica 2.3 6.9 2.6 7.7
## 120 virginica 1.5 5.0 2.2 6.0
## 121 virginica 2.3 5.7 3.2 6.9
## 122 virginica 2.0 4.9 2.8 5.6
## 123 virginica 2.0 6.7 2.8 7.7
## 124 virginica 1.8 4.9 2.7 6.3
## 125 virginica 2.1 5.7 3.3 6.7
## 126 virginica 1.8 6.0 3.2 7.2
## 127 virginica 1.8 4.8 2.8 6.2
## 128 virginica 1.8 4.9 3.0 6.1
## 129 virginica 2.1 5.6 2.8 6.4
## 130 virginica 1.6 5.8 3.0 7.2
## 131 virginica 1.9 6.1 2.8 7.4
## 132 virginica 2.0 6.4 3.8 7.9
## 133 virginica 2.2 5.6 2.8 6.4
## 134 virginica 1.5 5.1 2.8 6.3
## 135 virginica 1.4 5.6 2.6 6.1
## 136 virginica 2.3 6.1 3.0 7.7
## 137 virginica 2.4 5.6 3.4 6.3
## 138 virginica 1.8 5.5 3.1 6.4
## 139 virginica 1.8 4.8 3.0 6.0
## 140 virginica 2.1 5.4 3.1 6.9
## 141 virginica 2.4 5.6 3.1 6.7
## 142 virginica 2.3 5.1 3.1 6.9
## 143 virginica 1.9 5.1 2.7 5.8
## 144 virginica 2.3 5.9 3.2 6.8
## 145 virginica 2.5 5.7 3.3 6.7
## 146 virginica 2.3 5.2 3.0 6.7
## 147 virginica 1.9 5.0 2.5 6.3
## 148 virginica 2.0 5.2 3.0 6.5
## 149 virginica 2.3 5.4 3.4 6.2
## 150 virginica 1.8 5.1 3.0 5.9
1.1.4 Rename variables
Renaming variables improves clarity and standardization, especially when variable names will be used in tables, plots, or models. A common practice is to use consistent naming conventions (e.g., snake_case) and avoid special characters.
rename() changes variable names explicitly.
## Sepal_Length Sepal_Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5.0 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## 11 5.4 3.7 1.5 0.2 setosa
## 12 4.8 3.4 1.6 0.2 setosa
## 13 4.8 3.0 1.4 0.1 setosa
## 14 4.3 3.0 1.1 0.1 setosa
## 15 5.8 4.0 1.2 0.2 setosa
## 16 5.7 4.4 1.5 0.4 setosa
## 17 5.4 3.9 1.3 0.4 setosa
## 18 5.1 3.5 1.4 0.3 setosa
## 19 5.7 3.8 1.7 0.3 setosa
## 20 5.1 3.8 1.5 0.3 setosa
## 21 5.4 3.4 1.7 0.2 setosa
## 22 5.1 3.7 1.5 0.4 setosa
## 23 4.6 3.6 1.0 0.2 setosa
## 24 5.1 3.3 1.7 0.5 setosa
## 25 4.8 3.4 1.9 0.2 setosa
## 26 5.0 3.0 1.6 0.2 setosa
## 27 5.0 3.4 1.6 0.4 setosa
## 28 5.2 3.5 1.5 0.2 setosa
## 29 5.2 3.4 1.4 0.2 setosa
## 30 4.7 3.2 1.6 0.2 setosa
## 31 4.8 3.1 1.6 0.2 setosa
## 32 5.4 3.4 1.5 0.4 setosa
## 33 5.2 4.1 1.5 0.1 setosa
## 34 5.5 4.2 1.4 0.2 setosa
## 35 4.9 3.1 1.5 0.2 setosa
## 36 5.0 3.2 1.2 0.2 setosa
## 37 5.5 3.5 1.3 0.2 setosa
## 38 4.9 3.6 1.4 0.1 setosa
## 39 4.4 3.0 1.3 0.2 setosa
## 40 5.1 3.4 1.5 0.2 setosa
## 41 5.0 3.5 1.3 0.3 setosa
## 42 4.5 2.3 1.3 0.3 setosa
## 43 4.4 3.2 1.3 0.2 setosa
## 44 5.0 3.5 1.6 0.6 setosa
## 45 5.1 3.8 1.9 0.4 setosa
## 46 4.8 3.0 1.4 0.3 setosa
## 47 5.1 3.8 1.6 0.2 setosa
## 48 4.6 3.2 1.4 0.2 setosa
## 49 5.3 3.7 1.5 0.2 setosa
## 50 5.0 3.3 1.4 0.2 setosa
## 51 7.0 3.2 4.7 1.4 versicolor
## 52 6.4 3.2 4.5 1.5 versicolor
## 53 6.9 3.1 4.9 1.5 versicolor
## 54 5.5 2.3 4.0 1.3 versicolor
## 55 6.5 2.8 4.6 1.5 versicolor
## 56 5.7 2.8 4.5 1.3 versicolor
## 57 6.3 3.3 4.7 1.6 versicolor
## 58 4.9 2.4 3.3 1.0 versicolor
## 59 6.6 2.9 4.6 1.3 versicolor
## 60 5.2 2.7 3.9 1.4 versicolor
## 61 5.0 2.0 3.5 1.0 versicolor
## 62 5.9 3.0 4.2 1.5 versicolor
## 63 6.0 2.2 4.0 1.0 versicolor
## 64 6.1 2.9 4.7 1.4 versicolor
## 65 5.6 2.9 3.6 1.3 versicolor
## 66 6.7 3.1 4.4 1.4 versicolor
## 67 5.6 3.0 4.5 1.5 versicolor
## 68 5.8 2.7 4.1 1.0 versicolor
## 69 6.2 2.2 4.5 1.5 versicolor
## 70 5.6 2.5 3.9 1.1 versicolor
## 71 5.9 3.2 4.8 1.8 versicolor
## 72 6.1 2.8 4.0 1.3 versicolor
## 73 6.3 2.5 4.9 1.5 versicolor
## 74 6.1 2.8 4.7 1.2 versicolor
## 75 6.4 2.9 4.3 1.3 versicolor
## 76 6.6 3.0 4.4 1.4 versicolor
## 77 6.8 2.8 4.8 1.4 versicolor
## 78 6.7 3.0 5.0 1.7 versicolor
## 79 6.0 2.9 4.5 1.5 versicolor
## 80 5.7 2.6 3.5 1.0 versicolor
## 81 5.5 2.4 3.8 1.1 versicolor
## 82 5.5 2.4 3.7 1.0 versicolor
## 83 5.8 2.7 3.9 1.2 versicolor
## 84 6.0 2.7 5.1 1.6 versicolor
## 85 5.4 3.0 4.5 1.5 versicolor
## 86 6.0 3.4 4.5 1.6 versicolor
## 87 6.7 3.1 4.7 1.5 versicolor
## 88 6.3 2.3 4.4 1.3 versicolor
## 89 5.6 3.0 4.1 1.3 versicolor
## 90 5.5 2.5 4.0 1.3 versicolor
## 91 5.5 2.6 4.4 1.2 versicolor
## 92 6.1 3.0 4.6 1.4 versicolor
## 93 5.8 2.6 4.0 1.2 versicolor
## 94 5.0 2.3 3.3 1.0 versicolor
## 95 5.6 2.7 4.2 1.3 versicolor
## 96 5.7 3.0 4.2 1.2 versicolor
## 97 5.7 2.9 4.2 1.3 versicolor
## 98 6.2 2.9 4.3 1.3 versicolor
## 99 5.1 2.5 3.0 1.1 versicolor
## 100 5.7 2.8 4.1 1.3 versicolor
## 101 6.3 3.3 6.0 2.5 virginica
## 102 5.8 2.7 5.1 1.9 virginica
## 103 7.1 3.0 5.9 2.1 virginica
## 104 6.3 2.9 5.6 1.8 virginica
## 105 6.5 3.0 5.8 2.2 virginica
## 106 7.6 3.0 6.6 2.1 virginica
## 107 4.9 2.5 4.5 1.7 virginica
## 108 7.3 2.9 6.3 1.8 virginica
## 109 6.7 2.5 5.8 1.8 virginica
## 110 7.2 3.6 6.1 2.5 virginica
## 111 6.5 3.2 5.1 2.0 virginica
## 112 6.4 2.7 5.3 1.9 virginica
## 113 6.8 3.0 5.5 2.1 virginica
## 114 5.7 2.5 5.0 2.0 virginica
## 115 5.8 2.8 5.1 2.4 virginica
## 116 6.4 3.2 5.3 2.3 virginica
## 117 6.5 3.0 5.5 1.8 virginica
## 118 7.7 3.8 6.7 2.2 virginica
## 119 7.7 2.6 6.9 2.3 virginica
## 120 6.0 2.2 5.0 1.5 virginica
## 121 6.9 3.2 5.7 2.3 virginica
## 122 5.6 2.8 4.9 2.0 virginica
## 123 7.7 2.8 6.7 2.0 virginica
## 124 6.3 2.7 4.9 1.8 virginica
## 125 6.7 3.3 5.7 2.1 virginica
## 126 7.2 3.2 6.0 1.8 virginica
## 127 6.2 2.8 4.8 1.8 virginica
## 128 6.1 3.0 4.9 1.8 virginica
## 129 6.4 2.8 5.6 2.1 virginica
## 130 7.2 3.0 5.8 1.6 virginica
## 131 7.4 2.8 6.1 1.9 virginica
## 132 7.9 3.8 6.4 2.0 virginica
## 133 6.4 2.8 5.6 2.2 virginica
## 134 6.3 2.8 5.1 1.5 virginica
## 135 6.1 2.6 5.6 1.4 virginica
## 136 7.7 3.0 6.1 2.3 virginica
## 137 6.3 3.4 5.6 2.4 virginica
## 138 6.4 3.1 5.5 1.8 virginica
## 139 6.0 3.0 4.8 1.8 virginica
## 140 6.9 3.1 5.4 2.1 virginica
## 141 6.7 3.1 5.6 2.4 virginica
## 142 6.9 3.1 5.1 2.3 virginica
## 143 5.8 2.7 5.1 1.9 virginica
## 144 6.8 3.2 5.9 2.3 virginica
## 145 6.7 3.3 5.7 2.5 virginica
## 146 6.7 3.0 5.2 2.3 virginica
## 147 6.3 2.5 5.0 1.9 virginica
## 148 6.5 3.0 5.2 2.0 virginica
## 149 6.2 3.4 5.4 2.3 virginica
## 150 5.9 3.0 5.1 1.8 virginica
rename_with() applies a function to multiple variable names. For example, converting to lowercase can help standardize variable names when merging datasets from different sources.
## sepal.length sepal.width petal.length petal.width species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
1.1.5 Sorting in ascending or descending order
Sorting is commonly used for (1) identifying extreme values, (2) reviewing data quality, and (3) preparing ordered outputs for tables.
- put a minus in front of a variable for descending order
Here we sort byPetal.Lengthascending, and within that, byPetal.Widthdescending.
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 4.6 3.6 1.0 0.2 setosa
## 2 4.3 3.0 1.1 0.1 setosa
## 3 5.8 4.0 1.2 0.2 setosa
## 4 5.0 3.2 1.2 0.2 setosa
## 5 5.4 3.9 1.3 0.4 setosa
## 6 5.0 3.5 1.3 0.3 setosa
## 7 4.5 2.3 1.3 0.3 setosa
## 8 4.7 3.2 1.3 0.2 setosa
## 9 5.5 3.5 1.3 0.2 setosa
## 10 4.4 3.0 1.3 0.2 setosa
## 11 4.4 3.2 1.3 0.2 setosa
## 12 4.6 3.4 1.4 0.3 setosa
## 13 5.1 3.5 1.4 0.3 setosa
## 14 4.8 3.0 1.4 0.3 setosa
## 15 5.1 3.5 1.4 0.2 setosa
## 16 4.9 3.0 1.4 0.2 setosa
## 17 5.0 3.6 1.4 0.2 setosa
## 18 4.4 2.9 1.4 0.2 setosa
## 19 5.2 3.4 1.4 0.2 setosa
## 20 5.5 4.2 1.4 0.2 setosa
## 21 4.6 3.2 1.4 0.2 setosa
## 22 5.0 3.3 1.4 0.2 setosa
## 23 4.8 3.0 1.4 0.1 setosa
## 24 4.9 3.6 1.4 0.1 setosa
## 25 5.7 4.4 1.5 0.4 setosa
## 26 5.1 3.7 1.5 0.4 setosa
## 27 5.4 3.4 1.5 0.4 setosa
## 28 5.1 3.8 1.5 0.3 setosa
## 29 4.6 3.1 1.5 0.2 setosa
## 30 5.0 3.4 1.5 0.2 setosa
## 31 5.4 3.7 1.5 0.2 setosa
## 32 5.2 3.5 1.5 0.2 setosa
## 33 4.9 3.1 1.5 0.2 setosa
## 34 5.1 3.4 1.5 0.2 setosa
## 35 5.3 3.7 1.5 0.2 setosa
## 36 4.9 3.1 1.5 0.1 setosa
## 37 5.2 4.1 1.5 0.1 setosa
## 38 5.0 3.5 1.6 0.6 setosa
## 39 5.0 3.4 1.6 0.4 setosa
## 40 4.8 3.4 1.6 0.2 setosa
## 41 5.0 3.0 1.6 0.2 setosa
## 42 4.7 3.2 1.6 0.2 setosa
## 43 4.8 3.1 1.6 0.2 setosa
## 44 5.1 3.8 1.6 0.2 setosa
## 45 5.1 3.3 1.7 0.5 setosa
## 46 5.4 3.9 1.7 0.4 setosa
## 47 5.7 3.8 1.7 0.3 setosa
## 48 5.4 3.4 1.7 0.2 setosa
## 49 5.1 3.8 1.9 0.4 setosa
## 50 4.8 3.4 1.9 0.2 setosa
## 51 5.1 2.5 3.0 1.1 versicolor
## 52 4.9 2.4 3.3 1.0 versicolor
## 53 5.0 2.3 3.3 1.0 versicolor
## 54 5.0 2.0 3.5 1.0 versicolor
## 55 5.7 2.6 3.5 1.0 versicolor
## 56 5.6 2.9 3.6 1.3 versicolor
## 57 5.5 2.4 3.7 1.0 versicolor
## 58 5.5 2.4 3.8 1.1 versicolor
## 59 5.2 2.7 3.9 1.4 versicolor
## 60 5.8 2.7 3.9 1.2 versicolor
## 61 5.6 2.5 3.9 1.1 versicolor
## 62 5.5 2.3 4.0 1.3 versicolor
## 63 6.1 2.8 4.0 1.3 versicolor
## 64 5.5 2.5 4.0 1.3 versicolor
## 65 5.8 2.6 4.0 1.2 versicolor
## 66 6.0 2.2 4.0 1.0 versicolor
## 67 5.6 3.0 4.1 1.3 versicolor
## 68 5.7 2.8 4.1 1.3 versicolor
## 69 5.8 2.7 4.1 1.0 versicolor
## 70 5.9 3.0 4.2 1.5 versicolor
## 71 5.6 2.7 4.2 1.3 versicolor
## 72 5.7 2.9 4.2 1.3 versicolor
## 73 5.7 3.0 4.2 1.2 versicolor
## 74 6.4 2.9 4.3 1.3 versicolor
## 75 6.2 2.9 4.3 1.3 versicolor
## 76 6.7 3.1 4.4 1.4 versicolor
## 77 6.6 3.0 4.4 1.4 versicolor
## 78 6.3 2.3 4.4 1.3 versicolor
## 79 5.5 2.6 4.4 1.2 versicolor
## 80 4.9 2.5 4.5 1.7 virginica
## 81 6.0 3.4 4.5 1.6 versicolor
## 82 6.4 3.2 4.5 1.5 versicolor
## 83 5.6 3.0 4.5 1.5 versicolor
## 84 6.2 2.2 4.5 1.5 versicolor
## 85 6.0 2.9 4.5 1.5 versicolor
## 86 5.4 3.0 4.5 1.5 versicolor
## 87 5.7 2.8 4.5 1.3 versicolor
## 88 6.5 2.8 4.6 1.5 versicolor
## 89 6.1 3.0 4.6 1.4 versicolor
## 90 6.6 2.9 4.6 1.3 versicolor
## 91 6.3 3.3 4.7 1.6 versicolor
## 92 6.7 3.1 4.7 1.5 versicolor
## 93 7.0 3.2 4.7 1.4 versicolor
## 94 6.1 2.9 4.7 1.4 versicolor
## 95 6.1 2.8 4.7 1.2 versicolor
## 96 5.9 3.2 4.8 1.8 versicolor
## 97 6.2 2.8 4.8 1.8 virginica
## 98 6.0 3.0 4.8 1.8 virginica
## 99 6.8 2.8 4.8 1.4 versicolor
## 100 5.6 2.8 4.9 2.0 virginica
## 101 6.3 2.7 4.9 1.8 virginica
## 102 6.1 3.0 4.9 1.8 virginica
## 103 6.9 3.1 4.9 1.5 versicolor
## 104 6.3 2.5 4.9 1.5 versicolor
## 105 5.7 2.5 5.0 2.0 virginica
## 106 6.3 2.5 5.0 1.9 virginica
## 107 6.7 3.0 5.0 1.7 versicolor
## 108 6.0 2.2 5.0 1.5 virginica
## 109 5.8 2.8 5.1 2.4 virginica
## 110 6.9 3.1 5.1 2.3 virginica
## 111 6.5 3.2 5.1 2.0 virginica
## 112 5.8 2.7 5.1 1.9 virginica
## 113 5.8 2.7 5.1 1.9 virginica
## 114 5.9 3.0 5.1 1.8 virginica
## 115 6.0 2.7 5.1 1.6 versicolor
## 116 6.3 2.8 5.1 1.5 virginica
## 117 6.7 3.0 5.2 2.3 virginica
## 118 6.5 3.0 5.2 2.0 virginica
## 119 6.4 3.2 5.3 2.3 virginica
## 120 6.4 2.7 5.3 1.9 virginica
## 121 6.2 3.4 5.4 2.3 virginica
## 122 6.9 3.1 5.4 2.1 virginica
## 123 6.8 3.0 5.5 2.1 virginica
## 124 6.5 3.0 5.5 1.8 virginica
## 125 6.4 3.1 5.5 1.8 virginica
## 126 6.3 3.4 5.6 2.4 virginica
## 127 6.7 3.1 5.6 2.4 virginica
## 128 6.4 2.8 5.6 2.2 virginica
## 129 6.4 2.8 5.6 2.1 virginica
## 130 6.3 2.9 5.6 1.8 virginica
## 131 6.1 2.6 5.6 1.4 virginica
## 132 6.7 3.3 5.7 2.5 virginica
## 133 6.9 3.2 5.7 2.3 virginica
## 134 6.7 3.3 5.7 2.1 virginica
## 135 6.5 3.0 5.8 2.2 virginica
## 136 6.7 2.5 5.8 1.8 virginica
## 137 7.2 3.0 5.8 1.6 virginica
## 138 6.8 3.2 5.9 2.3 virginica
## 139 7.1 3.0 5.9 2.1 virginica
## 140 6.3 3.3 6.0 2.5 virginica
## 141 7.2 3.2 6.0 1.8 virginica
## 142 7.2 3.6 6.1 2.5 virginica
## 143 7.7 3.0 6.1 2.3 virginica
## 144 7.4 2.8 6.1 1.9 virginica
## 145 7.3 2.9 6.3 1.8 virginica
## 146 7.9 3.8 6.4 2.0 virginica
## 147 7.6 3.0 6.6 2.1 virginica
## 148 7.7 3.8 6.7 2.2 virginica
## 149 7.7 2.8 6.7 2.0 virginica
## 150 7.7 2.6 6.9 2.3 virginica
1.1.6 Transform variables
Creating new variables is one of the most common wrangling tasks. Typical use cases include unit conversion, categorization, deriving endpoints, and creating analysis flags.
mutate() creates new variables (or overwrites existing variables). Here, we create a new variable and also rescale an existing variable.
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species newvar
## 1 5.1 3.5 0.014 0.2 setosa 35
## 2 4.9 3.0 0.014 0.2 setosa 30
## 3 4.7 3.2 0.013 0.2 setosa 32
## 4 4.6 3.1 0.015 0.2 setosa 31
## 5 5.0 3.6 0.014 0.2 setosa 36
## 6 5.4 3.9 0.017 0.4 setosa 39
## 7 4.6 3.4 0.014 0.3 setosa 34
## 8 5.0 3.4 0.015 0.2 setosa 34
## 9 4.4 2.9 0.014 0.2 setosa 29
## 10 4.9 3.1 0.015 0.1 setosa 31
## 11 5.4 3.7 0.015 0.2 setosa 37
## 12 4.8 3.4 0.016 0.2 setosa 34
## 13 4.8 3.0 0.014 0.1 setosa 30
## 14 4.3 3.0 0.011 0.1 setosa 30
## 15 5.8 4.0 0.012 0.2 setosa 40
## 16 5.7 4.4 0.015 0.4 setosa 44
## 17 5.4 3.9 0.013 0.4 setosa 39
## 18 5.1 3.5 0.014 0.3 setosa 35
## 19 5.7 3.8 0.017 0.3 setosa 38
## 20 5.1 3.8 0.015 0.3 setosa 38
## 21 5.4 3.4 0.017 0.2 setosa 34
## 22 5.1 3.7 0.015 0.4 setosa 37
## 23 4.6 3.6 0.010 0.2 setosa 36
## 24 5.1 3.3 0.017 0.5 setosa 33
## 25 4.8 3.4 0.019 0.2 setosa 34
## 26 5.0 3.0 0.016 0.2 setosa 30
## 27 5.0 3.4 0.016 0.4 setosa 34
## 28 5.2 3.5 0.015 0.2 setosa 35
## 29 5.2 3.4 0.014 0.2 setosa 34
## 30 4.7 3.2 0.016 0.2 setosa 32
## 31 4.8 3.1 0.016 0.2 setosa 31
## 32 5.4 3.4 0.015 0.4 setosa 34
## 33 5.2 4.1 0.015 0.1 setosa 41
## 34 5.5 4.2 0.014 0.2 setosa 42
## 35 4.9 3.1 0.015 0.2 setosa 31
## 36 5.0 3.2 0.012 0.2 setosa 32
## 37 5.5 3.5 0.013 0.2 setosa 35
## 38 4.9 3.6 0.014 0.1 setosa 36
## 39 4.4 3.0 0.013 0.2 setosa 30
## 40 5.1 3.4 0.015 0.2 setosa 34
## 41 5.0 3.5 0.013 0.3 setosa 35
## 42 4.5 2.3 0.013 0.3 setosa 23
## 43 4.4 3.2 0.013 0.2 setosa 32
## 44 5.0 3.5 0.016 0.6 setosa 35
## 45 5.1 3.8 0.019 0.4 setosa 38
## 46 4.8 3.0 0.014 0.3 setosa 30
## 47 5.1 3.8 0.016 0.2 setosa 38
## 48 4.6 3.2 0.014 0.2 setosa 32
## 49 5.3 3.7 0.015 0.2 setosa 37
## 50 5.0 3.3 0.014 0.2 setosa 33
## 51 7.0 3.2 0.047 1.4 versicolor 32
## 52 6.4 3.2 0.045 1.5 versicolor 32
## 53 6.9 3.1 0.049 1.5 versicolor 31
## 54 5.5 2.3 0.040 1.3 versicolor 23
## 55 6.5 2.8 0.046 1.5 versicolor 28
## 56 5.7 2.8 0.045 1.3 versicolor 28
## 57 6.3 3.3 0.047 1.6 versicolor 33
## 58 4.9 2.4 0.033 1.0 versicolor 24
## 59 6.6 2.9 0.046 1.3 versicolor 29
## 60 5.2 2.7 0.039 1.4 versicolor 27
## 61 5.0 2.0 0.035 1.0 versicolor 20
## 62 5.9 3.0 0.042 1.5 versicolor 30
## 63 6.0 2.2 0.040 1.0 versicolor 22
## 64 6.1 2.9 0.047 1.4 versicolor 29
## 65 5.6 2.9 0.036 1.3 versicolor 29
## 66 6.7 3.1 0.044 1.4 versicolor 31
## 67 5.6 3.0 0.045 1.5 versicolor 30
## 68 5.8 2.7 0.041 1.0 versicolor 27
## 69 6.2 2.2 0.045 1.5 versicolor 22
## 70 5.6 2.5 0.039 1.1 versicolor 25
## 71 5.9 3.2 0.048 1.8 versicolor 32
## 72 6.1 2.8 0.040 1.3 versicolor 28
## 73 6.3 2.5 0.049 1.5 versicolor 25
## 74 6.1 2.8 0.047 1.2 versicolor 28
## 75 6.4 2.9 0.043 1.3 versicolor 29
## 76 6.6 3.0 0.044 1.4 versicolor 30
## 77 6.8 2.8 0.048 1.4 versicolor 28
## 78 6.7 3.0 0.050 1.7 versicolor 30
## 79 6.0 2.9 0.045 1.5 versicolor 29
## 80 5.7 2.6 0.035 1.0 versicolor 26
## 81 5.5 2.4 0.038 1.1 versicolor 24
## 82 5.5 2.4 0.037 1.0 versicolor 24
## 83 5.8 2.7 0.039 1.2 versicolor 27
## 84 6.0 2.7 0.051 1.6 versicolor 27
## 85 5.4 3.0 0.045 1.5 versicolor 30
## 86 6.0 3.4 0.045 1.6 versicolor 34
## 87 6.7 3.1 0.047 1.5 versicolor 31
## 88 6.3 2.3 0.044 1.3 versicolor 23
## 89 5.6 3.0 0.041 1.3 versicolor 30
## 90 5.5 2.5 0.040 1.3 versicolor 25
## 91 5.5 2.6 0.044 1.2 versicolor 26
## 92 6.1 3.0 0.046 1.4 versicolor 30
## 93 5.8 2.6 0.040 1.2 versicolor 26
## 94 5.0 2.3 0.033 1.0 versicolor 23
## 95 5.6 2.7 0.042 1.3 versicolor 27
## 96 5.7 3.0 0.042 1.2 versicolor 30
## 97 5.7 2.9 0.042 1.3 versicolor 29
## 98 6.2 2.9 0.043 1.3 versicolor 29
## 99 5.1 2.5 0.030 1.1 versicolor 25
## 100 5.7 2.8 0.041 1.3 versicolor 28
## 101 6.3 3.3 0.060 2.5 virginica 33
## 102 5.8 2.7 0.051 1.9 virginica 27
## 103 7.1 3.0 0.059 2.1 virginica 30
## 104 6.3 2.9 0.056 1.8 virginica 29
## 105 6.5 3.0 0.058 2.2 virginica 30
## 106 7.6 3.0 0.066 2.1 virginica 30
## 107 4.9 2.5 0.045 1.7 virginica 25
## 108 7.3 2.9 0.063 1.8 virginica 29
## 109 6.7 2.5 0.058 1.8 virginica 25
## 110 7.2 3.6 0.061 2.5 virginica 36
## 111 6.5 3.2 0.051 2.0 virginica 32
## 112 6.4 2.7 0.053 1.9 virginica 27
## 113 6.8 3.0 0.055 2.1 virginica 30
## 114 5.7 2.5 0.050 2.0 virginica 25
## 115 5.8 2.8 0.051 2.4 virginica 28
## 116 6.4 3.2 0.053 2.3 virginica 32
## 117 6.5 3.0 0.055 1.8 virginica 30
## 118 7.7 3.8 0.067 2.2 virginica 38
## 119 7.7 2.6 0.069 2.3 virginica 26
## 120 6.0 2.2 0.050 1.5 virginica 22
## 121 6.9 3.2 0.057 2.3 virginica 32
## 122 5.6 2.8 0.049 2.0 virginica 28
## 123 7.7 2.8 0.067 2.0 virginica 28
## 124 6.3 2.7 0.049 1.8 virginica 27
## 125 6.7 3.3 0.057 2.1 virginica 33
## 126 7.2 3.2 0.060 1.8 virginica 32
## 127 6.2 2.8 0.048 1.8 virginica 28
## 128 6.1 3.0 0.049 1.8 virginica 30
## 129 6.4 2.8 0.056 2.1 virginica 28
## 130 7.2 3.0 0.058 1.6 virginica 30
## 131 7.4 2.8 0.061 1.9 virginica 28
## 132 7.9 3.8 0.064 2.0 virginica 38
## 133 6.4 2.8 0.056 2.2 virginica 28
## 134 6.3 2.8 0.051 1.5 virginica 28
## 135 6.1 2.6 0.056 1.4 virginica 26
## 136 7.7 3.0 0.061 2.3 virginica 30
## 137 6.3 3.4 0.056 2.4 virginica 34
## 138 6.4 3.1 0.055 1.8 virginica 31
## 139 6.0 3.0 0.048 1.8 virginica 30
## 140 6.9 3.1 0.054 2.1 virginica 31
## 141 6.7 3.1 0.056 2.4 virginica 31
## 142 6.9 3.1 0.051 2.3 virginica 31
## 143 5.8 2.7 0.051 1.9 virginica 27
## 144 6.8 3.2 0.059 2.3 virginica 32
## 145 6.7 3.3 0.057 2.5 virginica 33
## 146 6.7 3.0 0.052 2.3 virginica 30
## 147 6.3 2.5 0.050 1.9 virginica 25
## 148 6.5 3.0 0.052 2.0 virginica 30
## 149 6.2 3.4 0.054 2.3 virginica 34
## 150 5.9 3.0 0.051 1.8 virginica 30
rep, sample
In practice, you often need to test code before real study data are available. Simulated toy datasets are useful for prototyping pipelines, checking assumptions, and training new team members.
The following example creates a simple longitudinal dataset with three records per subject (year = 0, 1, 2). We also simulate a treatment assignment and a binary adverse event (se_headache).
set.seed(123)
drug_trial <- tibble(
# each id has three
id = rep(1:20, each = 3),
# all has 20
year = rep(0:2, times = 20),
# sample 20 and each repeat 3
age = sample(35:75, 20, TRUE) %>% rep(each = 3),
#
drug = sample(c("Placebo", "Active"), 20, TRUE) %>%
rep(each = 3),
#
se_headache = if_else(
drug == "Placebo",
# the possibility that outcome is 1
sample(0:1, 60, TRUE, c(.95,.05)),
sample(0:1, 60, TRUE, c(.10, .90))
)
#
)
head(drug_trial)## # A tibble: 6 × 5
## id year age drug se_headache
## <int> <int> <int> <chr> <int>
## 1 1 0 65 Active 0
## 2 1 1 65 Active 1
## 3 1 2 65 Active 1
## 4 2 0 49 Active 1
## 5 2 1 49 Active 0
## 6 2 2 49 Active 1
A quick check with head() helps confirm the dataset structure and that variables were created as expected.
## # A tibble: 60 × 6
## id year age drug se_headache complete
## <int> <int> <int> <chr> <int> <dbl>
## 1 1 0 65 Active 0 1
## 2 1 1 65 Active 1 1
## 3 1 2 65 Active 1 1
## 4 2 0 49 Active 1 1
## 5 2 1 49 Active 0 1
## 6 2 2 49 Active 1 1
## 7 3 0 48 Placebo 0 1
## 8 3 1 48 Placebo 0 1
## 9 3 2 48 Placebo 0 1
## 10 4 0 37 Placebo 0 1
## # ℹ 50 more rows
factor is just for factor (labeling the value; whilerenamelabeling the variable) and recode is for different types whose outcome depends on the right side of the equation sign. factor can change the reference group.
This is an important concept in statistical work:
- rename() changes the variable name (column name).
- factor() changes the data type and attaches value labels to categories. Factors also have an ordering of levels, which can affect summaries and regression reference categories.
- recode() maps old values to new values (useful for harmonizing coding across sources, or mapping numeric codes to analytic codes).
Below we (1) create labeled factors for the headache indicator, (2) create a reversed reference order to demonstrate how factor level order changes the reference group, and (3) create a recoded numeric variable.
drug_trial=drug_trial %>%
mutate(mi_f = factor(se_headache , c(0, 1), c("No", "Yes")))%>%
mutate(mi_f_reverse = factor(se_headache , c(1, 0), c("yes", "no")))%>%
# change the reference group by using factor
mutate(mi = recode(se_headache , "0"=2, "1"=1))
str(drug_trial)## tibble [60 × 8] (S3: tbl_df/tbl/data.frame)
## $ id : int [1:60] 1 1 1 2 2 2 3 3 3 4 ...
## $ year : int [1:60] 0 1 2 0 1 2 0 1 2 0 ...
## $ age : int [1:60] 65 65 65 49 49 49 48 48 48 37 ...
## $ drug : chr [1:60] "Active" "Active" "Active" "Active" ...
## $ se_headache : int [1:60] 0 1 1 1 0 1 0 0 0 0 ...
## $ mi_f : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 2 1 1 1 1 ...
## $ mi_f_reverse: Factor w/ 2 levels "yes","no": 2 1 1 1 2 1 2 2 2 2 ...
## $ mi : num [1:60] 2 1 1 1 2 1 2 2 2 2 ...
##
## No Yes
## 40 20
##
## yes no
## 20 40
Row-wise operations apply calculations to each row independently. This is helpful when you need to combine multiple columns within a row (e.g., “any adverse event occurred”, “max lab value within row”, etc.).
However, for most group-level summaries (e.g., by subject), group_by() is usually more efficient and more common.
## # A tibble: 60 × 9
## # Rowwise:
## id year age drug se_headache mi_f mi_f_reverse mi any_se_year
## <int> <int> <int> <chr> <int> <fct> <fct> <dbl> <lgl>
## 1 1 0 65 Active 0 No no 2 FALSE
## 2 1 1 65 Active 1 Yes yes 1 TRUE
## 3 1 2 65 Active 1 Yes yes 1 TRUE
## 4 2 0 49 Active 1 Yes yes 1 TRUE
## 5 2 1 49 Active 0 No no 2 TRUE
## 6 2 2 49 Active 1 Yes yes 1 TRUE
## 7 3 0 48 Placebo 0 No no 2 FALSE
## 8 3 1 48 Placebo 0 No no 2 TRUE
## 9 3 2 48 Placebo 0 No no 2 TRUE
## 10 4 0 37 Placebo 0 No no 2 FALSE
## # ℹ 50 more rows
what is the difference of rowwise and group_by
- group_by(id) means operations are performed within each subject group. This is typically used for subject-level summaries, longitudinal flags, and group totals.
- rowwise() means operations are performed for each row. This is best when combining multiple variables across columns within the same row.
In the examples below, compare what happens when you compute sum(se_headache, year) under different grouping structures.
## # A tibble: 60 × 9
## # Groups: id [20]
## id year age drug se_headache mi_f mi_f_reverse mi any_se_year
## <int> <int> <int> <chr> <int> <fct> <fct> <dbl> <int>
## 1 1 0 65 Active 0 No no 2 5
## 2 1 1 65 Active 1 Yes yes 1 5
## 3 1 2 65 Active 1 Yes yes 1 5
## 4 2 0 49 Active 1 Yes yes 1 5
## 5 2 1 49 Active 0 No no 2 5
## 6 2 2 49 Active 1 Yes yes 1 5
## 7 3 0 48 Placebo 0 No no 2 3
## 8 3 1 48 Placebo 0 No no 2 3
## 9 3 2 48 Placebo 0 No no 2 3
## 10 4 0 37 Placebo 0 No no 2 3
## # ℹ 50 more rows
## # A tibble: 60 × 9
## # Rowwise:
## id year age drug se_headache mi_f mi_f_reverse mi any_se_year
## <int> <int> <int> <chr> <int> <fct> <fct> <dbl> <int>
## 1 1 0 65 Active 0 No no 2 0
## 2 1 1 65 Active 1 Yes yes 1 2
## 3 1 2 65 Active 1 Yes yes 1 3
## 4 2 0 49 Active 1 Yes yes 1 1
## 5 2 1 49 Active 0 No no 2 1
## 6 2 2 49 Active 1 Yes yes 1 3
## 7 3 0 48 Placebo 0 No no 2 0
## 8 3 1 48 Placebo 0 No no 2 1
## 9 3 2 48 Placebo 0 No no 2 2
## 10 4 0 37 Placebo 0 No no 2 0
## # ℹ 50 more rows
starts_with and end_with
Selector helpers such as starts_with() are useful when working with datasets that have many related variables with consistent naming patterns (e.g., adverse event indicators, lab variables, visit-level measurements).
## # A tibble: 60 × 3
## id year se_headache
## <int> <int> <int>
## 1 1 0 0
## 2 1 1 1
## 3 1 2 1
## 4 2 0 1
## 5 2 1 0
## 6 2 2 1
## 7 3 0 0
## 8 3 1 0
## 9 3 2 0
## 10 4 0 0
## # ℹ 50 more rows
keep all combination levels even 0 times
When summarizing counts for categorical combinations, you sometimes want to display categories even if the count is zero. This is particularly important in reporting (e.g., safety tables, disposition tables).
Here, .drop = FALSE keeps all factor combinations in the grouped output.
drug_trial %>%
filter(year == 0) %>%
filter(age < 65) %>%
group_by(drug, se_headache, .drop = FALSE) %>%
summarise(n = n())## `summarise()` has grouped output by 'drug'. You can
## override using the `.groups` argument.
## # A tibble: 3 × 3
## # Groups: drug [2]
## drug se_headache n
## <chr> <int> <int>
## 1 Active 1 5
## 2 Placebo 0 11
## 3 Placebo 1 1
group_by
A very common pattern is: filter to the analytic dataset, group by a key variable (e.g., treatment), then summarize.
## # A tibble: 2 × 2
## drug mean_age
## <chr> <dbl>
## 1 Active 54.7
## 2 Placebo 53.2
1.1.7 Working with pipes %>%
The pipe operator %>% is used to chain steps in a readable sequence. This style matches how we typically describe data processing: “filter, then transform, then select, then sort.”
In practice, pipes help build reproducible and auditable wrangling pipelines.
iris %>%
filter(Species=="setosa") %>%
mutate (newvar=Sepal.Width*10) %>%
select (-Sepal.Width, -Petal.Width) %>%
arrange(-Sepal.Length, newvar)## Sepal.Length Petal.Length Species newvar
## 1 5.8 1.2 setosa 40
## 2 5.7 1.7 setosa 38
## 3 5.7 1.5 setosa 44
## 4 5.5 1.3 setosa 35
## 5 5.5 1.4 setosa 42
## 6 5.4 1.7 setosa 34
## 7 5.4 1.5 setosa 34
## 8 5.4 1.5 setosa 37
## 9 5.4 1.7 setosa 39
## 10 5.4 1.3 setosa 39
## 11 5.3 1.5 setosa 37
## 12 5.2 1.4 setosa 34
## 13 5.2 1.5 setosa 35
## 14 5.2 1.5 setosa 41
## 15 5.1 1.7 setosa 33
## 16 5.1 1.5 setosa 34
## 17 5.1 1.4 setosa 35
## 18 5.1 1.4 setosa 35
## 19 5.1 1.5 setosa 37
## 20 5.1 1.5 setosa 38
## 21 5.1 1.9 setosa 38
## 22 5.1 1.6 setosa 38
## 23 5.0 1.6 setosa 30
## 24 5.0 1.2 setosa 32
## 25 5.0 1.4 setosa 33
## 26 5.0 1.5 setosa 34
## 27 5.0 1.6 setosa 34
## 28 5.0 1.3 setosa 35
## 29 5.0 1.6 setosa 35
## 30 5.0 1.4 setosa 36
## 31 4.9 1.4 setosa 30
## 32 4.9 1.5 setosa 31
## 33 4.9 1.5 setosa 31
## 34 4.9 1.4 setosa 36
## 35 4.8 1.4 setosa 30
## 36 4.8 1.4 setosa 30
## 37 4.8 1.6 setosa 31
## 38 4.8 1.6 setosa 34
## 39 4.8 1.9 setosa 34
## 40 4.7 1.3 setosa 32
## 41 4.7 1.6 setosa 32
## 42 4.6 1.5 setosa 31
## 43 4.6 1.4 setosa 32
## 44 4.6 1.4 setosa 34
## 45 4.6 1.0 setosa 36
## 46 4.5 1.3 setosa 23
## 47 4.4 1.4 setosa 29
## 48 4.4 1.3 setosa 30
## 49 4.4 1.3 setosa 32
## 50 4.3 1.1 setosa 30
1.1.8 Pivot wider (long to wide)
Reshaping data is common in longitudinal analysis and reporting.
- Long format (one row per subject-visit) is common for modeling and plotting.
- Wide format (one row per subject, with multiple columns for repeated measures) is sometimes required for specific reporting formats or certain algorithms.
- If no unique identifier row in each group doesn’t work
Pivoting wider requires a unique key for each output row. If the key is not unique, values can collide and pivoting may fail or require aggregation.
## Warning: Values from `Sepal.Length` are not uniquely
## identified; output will contain list-cols.
## • Use `values_fn = list` to suppress this warning.
## • Use `values_fn = {summary_fun}` to summarise
## duplicates.
## • Use the following dplyr code to identify duplicates.
## {data} |>
## dplyr::summarise(n = dplyr::n(), .by =
## c(Sepal.Width, Petal.Length, Petal.Width, Species))
## |>
## dplyr::filter(n > 1L)
## # A tibble: 143 × 6
## Sepal.Width Petal.Length Petal.Width setosa versicolor virginica
## <dbl> <dbl> <dbl> <list> <list> <list>
## 1 3.5 1.4 0.2 <dbl [1]> <NULL> <NULL>
## 2 3 1.4 0.2 <dbl [1]> <NULL> <NULL>
## 3 3.2 1.3 0.2 <dbl [2]> <NULL> <NULL>
## 4 3.1 1.5 0.2 <dbl [2]> <NULL> <NULL>
## 5 3.6 1.4 0.2 <dbl [1]> <NULL> <NULL>
## 6 3.9 1.7 0.4 <dbl [1]> <NULL> <NULL>
## 7 3.4 1.4 0.3 <dbl [1]> <NULL> <NULL>
## 8 3.4 1.5 0.2 <dbl [2]> <NULL> <NULL>
## 9 2.9 1.4 0.2 <dbl [1]> <NULL> <NULL>
## 10 3.1 1.5 0.1 <dbl [1]> <NULL> <NULL>
## # ℹ 133 more rows
- Create a unique identifier row for each name and then use pivot_wider
Here we create a row number within eachSpeciesgroup, which becomes part of the unique key.
widedata <- iris %>%
# create groups then assign unique identifier row number in each group
group_by(Species) %>%
mutate(row = row_number()) %>%
pivot_wider(names_from=Species, values_from= c(Petal.Length,Sepal.Length,Petal.Width,Sepal.Width))
widedata## # A tibble: 50 × 13
## row Petal.Length_setosa Petal.Length_versicolor Petal.Length_virginica
## <int> <dbl> <dbl> <dbl>
## 1 1 1.4 4.7 6
## 2 2 1.4 4.5 5.1
## 3 3 1.3 4.9 5.9
## 4 4 1.5 4 5.6
## 5 5 1.4 4.6 5.8
## 6 6 1.7 4.5 6.6
## 7 7 1.4 4.7 4.5
## 8 8 1.5 3.3 6.3
## 9 9 1.4 4.6 5.8
## 10 10 1.5 3.9 6.1
## # ℹ 40 more rows
## # ℹ 9 more variables: Sepal.Length_setosa <dbl>, Sepal.Length_versicolor <dbl>,
## # Sepal.Length_virginica <dbl>, Petal.Width_setosa <dbl>,
## # Petal.Width_versicolor <dbl>, Petal.Width_virginica <dbl>,
## # Sepal.Width_setosa <dbl>, Sepal.Width_versicolor <dbl>,
## # Sepal.Width_virginica <dbl>
This is a smaller example, pivoting only a subset of variables.
iris %>%
group_by(Species) %>%
mutate(row = row_number()) %>%
pivot_wider( names_from=Species, values_from= c(Petal.Length, Petal.Width))## # A tibble: 150 × 9
## Sepal.Length Sepal.Width row Petal.Length_setosa Petal.Length_versicolor
## <dbl> <dbl> <int> <dbl> <dbl>
## 1 5.1 3.5 1 1.4 NA
## 2 4.9 3 2 1.4 NA
## 3 4.7 3.2 3 1.3 NA
## 4 4.6 3.1 4 1.5 NA
## 5 5 3.6 5 1.4 NA
## 6 5.4 3.9 6 1.7 NA
## 7 4.6 3.4 7 1.4 NA
## 8 5 3.4 8 1.5 NA
## 9 4.4 2.9 9 1.4 NA
## 10 4.9 3.1 10 1.5 NA
## # ℹ 140 more rows
## # ℹ 4 more variables: Petal.Length_virginica <dbl>, Petal.Width_setosa <dbl>,
## # Petal.Width_versicolor <dbl>, Petal.Width_virginica <dbl>
1.1.9 Pivot longer (wide to long)
why long format? there are many analytic techniques that require our data to be in this format, like plotting.
In clinical trial and observational research, long format is often the default because it supports:
- longitudinal models (e.g., MMRM, mixed models)
- visit-based summaries
- plotting repeated measures over time
- tidy workflows where “one observation per row” is preferred
The next example converts selected wide columns into a long structure.
longdata = pivot_longer(widedata, -
c( "row" , "Petal.Length_setosa" , "Petal.Length_versicolor",
"Petal.Length_virginica", "Sepal.Length_setosa" , "Sepal.Length_versicolor",
"Sepal.Length_virginica" , "Petal.Width_setosa" , "Petal.Width_versicolor" ,
"Petal.Width_virginica" ) , names_to="Sepal.Width", values_to="Sepal.Width.value")
longdata## # A tibble: 150 × 12
## row Petal.Length_setosa Petal.Length_versicolor Petal.Length_virginica
## <int> <dbl> <dbl> <dbl>
## 1 1 1.4 4.7 6
## 2 1 1.4 4.7 6
## 3 1 1.4 4.7 6
## 4 2 1.4 4.5 5.1
## 5 2 1.4 4.5 5.1
## 6 2 1.4 4.5 5.1
## 7 3 1.3 4.9 5.9
## 8 3 1.3 4.9 5.9
## 9 3 1.3 4.9 5.9
## 10 4 1.5 4 5.6
## # ℹ 140 more rows
## # ℹ 8 more variables: Sepal.Length_setosa <dbl>, Sepal.Length_versicolor <dbl>,
## # Sepal.Length_virginica <dbl>, Petal.Width_setosa <dbl>,
## # Petal.Width_versicolor <dbl>, Petal.Width_virginica <dbl>,
## # Sepal.Width <chr>, Sepal.Width.value <dbl>
The following babies dataset is a simple example of repeated measurements stored in wide format (e.g., weight_3, weight_6). This naming pattern is common in real datasets: variable name + underscore + timepoint.
babies <- tibble(
id = 1001:1008,
sex = c("F", "F", "M", "F", "M", "M", "M", "F"),
weight_3 = c(9, 11, 17, 16, 11, 17, 16, 15),
weight_6 = c(13, 16, 20, 18, 15, 21, 17, 16),
weight_9 = c(16, 17, 23, 21, 16, 25, 19, 18),
weight_12 = c(17, 20, 24, 22, 18, 26, 21, 19)
) %>%
print()## # A tibble: 8 × 6
## id sex weight_3 weight_6 weight_9 weight_12
## <int> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1001 F 9 13 16 17
## 2 1002 F 11 16 17 20
## 3 1003 M 17 20 23 24
## 4 1004 F 16 18 21 22
## 5 1005 M 11 15 16 18
## 6 1006 M 17 21 25 26
## 7 1007 M 16 17 19 21
## 8 1008 F 15 16 18 19
select which cols
pivot_longer() converts multiple columns into key-value pairs. Here, we reshape all columns that start with "weight".
babies %>%
pivot_longer(
cols = starts_with("weight"), #which cols
names_to = "months",
names_prefix = "weight_", #remove prefix
values_to = "weight"
)## # A tibble: 32 × 4
## id sex months weight
## <int> <chr> <chr> <dbl>
## 1 1001 F 3 9
## 2 1001 F 6 13
## 3 1001 F 9 16
## 4 1001 F 12 17
## 5 1002 F 3 11
## 6 1002 F 6 16
## 7 1002 F 9 17
## 8 1002 F 12 20
## 9 1003 M 3 17
## 10 1003 M 6 20
## # ℹ 22 more rows
using names_prefix argument
names_prefix removes a fixed prefix from the variable names. This is useful to keep only the meaningful portion (e.g., the timepoint).
“it is a regular expression used to remove matching text from the start of each variable name.”
babies %>%
pivot_longer(
cols = starts_with("weight"),
names_to = "months",
names_prefix = "\\w+_"
)## # A tibble: 32 × 4
## id sex months value
## <int> <chr> <chr> <dbl>
## 1 1001 F 3 9
## 2 1001 F 6 13
## 3 1001 F 9 16
## 4 1001 F 12 17
## 5 1002 F 3 11
## 6 1002 F 6 16
## 7 1002 F 9 17
## 8 1002 F 12 20
## 9 1003 M 3 17
## 10 1003 M 6 20
## # ℹ 22 more rows
using default argument
If you do not specify names_to or values_to, pivot_longer() uses defaults. This is convenient for quick exploration, but in practice you usually want explicit names for clarity.
## # A tibble: 32 × 4
## id sex name value
## <int> <chr> <chr> <dbl>
## 1 1001 F weight_3 9
## 2 1001 F weight_6 13
## 3 1001 F weight_9 16
## 4 1001 F weight_12 17
## 5 1002 F weight_3 11
## 6 1002 F weight_6 16
## 7 1002 F weight_9 17
## 8 1002 F weight_12 20
## 9 1003 M weight_3 17
## 10 1003 M weight_6 20
## # ℹ 22 more rows
This extended example includes both weight and length measured at multiple timepoints. We will demonstrate how to reshape multiple measurement types in one step.
set.seed(123)
babies <- tibble(
id = 1001:1008,
sex = c("F", "F", "M", "F", "M", "M", "M", "F"),
weight_3 = c(9, 11, 17, 16, 11, 17, 16, 15),
weight_6 = c(13, 16, 20, 18, 15, 21, 17, 16),
weight_9 = c(16, 17, 23, 21, 16, 25, 19, 18),
weight_12 = c(17, 20, 24, 22, 18, 26, 21, 19),
length_3 = c(17, 19, 23, 20, 18, 22, 21, 18),
length_6 = round(length_3 + rnorm(8, 2, 1)),
length_9 = round(length_6 + rnorm(8, 2, 1)),
length_12 = round(length_9 + rnorm(8, 2, 1)),
) %>%
print()## # A tibble: 8 × 10
## id sex weight_3 weight_6 weight_9 weight_12 length_3 length_6 length_9
## <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1001 F 9 13 16 17 17 18 19
## 2 1002 F 11 16 17 20 19 21 23
## 3 1003 M 17 20 23 24 23 27 30
## 4 1004 F 16 18 21 22 20 22 24
## 5 1005 M 11 15 16 18 18 20 22
## 6 1006 M 17 21 25 26 22 26 28
## 7 1007 M 16 17 19 21 21 23 24
## 8 1008 F 15 16 18 19 18 19 23
## # ℹ 1 more variable: length_12 <dbl>
- pivoting multiple sets of cols
When variable names contain multiple components (e.g.,weight_3), you can split them into separate columns usingnames_sep.
Using.valueis a powerful technique: it tellspivot_longer()to create a separate output column for each measurement type (e.g., weight and length).
This format is especially helpful for longitudinal plotting and modeling where you want one row per subject-month.
## # A tibble: 32 × 5
## id sex months weight length
## <int> <chr> <chr> <dbl> <dbl>
## 1 1001 F 3 9 17
## 2 1001 F 6 13 18
## 3 1001 F 9 16 19
## 4 1001 F 12 17 21
## 5 1002 F 3 11 19
## 6 1002 F 6 16 21
## 7 1002 F 9 17 23
## 8 1002 F 12 20 23
## 9 1003 M 3 17 23
## 10 1003 M 6 20 27
## # ℹ 22 more rows
This alternative example reshapes into a simpler long format with one value column. This is useful when you want a single measurement (e.g., weight only).
babies %>%
pivot_longer(
cols = c(-id, -sex),
names_to = "months",
values_to = "weight"
) %>%
print()## # A tibble: 64 × 4
## id sex months weight
## <int> <chr> <chr> <dbl>
## 1 1001 F weight_3 9
## 2 1001 F weight_6 13
## 3 1001 F weight_9 16
## 4 1001 F weight_12 17
## 5 1001 F length_3 17
## 6 1001 F length_6 18
## 7 1001 F length_9 19
## 8 1001 F length_12 21
## 9 1002 F weight_3 11
## 10 1002 F weight_6 16
## # ℹ 54 more rows
using names_sep and .value argument
Here we keep both parts of the name as separate columns: measure and months. This creates a “tidy” dataset with one value column and a measure identifier.
## # A tibble: 64 × 5
## id sex measure months value
## <int> <chr> <chr> <chr> <dbl>
## 1 1001 F weight 3 9
## 2 1001 F weight 6 13
## 3 1001 F weight 9 16
## 4 1001 F weight 12 17
## 5 1001 F length 3 17
## 6 1001 F length 6 18
## 7 1001 F length 9 19
## 8 1001 F length 12 21
## 9 1002 F weight 3 11
## 10 1002 F weight 6 16
## # ℹ 54 more rows
.value tells pivot_longer() to create a new column for each unique character string that is in front of the underscore
This is often the most practical output when you have multiple repeated-measure variables collected at the same timepoints.
## # A tibble: 32 × 5
## id sex months weight length
## <int> <chr> <chr> <dbl> <dbl>
## 1 1001 F 3 9 17
## 2 1001 F 6 13 18
## 3 1001 F 9 16 19
## 4 1001 F 12 17 21
## 5 1002 F 3 11 19
## 6 1002 F 6 16 21
## 7 1002 F 9 17 23
## 8 1002 F 12 20 23
## 9 1003 M 3 17 23
## 10 1003 M 6 20 27
## # ℹ 22 more rows
Swapping the order changes the interpretation of the split name pieces. This example is mainly to show that the mapping is positional.
## # A tibble: 16 × 7
## id sex months `3` `6` `9` `12`
## <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1001 F weight 9 13 16 17
## 2 1001 F length 17 18 19 21
## 3 1002 F weight 11 16 17 20
## 4 1002 F length 19 21 23 23
## 5 1003 M weight 17 20 23 24
## 6 1003 M length 23 27 30 33
## 7 1004 F weight 16 18 21 22
## 8 1004 F length 20 22 24 26
## 9 1005 M weight 11 15 16 18
## 10 1005 M length 18 20 22 23
## 11 1006 M weight 17 21 25 26
## 12 1006 M length 22 26 28 30
## 13 1007 M weight 16 17 19 21
## 14 1007 M length 21 23 24 25
## 15 1008 F weight 15 16 18 19
## 16 1008 F length 18 19 23 24
- Pivot wider again (long to wide)
After processing data in long format (e.g., computing summaries by month), you may want to convert back to wide format for reporting.
pivoting to multiple sets of cols
This demonstrates a common workflow: wide → long (for analysis) → wide (for reporting).
babies %>%
pivot_longer(
cols = c(-id, -sex),
names_to = c( ".value","months" ),
names_sep = "_"
) %>%
pivot_wider( names_from=months, values_from= c(weight, length ))## # A tibble: 8 × 10
## id sex weight_3 weight_6 weight_9 weight_12 length_3 length_6 length_9
## <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1001 F 9 13 16 17 17 18 19
## 2 1002 F 11 16 17 20 19 21 23
## 3 1003 M 17 20 23 24 23 27 30
## 4 1004 F 16 18 21 22 20 22 24
## 5 1005 M 11 15 16 18 18 20 22
## 6 1006 M 17 21 25 26 22 26 28
## 7 1007 M 16 17 19 21 21 23 24
## 8 1008 F 15 16 18 19 18 19 23
## # ℹ 1 more variable: length_12 <dbl>
This example pivots the previously created longdata object back to wide.
## # A tibble: 50 × 13
## row Petal.Length_setosa Petal.Length_versicolor Petal.Length_virginica
## <int> <dbl> <dbl> <dbl>
## 1 1 1.4 4.7 6
## 2 2 1.4 4.5 5.1
## 3 3 1.3 4.9 5.9
## 4 4 1.5 4 5.6
## 5 5 1.4 4.6 5.8
## 6 6 1.7 4.5 6.6
## 7 7 1.4 4.7 4.5
## 8 8 1.5 3.3 6.3
## 9 9 1.4 4.6 5.8
## 10 10 1.5 3.9 6.1
## # ℹ 40 more rows
## # ℹ 9 more variables: Sepal.Length_setosa <dbl>, Sepal.Length_versicolor <dbl>,
## # Sepal.Length_virginica <dbl>, Petal.Width_setosa <dbl>,
## # Petal.Width_versicolor <dbl>, Petal.Width_virginica <dbl>,
## # Sepal.Width_setosa <dbl>, Sepal.Width_versicolor <dbl>,
## # Sepal.Width_virginica <dbl>
1.1.10 Separate columns
Separating columns is useful when a single string variable contains multiple components. In practice, this appears in fields like codes (e.g., lab test codes), composite identifiers, or delimited values.
Here we split Species into three parts using "o" as the separator. This is a toy example to demonstrate the function.
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in
## 100 rows [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,
## 14, 15, 16, 17, 18, 19, 20, ...].
## Sepal.Length Sepal.Width Petal.Length Petal.Width integer decimal third
## 1 5.1 3.5 1.4 0.2 set sa <NA>
## 2 4.9 3.0 1.4 0.2 set sa <NA>
## 3 4.7 3.2 1.3 0.2 set sa <NA>
## 4 4.6 3.1 1.5 0.2 set sa <NA>
## 5 5.0 3.6 1.4 0.2 set sa <NA>
## 6 5.4 3.9 1.7 0.4 set sa <NA>
## 7 4.6 3.4 1.4 0.3 set sa <NA>
## 8 5.0 3.4 1.5 0.2 set sa <NA>
## 9 4.4 2.9 1.4 0.2 set sa <NA>
## 10 4.9 3.1 1.5 0.1 set sa <NA>
## 11 5.4 3.7 1.5 0.2 set sa <NA>
## 12 4.8 3.4 1.6 0.2 set sa <NA>
## 13 4.8 3.0 1.4 0.1 set sa <NA>
## 14 4.3 3.0 1.1 0.1 set sa <NA>
## 15 5.8 4.0 1.2 0.2 set sa <NA>
## 16 5.7 4.4 1.5 0.4 set sa <NA>
## 17 5.4 3.9 1.3 0.4 set sa <NA>
## 18 5.1 3.5 1.4 0.3 set sa <NA>
## 19 5.7 3.8 1.7 0.3 set sa <NA>
## 20 5.1 3.8 1.5 0.3 set sa <NA>
## 21 5.4 3.4 1.7 0.2 set sa <NA>
## 22 5.1 3.7 1.5 0.4 set sa <NA>
## 23 4.6 3.6 1.0 0.2 set sa <NA>
## 24 5.1 3.3 1.7 0.5 set sa <NA>
## 25 4.8 3.4 1.9 0.2 set sa <NA>
## 26 5.0 3.0 1.6 0.2 set sa <NA>
## 27 5.0 3.4 1.6 0.4 set sa <NA>
## 28 5.2 3.5 1.5 0.2 set sa <NA>
## 29 5.2 3.4 1.4 0.2 set sa <NA>
## 30 4.7 3.2 1.6 0.2 set sa <NA>
## 31 4.8 3.1 1.6 0.2 set sa <NA>
## 32 5.4 3.4 1.5 0.4 set sa <NA>
## 33 5.2 4.1 1.5 0.1 set sa <NA>
## 34 5.5 4.2 1.4 0.2 set sa <NA>
## 35 4.9 3.1 1.5 0.2 set sa <NA>
## 36 5.0 3.2 1.2 0.2 set sa <NA>
## 37 5.5 3.5 1.3 0.2 set sa <NA>
## 38 4.9 3.6 1.4 0.1 set sa <NA>
## 39 4.4 3.0 1.3 0.2 set sa <NA>
## 40 5.1 3.4 1.5 0.2 set sa <NA>
## 41 5.0 3.5 1.3 0.3 set sa <NA>
## 42 4.5 2.3 1.3 0.3 set sa <NA>
## 43 4.4 3.2 1.3 0.2 set sa <NA>
## 44 5.0 3.5 1.6 0.6 set sa <NA>
## 45 5.1 3.8 1.9 0.4 set sa <NA>
## 46 4.8 3.0 1.4 0.3 set sa <NA>
## 47 5.1 3.8 1.6 0.2 set sa <NA>
## 48 4.6 3.2 1.4 0.2 set sa <NA>
## 49 5.3 3.7 1.5 0.2 set sa <NA>
## 50 5.0 3.3 1.4 0.2 set sa <NA>
## 51 7.0 3.2 4.7 1.4 versic l r
## 52 6.4 3.2 4.5 1.5 versic l r
## 53 6.9 3.1 4.9 1.5 versic l r
## 54 5.5 2.3 4.0 1.3 versic l r
## 55 6.5 2.8 4.6 1.5 versic l r
## 56 5.7 2.8 4.5 1.3 versic l r
## 57 6.3 3.3 4.7 1.6 versic l r
## 58 4.9 2.4 3.3 1.0 versic l r
## 59 6.6 2.9 4.6 1.3 versic l r
## 60 5.2 2.7 3.9 1.4 versic l r
## 61 5.0 2.0 3.5 1.0 versic l r
## 62 5.9 3.0 4.2 1.5 versic l r
## 63 6.0 2.2 4.0 1.0 versic l r
## 64 6.1 2.9 4.7 1.4 versic l r
## 65 5.6 2.9 3.6 1.3 versic l r
## 66 6.7 3.1 4.4 1.4 versic l r
## 67 5.6 3.0 4.5 1.5 versic l r
## 68 5.8 2.7 4.1 1.0 versic l r
## 69 6.2 2.2 4.5 1.5 versic l r
## 70 5.6 2.5 3.9 1.1 versic l r
## 71 5.9 3.2 4.8 1.8 versic l r
## 72 6.1 2.8 4.0 1.3 versic l r
## 73 6.3 2.5 4.9 1.5 versic l r
## 74 6.1 2.8 4.7 1.2 versic l r
## 75 6.4 2.9 4.3 1.3 versic l r
## 76 6.6 3.0 4.4 1.4 versic l r
## 77 6.8 2.8 4.8 1.4 versic l r
## 78 6.7 3.0 5.0 1.7 versic l r
## 79 6.0 2.9 4.5 1.5 versic l r
## 80 5.7 2.6 3.5 1.0 versic l r
## 81 5.5 2.4 3.8 1.1 versic l r
## 82 5.5 2.4 3.7 1.0 versic l r
## 83 5.8 2.7 3.9 1.2 versic l r
## 84 6.0 2.7 5.1 1.6 versic l r
## 85 5.4 3.0 4.5 1.5 versic l r
## 86 6.0 3.4 4.5 1.6 versic l r
## 87 6.7 3.1 4.7 1.5 versic l r
## 88 6.3 2.3 4.4 1.3 versic l r
## 89 5.6 3.0 4.1 1.3 versic l r
## 90 5.5 2.5 4.0 1.3 versic l r
## 91 5.5 2.6 4.4 1.2 versic l r
## 92 6.1 3.0 4.6 1.4 versic l r
## 93 5.8 2.6 4.0 1.2 versic l r
## 94 5.0 2.3 3.3 1.0 versic l r
## 95 5.6 2.7 4.2 1.3 versic l r
## 96 5.7 3.0 4.2 1.2 versic l r
## 97 5.7 2.9 4.2 1.3 versic l r
## 98 6.2 2.9 4.3 1.3 versic l r
## 99 5.1 2.5 3.0 1.1 versic l r
## 100 5.7 2.8 4.1 1.3 versic l r
## 101 6.3 3.3 6.0 2.5 virginica <NA> <NA>
## 102 5.8 2.7 5.1 1.9 virginica <NA> <NA>
## 103 7.1 3.0 5.9 2.1 virginica <NA> <NA>
## 104 6.3 2.9 5.6 1.8 virginica <NA> <NA>
## 105 6.5 3.0 5.8 2.2 virginica <NA> <NA>
## 106 7.6 3.0 6.6 2.1 virginica <NA> <NA>
## 107 4.9 2.5 4.5 1.7 virginica <NA> <NA>
## 108 7.3 2.9 6.3 1.8 virginica <NA> <NA>
## 109 6.7 2.5 5.8 1.8 virginica <NA> <NA>
## 110 7.2 3.6 6.1 2.5 virginica <NA> <NA>
## 111 6.5 3.2 5.1 2.0 virginica <NA> <NA>
## 112 6.4 2.7 5.3 1.9 virginica <NA> <NA>
## 113 6.8 3.0 5.5 2.1 virginica <NA> <NA>
## 114 5.7 2.5 5.0 2.0 virginica <NA> <NA>
## 115 5.8 2.8 5.1 2.4 virginica <NA> <NA>
## 116 6.4 3.2 5.3 2.3 virginica <NA> <NA>
## 117 6.5 3.0 5.5 1.8 virginica <NA> <NA>
## 118 7.7 3.8 6.7 2.2 virginica <NA> <NA>
## 119 7.7 2.6 6.9 2.3 virginica <NA> <NA>
## 120 6.0 2.2 5.0 1.5 virginica <NA> <NA>
## 121 6.9 3.2 5.7 2.3 virginica <NA> <NA>
## 122 5.6 2.8 4.9 2.0 virginica <NA> <NA>
## 123 7.7 2.8 6.7 2.0 virginica <NA> <NA>
## 124 6.3 2.7 4.9 1.8 virginica <NA> <NA>
## 125 6.7 3.3 5.7 2.1 virginica <NA> <NA>
## 126 7.2 3.2 6.0 1.8 virginica <NA> <NA>
## 127 6.2 2.8 4.8 1.8 virginica <NA> <NA>
## 128 6.1 3.0 4.9 1.8 virginica <NA> <NA>
## 129 6.4 2.8 5.6 2.1 virginica <NA> <NA>
## 130 7.2 3.0 5.8 1.6 virginica <NA> <NA>
## 131 7.4 2.8 6.1 1.9 virginica <NA> <NA>
## 132 7.9 3.8 6.4 2.0 virginica <NA> <NA>
## 133 6.4 2.8 5.6 2.2 virginica <NA> <NA>
## 134 6.3 2.8 5.1 1.5 virginica <NA> <NA>
## 135 6.1 2.6 5.6 1.4 virginica <NA> <NA>
## 136 7.7 3.0 6.1 2.3 virginica <NA> <NA>
## 137 6.3 3.4 5.6 2.4 virginica <NA> <NA>
## 138 6.4 3.1 5.5 1.8 virginica <NA> <NA>
## 139 6.0 3.0 4.8 1.8 virginica <NA> <NA>
## 140 6.9 3.1 5.4 2.1 virginica <NA> <NA>
## 141 6.7 3.1 5.6 2.4 virginica <NA> <NA>
## 142 6.9 3.1 5.1 2.3 virginica <NA> <NA>
## 143 5.8 2.7 5.1 1.9 virginica <NA> <NA>
## 144 6.8 3.2 5.9 2.3 virginica <NA> <NA>
## 145 6.7 3.3 5.7 2.5 virginica <NA> <NA>
## 146 6.7 3.0 5.2 2.3 virginica <NA> <NA>
## 147 6.3 2.5 5.0 1.9 virginica <NA> <NA>
## 148 6.5 3.0 5.2 2.0 virginica <NA> <NA>
## 149 6.2 3.4 5.4 2.3 virginica <NA> <NA>
## 150 5.9 3.0 5.1 1.8 virginica <NA> <NA>
1.1.11 Recode/relabel data
Recoding is commonly used to standardize categories, shorten labels, or map source system codes to analysis-friendly values.
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species Species2
## 1 5.1 3.5 1.4 0.2 setosa seto
## 2 4.9 3.0 1.4 0.2 setosa seto
## 3 4.7 3.2 1.3 0.2 setosa seto
## 4 4.6 3.1 1.5 0.2 setosa seto
## 5 5.0 3.6 1.4 0.2 setosa seto
## 6 5.4 3.9 1.7 0.4 setosa seto
## 7 4.6 3.4 1.4 0.3 setosa seto
## 8 5.0 3.4 1.5 0.2 setosa seto
## 9 4.4 2.9 1.4 0.2 setosa seto
## 10 4.9 3.1 1.5 0.1 setosa seto
## 11 5.4 3.7 1.5 0.2 setosa seto
## 12 4.8 3.4 1.6 0.2 setosa seto
## 13 4.8 3.0 1.4 0.1 setosa seto
## 14 4.3 3.0 1.1 0.1 setosa seto
## 15 5.8 4.0 1.2 0.2 setosa seto
## 16 5.7 4.4 1.5 0.4 setosa seto
## 17 5.4 3.9 1.3 0.4 setosa seto
## 18 5.1 3.5 1.4 0.3 setosa seto
## 19 5.7 3.8 1.7 0.3 setosa seto
## 20 5.1 3.8 1.5 0.3 setosa seto
## 21 5.4 3.4 1.7 0.2 setosa seto
## 22 5.1 3.7 1.5 0.4 setosa seto
## 23 4.6 3.6 1.0 0.2 setosa seto
## 24 5.1 3.3 1.7 0.5 setosa seto
## 25 4.8 3.4 1.9 0.2 setosa seto
## 26 5.0 3.0 1.6 0.2 setosa seto
## 27 5.0 3.4 1.6 0.4 setosa seto
## 28 5.2 3.5 1.5 0.2 setosa seto
## 29 5.2 3.4 1.4 0.2 setosa seto
## 30 4.7 3.2 1.6 0.2 setosa seto
## 31 4.8 3.1 1.6 0.2 setosa seto
## 32 5.4 3.4 1.5 0.4 setosa seto
## 33 5.2 4.1 1.5 0.1 setosa seto
## 34 5.5 4.2 1.4 0.2 setosa seto
## 35 4.9 3.1 1.5 0.2 setosa seto
## 36 5.0 3.2 1.2 0.2 setosa seto
## 37 5.5 3.5 1.3 0.2 setosa seto
## 38 4.9 3.6 1.4 0.1 setosa seto
## 39 4.4 3.0 1.3 0.2 setosa seto
## 40 5.1 3.4 1.5 0.2 setosa seto
## 41 5.0 3.5 1.3 0.3 setosa seto
## 42 4.5 2.3 1.3 0.3 setosa seto
## 43 4.4 3.2 1.3 0.2 setosa seto
## 44 5.0 3.5 1.6 0.6 setosa seto
## 45 5.1 3.8 1.9 0.4 setosa seto
## 46 4.8 3.0 1.4 0.3 setosa seto
## 47 5.1 3.8 1.6 0.2 setosa seto
## 48 4.6 3.2 1.4 0.2 setosa seto
## 49 5.3 3.7 1.5 0.2 setosa seto
## 50 5.0 3.3 1.4 0.2 setosa seto
## 51 7.0 3.2 4.7 1.4 versicolor versi
## 52 6.4 3.2 4.5 1.5 versicolor versi
## 53 6.9 3.1 4.9 1.5 versicolor versi
## 54 5.5 2.3 4.0 1.3 versicolor versi
## 55 6.5 2.8 4.6 1.5 versicolor versi
## 56 5.7 2.8 4.5 1.3 versicolor versi
## 57 6.3 3.3 4.7 1.6 versicolor versi
## 58 4.9 2.4 3.3 1.0 versicolor versi
## 59 6.6 2.9 4.6 1.3 versicolor versi
## 60 5.2 2.7 3.9 1.4 versicolor versi
## 61 5.0 2.0 3.5 1.0 versicolor versi
## 62 5.9 3.0 4.2 1.5 versicolor versi
## 63 6.0 2.2 4.0 1.0 versicolor versi
## 64 6.1 2.9 4.7 1.4 versicolor versi
## 65 5.6 2.9 3.6 1.3 versicolor versi
## 66 6.7 3.1 4.4 1.4 versicolor versi
## 67 5.6 3.0 4.5 1.5 versicolor versi
## 68 5.8 2.7 4.1 1.0 versicolor versi
## 69 6.2 2.2 4.5 1.5 versicolor versi
## 70 5.6 2.5 3.9 1.1 versicolor versi
## 71 5.9 3.2 4.8 1.8 versicolor versi
## 72 6.1 2.8 4.0 1.3 versicolor versi
## 73 6.3 2.5 4.9 1.5 versicolor versi
## 74 6.1 2.8 4.7 1.2 versicolor versi
## 75 6.4 2.9 4.3 1.3 versicolor versi
## 76 6.6 3.0 4.4 1.4 versicolor versi
## 77 6.8 2.8 4.8 1.4 versicolor versi
## 78 6.7 3.0 5.0 1.7 versicolor versi
## 79 6.0 2.9 4.5 1.5 versicolor versi
## 80 5.7 2.6 3.5 1.0 versicolor versi
## 81 5.5 2.4 3.8 1.1 versicolor versi
## 82 5.5 2.4 3.7 1.0 versicolor versi
## 83 5.8 2.7 3.9 1.2 versicolor versi
## 84 6.0 2.7 5.1 1.6 versicolor versi
## 85 5.4 3.0 4.5 1.5 versicolor versi
## 86 6.0 3.4 4.5 1.6 versicolor versi
## 87 6.7 3.1 4.7 1.5 versicolor versi
## 88 6.3 2.3 4.4 1.3 versicolor versi
## 89 5.6 3.0 4.1 1.3 versicolor versi
## 90 5.5 2.5 4.0 1.3 versicolor versi
## 91 5.5 2.6 4.4 1.2 versicolor versi
## 92 6.1 3.0 4.6 1.4 versicolor versi
## 93 5.8 2.6 4.0 1.2 versicolor versi
## 94 5.0 2.3 3.3 1.0 versicolor versi
## 95 5.6 2.7 4.2 1.3 versicolor versi
## 96 5.7 3.0 4.2 1.2 versicolor versi
## 97 5.7 2.9 4.2 1.3 versicolor versi
## 98 6.2 2.9 4.3 1.3 versicolor versi
## 99 5.1 2.5 3.0 1.1 versicolor versi
## 100 5.7 2.8 4.1 1.3 versicolor versi
## 101 6.3 3.3 6.0 2.5 virginica virginica
## 102 5.8 2.7 5.1 1.9 virginica virginica
## 103 7.1 3.0 5.9 2.1 virginica virginica
## 104 6.3 2.9 5.6 1.8 virginica virginica
## 105 6.5 3.0 5.8 2.2 virginica virginica
## 106 7.6 3.0 6.6 2.1 virginica virginica
## 107 4.9 2.5 4.5 1.7 virginica virginica
## 108 7.3 2.9 6.3 1.8 virginica virginica
## 109 6.7 2.5 5.8 1.8 virginica virginica
## 110 7.2 3.6 6.1 2.5 virginica virginica
## 111 6.5 3.2 5.1 2.0 virginica virginica
## 112 6.4 2.7 5.3 1.9 virginica virginica
## 113 6.8 3.0 5.5 2.1 virginica virginica
## 114 5.7 2.5 5.0 2.0 virginica virginica
## 115 5.8 2.8 5.1 2.4 virginica virginica
## 116 6.4 3.2 5.3 2.3 virginica virginica
## 117 6.5 3.0 5.5 1.8 virginica virginica
## 118 7.7 3.8 6.7 2.2 virginica virginica
## 119 7.7 2.6 6.9 2.3 virginica virginica
## 120 6.0 2.2 5.0 1.5 virginica virginica
## 121 6.9 3.2 5.7 2.3 virginica virginica
## 122 5.6 2.8 4.9 2.0 virginica virginica
## 123 7.7 2.8 6.7 2.0 virginica virginica
## 124 6.3 2.7 4.9 1.8 virginica virginica
## 125 6.7 3.3 5.7 2.1 virginica virginica
## 126 7.2 3.2 6.0 1.8 virginica virginica
## 127 6.2 2.8 4.8 1.8 virginica virginica
## 128 6.1 3.0 4.9 1.8 virginica virginica
## 129 6.4 2.8 5.6 2.1 virginica virginica
## 130 7.2 3.0 5.8 1.6 virginica virginica
## 131 7.4 2.8 6.1 1.9 virginica virginica
## 132 7.9 3.8 6.4 2.0 virginica virginica
## 133 6.4 2.8 5.6 2.2 virginica virginica
## 134 6.3 2.8 5.1 1.5 virginica virginica
## 135 6.1 2.6 5.6 1.4 virginica virginica
## 136 7.7 3.0 6.1 2.3 virginica virginica
## 137 6.3 3.4 5.6 2.4 virginica virginica
## 138 6.4 3.1 5.5 1.8 virginica virginica
## 139 6.0 3.0 4.8 1.8 virginica virginica
## 140 6.9 3.1 5.4 2.1 virginica virginica
## 141 6.7 3.1 5.6 2.4 virginica virginica
## 142 6.9 3.1 5.1 2.3 virginica virginica
## 143 5.8 2.7 5.1 1.9 virginica virginica
## 144 6.8 3.2 5.9 2.3 virginica virginica
## 145 6.7 3.3 5.7 2.5 virginica virginica
## 146 6.7 3.0 5.2 2.3 virginica virginica
## 147 6.3 2.5 5.0 1.9 virginica virginica
## 148 6.5 3.0 5.2 2.0 virginica virginica
## 149 6.2 3.4 5.4 2.3 virginica virginica
## 150 5.9 3.0 5.1 1.8 virginica virginica
1.1.12 deduplication
Duplicate rows can occur due to data entry issues, merges that unintentionally replicate records, or repeated extracts.
A typical cleaning step is to either remove duplicates or flag them for review.
1.1.12.1 Complete duplicate row
Complete duplicates are rows that match across all variables.
distinct() keeps only unique rows.
df <- tribble(
~id, ~day, ~x,
1, 1, 1,
1, 2, 11,
2, 1, 12,
2, 2, 13,
2, 2, 14,
3, 1, 12,
3, 1, 12,
3, 2, 13,
4, 1, 13,
5, 1, 10,
5, 2, 11,
5, 1, 10
)
df %>%
distinct()## # A tibble: 10 × 3
## id day x
## <dbl> <dbl> <dbl>
## 1 1 1 1
## 2 1 2 11
## 3 2 1 12
## 4 2 2 13
## 5 2 2 14
## 6 3 1 12
## 7 3 2 13
## 8 4 1 13
## 9 5 1 10
## 10 5 2 11
mark duplication
The duplicated() function identifies whether each row has appeared previously. This is useful when you want to keep the full dataset but flag duplicates for investigation.
## # A tibble: 12 × 4
## id day x dup
## <dbl> <dbl> <dbl> <lgl>
## 1 1 1 1 FALSE
## 2 1 2 11 FALSE
## 3 2 1 12 FALSE
## 4 2 2 13 FALSE
## 5 2 2 14 FALSE
## 6 3 1 12 FALSE
## 7 3 1 12 TRUE
## 8 3 2 13 FALSE
## 9 4 1 13 FALSE
## 10 5 1 10 FALSE
## 11 5 2 11 FALSE
## 12 5 1 10 TRUE
This approach creates a row counter within identical rows and flags the second and later occurrences as duplicates.
## # A tibble: 12 × 5
## # Groups: id, day, x [10]
## id day x n_row dup
## <dbl> <dbl> <dbl> <int> <lgl>
## 1 1 1 1 1 FALSE
## 2 1 2 11 1 FALSE
## 3 2 1 12 1 FALSE
## 4 2 2 13 1 FALSE
## 5 2 2 14 1 FALSE
## 6 3 1 12 1 FALSE
## 7 3 1 12 2 TRUE
## 8 3 2 13 1 FALSE
## 9 4 1 13 1 FALSE
## 10 5 1 10 1 FALSE
## 11 5 2 11 1 FALSE
## 12 5 1 10 2 TRUE
1.1.12.2 Partial duplicate rows
Sometimes duplication is defined only by a key (e.g., subject + visit), not by all columns.
Here we keep only the first record for each id, day pair and keep all columns (.keep_all = TRUE).
## # A tibble: 9 × 3
## id day x
## <dbl> <dbl> <dbl>
## 1 1 1 1
## 2 1 2 11
## 3 2 1 12
## 4 2 2 13
## 5 3 1 12
## 6 3 2 13
## 7 4 1 13
## 8 5 1 10
## 9 5 2 11
Another way is to explicitly count duplicates within the key groups. This is useful if you want to decide how to handle duplicates (e.g., keep latest, average, or review manually).
df %>%
group_by(id, day) %>%
mutate(
count = row_number(), # Counts rows by group
dup = count > 1 # TRUE if there is more than one row per group
)## # A tibble: 12 × 5
## # Groups: id, day [9]
## id day x count dup
## <dbl> <dbl> <dbl> <int> <lgl>
## 1 1 1 1 1 FALSE
## 2 1 2 11 1 FALSE
## 3 2 1 12 1 FALSE
## 4 2 2 13 1 FALSE
## 5 2 2 14 2 TRUE
## 6 3 1 12 1 FALSE
## 7 3 1 12 2 TRUE
## 8 3 2 13 1 FALSE
## 9 4 1 13 1 FALSE
## 10 5 1 10 1 FALSE
## 11 5 2 11 1 FALSE
## 12 5 1 10 2 TRUE
select duplicates
This returns only the rows that are duplicated (including both the first and later occurrences).
## # A tibble: 4 × 4
## id day x dup
## <dbl> <dbl> <dbl> <lgl>
## 1 3 1 12 TRUE
## 2 3 1 12 TRUE
## 3 5 1 10 TRUE
## 4 5 1 10 TRUE
1.1.13 Combine data sets
Combining datasets is a core step in analysis workflows. Common scenarios include:
- merging demographics with outcomes
- attaching treatment assignment to longitudinal records
- linking subject-level and visit-level datasets
Below we demonstrate common join types in dplyr. The key concept is to be explicit about which table is the “left” table and which key variables define a match.
- prepare data sets
data1 <- data.frame(ID = 1:4,
X1 = c("a1", "a2","a3", "a4"),
stringsAsFactors = FALSE)
data2 <- data.frame(ID = 2:5,
X2 = c("b1", "b2","b3", "b4"),
stringsAsFactors = FALSE)- inner join
Keeps only IDs that appear in both datasets.
## ID X1 X2
## 1 2 a2 b1
## 2 3 a3 b2
## 3 4 a4 b3
- left join
Keeps all rows fromdata1and attaches matches fromdata2. Unmatched rows get missing values fordata2variables.
there is not ID 1 in data2, thence, x2 is missing.
## ID X1 X2
## 1 1 a1 <NA>
## 2 2 a2 b1
## 3 3 a3 b2
## 4 4 a4 b3
differing key col names
When key columns have different names across datasets, you can map them with by = c("left_key"="right_key").
## ID X1 X2
## 1 1 a1 <NA>
## 2 2 a2 b1
## 3 3 a3 b2
## 4 4 a4 b3
One-to-many relationship has the same statement
When the right table has multiple rows per key, the join will expand rows in the left table accordingly. This is expected but should be checked carefully in real analyses.
## ID X1 X2
## 1 1 a1 <NA>
## 2 2 a2 b1
## 3 3 a3 b2
## 4 4 a4 b3
multiple data frames
You can chain joins to attach multiple datasets step by step. In practice, verify keys and row counts after each join.
## ID X1 X2.x X2.y
## 1 1 a1 <NA> <NA>
## 2 2 a2 b1 b1
## 3 3 a3 b2 b2
## 4 4 a4 b3 b3
multiple key values
In many clinical datasets, joins use multiple keys such as subject ID and visit number.
- right join
Keeps all rows from the right table. This is less common in practice, because most workflows start from a primary “analysis dataset” on the left.
## ID X1 X2
## 1 2 a2 b1
## 2 3 a3 b2
## 3 4 a4 b3
## 4 5 <NA> b4
- full join
Keeps all rows from both tables. This is useful for reconciliation or when building a master key list.
## ID X1 X2
## 1 1 a1 <NA>
## 2 2 a2 b1
## 3 3 a3 b2
## 4 4 a4 b3
## 5 5 <NA> b4
- keep cases of left data table without in right data table
anti_join()is useful for checking which keys did not match, which is a common QC step after merges.
## ID X1
## 1 1 a1
- keep cases of left data table in right data table
semi_join()keeps rows from the left table that have a match in the right table, without bringing in right-table columns.
## ID X1
## 1 2 a2
## 2 3 a3
## 3 4 a4
- multiple full join
This demonstrates that joins can be chained. In practice, consider whether repeated joins are intended and check for row multiplication.
## ID X1 X2.x X2.y
## 1 1 a1 <NA> <NA>
## 2 2 a2 b1 b1
## 3 3 a3 b2 b2
## 4 4 a4 b3 b3
## 5 5 <NA> b4 b4
- append two data tables by using join and merge
Base Rmerge()is still commonly seen in legacy code.all = TRUEcorresponds to a full join.
data_frame1 <- data.frame(col1 = c(6:8),
col2 = letters[1:3],
col3 = c(1,4,NA))
data_frame2 <- data.frame(col1 = c(5:6),
col5 = letters[7:8])
data_frame_merge <- merge(data_frame1, data_frame2,
by = 'col1', all = TRUE)
print (data_frame_merge)## col1 col2 col3 col5
## 1 5 <NA> NA g
## 2 6 a 1 h
## 3 7 b 4 <NA>
## 4 8 c NA <NA>
This is the tidyverse equivalent of the merge above.
## col1 col2 col3 col5
## 1 6 a 1 h
## 2 7 b 4 <NA>
## 3 8 c NA <NA>
## 4 5 <NA> NA g
- adding rows using
bind_ function
Row-binding is used when two datasets have the same meaning (same variable definitions) and you want to stack them.
In practice, this is common when appending data from multiple sites, multiple batches, or multiple extracts.
rbind doesn’t work
Base R rbind() requires the same column names and order.
bind_ function conbines the data framesbased on column names; having our columns in a different order isn’t a problem.
df1 <- data.frame(col1 = LETTERS[1:6],
col2a = c(5:10),
col3a = TRUE)
df2 <- data.frame(col1 = LETTERS[4:8],
col2b= c(4:8),
col3b = FALSE)
# rbind(df1,df2)
df1 %>%
bind_rows(df2)## col1 col2a col3a col2b col3b
## 1 A 5 TRUE NA NA
## 2 B 6 TRUE NA NA
## 3 C 7 TRUE NA NA
## 4 D 8 TRUE NA NA
## 5 E 9 TRUE NA NA
## 6 F 10 TRUE NA NA
## 7 D NA NA 4 FALSE
## 8 E NA NA 5 FALSE
## 9 F NA NA 6 FALSE
## 10 G NA NA 7 FALSE
## 11 H NA NA 8 FALSE
rename the differing cols
Before binding rows, make sure variable names represent the same concept. Here we align names so that columns stack correctly.
## col1 col2a col3a
## 1 A 5 TRUE
## 2 B 6 TRUE
## 3 C 7 TRUE
## 4 D 8 TRUE
## 5 E 9 TRUE
## 6 F 10 TRUE
## 7 D 4 FALSE
## 8 E 5 FALSE
## 9 F 6 FALSE
## 10 G 7 FALSE
## 11 H 8 FALSE
combining more than 2 data frames
You can bind multiple datasets at once. In practice, you may bind a list of datasets after validating that they share the same structure.
## col1 col2a col3a col2b col3b
## 1 A 5 TRUE NA NA
## 2 B 6 TRUE NA NA
## 3 C 7 TRUE NA NA
## 4 D 8 TRUE NA NA
## 5 E 9 TRUE NA NA
## 6 F 10 TRUE NA NA
## 7 D NA NA 4 FALSE
## 8 E NA NA 5 FALSE
## 9 F NA NA 6 FALSE
## 10 G NA NA 7 FALSE
## 11 H NA NA 8 FALSE
## 12 D NA NA 4 FALSE
## 13 E NA NA 5 FALSE
## 14 F NA NA 6 FALSE
## 15 G NA NA 7 FALSE
## 16 H NA NA 8 FALSE
1.1.14 Working with character strings
Character string cleaning is common in real datasets: names, sites, lab test labels, free-text fields, and coded fields may contain inconsistent capitalization, extra spaces, or punctuation.
The stringr package provides consistent and vectorized string operations, which fit naturally in dplyr pipelines.
- Look at the values
A good first step is to inspect the raw values and identify patterns (e.g., trailing commas, mixed case, inconsistent coding).
ehr <- read_excel("C:\\Users\\hed2\\Downloads\\others\\mybook2\\mybook2/excel.xlsx")
ehr %>%
arrange(sex) %>%
pull(sex)## [1] "Female" "Female" "Male" "Male"
- Coerce to lowercase
Standardizing case helps reduce duplication (e.g.,"Male"vs"MALE").
## [1] "female" "female" "male" "male"
- Coerce to upper case
Uppercase is sometimes used for code lists or standardized labels.
## [1] "FEMALE" "FEMALE" "MALE" "MALE"
- Title case
Title case improves readability for display fields such as names.
## [1] "Female" "Female" "Male" "Male"
- Sentence case
Sentence case is useful for free text or display labels.
## [1] "Female" "Female" "Male" "Male"
- Trim white space from the beginning and end
Extra spaces are very common and can break joins or comparisons if not removed.
## [1] "Ryan Edwards"
- Remove the comma
This shows a simple string replacement.
## [1] "weston fox"
Applying the replacement to a dataset column is a typical cleaning step.
## # A tibble: 4 × 4
## ID sex ht_in wgt_lbs
## <chr> <chr> <dbl> <dbl>
## 1 001 Male 71 190
## 2 002 Male 69 176
## 3 003 Female 64 130
## 4 004 Female 65 154
- Separate values into component parts
Regular expressions allow flexible extraction of patterns within strings. This is powerful for cleaning IDs, parsing codes, or extracting components from composite variables.
regular express, ^\\w+ look for one or more consecutive word characters at the start of the character string and extract them.
## [1] "zariah"
## [1] "hernandez"
str_match() returns captured groups based on parentheses in the pattern. This is useful when you need to parse structured expressions.
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] "(weight_3)" "(" "weight" "_" "3" ")"
- Detect a special string to category
str_detect()returns TRUE/FALSE based on whether the pattern is present. This is commonly used to create indicator variables.
index a special string
ehr %>%
mutate(
man = str_detect(sex, "F"),
woman = str_detect(sex, "M"),
all = str_detect(sex, "ale")
) %>%
mutate(man_dummy = as.numeric(man))## # A tibble: 4 × 8
## ID sex ht_in wgt_lbs man woman all man_dummy
## <chr> <chr> <dbl> <dbl> <lgl> <lgl> <lgl> <dbl>
## 1 001 Male 71 190 FALSE TRUE TRUE 0
## 2 002 Male 69 176 FALSE TRUE TRUE 0
## 3 003 Female 64 130 TRUE FALSE TRUE 1
## 4 004 Female 65 154 TRUE FALSE TRUE 1
This is a similar example using ifelse() to create a numeric dummy variable. In practice, if_else() is preferred because it is stricter about types, but both patterns are frequently seen.
## # A tibble: 4 × 5
## ID sex ht_in wgt_lbs dummy
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 001 Male 71 190 1
## 2 002 Male 69 176 1
## 3 003 Female 64 130 0
## 4 004 Female 65 154 0
1.1.15 Conditional operations
Conditional logic is used to define categories, flags, and derived endpoints. In clinical and observational data, this is often required for defining analysis populations, endpoint derivations, or clinical classifications.
- Testing multiple conditions simultaneously
Here we construct a toy blood pressure dataset and classify observations based on systolic and diastolic thresholds.
blood_pressure <- tibble(
id = 1:10,
sysbp = c(152, 120, 119, 123, 135, 83, 191, 147, 209, 166),
diasbp = c(78, 60, 88, 76, 85, 54, 116, 95, 100, 106)
) %>%
print()## # A tibble: 10 × 3
## id sysbp diasbp
## <int> <dbl> <dbl>
## 1 1 152 78
## 2 2 120 60
## 3 3 119 88
## 4 4 123 76
## 5 5 135 85
## 6 6 83 54
## 7 7 191 116
## 8 8 147 95
## 9 9 209 100
## 10 10 166 106
if_else() is a tidyverse version of ifelse() and is often preferred for clearer type handling.
## # A tibble: 10 × 4
## id sysbp diasbp bp
## <int> <dbl> <dbl> <chr>
## 1 1 152 78 Not Normal
## 2 2 120 60 Not Normal
## 3 3 119 88 Not Normal
## 4 4 123 76 Not Normal
## 5 5 135 85 Not Normal
## 6 6 83 54 Normal
## 7 7 191 116 Not Normal
## 8 8 147 95 Not Normal
## 9 9 209 100 Not Normal
## 10 10 166 106 Not Normal
- Using case_when function
case_when()is the standard approach for multiple mutually exclusive conditions. It improves readability compared with nestedif_else()statements.
blood_pressure %>%
mutate(
bp = case_when(
sysbp < 120 & diasbp < 80 ~ "Normal",
sysbp >= 120 & sysbp < 130 & diasbp < 80 ~ "Elevated",
sysbp >= 130 & sysbp < 140 | diasbp >= 80 & diasbp < 90 ~ "Hypertension Stage 1",
sysbp >= 140 | diasbp >= 90 ~ "Hypertension Stage 2"
)
)## # A tibble: 10 × 4
## id sysbp diasbp bp
## <int> <dbl> <dbl> <chr>
## 1 1 152 78 Hypertension Stage 2
## 2 2 120 60 Elevated
## 3 3 119 88 Hypertension Stage 1
## 4 4 123 76 Elevated
## 5 5 135 85 Hypertension Stage 1
## 6 6 83 54 Normal
## 7 7 191 116 Hypertension Stage 2
## 8 8 147 95 Hypertension Stage 2
## 9 9 209 100 Hypertension Stage 2
## 10 10 166 106 Hypertension Stage 2
- Recoding variables
Sometimes you want a numeric coding for modeling, and a labeled factor for interpretability in tables. Here we create both in one step.
blood_pressure %>%
mutate(
bp= case_when(
sysbp < 120 & diasbp < 80 ~ 1,
sysbp >= 120 & sysbp < 130 & diasbp < 80 ~ 2,
sysbp >= 130 & sysbp < 140 | diasbp >= 80 & diasbp < 90 ~ 3,
sysbp >= 140 | diasbp >= 90 ~ 4
)
,
bp_f = factor( bp,
labels = c(
"normal", "elevated", "stage 1",
"stage 2")
)
)## # A tibble: 10 × 5
## id sysbp diasbp bp bp_f
## <int> <dbl> <dbl> <dbl> <fct>
## 1 1 152 78 4 stage 2
## 2 2 120 60 2 elevated
## 3 3 119 88 3 stage 1
## 4 4 123 76 2 elevated
## 5 5 135 85 3 stage 1
## 6 6 83 54 1 normal
## 7 7 191 116 4 stage 2
## 8 8 147 95 4 stage 2
## 9 9 209 100 4 stage 2
## 10 10 166 106 4 stage 2
- Recoding missing
Missing values are common in real datasets. A typical step is to recode special numeric values (e.g., 7, 9, 99) to missing.
using NA_real instead of NA.
NA_real_ ensures the result stays numeric (double). This avoids unintended type conversion.
demographics <- tibble(
race = c(1, 2, 1, 4, 7, 1, 2, 9, 1, 3),
hispanic = c(7, 0, 1, 0, 1, 0, 1, 9, 0, 1)
)
demographics %>%
mutate(
# Recode 7 and 9 to missing
race_recode = if_else(race == 7 | race == 9, NA_real_, race),
hispanic_recode = if_else(hispanic == 7 | hispanic == 9, NA_real_, hispanic)
) ## # A tibble: 10 × 4
## race hispanic race_recode hispanic_recode
## <dbl> <dbl> <dbl> <dbl>
## 1 1 7 1 NA
## 2 2 0 2 0
## 3 1 1 1 1
## 4 4 0 4 0
## 5 7 1 NA 1
## 6 1 0 1 0
## 7 2 1 2 1
## 8 9 9 NA NA
## 9 1 0 1 0
## 10 3 1 3 1
check whether or not
Missing values can propagate through logical checks. This pattern shows one way to replace missing logical values with FALSE.
## # A tibble: 10 × 5
## id sysbp diasbp match n_match
## <int> <dbl> <dbl> <lgl> <lgl>
## 1 1 152 78 TRUE TRUE
## 2 2 120 60 TRUE TRUE
## 3 3 119 88 TRUE TRUE
## 4 4 123 76 TRUE TRUE
## 5 5 135 85 TRUE TRUE
## 6 6 83 54 TRUE TRUE
## 7 7 191 116 TRUE TRUE
## 8 8 147 95 TRUE TRUE
## 9 9 209 100 TRUE TRUE
## 10 10 166 106 TRUE TRUE
1.2 How to do aggregation/ summarization
Aggregation is used to produce summary statistics for reporting and exploratory analysis. Typical examples include baseline summaries by treatment, counts by category, and descriptive statistics by subgroup.
1.2.1 Summarization after grouping
We use group_by() to define groups, then summarise() (or summarize()) to compute summary statistics within each group.
This chunk is hidden but ensures the package is available for the following code blocks in some knitted settings.
This example calculates the mean Sepal.Length for each species, then sorts from highest to lowest.
iris %>%
group_by(Species) %>%
summarize(Support = mean(Sepal.Length)) %>% # average
arrange(-Support) # sort## # A tibble: 3 × 2
## Species Support
## <fct> <dbl>
## 1 virginica 6.59
## 2 versicolor 5.94
## 3 setosa 5.01
This example shows how to compute multiple summary statistics at once. Creating intermediate metrics (like diff) is useful for quick comparisons across groups.
iris %>%
group_by(Species) %>%
summarize(mean_s = mean(Sepal.Width),
meas_p = mean(Petal.Length),
diff = mean(Sepal.Width-Petal.Length)) %>%
arrange(-diff) ## # A tibble: 3 × 4
## Species mean_s meas_p diff
## <fct> <dbl> <dbl> <dbl>
## 1 setosa 3.43 1.46 1.97
## 2 versicolor 2.77 4.26 -1.49
## 3 virginica 2.97 5.55 -2.58
This example adds sample size (n) and standard deviation (sd). These are standard descriptive summaries for continuous variables.
iris %>%
group_by(Species) %>%
summarize(n = n(),
meas_p = mean(Petal.Length),
sd = sd(Petal.Length)) ## # A tibble: 3 × 4
## Species n meas_p sd
## <fct> <int> <dbl> <dbl>
## 1 setosa 50 1.46 0.174
## 2 versicolor 50 4.26 0.470
## 3 virginica 50 5.55 0.552
1.2.2 Summarization with upgroup
If you have grouped data but want overall summaries across all rows, you can remove the grouping structure using ungroup().
## n meas_p sd
## 1 150 3.758 1.765298
1.2.3 Mutate new variables after grouping
mutate() after group_by() adds group-level summaries back to each row. This is useful when you want both row-level values and group-level context in the same dataset (e.g., for plotting or QC checks).
## # A tibble: 150 × 8
## # Groups: Species [3]
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species n meas_p sd
## <dbl> <dbl> <dbl> <dbl> <fct> <int> <dbl> <dbl>
## 1 5.1 3.5 1.4 0.2 setosa 50 1.46 0.174
## 2 4.9 3 1.4 0.2 setosa 50 1.46 0.174
## 3 4.7 3.2 1.3 0.2 setosa 50 1.46 0.174
## 4 4.6 3.1 1.5 0.2 setosa 50 1.46 0.174
## 5 5 3.6 1.4 0.2 setosa 50 1.46 0.174
## 6 5.4 3.9 1.7 0.4 setosa 50 1.46 0.174
## 7 4.6 3.4 1.4 0.3 setosa 50 1.46 0.174
## 8 5 3.4 1.5 0.2 setosa 50 1.46 0.174
## 9 4.4 2.9 1.4 0.2 setosa 50 1.46 0.174
## 10 4.9 3.1 1.5 0.1 setosa 50 1.46 0.174
## # ℹ 140 more rows
This example demonstrates:
- computing group-level summaries with missing handling (na.rm = T), and
- then summarizing again to produce one row per group.
In practice, this pattern appears in QC workflows and layered summary derivations.
iris %>%
group_by(Species) %>%
mutate(n = n(),
meas_p = mean(Petal.Length, na.rm = T),
sd = sd(Petal.Length)) %>%
summarize (n_mean = paste ("sample size:",mean(n)),
meas_p = mean(Petal.Length),
sd = sd(Petal.Length))## # A tibble: 3 × 4
## Species n_mean meas_p sd
## <fct> <chr> <dbl> <dbl>
## 1 setosa sample size: 50 1.46 0.174
## 2 versicolor sample size: 50 4.26 0.470
## 3 virginica sample size: 50 5.55 0.552
1.2.4 Recode and generate new variables, then value label
This is a common pattern: create a derived variable with conditional logic, then attach labels for interpretability.
Here, we set one category to missing and then relabel remaining categories using a factor.
irisifelse <- iris %>%
mutate(Species2 = ifelse(Species == "setosa", NA, Species))
# relabel values
irisifelse$Species2 <- factor(irisifelse$Species2,labels = c( "versi","virg"))
irisifelse## Sepal.Length Sepal.Width Petal.Length Petal.Width Species Species2
## 1 5.1 3.5 1.4 0.2 setosa <NA>
## 2 4.9 3.0 1.4 0.2 setosa <NA>
## 3 4.7 3.2 1.3 0.2 setosa <NA>
## 4 4.6 3.1 1.5 0.2 setosa <NA>
## 5 5.0 3.6 1.4 0.2 setosa <NA>
## 6 5.4 3.9 1.7 0.4 setosa <NA>
## 7 4.6 3.4 1.4 0.3 setosa <NA>
## 8 5.0 3.4 1.5 0.2 setosa <NA>
## 9 4.4 2.9 1.4 0.2 setosa <NA>
## 10 4.9 3.1 1.5 0.1 setosa <NA>
## 11 5.4 3.7 1.5 0.2 setosa <NA>
## 12 4.8 3.4 1.6 0.2 setosa <NA>
## 13 4.8 3.0 1.4 0.1 setosa <NA>
## 14 4.3 3.0 1.1 0.1 setosa <NA>
## 15 5.8 4.0 1.2 0.2 setosa <NA>
## 16 5.7 4.4 1.5 0.4 setosa <NA>
## 17 5.4 3.9 1.3 0.4 setosa <NA>
## 18 5.1 3.5 1.4 0.3 setosa <NA>
## 19 5.7 3.8 1.7 0.3 setosa <NA>
## 20 5.1 3.8 1.5 0.3 setosa <NA>
## 21 5.4 3.4 1.7 0.2 setosa <NA>
## 22 5.1 3.7 1.5 0.4 setosa <NA>
## 23 4.6 3.6 1.0 0.2 setosa <NA>
## 24 5.1 3.3 1.7 0.5 setosa <NA>
## 25 4.8 3.4 1.9 0.2 setosa <NA>
## 26 5.0 3.0 1.6 0.2 setosa <NA>
## 27 5.0 3.4 1.6 0.4 setosa <NA>
## 28 5.2 3.5 1.5 0.2 setosa <NA>
## 29 5.2 3.4 1.4 0.2 setosa <NA>
## 30 4.7 3.2 1.6 0.2 setosa <NA>
## 31 4.8 3.1 1.6 0.2 setosa <NA>
## 32 5.4 3.4 1.5 0.4 setosa <NA>
## 33 5.2 4.1 1.5 0.1 setosa <NA>
## 34 5.5 4.2 1.4 0.2 setosa <NA>
## 35 4.9 3.1 1.5 0.2 setosa <NA>
## 36 5.0 3.2 1.2 0.2 setosa <NA>
## 37 5.5 3.5 1.3 0.2 setosa <NA>
## 38 4.9 3.6 1.4 0.1 setosa <NA>
## 39 4.4 3.0 1.3 0.2 setosa <NA>
## 40 5.1 3.4 1.5 0.2 setosa <NA>
## 41 5.0 3.5 1.3 0.3 setosa <NA>
## 42 4.5 2.3 1.3 0.3 setosa <NA>
## 43 4.4 3.2 1.3 0.2 setosa <NA>
## 44 5.0 3.5 1.6 0.6 setosa <NA>
## 45 5.1 3.8 1.9 0.4 setosa <NA>
## 46 4.8 3.0 1.4 0.3 setosa <NA>
## 47 5.1 3.8 1.6 0.2 setosa <NA>
## 48 4.6 3.2 1.4 0.2 setosa <NA>
## 49 5.3 3.7 1.5 0.2 setosa <NA>
## 50 5.0 3.3 1.4 0.2 setosa <NA>
## 51 7.0 3.2 4.7 1.4 versicolor versi
## 52 6.4 3.2 4.5 1.5 versicolor versi
## 53 6.9 3.1 4.9 1.5 versicolor versi
## 54 5.5 2.3 4.0 1.3 versicolor versi
## 55 6.5 2.8 4.6 1.5 versicolor versi
## 56 5.7 2.8 4.5 1.3 versicolor versi
## 57 6.3 3.3 4.7 1.6 versicolor versi
## 58 4.9 2.4 3.3 1.0 versicolor versi
## 59 6.6 2.9 4.6 1.3 versicolor versi
## 60 5.2 2.7 3.9 1.4 versicolor versi
## 61 5.0 2.0 3.5 1.0 versicolor versi
## 62 5.9 3.0 4.2 1.5 versicolor versi
## 63 6.0 2.2 4.0 1.0 versicolor versi
## 64 6.1 2.9 4.7 1.4 versicolor versi
## 65 5.6 2.9 3.6 1.3 versicolor versi
## 66 6.7 3.1 4.4 1.4 versicolor versi
## 67 5.6 3.0 4.5 1.5 versicolor versi
## 68 5.8 2.7 4.1 1.0 versicolor versi
## 69 6.2 2.2 4.5 1.5 versicolor versi
## 70 5.6 2.5 3.9 1.1 versicolor versi
## 71 5.9 3.2 4.8 1.8 versicolor versi
## 72 6.1 2.8 4.0 1.3 versicolor versi
## 73 6.3 2.5 4.9 1.5 versicolor versi
## 74 6.1 2.8 4.7 1.2 versicolor versi
## 75 6.4 2.9 4.3 1.3 versicolor versi
## 76 6.6 3.0 4.4 1.4 versicolor versi
## 77 6.8 2.8 4.8 1.4 versicolor versi
## 78 6.7 3.0 5.0 1.7 versicolor versi
## 79 6.0 2.9 4.5 1.5 versicolor versi
## 80 5.7 2.6 3.5 1.0 versicolor versi
## 81 5.5 2.4 3.8 1.1 versicolor versi
## 82 5.5 2.4 3.7 1.0 versicolor versi
## 83 5.8 2.7 3.9 1.2 versicolor versi
## 84 6.0 2.7 5.1 1.6 versicolor versi
## 85 5.4 3.0 4.5 1.5 versicolor versi
## 86 6.0 3.4 4.5 1.6 versicolor versi
## 87 6.7 3.1 4.7 1.5 versicolor versi
## 88 6.3 2.3 4.4 1.3 versicolor versi
## 89 5.6 3.0 4.1 1.3 versicolor versi
## 90 5.5 2.5 4.0 1.3 versicolor versi
## 91 5.5 2.6 4.4 1.2 versicolor versi
## 92 6.1 3.0 4.6 1.4 versicolor versi
## 93 5.8 2.6 4.0 1.2 versicolor versi
## 94 5.0 2.3 3.3 1.0 versicolor versi
## 95 5.6 2.7 4.2 1.3 versicolor versi
## 96 5.7 3.0 4.2 1.2 versicolor versi
## 97 5.7 2.9 4.2 1.3 versicolor versi
## 98 6.2 2.9 4.3 1.3 versicolor versi
## 99 5.1 2.5 3.0 1.1 versicolor versi
## 100 5.7 2.8 4.1 1.3 versicolor versi
## 101 6.3 3.3 6.0 2.5 virginica virg
## 102 5.8 2.7 5.1 1.9 virginica virg
## 103 7.1 3.0 5.9 2.1 virginica virg
## 104 6.3 2.9 5.6 1.8 virginica virg
## 105 6.5 3.0 5.8 2.2 virginica virg
## 106 7.6 3.0 6.6 2.1 virginica virg
## 107 4.9 2.5 4.5 1.7 virginica virg
## 108 7.3 2.9 6.3 1.8 virginica virg
## 109 6.7 2.5 5.8 1.8 virginica virg
## 110 7.2 3.6 6.1 2.5 virginica virg
## 111 6.5 3.2 5.1 2.0 virginica virg
## 112 6.4 2.7 5.3 1.9 virginica virg
## 113 6.8 3.0 5.5 2.1 virginica virg
## 114 5.7 2.5 5.0 2.0 virginica virg
## 115 5.8 2.8 5.1 2.4 virginica virg
## 116 6.4 3.2 5.3 2.3 virginica virg
## 117 6.5 3.0 5.5 1.8 virginica virg
## 118 7.7 3.8 6.7 2.2 virginica virg
## 119 7.7 2.6 6.9 2.3 virginica virg
## 120 6.0 2.2 5.0 1.5 virginica virg
## 121 6.9 3.2 5.7 2.3 virginica virg
## 122 5.6 2.8 4.9 2.0 virginica virg
## 123 7.7 2.8 6.7 2.0 virginica virg
## 124 6.3 2.7 4.9 1.8 virginica virg
## 125 6.7 3.3 5.7 2.1 virginica virg
## 126 7.2 3.2 6.0 1.8 virginica virg
## 127 6.2 2.8 4.8 1.8 virginica virg
## 128 6.1 3.0 4.9 1.8 virginica virg
## 129 6.4 2.8 5.6 2.1 virginica virg
## 130 7.2 3.0 5.8 1.6 virginica virg
## 131 7.4 2.8 6.1 1.9 virginica virg
## 132 7.9 3.8 6.4 2.0 virginica virg
## 133 6.4 2.8 5.6 2.2 virginica virg
## 134 6.3 2.8 5.1 1.5 virginica virg
## 135 6.1 2.6 5.6 1.4 virginica virg
## 136 7.7 3.0 6.1 2.3 virginica virg
## 137 6.3 3.4 5.6 2.4 virginica virg
## 138 6.4 3.1 5.5 1.8 virginica virg
## 139 6.0 3.0 4.8 1.8 virginica virg
## 140 6.9 3.1 5.4 2.1 virginica virg
## 141 6.7 3.1 5.6 2.4 virginica virg
## 142 6.9 3.1 5.1 2.3 virginica virg
## 143 5.8 2.7 5.1 1.9 virginica virg
## 144 6.8 3.2 5.9 2.3 virginica virg
## 145 6.7 3.3 5.7 2.5 virginica virg
## 146 6.7 3.0 5.2 2.3 virginica virg
## 147 6.3 2.5 5.0 1.9 virginica virg
## 148 6.5 3.0 5.2 2.0 virginica virg
## 149 6.2 3.4 5.4 2.3 virginica virg
## 150 5.9 3.0 5.1 1.8 virginica virg
Always check structure after transformations, especially when factors and missing values are involved.
## 'data.frame': 150 obs. of 6 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Species2 : Factor w/ 2 levels "versi","virg": NA NA NA NA NA NA NA NA NA NA ...
1.3 How to creat table 1 with test
In clinical trial reporting, “Table 1” typically refers to baseline characteristics (e.g., age, sex, race, key labs) summarized by treatment group. This often includes statistical tests (or standardized differences) to describe balance between groups.
The link below provides a full example workflow.
1.4 Imputing Missing Data with MICE
Missing data are common in clinical and observational studies. Multiple imputation (e.g., using MICE) is a practical approach under missing-at-random assumptions, and it allows uncertainty due to missingness to be propagated into final inference.
The link below provides an example implementation.