4 Data visualization

Data visualization is not only about making plots look “pretty.” In real projects, visualization has three practical jobs:

  1. Understand the data: distributions, outliers, missingness, and relationships.
  2. Communicate findings: help others quickly grasp patterns and results.
  3. Validate assumptions: especially before modeling (linearity, normality, separation, influential points, etc.).

This chapter focuses on ggplot2 (in the tidyverse ecosystem) and a few practical companion tools. The goal is to build a workflow that is repeatable: summarize → reshape → visualize → refine → annotate → communicate.

4.1 Data visualization introduction

Before plotting, it is often useful to summarize, reshape, and clean your data. Many plotting tasks become straightforward once the dataset is in a tidy (“long”) format. The section below shows common operations that prepare your data for visualization.

4.1.1 Summarization

A typical first step is to compute group-level summaries. In practice, these summaries help you: - sanity-check units and ranges, - compare groups quickly, - decide which plots are appropriate (boxplots, line charts, etc.).

The following example converts miles-per-gallon (mpg) into a rough metric unit and summarizes by cylinder group.

library(tidyverse)
library(dplyr)
mtcars %>%
  mutate( kml = mpg * 0.42) %>%
  group_by(cyl) %>%
   summarise(avg_US = mean(mpg), avg_metric = mean(kml))
## # A tibble: 3 × 3
##     cyl avg_US avg_metric
##   <dbl>  <dbl>      <dbl>
## 1     4   26.7      11.2 
## 2     6   19.7       8.29
## 3     8   15.1       6.34

The next summary computes average city and highway mileage by manufacturer and year.
In reporting, this kind of grouped summary often becomes the basis for trend plots and comparisons.

mpg %>% 
  group_by(manufacturer, year) %>% 
  summarise_at(vars(cty, hwy), mean)
## # A tibble: 30 × 4
## # Groups:   manufacturer [15]
##    manufacturer  year   cty   hwy
##    <chr>        <int> <dbl> <dbl>
##  1 audi          1999  17.1  26.1
##  2 audi          2008  18.1  26.8
##  3 chevrolet     1999  15.1  21.6
##  4 chevrolet     2008  14.9  22.1
##  5 dodge         1999  13.4  18.4
##  6 dodge         2008  13.0  17.6
##  7 ford          1999  13.9  18.6
##  8 ford          2008  14.1  20.5
##  9 honda         1999  24.8  31.6
## 10 honda         2008  24    33.8
## # ℹ 20 more rows
  • change layout
    Sometimes you want a table for quick comparison rather than a plot.
    Here we count cars by class and year, then spread into a wide layout (years as rows, classes as columns).
    This is helpful for producing “Table 1 style” counts or for feeding into heatmaps.
mpg %>% count(class, year)%>% 
  spread(class, n) 
## # A tibble: 2 × 8
##    year `2seater` compact midsize minivan pickup subcompact   suv
##   <int>     <int>   <int>   <int>   <int>  <int>      <int> <int>
## 1  1999         2      25      20       6     16         19    29
## 2  2008         3      22      21       5     17         16    33
  • change all characters into factors
    Many plotting functions treat characters as discrete categories anyway, but converting to factor can make your intent explicit and allows you to control ordering and labeling more deliberately.
    This is especially useful for facets and legends.
mpg <- mpg %>% 
  mutate_if(is.character, as.factor) #if a column is a character, change to a factor
  • wide to long data
    A common visualization trick is to stack multiple measurement columns into a long format.
    Here we convert cty and hwy into a single numeric column with a key indicating which metric it came from.
    Long format is the natural input format for ggplot because it plays well with color, linetype, and facets.
mpg1 <- mpg %>% 
  gather("key", "value", cty, hwy)
  • convert wide data to long data using pivot_longer pivot_longer() is the modern tidyverse replacement for gather().
    This example:
  1. pivots cty and hwy into long form,
  2. recodes the indicator variable into friendly labels,
  3. overlays points and smooth curves for each measurement type.

This is a classic real-world pattern: put the data in long format, then map the “type” to color.

## Your code here. Naming choices for 1 and 2 are yours
dta <- mpg %>% 
  pivot_longer(cty:hwy, names_to = "var", values_to = "value") %>%  # Both of those are 
  # value label
   mutate(var = ifelse( var == 'cty', 'city','highway'))

ggplot(dta, aes(x = displ, y = value)) + 
  geom_point(aes(color = var))  +
  geom_smooth(aes(color = var), se = F)
## `geom_smooth()` using method = 'loess' and formula =
## 'y ~ x'

  • explore distribution
    Exploratory distribution plots help you quickly identify:
  • skewed variables,
  • implausible values,
  • heavy tails,
  • and whether transformations (log, sqrt) might be helpful.

Packages like DataExplorer and psych provide rapid EDA utilities.
(Here, riskfactors is assumed to exist in your environment.)

library(DataExplorer)
library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
library(naniar)
## Warning: package 'naniar' was built under R version 4.4.3
plot_histogram(riskfactors)

  • explore relationship/correlation
    Pairwise plots are very useful early in analysis because they reveal:
  • correlations,
  • nonlinear patterns,
  • clustering,
  • and outliers that may dominate a model.

These are especially valuable before regression or machine learning.

library(psych)
pairs.panels(riskfactors[,1:10])
## Warning in cor(x, y, use = "pairwise", method = method): the standard deviation
## is zero
## Warning in cor(x, y, use = "pairwise", method = method): the standard deviation
## is zero

  • create a individual theme
    In a book or internal training material, consistent formatting matters.
    Creating your own theme function lets you apply the same style across all figures:
  • consistent font sizes,
  • consistent background,
  • consistent axis formatting.

This is a practical “quality of life” improvement when you generate many plots.

my_theme <- function(){
  theme_bw() + 
 
  theme(axis.title = element_text(size=16),
  axis.text = element_text(size=14),
  text = element_text(size = 14))
}

4.1.2 Explore missing values

Missingness is not just a data cleaning issue—it can be informative.
Before you impute or drop data, you should understand: - which variables have missingness, - whether missingness co-occurs across variables, - and whether missingness is related to groups (e.g., sites, time periods).

The following blocks use naniar and related tools to visualize missing patterns.

# install.packages("naniar")
library(naniar)
# head(riskfactors)
riskfactors <- riskfactors

gg_miss_upset() is useful for seeing combinations of missing variables (like an Upset plot). This helps identify patterns such as “these three labs are often missing together.”

gg_miss_upset(riskfactors,nsets=10)
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more
##   information.
## ℹ The deprecated feature was likely used in the UpSetR
##   package.
##   Please report the issue to the authors.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see
## where this warning was generated.
## Warning: Using `size` aesthetic for lines was deprecated in
## ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the UpSetR
##   package.
##   Please report the issue to the authors.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see
## where this warning was generated.
## Warning: The `size` argument of `element_line()` is deprecated
## as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## ℹ The deprecated feature was likely used in the UpSetR
##   package.
##   Please report the issue to the authors.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see
## where this warning was generated.

plot_missing() gives a quick overall missingness view, often used at the start of EDA.

# install.packages("DataExplorer")

plot_missing(riskfactors)

visdat::vis_dat() is a fast way to inspect column types and missingness at once.
It’s particularly helpful to detect: - columns accidentally read as character, - columns with mixed types, - or unexpected missing patterns.

# take a quick look at the data types of each column
visdat::vis_dat(riskfactors)

4.1.3 Add statistical test

Plots often become more persuasive when paired with a simple statistical comparison—but you should add tests carefully: - avoid fishing/over-testing, - ensure assumptions are reasonable, - and interpret p-values in context.

Here we build a faceted boxplot and add a comparison between years (1999 vs 2008) for each manufacturer.
This is a practical template for “small multiple” visual comparisons with annotated tests.

library(ggpubr)
plt <- ggplot(  data=mpg,
  mapping= aes(x =  as.factor(year),
               y = cty,
               color = as.factor(year) ) )+
  geom_boxplot() +
  geom_jitter(width=0.1)+
  labs(x = 'Year',
       y = "City mpg") +
  my_theme()+ facet_wrap( ~ manufacturer,nrow = 2)  
  
# add statistical test
  my_comparisons <- list(c('1999','2008'))
plt + stat_compare_means() +
stat_compare_means(comparisons = my_comparisons)

4.1.4 Add texts to dots

When plotting many points, labels can clutter quickly.
But for small datasets (or when identifying key outliers), labels add interpretability.
This example labels each state for a scatter plot of urban population vs murder arrests.

USArrests <- USArrests %>% rownames_to_column('State')

ggplot(USArrests, aes(
x=UrbanPop,y=Murder))+
geom_point() +
  labs(x = "Percent of population that is urban", 
       y = "Murder arrests (per 100,000)",
       caption =  "McNeil (1997). Interactive Data Analysis")+
  geom_text(aes(label=State),size=3)

4.1.5 Set the legend

Legends are often the difference between a confusing figure and a publishable one.
Here we demonstrate manual control of: - fill colors, - linetypes, - legend title and labels.

In practice, manual scales are useful when you want consistent colors across plots in a report or book.

ggplot(iris, aes(x=  Sepal.Length , fill= as.factor( Species)) ) +  #whole plot's option
  
  geom_histogram(aes(y=..density..),alpha=0.5, position="identity" , bins = 50)+
  geom_density(aes(linetype=as.factor(Species)),alpha=.1 )+    #aesthetic's option
  
  scale_fill_manual( name = "Groups",values = c("grey", "black", "skyblue"),labels = c("setosa", "versicolor" , "virginica" ))+  
  scale_linetype_manual( name = "Groups" ,values = c(1,3,5),labels = c("setosa", "versicolor" , "virginica") )+       # common legend 
  
  labs(x = "Sepal.Length", 
       y = "Density", 
       title = "") 
## Warning: The dot-dot notation (`..density..`) was deprecated in
## ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see
## where this warning was generated.

4.1.6 Create a panel of plots

A panel (multi-plot layout) is a very effective way to summarize multiple variables at once.
In exploratory analysis, it helps you quickly see: - distributions (histogram, bar), - group differences (boxplot), - and composition (stacked percent bars).

Here we build four plots and arrange them into a 2×2 panel.

p1=ggplot(data=riskfactors,aes(x=age))+
  geom_histogram(bins = 30 )

p2=ggplot(data=riskfactors,aes(x=sex))+
  geom_bar (aes(x=sex) )

p3=ggplot(riskfactors,aes(x = education, y = bmi))+
  geom_boxplot (  )

p4=ggplot(riskfactors, aes(x = marital )) + 
  geom_bar(aes(group = education, y = (..count..)/sum(..count..),fill = education)) + 
          scale_y_continuous(labels=scales::percent) 

# install.packages("ggpubr")
library(ggpubr)
ggarrange(p1, p2, p3, p4, ncol = 2, nrow=2)
## Warning: Removed 11 rows containing non-finite outside the
## scale range (`stat_boxplot()`).

4.1.7 Plots in regression

Visualization is a powerful companion to regression: - diagnostic plots check assumptions, - coefficient/OR plots communicate model results, - and fitted relationships can be compared across groups.

  • create linear regression model
    We fit a linear regression on Boston housing as a demonstration.
    In practice, you would check residuals, leverage, and outliers; this section mainly sets up the model for coefficient visualization.
data("Boston", package = "MASS")
linear_reg <- glm(medv ~ ., data=Boston , family = gaussian())
summary(linear_reg)
## 
## Call:
## glm(formula = medv ~ ., family = gaussian(), data = Boston)
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.646e+01  5.103e+00   7.144 3.28e-12 ***
## crim        -1.080e-01  3.286e-02  -3.287 0.001087 ** 
## zn           4.642e-02  1.373e-02   3.382 0.000778 ***
## indus        2.056e-02  6.150e-02   0.334 0.738288    
## chas         2.687e+00  8.616e-01   3.118 0.001925 ** 
## nox         -1.777e+01  3.820e+00  -4.651 4.25e-06 ***
## rm           3.810e+00  4.179e-01   9.116  < 2e-16 ***
## age          6.922e-04  1.321e-02   0.052 0.958229    
## dis         -1.476e+00  1.995e-01  -7.398 6.01e-13 ***
## rad          3.060e-01  6.635e-02   4.613 5.07e-06 ***
## tax         -1.233e-02  3.760e-03  -3.280 0.001112 ** 
## ptratio     -9.527e-01  1.308e-01  -7.283 1.31e-12 ***
## black        9.312e-03  2.686e-03   3.467 0.000573 ***
## lstat       -5.248e-01  5.072e-02 -10.347  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 22.51785)
## 
##     Null deviance: 42716  on 505  degrees of freedom
## Residual deviance: 11079  on 492  degrees of freedom
## AIC: 3027.6
## 
## Number of Fisher Scoring iterations: 2
  • summary
    broom::tidy() produces a clean summary table that is easy to include in reports.
    Using knitr::kable() formats it neatly in HTML/PDF output.
knitr::kable(broom::tidy(linear_reg))
term estimate std.error statistic p.value
(Intercept) 36.4594884 5.1034588 7.1440742 0.0000000
crim -0.1080114 0.0328650 -3.2865169 0.0010868
zn 0.0464205 0.0137275 3.3815763 0.0007781
indus 0.0205586 0.0614957 0.3343100 0.7382881
chas 2.6867338 0.8615798 3.1183809 0.0019250
nox -17.7666112 3.8197437 -4.6512574 0.0000042
rm 3.8098652 0.4179253 9.1161402 0.0000000
age 0.0006922 0.0132098 0.0524024 0.9582293
dis -1.4755668 0.1994547 -7.3980036 0.0000000
rad 0.3060495 0.0663464 4.6128998 0.0000051
tax -0.0123346 0.0037605 -3.2800091 0.0011116
ptratio -0.9527472 0.1308268 -7.2825106 0.0000000
black 0.0093117 0.0026860 3.4667926 0.0005729
lstat -0.5247584 0.0507153 -10.3471458 0.0000000
  • create logistical regression
    We fit logistic regression on the diabetes dataset. This is a classic example where coefficient plots and OR plots become very helpful for communication.
# load the Pima Indians dataset from the mlbench dataset
library(mlbench)
## Warning: package 'mlbench' was built under R version 4.4.3
data(PimaIndiansDiabetes) 
# rename dataset to have shorter name because lazy
diabetes <- PimaIndiansDiabetes

logistic_reg <- glm(diabetes ~ ., data=diabetes, family = binomial)

summary(logistic_reg)
## 
## Call:
## glm(formula = diabetes ~ ., family = binomial, data = diabetes)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -8.4046964  0.7166359 -11.728  < 2e-16 ***
## pregnant     0.1231823  0.0320776   3.840 0.000123 ***
## glucose      0.0351637  0.0037087   9.481  < 2e-16 ***
## pressure    -0.0132955  0.0052336  -2.540 0.011072 *  
## triceps      0.0006190  0.0068994   0.090 0.928515    
## insulin     -0.0011917  0.0009012  -1.322 0.186065    
## mass         0.0897010  0.0150876   5.945 2.76e-09 ***
## pedigree     0.9451797  0.2991475   3.160 0.001580 ** 
## age          0.0148690  0.0093348   1.593 0.111192    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 993.48  on 767  degrees of freedom
## Residual deviance: 723.45  on 759  degrees of freedom
## AIC: 741.45
## 
## Number of Fisher Scoring iterations: 5
  • summary
    Again, broom::tidy() provides a clear coefficient table.
knitr::kable(broom::tidy(logistic_reg))
term estimate std.error statistic p.value
(Intercept) -8.4046964 0.7166359 -11.7279870 0.0000000
pregnant 0.1231823 0.0320776 3.8401403 0.0001230
glucose 0.0351637 0.0037087 9.4813935 0.0000000
pressure -0.0132955 0.0052336 -2.5404160 0.0110721
triceps 0.0006190 0.0068994 0.0897131 0.9285152
insulin -0.0011917 0.0009012 -1.3223094 0.1860652
mass 0.0897010 0.0150876 5.9453340 0.0000000
pedigree 0.9451797 0.2991475 3.1595780 0.0015800
age 0.0148690 0.0093348 1.5928584 0.1111920

4.1.7.1 Create forest plots for coefficients or OR

Forest plots are one of the most practical ways to present regression results: - For linear regression, coefficients are on the original outcome scale. - For logistic regression, coefficients are usually transformed to odds ratios (OR).

sjPlot::plot_model() offers fast, clean coefficient plots.

library(sjPlot)
## Warning: package 'sjPlot' was built under R version 4.4.3
## 
## Attaching package: 'sjPlot'
## The following object is masked from 'package:ggplot2':
## 
##     set_theme
plot_model(linear_reg, show.values = TRUE, value.offset = 0.5)

Here we plot logistic regression results similarly.
In practice, you may want to ensure the plot displays OR (exp(beta)) depending on the function defaults and your audience.

plot_model(logistic_reg, show.values = TRUE, value.offset = .5, vline.color = "black")

another way
finalfit provides publication-friendly coefficient/OR plots with tables.
This is especially useful for clinical or epidemiology style reporting.

library(finalfit) 
## Warning: package 'finalfit' was built under R version 4.4.3
explanatory = c(  "crim"  ,  "zn"   ,   "indus"  ,   "nox"   ,  "rm"   ,   "age" ,    "dis"  ,   "rad"   ,  "tax"   ,"ptratio" ,"black"  , "lstat" )

dependent = "medv"


Boston %>%
 coefficient_plot(dependent, explanatory, table_text_size=3, 
                  title_text_size=12,
                   plot_opts=list(xlab("Beta, 95% CI"), 
                                  theme(axis.title = element_text(size=12))))
## `height` was translated to `width`.

And the OR plot for logistic regression.
This is a practical template to create a clean “OR (95% CI)” figure plus table in one call.

library(finalfit) 

explanatory = c(  "pregnant", "glucose" , "pressure", "triceps"  ,"insulin" , "mass"   ,  "pedigree", "age"  )

dependent = "diabetes"

diabetes %>%
 or_plot(dependent, explanatory, table_text_size=3, 
                  title_text_size=12,
                   plot_opts=list(xlab("OR, 95% CI"), 
                                  theme(axis.title = element_text(size=12))))
## Waiting for profiling to be done...
## Waiting for profiling to be done...
## Waiting for profiling to be done...
## `height` was translated to `width`.

  • qq plot
    A QQ plot is a quick check of distributional assumptions.
    Here we show a QQ plot for the Boston target variable.
    In practice, you might also check residuals from the fitted model, not only the raw outcome.
ggqqplot( (Boston$medv))
## Warning: The `size` argument of `element_rect()` is deprecated
## as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## ℹ The deprecated feature was likely used in the ggpubr
##   package.
##   Please report the issue at
##   <https://github.com/kassambara/ggpubr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see
## where this warning was generated.

  • Loading data set
    We now shift into ggplot2 fundamentals using the classic iris dataset.
    This section builds intuition: start from an empty canvas, add layers, add mappings, then customize.
library(printr)
## Warning: package 'printr' was built under R version 4.4.3
## Registered S3 method overwritten by 'printr':
##   method                from     
##   knit_print.data.frame rmarkdown
library(tidyverse)
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa

4.2 Scatter plot

Scatter plots are ideal for exploring relationships between two continuous variables.
In practice, they are often your first step before fitting correlation/regression models.

4.2.1 Create a empty canvas

  • then create aesthetic mapping
  • tell the function which dataset and variables to use
    This creates a ggplot “object” (canvas) but does not draw anything until a geom layer is added.
ggplot(data = iris,        # which data set? canvas? 
       aes(x=Sepal.Length , y=Petal.Length   ))  # which variables as aesthetics? x and y are mapped to columns of the data; different geoms can have different aesthetics (different variables). 

4.2.2 Add a layer/geom of points to the canvas

A plot becomes visible once we add geom_point().
This is the standard pattern in ggplot: define the canvas + add one or more layers.

ggplot(data = iris,
       mapping = aes(x=Sepal.Length , y=Petal.Length  )) + 
  geom_point()   # adding the geometrical representation 

This alternative style places the mapping in the geom.
It is useful when different layers use different mappings.

# same plot as above
ggplot(data = iris) + 
  geom_point(  aes(x=Sepal.Length , y=Petal.Length  )) 

4.2.3 Add another aesthetic

  • add a curve/straight line to fit these points
  • geom provides the aesthetic to ggplot

A smooth curve helps reveal the trend.
By default, geom_smooth() uses a loess smoother for small/medium datasets.

# Loess curve
ggplot(data = iris, 
       mapping = aes(x=Sepal.Length , y=Petal.Length  )) + 
  geom_point() +
  geom_smooth() 
## `geom_smooth()` using method = 'loess' and formula =
## 'y ~ x'

If you want an interpretable linear trend line, specify method = "lm".

# Linear regression line
ggplot(data = iris, 
       mapping = aes(x=Sepal.Length , y=Petal.Length)) + 
  geom_point() +
  geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'

4.2.4 Add other aesthetic

  • set other aesthetics colour, alpha (transparency), and size of points

Aesthetic mapping is powerful: - mapping inside aes() means “vary by data,” - setting outside aes() means “fixed constant.”

Here we map point size to Sepal.Width and set a fixed color + alpha.

ggplot(data = iris) + 
  geom_point(aes(x=Sepal.Length , y=Petal.Length, size = Sepal.Width     ), 
             alpha = .5, 
             colour = "red")

Here we map color to Species, producing a grouped scatter plot.

ggplot(data = iris) + 
  geom_point(aes(x=Sepal.Length , y=Petal.Length, size = Sepal.Width , colour=Species), #white is a variable here 
             alpha=.9)

  • categorize Petal.Width then map colour to this new variable
    Creating derived categorical variables is common in applied analysis (e.g., risk categories).
    Then you can visualize patterns by category.
iris <- iris %>% 
  mutate(growth = ifelse(Petal.Width   > 1.5, "Wide", "Normal"))

ggplot(data=iris) + 
  geom_point(aes(x=Sepal.Length , y=Petal.Length, size = Sepal.Width , colour=growth), 
             alpha=.9)

4.3 Bar chart

Bar charts summarize counts for categorical variables.
Here we count the number of observations in each growth category.

ggplot(data = iris) + 
  geom_bar(aes(x = growth))

  • bar chart after group_by
  • then use stat='identity'

If you want bars to represent pre-computed values (means, totals, proportions), you must provide y explicitly and set stat = "identity".
Here we summarize mean Sepal.Length by Species and growth category.

library(dplyr)
results  <- iris %>% 
  group_by(Species, growth) %>% 
   summarise(Sepal.Length.mean=mean (Sepal.Length ))
## `summarise()` has grouped output by 'Species'. You can
## override using the `.groups` argument.
 gop <- results  %>% 
  filter(Species != "setosa_null"  )
 gop
## # A tibble: 5 × 3
## # Groups:   Species [3]
##   Species    growth Sepal.Length.mean
##   <fct>      <chr>              <dbl>
## 1 setosa     Normal              5.01
## 2 versicolor Normal              5.91
## 3 versicolor Wide                6.18
## 4 virginica  Normal              6.13
## 5 virginica  Wide                6.62
  • though meaningless below until line chart (just use the mean as the sum for demonstration)
    This shows the mechanics of an identity bar chart.
# We can also store parts of a plot in an object
plot1 <- ggplot(gop) + 
  geom_bar(aes(x=growth , y=Sepal.Length.mean), 
           
           stat='identity')
plot1

4.3.1 Add some options for the whole ggplot rather than layers

  • switch the x and y axes
    coord_flip() is frequently used when category labels are long.
plot1 + 
  coord_flip()

  • reorder x categories (-means descending)
    Ordering categories by a statistic is one of the easiest ways to improve readability.
ggplot( gop) + 
  geom_bar(aes(x=reorder(growth, -Sepal.Length.mean), y=Sepal.Length.mean, fill=growth), 
           stat='identity') + 
  coord_flip()

  • add x axis label and a theme
    This is a typical “publication polish” pattern: label axes, remove redundant legends, apply a clean theme.
ggplot(gop) + 
  geom_bar(aes(x=reorder(growth, -Sepal.Length.mean), y=Sepal.Length.mean, fill=growth), 
           stat='identity') + 
  coord_flip() + 
  xlab("Growth categories") + 
  guides(fill=F) +
  theme_minimal()
## Warning: The `<scale>` argument of `guides()` cannot be
## `FALSE`. Use "none" instead as of ggplot2 3.3.4.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see
## where this warning was generated.

  • set theme
    Themes can be standardized across reports. ggthemes offers a set of recognizable styles.
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 4.4.3
ggplot(data = iris) + 
  geom_bar(aes(x = growth)) + 
  theme_economist()

4.3.2 Grouped bar chart

-bar chart with different panels
Faceting is a quick way to compare the same plot structure across groups.

ggplot(mpg, aes(x = class)) + 
  geom_bar() + 
  facet_wrap( ~ year)

  • actual number (groups are stacked by default)
    Stacked bars are useful for composition, but can be harder to compare across groups.
ggplot(gop) + 
  geom_bar(aes(x=growth, y=Sepal.Length.mean, fill=Species)
           , stat='identity'
           )

This shows a stacked count plot by default.

ggplot(mpg, aes(x = class )) + 
  geom_bar(aes(group = year, fill = year), position = "stack")

  • percentage
    Position “fill” converts to proportions (each bar sums to 1).
    This is ideal for comparing composition across categories.
ggplot(gop) + 
  geom_bar(aes(x=growth, y=Sepal.Length.mean, fill=Species), 
           stat='identity', 
           position='fill')

  • groups are dodge with actual number
    Dodged bars are easier for comparing group values directly.
ggplot(gop) + 
  geom_bar(aes(x=growth, y=Sepal.Length.mean, fill=Species), 
           stat='identity', 
           position='dodge')

  • groups are dodge with percentage
    Here we compute within-growth proportions manually, then plot as dodged bars.
    This is a useful pattern when you want side-by-side proportions rather than stacked proportions.
gop2 <- gop %>% 
  group_by(growth ) %>% 
  mutate(Sepal.Length.prop=Sepal.Length.mean/sum(Sepal.Length.mean))

ggplot(gop2) + 
  geom_bar(aes(x=growth, y=Sepal.Length.prop, fill=Species), 
           stat='identity', 
           position='dodge') + 
  ylab("Votes (%)")

4.4 Line charts

Line charts are best when x represents an ordered scale (time, dose, sequence).
If x is not truly ordered, lines can mislead, so use carefully.

This simple line plot demonstrates the syntax, though in real applications you usually group lines by an ID or category.

ggplot(iris) + 
  geom_line(aes(x=Sepal.Length , y=Petal.Length))

4.4.1 Grouped by colour variable

Mapping color to Species creates separate lines for each group.
This is a common pattern for showing group trends.

ggplot(iris) + 
  geom_line(aes(x=Sepal.Length , y=Petal.Length, colour = Species))

  • grouped by state then set how many rows or columns
    Faceting creates separate small panels for each group.
    This can improve readability when multiple lines overlap in one plot.
ggplot(iris) + 
  geom_line(aes(x=Sepal.Length , y=Petal.Length) ) + 
  
  facet_wrap(~Species, nrow = 1) +      #set how many rows
  coord_flip()  

4.4.2 Multiple aesthetics

Here we show multiple mappings and transformations: - a thicker line, - log scaling on x, - point size and color mapped to variables, - and faceting by Species.

This is a realistic “final figure” style example—dense but informative.

iris <- iris %>% 
  mutate(growth = ifelse(Petal.Width   > 1.5, "Wide", "Normal"))

ggplot(iris, aes(x = Sepal.Length, y = Petal.Length)) + 
  geom_line(size=2,color="purple")+  
  # number format
  scale_x_log10(labels = scales::label_number())+
  geom_point( aes(size =  Sepal.Length,colour = as.factor(growth)),show.legend = F)+
  facet_wrap(~ Species) 

4.5 ggplot2 parameters

To go further, it helps to understand the “grammar of graphics” idea behind ggplot2:
- data provides the rows,
- aes maps variables to visual properties,
- geoms draw shapes,
- scales control transformations and legends,
- guides refine legends,
- themes control appearance.

For detail, please read this article and this one.

The following chunk lists a large set of packages used for advanced visualization workflows.
In practice, you do not need all of them for basic plots, but it is useful to know what exists: - ggrepel for non-overlapping labels, - patchwork for combining plots, - gganimate for animation, - sf for spatial data, - ggthemes for themes, etc.

This example is a classic demonstration that summary statistics can be misleading.
Different datasets can share the same mean, variance, and correlation—but look completely different when plotted.
Always plot your data.

library(datasauRus)
## Warning: package 'datasauRus' was built under R version 4.4.3
  ggplot( )+
    geom_point(data=datasaurus_dozen[datasaurus_dozen$dataset=="dino",], aes(x = x, y = y),color= "#7CAE00" ) +
    theme_void()+
    theme(legend.position = "none") 

4.5.1 Components of plot

A practical way to think about building plots:

  • Components of plot
    • Data: is a data frame
    • Aesthetics: is used to indicate x and y variables and to control the color, the size or the shape …
    • Geometry: the type of graphics (bar plot, line plot, scatter plot…)
  • adjust parameters
    • adjust legend using guide_
    • adjust color, size, and shape using scale_, guide_ can further adjust scale_
    • adjust panel, background, axis (font, color, size, angle), title, legend (position), caption using theme
  • types of plots
    • geom_boxplot(): Box plot
    • geom_violin(): Violin plot
    • geom_dotplot(): Dot plot
    • geom_jitter(): Jitter charts
    • geom_line(): Line plot
    • geom_bar(): Bar plot
    • geom_errorbar(): Error bars
    • geom_point(): Scatter plot
    • geom_smooth(): Add smoothed curve
    • geom_quantile(): Add quantile lines
    • geom_text(): Textual annotations
    • geom_density(): Create a smooth density curve
    • geom_histogram(): Histogram

example
This example uses the Old Faithful geyser dataset and overlays a 2D density estimate.
This is a useful technique when points overlap heavily and you want to show concentration.

data("faithful")
# Basic scatterplot
ggplot(data = faithful, 
       mapping = aes(x = eruptions, y = waiting)) + 
  geom_point()+
  stat_density_2d(aes(fill = ..level..), geom="polygon")   

# Data and mapping can be given both as global (in ggplot()) or per layer
# ggplot() + 
#   geom_point(mapping = aes(x = eruptions, y = waiting),
#              data = faithful)

4.5.2 Create main title, axis labels, caption

Titles and captions turn plots into “report-ready” figures.
Axis labels should make units explicit, and captions should record the data source if applicable.

pay attention whether argument is factor or continuous.

ggplot(data = faithful, 
       mapping = aes(x = eruptions, y = waiting)) + 
  geom_point()+
  
  labs(title = "Number of xxx",
       caption = "source: http://xxx",
       x = "Eruptions" ,
       y = "Waiting time" ) +
# customize title, axis, caption  
  theme(
plot.title = element_text(color="red", size=14, face="bold.italic"),
plot.caption = element_text(color="red", size=10, face="italic"),
axis.title.x = element_text(color="blue", size=14, face="bold"),
axis.title.y = element_text(color="#993333", size=14, face="bold")
)+
  
# hide main title  
theme(plot.title = element_blank() )

4.5.3 Create legend title, position

Legends should explain the mapping clearly (what does color mean? what does size mean?).
You can also move the legend depending on layout constraints (e.g., left for tall plots, bottom for wide panels).

p <- ggplot(faithful) + 
  geom_point(aes(x = eruptions, y = waiting, colour = eruptions < 3))+
  labs(color="Numbers of eruptions") +
  
  theme(legend.position = "left")+
# Change the appearance of legend title and labels
  theme(legend.title = element_text(colour="blue"),
          legend.text = element_text(colour="red"))+
# Change legend box background color
  theme(legend.background = element_rect(fill=NULL))
print(p)

  • customize legends using scale functions
    Scales can rename and reorder legend entries.
    This is often necessary when boolean variables show up as “FALSE/TRUE” but you want more meaningful labels.
# how to change order of legend?  
# Set legend title and labels
p+ scale_color_discrete(name = "Numbers of eruptions change", labels = c("F", "T" ))

  • customize legend
    guide_colorbar() is typically used for continuous variables;
    guide_legend() is used for discrete variables (colors, shapes, linetypes).
    This block shows how to control legend title and label appearance in detail.
ggplot(faithful) + 
  geom_point(aes(x = eruptions, y = waiting, size = eruptions < 3))+
  guides(size = guide_legend(
    # legend title
    title = "title is too low",
    title.position = "bottom",
    title.vjust = -5,
    # legend label
    label.position = "left",
    label.hjust = 1,
    label.theme = element_text(size = 15,
                               face = "italic",
                               colour = "red",
                               angle = 0),
    # label reverse
    reverse = TRUE,
    
    # width of bin
    keywidth = 1,
    
    ncol = 4
    
   )
  )
## Warning: Using size for a discrete variable is not advised.

  • delete a legend
    A common practical cleanup is to remove redundant legends when multiple aesthetics are shown.
    Here we keep the color legend but remove size legend.
  ggplot(mpg, aes(x = displ, y = hwy, color = class, size = cyl)) +
  geom_point() +
  guides(
    color = guide_legend("type"),      # keep
    size = "none"                      # remove
  )

  • combine two legends when they use the same variable (mapping)
    Mapping both color and size to the same variable can create duplicate legends.
    Here we unify them by explicitly setting guides.
  ggplot(mpg, aes(x = displ, y = hwy, color = cyl, size = cyl)) +
  geom_point() +
  scale_color_viridis_c() +
  
  guides(
    color = guide_legend("title"),
    size = guide_legend("title")
  )

# guide = "legend"

4.5.4 Change plot colors

This section demonstrates the key concept:

  • set color into aes() means the color varies by a variable.
  • set color outside of aes() means a constant color.

set color into aes()

ggplot(faithful) + 
  geom_point(aes(x = eruptions, y = waiting, colour = eruptions < 3))

set color outside of aes()

ggplot(faithful) + 
  geom_point(aes(x = eruptions, y = waiting),
             colour = 'steelblue')

  • Colour the histogram with color and fill
    This maps outline color and fill color to logical conditions.
    In practice, you might map fill to a categorical group or a binned continuous variable.
ggplot(faithful) + 
  geom_histogram(aes(x = eruptions,color=eruptions < 3, fill=eruptions < 4))
## `stat_bin()` using `bins = 30`. Pick better value
## `binwidth`.

  • Colour the histogram by waiting and changing position
    Position controls how groups are displayed (stacked, dodged, or overlaid).
ggplot(faithful) + 
  geom_histogram(aes(x = eruptions,color=waiting>60), position = 'dodge')
## `stat_bin()` using `bins = 30`. Pick better value
## `binwidth`.

Overlaying histograms can be informative but can also become messy; using transparency or density curves can help.

ggplot(faithful) + 
  geom_histogram(aes(x = eruptions,color=waiting>60), position = 'identity')
## `stat_bin()` using `bins = 30`. Pick better value
## `binwidth`.

For fill and stack position, please see position section.

  • change colors manually using scale
    Manual scales are helpful for consistency across plots in a report.
    This example shows both fill and outline color manual control.
ggplot(faithful) + 
  geom_histogram(aes(x = eruptions,color=eruptions < 3, fill=eruptions < 4))+
  # Box plot
  scale_fill_manual(values=c("#999999", "#E69F00"))

# Scatter plot
  scale_color_manual(values=c( "#E69F00", "#56B4E9"))
## <ggproto object: Class ScaleDiscrete, Scale, gg>
##     aesthetics: colour
##     axis_order: function
##     break_info: function
##     break_positions: function
##     breaks: waiver
##     call: call
##     clone: function
##     dimension: function
##     drop: TRUE
##     expand: waiver
##     fallback_palette: function
##     get_breaks: function
##     get_breaks_minor: function
##     get_labels: function
##     get_limits: function
##     get_transformation: function
##     guide: legend
##     is_discrete: function
##     is_empty: function
##     labels: waiver
##     limits: NULL
##     make_sec_title: function
##     make_title: function
##     map: function
##     map_df: function
##     minor_breaks: waiver
##     n.breaks.cache: NULL
##     na.translate: TRUE
##     na.value: grey50
##     name: waiver
##     palette: function
##     palette.cache: NULL
##     position: left
##     range: environment
##     rescale: function
##     reset: function
##     train: function
##     train_df: function
##     transform: function
##     transform_df: function
##     super:  <ggproto object: Class ScaleDiscrete, Scale, gg>
  • using scale brewer automatically
    Brewer palettes are widely used, especially for categorical palettes that are print-friendly.
ggplot(faithful) + 
  geom_histogram(aes(x = eruptions,color=eruptions < 3, fill=eruptions < 4))+
  # Box plot
  scale_fill_brewer(palette="Dark2")
## `stat_bin()` using `bins = 30`. Pick better value
## `binwidth`.

# Scatter plot
  scale_color_brewer(palette="Set1")
## <ggproto object: Class ScaleDiscrete, Scale, gg>
##     aesthetics: colour
##     axis_order: function
##     break_info: function
##     break_positions: function
##     breaks: waiver
##     call: call
##     clone: function
##     dimension: function
##     drop: TRUE
##     expand: waiver
##     fallback_palette: function
##     get_breaks: function
##     get_breaks_minor: function
##     get_labels: function
##     get_limits: function
##     get_transformation: function
##     guide: legend
##     is_discrete: function
##     is_empty: function
##     labels: waiver
##     limits: NULL
##     make_sec_title: function
##     make_title: function
##     map: function
##     map_df: function
##     minor_breaks: waiver
##     n.breaks.cache: NULL
##     na.translate: TRUE
##     na.value: NA
##     name: waiver
##     palette: function
##     palette.cache: NULL
##     position: left
##     range: environment
##     rescale: function
##     reset: function
##     train: function
##     train_df: function
##     transform: function
##     transform_df: function
##     super:  <ggproto object: Class ScaleDiscrete, Scale, gg>
  # using guide to change the color of legend key
  • using gray colors using scale
    Grayscale is useful for black-and-white printing or when you want to avoid color emphasis.
# 
p + scale_fill_grey()  #no fill element

# 
p + scale_color_grey()  

  • Gradient or continuous colors (can set the middle point aswhite)
    Continuous gradients are typically used for continuous variables.
    A diverging gradient is helpful when there is a meaningful midpoint.
# Color by cty values
sp2<-ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(aes(color = cty))
sp2

# Change the low and high colors
# Sequential color scheme
sp2+scale_color_gradient(low="blue", high="red")

# Diverging color scheme
mid<-mean(mpg$cty)
sp2+scale_color_gradient2(midpoint=mid, low="blue", mid="white",
                          high="red"  )

4.5.5 Change points shapes, transparent and size

Transparency (alpha) helps when points overlap.
Shape and size can carry additional information, but too many aesthetics can overwhelm the reader, so use thoughtfully.

make the points larger and slightly transparent.

ggplot(faithful) + 
  geom_point(aes(x = eruptions, y = waiting, shape= eruptions < 3, size=eruptions), color="steelblue", alpha=0.5)

Reversing legend order can improve interpretability when sizes represent magnitude.

# hwo to reverse order of legend size 
ggplot(faithful) + 
  geom_point(aes(x = eruptions, y = waiting, shape= eruptions < 3, size=eruptions), color="steelblue", alpha=0.5)+

  scale_shape_manual(values=c(10, 23 ))+
 
  theme(legend.position="top")

4.5.6 Change bars position

Position is one of the most practical controls for bar charts: - default is stacked, - dodge is side-by-side, - fill is normalized to proportions.

This block shows all four layouts in a panel.

p <- ggplot(mpg, aes(fl, fill = drv))

p1 <- p + geom_bar ()

p2 <- p + geom_bar(position = "dodge")
 
p3 <-p + geom_bar(position = "fill")
 
p4 <-p + geom_bar(position = "stack")

library(gridExtra)
## Warning: package 'gridExtra' was built under R version 4.4.3
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
grid.arrange(p1, p2, p3,p4, ncol=2)

4.5.7 Add text annotations

Text annotations can highlight key points or provide labels.
Be cautious with large datasets; consider ggrepel for label repulsion when needed.

ggplot(data=mpg[(1:100), ], aes(x = displ, y = hwy)) +
  geom_point(aes(color = cty))+
  
  geom_text(aes(label = manufacturer ),
               size = 2, vjust = -1)  #vjust is site not direction

4.5.8 Add a line that (separates points)

Reference lines guide interpretation: - geom_abline() for a line defined by slope/intercept, - geom_hline() for a horizontal reference, - geom_vline() for a vertical reference.

These are often used to mark thresholds or decision boundaries.

ggplot(faithful) + 
  geom_point(aes(x = eruptions, y = waiting))+
  
  geom_abline(slope=-13,intercept = 100,color="red", linetype = "dashed")+
  # Add horizontal line at y = 2O; change line type and color
  geom_hline(yintercept=20, linetype="dotted", color = "red")+
# Add vertical line at x = 3; change line type, color and size
  geom_vline(xintercept = 3, color = "blue", size=1.5)

# Add regression line

add segment and arrow
Segments and arrows are useful for pointing out specific regions of the plot.
This is common in presentations and teaching.

ggplot(mpg, aes(x = displ, y = hwy )) +
  geom_point() +
# Add horizontal line segment
  geom_segment(aes(x = 2, y = 15, xend = 3, yend = 15, size=3, color="red")) +
  geom_segment(aes(x = 3, y = 33, xend = 2.5 , yend = 30),
                  arrow = arrow(length = unit(0.5, "cm")))
## Warning in geom_segment(aes(x = 2, y = 15, xend = 3, yend = 15, size = 3, : All aesthetics have length 1, but the data has 234
## rows.
## ℹ Please consider using `annotate()` or provide this
##   layer with data containing a single row.
## Warning in geom_segment(aes(x = 3, y = 33, xend = 2.5, yend = 30), arrow = arrow(length = unit(0.5, : All aesthetics have length 1, but the data has 234
## rows.
## ℹ Please consider using `annotate()` or provide this
##   layer with data containing a single row.

fitted curve
A smooth curve shows the overall pattern.
You can also subset data to remove categories that behave differently, as shown here.

ggplot(data=mpg[mpg$fl!="c",], aes(x = displ, y = hwy)) +
  geom_point( ) +
  geom_smooth(color="red") #fitted curve
## `geom_smooth()` using method = 'loess' and formula =
## 'y ~ x'

Quantile regression (geom_quantile()) can show how relationships differ across conditional quantiles.
This is a useful exploratory tool when variance is not constant.

ggplot(data=mpg[mpg$fl!="c",], aes(x = displ, y = hwy)) +
  geom_point( ) + 
  geom_quantile() +
  geom_rug()+
  theme_minimal()
## Smoothing formula not specified. Using: y ~ x

customize line
This example shows how to control multiple aesthetics and manually customize scales.
It also demonstrates a very important rule: if the plot is too complicated, consider simplifying—readability matters more than showing every possible mapping.

ggplot(mpg, aes(x = displ, y = hwy, color = fl, linetype = fl, size=fl)) +
  geom_point() +
  geom_line(aes( ) )+
  labs(color="What is fl")+
  
# customize linetype, color, size  
  scale_linetype_manual(values=c("twodash", "dotted","twodash", "dotted","twodash"))+
  scale_color_manual(name = "continents",breaks = c("c", "d", "e", "Europe", "Oceania"),labels = c("africa", "americas", "asia", "europe", "oceania"),  values=c('#999999','#E69F00','#999999','#E69F00','#999999')   )+ #using breaks define three labels   
  scale_size_manual(values=seq(1,4, 0.2))+
  theme(legend.position="top") +
  
  guides(color="legend")

4.5.9 Using scale_ function

Every aesthetic mapping has an associated scale, even if you do not specify it.
Learning to use scale_*() functions is one of the fastest ways to level up plot quality.

  • uisng scale_colour_brewer.
ggplot(mpg) + 
  geom_point(aes(x = displ, y = hwy, colour = class)) + 
  scale_colour_brewer(type = 'qual')

RColorBrewer
This displays available palettes—useful when choosing a consistent palette for a report.

RColorBrewer::display.brewer.all()

  • using different palettes
ggplot(mpg) + 
  geom_point(aes(x = displ, y = hwy, colour = class)) + 
  scale_colour_brewer (palette = 'Paired')

  • showing cyl with size
    Combining color and size is common, but be mindful of legend clutter.
    Use breaks to simplify size legend.
ggplot(mpg) + 
  geom_point(aes(x = displ, y = hwy, colour =  class, size=cyl)) + 
  scale_colour_brewer(palette  = 'Set1' ) +
  scale_size (breaks = c(4,6))   

  • using guides to modify the scale_
    Guides can override legend aesthetics.
    This is useful when you want legend points to be clearly visible and consistent.
 ggplot(mpg) +
   geom_point(aes(x = displ, y = hwy, colour = cyl, size=cyl)) +
   # scale_colour_brewer(palette  = 'Set1') +   #can not continuous 
   scale_size (breaks = c(4,5,6))  +
  
   guides( size = guide_legend(
            override.aes = list(color = c('red', 'blue', 'black'))))

  • unite legends
    when multiple aesthetics are mapped to the same variable.
    This can reduce legend duplication and improve readability.
ggplot(mpg) + 
  geom_point(aes(x = displ, y = hwy, colour = cyl, size=cyl))   +
  guides(colour="legend")

category is also ok
Discrete mapping works similarly, but some aesthetics (like size) may behave differently and can create awkward legends.

ggplot(mpg) + 
  geom_point(aes(x = displ, y = hwy, colour = fl, size=fl))   +
  guides(colour="legend") #size is not ok
## Warning: Using size for a discrete variable is not advised.

  • x and y also have associated scales
    Scales can transform axes and control tick marks.
    This example demonstrates manual x breaks and log2 transformation for y.
ggplot(mpg) + 
  geom_point(aes(x = displ, y = hwy)) + 
  scale_x_continuous(breaks = c(3.5, 5, 6)) + 
  scale_y_continuous(trans = 'log2')

4.5.10 Change coordinates

Coordinate transformations can change the story of a plot.
Use them intentionally—especially for polar charts, which can look impressive but may distort comparisons.

  • coord_polar
    This converts bar charts into a circular layout.
    Use with caution: it is harder to compare lengths in polar coordinates.
ggplot(mpg) + 
  geom_bar(aes(x = class)) + 
  coord_polar()

This version changes the polar mapping.
expand_limits() is used to create extra space.

ggplot(mpg) + 
  geom_bar(aes(x = class)) + 
  coord_polar(theta = 'y') + 
  expand_limits(y = 70)

  • specify the scale of coordinate
    You can set y limits and breaks.
    In practice, avoid truncating axes unless you have a strong reason and clearly indicate it.
require(scales)
## Loading required package: scales
## Warning: package 'scales' was built under R version 4.4.3
## 
## Attaching package: 'scales'
## The following objects are masked from 'package:psych':
## 
##     alpha, rescale
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
ggplot(mpg) + 
  geom_bar(aes(x = class)) + 
  scale_y_continuous(limits = c(0, 50), breaks = seq(0, 50, 01)) 
## Warning: Removed 1 row containing missing values or values
## outside the scale range (`geom_bar()`).

  # scale_y_continuous(labels = percent) # labels as percents
  # +
  # scale_x_discrete(labels=c(1:7)  )
  • using coord_cartesian zoom in
    coord_cartesian() zooms without dropping data outside the viewing range.
    This is usually preferable to setting scale limits, which can remove data.
# have been deleted 
ggplot(mpg) + 
  geom_bar(aes(x = class)) + 
  scale_y_continuous(limits = c(0, 30))+
  scale_x_discrete( limit=c("midsize","compact"))
## Warning: Removed 146 rows containing non-finite outside the
## scale range (`stat_count()`).
## Warning: Removed 2 rows containing missing values or values
## outside the scale range (`geom_bar()`).

This version zooms into y limits while keeping the underlying data.

ggplot(mpg) + 
  geom_bar(aes(x = class)) + 
  coord_cartesian( ylim = c(0, 30))+
  scale_x_discrete( limit=c("midsize","compact"))
## Warning: Removed 146 rows containing non-finite outside the
## scale range (`stat_count()`).

  • reverse direction of axes
    Axis reversal is useful for certain interpretive effects (e.g., ranking charts).
    Be careful combining multiple transformations in one plot—clarity first.
ggplot(mpg) + 
  geom_point(aes(x = hwy, y = displ))+
  
  scale_x_continuous(breaks = c(20, 30, 35,40)) + 
  scale_y_reverse()+
  scale_y_continuous(trans="log2")
## Scale for y is already present.
## Adding another scale for y, which will replace the
## existing scale.

# log10, sqrt, reverse, scale_y_continuous(trans="log2")

4.5.11 Customize axis ticks

Small axis choices (angle, size, color) can significantly improve readability.
This is especially important when plots are used in slides or printed reports.

  • change axis text font, color, size, angle using theme
# when use theme, scale, guide? 
ggplot(mpg) + 
  geom_point(aes(x = hwy, y = displ))+
  
  theme(axis.text.x = element_text(face="bold", color="#993333", 
                           size=14, angle=45),
          axis.text.y = element_text(face="bold", color="blue", 
                           size=7, angle=90))

  • remove aixs ticks and tick labels
    Sometimes minimal plots are more effective—especially when the audience already understands the scale or when the plot is purely illustrative.
ggplot(mpg) + 
  geom_point(aes(x = hwy, y = displ))+
theme(
  axis.text.x = element_blank(), # Remove x axis tick labels
  axis.text.y = element_blank(), # Remove y axis tick labels
  axis.ticks = element_blank()) # Remove ticks

4.5.12 Flip and reverse plot

Flipping and reversing can help match how people naturally read categories (top-to-bottom ranking charts) or fit long labels.

  • boxplot and violin
    Violin plots show distributions; boxplots show summary statistics.
    Overlaying both gives a compact distribution + summary view.
ggplot(mpg ) + 
  geom_violin( aes(x = as.factor(cyl), y=hwy ,color=as.factor(cyl) ) ,trim = FALSE,width = 4 ) +
  geom_boxplot( aes(x = as.factor(cyl), y=hwy  ), notch = F , width = 0.1)  
## Warning: `position_dodge()` requires non-overlapping x
## intervals.

  • dotplot
    Dotplots can show distributions with less emphasis on smooth density.
    Adding a mean and SD summary helps anchor interpretation.
ggplot(mpg ,aes(x = as.factor(cyl), y=hwy ) )+ 
  geom_dotplot(aes(color =as.factor(cyl), fill = as.factor(cyl)), 
                 binaxis = "y", stackdir = "center") + 
  stat_summary(fun.data="mean_sdl" )  
## Bin width defaults to 1/30 of the range of the data.
## Pick better value with `binwidth`.

  • errorbar
    This section builds a grouped summary dataset (df3) and then plots error bars.
    Error bars are useful for showing variability around means, but always clarify whether it is SD, SE, or CI.
df3 <- data_summary(mpg, varname="hwy", grps= c("cyl" ))
## Loading required package: plyr
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## The following object is masked from 'package:ggpubr':
## 
##     mutate
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following object is masked from 'package:purrr':
## 
##     compact
head(df3)
cyl hwy sd
4 28.80247 4.515030
5 28.75000 0.500000
6 22.82278 3.685590
8 17.62857 3.262307

This plot uses line + error bars.
Grouping is set via aes(group = 0) as a simple way to connect points in one line.

ggplot( df3, aes(as.factor(cyl) ,  (hwy), ymin = hwy-sd, ymax = hwy+sd)  ) +
  geom_line(aes(group = 0 )) +
  geom_errorbar(aes(color = as.factor(cyl) ),width = 0.2) 

This version uses bars plus error bars, then flips coordinates for readability.

ggplot( df3, aes(as.factor(cyl) ,  (hwy), ymin = hwy-sd, ymax = hwy+sd)  ) +
  geom_bar(aes(fill = as.factor(cyl)), stat = "identity",
             position = "dodge") +
  geom_errorbar(aes(   ),width = 0.2) +
  coord_flip()

using original data
Instead of precomputing summaries, you can also rely on stat_summary() directly on raw data.
This is convenient and reduces manual preprocessing.

  ggplot(mpg, aes(cyl, hwy)) +
  stat_summary(geom = "bar") +
  stat_summary(geom = "errorbar")
## No summary function supplied, defaulting to
## `mean_se()`
## No summary function supplied, defaulting to
## `mean_se()`

flip

ggplot(mpg) + 
  geom_bar(aes(x = class)) +
  coord_flip()

reverse
Be cautious: reversing y for bar charts can confuse readers unless you have a strong reason.

ggplot(mpg) + 
  geom_bar(aes(x = class)) +
  scale_y_reverse()

4.5.13 Create stats

Every geom has a stat. The stat can be overwritten if we use any additional computations.

Understanding stats is a major ggplot skill: - bar charts default to counts (stat_count), - histograms default to bin counts, - smoothing defaults to fitting a curve, - and stat_summary() computes summary measures.

Using original data (counts for cyl).

ggplot(mpg ) + 
  geom_bar (aes(x = cyl ),position = 'identity' ) #using original data

  • using transformed variables
    This section sets up data transformation patterns (commented out).
    The main idea: if you compute summary statistics yourself, you must use stat='identity' to plot them.
library(dplyr)
library(ggplot2)
# mpg_counted <-    
#   count(mpg,  cyl  )
# head(mpg_counted)

# ggplot(mpg_counted) + 
#   geom_smooth(aes(x =  cyl , y = n)) +
#   geom_bar (aes(x =  cyl , y = n), stat = 'identity') #using summary data 
  • using the after_stat() function inside aes().
    This computes proportions directly inside the plot call, which is very convenient.
    Here we plot class proportions instead of raw counts.
require(scales)
ggplot(mpg) + 
  geom_bar(aes(x = class, y = after_stat(  count / sum(count))))+
  scale_y_continuous(labels = percent) # labels decimals as percents

  • using density geometric in histogram
    Overlaying histogram density and a smoothed density curve is a classic approach to show distribution shape.
ggplot(mpg,aes(x = hwy)) + 
  geom_histogram(aes(y=..density..))+
  geom_density(  ) 
## `stat_bin()` using `bins = 30`. Pick better value
## `binwidth`.

  • Use stat_summary() to add the mean of hwy for each group
    STAT vs. GEOM is a practical concept:
  • geom_* draws shapes,
  • stat_* computes something first (means, SE, CI, etc.) and then draws.

This panel shows different summary layers and a 95% interval by setting mult = 1.96.

p1 <- ggplot(mpg,aes(x = class, y = hwy) ) + 
  stat_summary(
    geom = "pointrange",
    fun.data = mean_se
  )

p2 <- ggplot(mpg,aes(x = class, y = hwy) ) + 
  stat_summary( 
  )

p3 <- ggplot(mpg,aes(x = class, y = hwy) ) + 
  stat_summary( 
  )+
  stat_summary(
    fun.data = ~mean_se(., mult = 1.96), # Increase `mult` value for bigger interval!
    geom = "errorbar",
  )

library(patchwork)
## Warning: package 'patchwork' was built under R version 4.4.3
p1+p2+p3
## No summary function supplied, defaulting to
## `mean_se()`
## No summary function supplied, defaulting to
## `mean_se()`

Overlaying the group mean on top of raw points is a great way to show both individual variability and group central tendency.

ggplot(mpg) + 
  geom_point(aes(x = class, y = hwy), width = 0.2)+
  stat_summary(aes( x = class,y = hwy), geom="point",color="red",size=4)
## Warning in geom_point(aes(x = class, y = hwy), width = 0.2): Ignoring unknown
## parameters: `width`
## No summary function supplied, defaulting to
## `mean_se()`

jitter points
Jittering helps reduce overplotting when x is discrete.
This is one of the most common practical improvements for categorical scatter plots.

ggplot(mpg) + 
  geom_jitter(aes(x = class, y = hwy), width = 0.2)+
  stat_summary(aes( x = class,y = hwy), geom="point",color="red",size=4)
## No summary function supplied, defaulting to
## `mean_se()`

4.5.14 Facets

Facets are one of ggplot2’s strongest features. They let you compare the same relationship across groups.

facet_wrap() allows you to place facet side by side into a rectangular layout. facet_grid() allows you to specify different directions and works on two variables.

  • share the axes between the different panels
    Shared axes are helpful for direct comparison across panels.
ggplot(mpg) + 
  geom_point(aes(x = displ, y = hwy)) + 
  facet_wrap(~ class)

Facet grid with rows: drv by row.

ggplot(mpg) + 
  geom_point(aes(x = displ, y = hwy)) + 
  facet_grid(drv~ .  )

Facet grid with columns: drv by column.

ggplot(mpg) + 
  geom_point(aes(x = displ, y = hwy)) + 
  facet_grid(~ drv  )

Facet by two variables at once.
This is powerful but can create many small panels; use only when needed.

ggplot(mpg) + 
  geom_point(aes(x = displ, y = hwy)) + 
  facet_grid(year ~ drv)

  • do not share the axes between the different panels
    Free scales can make each panel readable but can harm comparability across panels.
    Use free scales when ranges differ drastically across groups.
ggplot(mpg) + 
  geom_point(aes(x = displ, y = hwy)) + 
  facet_wrap(~ drv , scales = "free")

only free y axes
This is a compromise: preserve x comparability but allow y to adjust per group.

ggplot(mpg) + 
  geom_point(aes(x = displ, y = hwy)) + 
  facet_wrap(~ drv , scales = "free_y")

  • adjust y scale (space) between the panels
    For bar charts with many categories, allowing free space can prevent labels from being squeezed.
ggplot(mpg) + 
  geom_bar(aes(y = manufacturer)) + 
  facet_grid(class ~ .)

This version frees y spacing and y scale to improve panel readability.

ggplot(mpg) + 
  geom_bar(aes(y = manufacturer)) + 
  facet_grid(class ~ ., space = "free_y", scales = "free_y")

  • display by adding multiple variables together
    Facetting by both year and drive type in one call.
    This is useful when you want a grid of comparisons.
ggplot(mpg) + 
  geom_point(aes(x = displ, y = hwy)) + 
  facet_wrap(year ~ drv)

4.5.15 Theme

Themes control the overall appearance: background, grid lines, fonts, and spacing.

theme_minimal
Minimal themes are clean for slides and modern reports.

ggplot(mpg) + 
  geom_bar(aes(y = class)) + 
  facet_wrap(~year) + 
  theme_minimal()

Further adjustments
theme_bw is a common default for academic-style plots.
This block demonstrates how to customize facet strip text, captions, and grid lines.

ggplot(mpg) + 
  geom_bar(aes(y = class)) + 
  facet_wrap(~year) + 
  
  labs(title = "Number of car models per class",
       caption = "source: http://fueleconomy.gov",
       x = NULL,
       y = NULL) +
   
  theme_bw() + 
  theme(
 
    strip.text = element_text(face = 'bold', hjust = 0),
    
    plot.caption = element_text(face = 'italic'),
    
    panel.grid.major = element_line('white', size = 0.5),
    panel.grid.minor = element_blank(),
    panel.grid.major.y = element_blank()
    # ,
    # panel.ontop = TRUE
  )

theme_classic
Classic themes remove background grids and look clean for publication.
This example also shows manual fill scales with custom legend labels.

ggplot(mpg) + 
  geom_bar(aes(y = class, fill = drv)  ,position = "dodge") + 
  facet_wrap(~year) + 
  labs(title = "Number of car models per class",
       caption = "source: http://fueleconomy.gov",
       x = 'Number of cars',
       y = NULL)+
  
  scale_fill_manual(name = "Drive Models",values=c("black", "grey50",  "grey80"), labels = c("4w","Fw","Rw" )) +
  
  # scale_x_continuous(expand = c(0, NA)) + 
  
  theme_classic() + 
  theme(
    # text = element_text('Avenir Next Condensed'),
    # strip.text = element_text(face = 'bold', hjust = 0),
    
    plot.caption = element_text(face = 'italic'),
    
    panel.grid.major = element_line('white', size = 0.5),
    panel.grid.minor = element_blank(),
    panel.grid.major.y = element_blank()
    
    # panel.ontop = TRUE)
  )

  • using ggthemes
    Themes from ggthemes provide recognizable styles.
    Use them when consistent branding is desired.
library(ggthemes)
ggplot(mpg) + 
  geom_bar(aes(y = class, fill = drv)  ,position = "dodge") + 
  facet_wrap(~year) + 
  labs(title = "Number of car models per class",
       caption = "source: http://fueleconomy.gov",
       x = 'Number of cars',
       y = NULL)+
  
  ggthemes::theme_economist()

  • customized theme
    This block demonstrates a wide range of theme elements you can control:
  • legend background, key, position,
  • panel background and border,
  • plot background,
  • axis line, ticks, and text rotation,
  • facet strip appearance,
  • panel spacing.

In practice, you typically standardize a simpler theme, but it is valuable to see what is possible.

ggplot(mpg) + 
  geom_bar(aes(y = class, fill = drv)  ,position = "dodge") + 
  facet_wrap(~year) + 
  labs(title = "Number of car models per class",
       caption = "source: http://fueleconomy.gov",
       x = 'Number of cars',
       y = NULL)+
  
theme(
  
    # 1 change legend
    legend.background = element_rect(
      fill = "#fff6c2",
      color = "black",
      linetype = "dashed"
    ),
    legend.key = element_rect(fill = "grey", color = "brown"),
    legend.position = "bottom", 
    
    # 2 change panel (middle erea) background 
    panel.background = element_rect(
      fill = "#005F59",
      color = "red", size = 3
    ),
    panel.border = element_rect(
      color = "black",
      fill = "transparent",
      linetype = "dashed", size = 3
    ),
    
    # 3 change plot background
    plot.background = element_rect(
      fill = "#a1dce9",
      color = "black",
      size = 1.3
    ),
    
   
    
    # 4 change axis elements
    axis.line = element_line(color = "orange", size = 2),
    axis.title = element_text(color = "red", face = "italic"),
    axis.ticks = element_line(color = "purple", size = 3),
    axis.text = element_text(color = "blue"),
    axis.text.x = element_text(angle = 45, hjust = 1), 
    
    # 5 change facet panel
    strip.background = element_rect(fill = "orange"),
    strip.text = element_text(color = "red"),
    panel.spacing = unit(0.3, "inch") 
  )

4.5.16 How to setup subscripts or superscripts

Scientific writing often needs subscripts/superscripts (e.g., CO_2, R²).
This link provides a focused guide for that formatting in R Markdown and ggplot text elements.

see here

4.6 How to create advanced plots

Once you master the basics (data → aes → geoms → scales → themes), advanced plots become much easier.
This link collects more advanced patterns and examples.

see here