R for data analysis: beginner level

Course Outline

This course is designed for those with no prior experience of data analysis. This 18-hour course on R covers loading datasets, performing basic statistics, and creating data visualizations. The course is divided into 6 3-hour sessions.

Session 1: Introduction to R (3 hours)

1.1 Getting Started with R and RStudio

Introduction to R and RStudio

R is a programming language and software environment specifically designed for statistical computing and data analysis. It is widely used by statisticians, data scientists, and researchers for its powerful data manipulation capabilities, extensive statistical techniques, and graphical tools. Here are some key aspects of R:

Key Features of R:

Statistical Analysis:
- R provides a wide range of statistical techniques, including linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, and more.
Data Manipulation:
- R includes robust tools for data manipulation and transformation. Packages like dplyr and tidyr allow for efficient data wrangling.
Data Visualization:
- R excels in data visualization. The ggplot2 package, based on the Grammar of Graphics, allows users to create complex and elegant visualizations with ease.
Reproducible Research:
- R supports reproducible research with tools like R Markdown, which integrates code and text in a single document, making it easy to share analysis and results.
Extensible:
- R is highly extensible through packages. The Comprehensive R Archive Network (CRAN) hosts thousands of packages that extend R’s functionality for various domains, from bioinformatics to finance.
Community Support:
- R has a large and active community. Numerous resources, forums, and user-contributed documentation are available, facilitating learning and problem-solving.
Integration:
- R can integrate with other languages and systems. It can call C, C++, and Fortran code and is also capable of interacting with databases and web services.

Installing R and RStudio

Basic RStudio interface

Rstudio basic panes

source: https://docs.posit.co/ide/user/ide/guide/ui/ui-panes.html

1.2 Basic R Syntax

Variables and data types

Introduction to different data types:

Numeric: Numbers, which can be either integers or floating-point.
Character: Text or string values.
Logical: Boolean values, either TRUE or FALSE.

Creating variables:

# Creating a numeric variable
num_var <- 42

# Creating a character variable
char_var <- "Hello, R!"

# Creating a logical variable
log_var <- TRUE

# Display the variables
num_var

## [1] 42

char_var

## [1] "Hello, R!"

log_var

## [1] TRUE

Exercises:

Create a variable age and assign it your age.
```
# Creating a numeric variable
age <- 25
```

Create a variable name and assign it your name.

# Creating a character variable
name <- "Pierre"

Create a variable is_student and assign it a logical value indicating whether you are a student or not.
```
# Creating a logical variable
is_student <- FALSE
```

Showing the results:

# Display the variables
age

## [1] 25

name

## [1] "Pierre"

is_student

## [1] FALSE

Basic operations

Performing arithmetic operations:

# Addition
3 + 5

## [1] 8

# Subtraction
10 - 4

## [1] 6

# Multiplication
6 * 7

## [1] 42

# Division
20 / 4

## [1] 5

# Exponentiation
2^3

## [1] 8

Logical operations:

# Equality
3 == 3

## [1] TRUE

# Inequality
5 != 4

## [1] TRUE

# Greater than
7 > 5

## [1] TRUE

# Less than
2 < 6

## [1] TRUE

# Greater than or equal to
4 >= 4

## [1] TRUE

# Less than or equal to
3 <= 4

## [1] TRUE

# Logical AND
TRUE & FALSE

## [1] FALSE

# Logical OR
TRUE | FALSE

## [1] TRUE

Using basic functions:

# Sum of numbers
sum(1, 2, 3)

## [1] 6

# Mean of numbers
mean(c(1, 2, 3, 4, 5))

## [1] 3

# Square root
sqrt(16)

## [1] 4

Exercises:

Perform addition, subtraction, multiplication, and division on two numeric variables you create.

#Creation of x and y
x = 1
y = 2

x + y

## [1] 3

x - y

## [1] -1

x*y

## [1] 2

x/y

## [1] 0.5

Check if the number 10 is greater than 5 and print the result.

# Check if the number 10 is greater than 5
result <- 10 > 5

# Print the result
print(result)

## [1] TRUE

Calculate the mean of the numbers 4, 8, 15, 16, 23, 42.
```
mean(c(4, 8, 15, 16, 23, 42))
```
```
## [1] 18
```

Writing and running scripts

Creating and executing R scripts within RStudio:

Open RStudio.
Create a new script by clicking on File -> New File -> R Script.
Write your R code in the script editor.
Save your script with a .R extension.
To run the script, highlight the code and click the Run button, or use the Ctrl+Enter shortcut.

Exercises:

Create a script that assigns two numbers to variables and prints their sum, difference, product, and quotient.
Save the script and run it in RStudio.

Example script (example_script.R):

# This is a comment
# Assign values to variables
x <- 10
y <- 5

# Perform arithmetic operations
sum <- x + y
difference <- x - y
product <- x * y
quotient <- x / y

# Print the results
print(sum)

## [1] 15

print(difference)

## [1] 5

print(product)

## [1] 50

print(quotient)

## [1] 2

1.3 Working with Vectors and Data Frames

Creating and Manipulating Vectors

Understanding vectors:

Vectors are one-dimensional arrays that can hold numeric, character, or logical data. They are the simplest type of data structure in R and are extremely useful for storing sequences of values.

Creating vectors:

You can create vectors using the c() function, which stands for “combine” or “concatenate.”

# Numeric vector
num_vector <- c(1, 2, 3, 4, 5)

# Character vector
char_vector <- c("apple", "banana", "cherry")

# Logical vector
log_vector <- c(TRUE, FALSE, TRUE)

Creating sequences:

You can create sequences of numbers using the seq() and rep() functions.

# Sequence from 1 to 10
seq(1, 10)

##  [1]  1  2  3  4  5  6  7  8  9 10

# Sequence from 1 to 10 with step size 2
seq(1, 10, by = 2)

## [1] 1 3 5 7 9

# Repeat a value
rep(5, times = 3)

## [1] 5 5 5

Subsetting vectors:

You can access specific elements of a vector using square brackets [].

# Access the second element
num_vector[2]

## [1] 2

# Access multiple elements
num_vector[c(1, 3, 5)]

## [1] 1 3 5

Vectorized operations:

R allows you to perform operations on entire vectors without the need for explicit loops. This is called vectorization and makes your code more efficient and concise.

# Add 2 to each element
num_vector + 2

## [1] 3 4 5 6 7

# Multiply each element by 3
num_vector * 3

## [1]  3  6  9 12 15

Exercises:

Create a numeric vector with the numbers 1 to 10.
```
vec <- c(1,2,3,4,5,6,7,8,9,10)
```
Create a character vector with the names of three fashion brands.
```
brands <- c("Gucci", "Nike", "Adidas")
```

Access the third element of the numeric vector and print it.

# Access the third element
vec3 <- vec[3]

print(vec3)

## [1] 3

Add 10 to each element of the numeric vector and print the result.

# Add 10 to each element
vec10 <- vec + 10

print(vec10)

##  [1] 11 12 13 14 15 16 17 18 19 20

Introduction to data frames

Creating data frames:

# Create a data frame
df <- data.frame(
  name = c("Alice", "Bob", "Charlie"),
  age = c(25, 30, 35),
  is_student = c(TRUE, FALSE, TRUE)
)

# Display the data frame
print(df)

##      name age is_student
## 1   Alice  25       TRUE
## 2     Bob  30      FALSE
## 3 Charlie  35       TRUE

Accessing rows and columns:

# Access a column
df$name

## [1] "Alice"   "Bob"     "Charlie"

# Access a row
df[1, ]

##    name age is_student
## 1 Alice  25       TRUE

# Access a specific element
df[2, "age"]

## [1] 30

Basic manipulations:

# Add a new column
df$height <- c(5.5, 6.0, 5.8)

# Remove a column
df$height <- NULL

Exercises:

Create a data frame with three columns: name, age, and is_student, fill it with your information and those from your neighbors

# Create a data frame
df <- data.frame(
  name = c("Pierre"),
  age = c(25),
  is_student = c(FALSE)
)

# Display the data frame
print(df)

##     name age is_student
## 1 Pierre  25      FALSE

Add a new column grade with some values.
```
# Add a new column
df$grade <- c(20)
```

Access and print the age column.

# Access a column
age <- df$age

print(age)

## [1] 25

Access and print the first row.

# Access a row
firstrow <- df[1, ]

print(firstrow)

##     name age is_student grade
## 1 Pierre  25      FALSE    20

Basic data frame operations

Sorting data frames:

# Sort by age
sorted_df <- df[order(df$age), ]
print(sorted_df)

##     name age is_student grade
## 1 Pierre  25      FALSE    20

Filtering data frames:

# Filter students
students <- df[df$is_student == TRUE, ]
print(students)

## [1] name       age        is_student grade     
## <0 rows> (or 0-length row.names)

Summarizing data frames:

# Summary statistics
summary(df)

##      name                age     is_student          grade   
##  Length:1           Min.   :25   Mode :logical   Min.   :20  
##  Class :character   1st Qu.:25   FALSE:1         1st Qu.:20  
##  Mode  :character   Median :25                   Median :20  
##                     Mean   :25                   Mean   :20  
##                     3rd Qu.:25                   3rd Qu.:20  
##                     Max.   :25                   Max.   :20

Exercises

Sort the data frame by the grade column and print the result.

# Sort by grade
sorted_df <- df[order(df$grade), ]
print(sorted_df)

##     name age is_student grade
## 1 Pierre  25      FALSE    20

Filter the data frame to include only students aged more than 20 years old and print the result.

# Filter students
students <- df[df$is_student == TRUE, ]
filteredstudents <- students[students$age >= 20, ]
print(filteredstudents)

## [1] name       age        is_student grade     
## <0 rows> (or 0-length row.names)

Summarize the data frame and print the summary.

summary <- summary(df)
print(summary)

##      name                age     is_student          grade   
##  Length:1           Min.   :25   Mode :logical   Min.   :20  
##  Class :character   1st Qu.:25   FALSE:1         1st Qu.:20  
##  Mode  :character   Median :25                   Median :20  
##                     Mean   :25                   Mean   :20  
##                     3rd Qu.:25                   3rd Qu.:20  
##                     Max.   :25                   Max.   :20

Session 2: Data Import and Export (3 hours)

2.0 Setting up working directory

Before loading data, it’s important to set the working directory. This tells R where to look for files on your computer. You can set the working directory to the folder where your data files are stored.

# Set the working directory
# Replace "path/to/your/directory" with the actual path to your directory
setwd("/Users/pierrebeaucoral/Documents/Pro/Cours GPE")

# Verify the working directory
getwd()

## [1] "/Users/pierrebeaucoral/Documents/Pro/Cours GPE"

2.1 Loading Data from Files

A CSV (Comma Separated Values) file is a plain text file that contains data separated by commas. It’s a common format for data exchange. We will use the `readr` package to read a CSV file into R.

# Load required package
library(readr)

# Read CSV file
data <- read_csv("./Data/SCIM.csv")

## Rows: 20580 Columns: 19
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (13): SUBJECT, Sujet, LOCATION, Pays, MEASURE, Mesure, FREQUENCY, Fréque...
## dbl  (2): PowerCode Code, Value
## lgl  (4): Reference Period Code, Reference Period, Flag Codes, Flags
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Display the first few rows of the data
head(data)

## # A tibble: 6 × 19
##   SUBJECT  Sujet   LOCATION Pays  MEASURE Mesure FREQUENCY Fréquence TIME  Temps
##   <chr>    <chr>   <chr>    <chr> <chr>   <chr>  <chr>     <chr>     <chr> <chr>
## 1 XTEXVA01 Export… ISR      Isra… CXMLSA  Monna… Q         Trimestr… 2022… T2-2…
## 2 XTEXVA01 Export… ISR      Isra… CXMLSA  Monna… Q         Trimestr… 2022… T3-2…
## 3 XTEXVA01 Export… ISR      Isra… CXMLSA  Monna… Q         Trimestr… 2022… T4-2…
## 4 XTEXVA01 Export… ISR      Isra… CXMLSA  Monna… Q         Trimestr… 2023… T1-2…
## 5 XTEXVA01 Export… ISR      Isra… CXMLSA  Monna… Q         Trimestr… 2023… T2-2…
## 6 XTEXVA01 Export… ISR      Isra… CXMLSA  Monna… Q         Trimestr… 2023… T3-2…
## # ℹ 9 more variables: `Unit Code` <chr>, Unit <chr>, `PowerCode Code` <dbl>,
## #   PowerCode <chr>, `Reference Period Code` <lgl>, `Reference Period` <lgl>,
## #   Value <dbl>, `Flag Codes` <lgl>, Flags <lgl>

# Summary of the data
summary(data)

##    SUBJECT             Sujet             LOCATION             Pays          
##  Length:20580       Length:20580       Length:20580       Length:20580      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    MEASURE             Mesure           FREQUENCY          Fréquence        
##  Length:20580       Length:20580       Length:20580       Length:20580      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      TIME              Temps            Unit Code             Unit          
##  Length:20580       Length:20580       Length:20580       Length:20580      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  PowerCode Code   PowerCode         Reference Period Code Reference Period
##  Min.   :0.000   Length:20580       Mode:logical          Mode:logical    
##  1st Qu.:0.000   Class :character   NA's:20580            NA's:20580      
##  Median :9.000   Mode  :character                                         
##  Mean   :6.715                                                            
##  3rd Qu.:9.000                                                            
##  Max.   :9.000                                                            
##      Value           Flag Codes      Flags        
##  Min.   : -28821.1   Mode:logical   Mode:logical  
##  1st Qu.:     -0.3   NA's:20580     NA's:20580    
##  Median :      8.2                                
##  Mean   :   3878.6                                
##  3rd Qu.:     55.7                                
##  Max.   :1167777.0

# Display the structure of the data
str(data)

## spc_tbl_ [20,580 × 19] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ SUBJECT              : chr [1:20580] "XTEXVA01" "XTEXVA01" "XTEXVA01" "XTEXVA01" ...
##  $ Sujet                : chr [1:20580] "Exportations des biens (en valeur)" "Exportations des biens (en valeur)" "Exportations des biens (en valeur)" "Exportations des biens (en valeur)" ...
##  $ LOCATION             : chr [1:20580] "ISR" "ISR" "ISR" "ISR" ...
##  $ Pays                 : chr [1:20580] "Israël" "Israël" "Israël" "Israël" ...
##  $ MEASURE              : chr [1:20580] "CXMLSA" "CXMLSA" "CXMLSA" "CXMLSA" ...
##  $ Mesure               : chr [1:20580] "Monnaie nationale convertie en dollars, corrigée des variations saisonnières" "Monnaie nationale convertie en dollars, corrigée des variations saisonnières" "Monnaie nationale convertie en dollars, corrigée des variations saisonnières" "Monnaie nationale convertie en dollars, corrigée des variations saisonnières" ...
##  $ FREQUENCY            : chr [1:20580] "Q" "Q" "Q" "Q" ...
##  $ Fréquence            : chr [1:20580] "Trimestrielle" "Trimestrielle" "Trimestrielle" "Trimestrielle" ...
##  $ TIME                 : chr [1:20580] "2022-Q2" "2022-Q3" "2022-Q4" "2023-Q1" ...
##  $ Temps                : chr [1:20580] "T2-2022" "T3-2022" "T4-2022" "T1-2023" ...
##  $ Unit Code            : chr [1:20580] "USD" "USD" "USD" "USD" ...
##  $ Unit                 : chr [1:20580] "Dollar des États-Unis" "Dollar des États-Unis" "Dollar des États-Unis" "Dollar des États-Unis" ...
##  $ PowerCode Code       : num [1:20580] 9 9 9 9 9 9 9 9 9 9 ...
##  $ PowerCode            : chr [1:20580] "Milliards" "Milliards" "Milliards" "Milliards" ...
##  $ Reference Period Code: logi [1:20580] NA NA NA NA NA NA ...
##  $ Reference Period     : logi [1:20580] NA NA NA NA NA NA ...
##  $ Value                : num [1:20580] 17.2 17.2 16.5 14 15.5 ...
##  $ Flag Codes           : logi [1:20580] NA NA NA NA NA NA ...
##  $ Flags                : logi [1:20580] NA NA NA NA NA NA ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   SUBJECT = col_character(),
##   ..   Sujet = col_character(),
##   ..   LOCATION = col_character(),
##   ..   Pays = col_character(),
##   ..   MEASURE = col_character(),
##   ..   Mesure = col_character(),
##   ..   FREQUENCY = col_character(),
##   ..   Fréquence = col_character(),
##   ..   TIME = col_character(),
##   ..   Temps = col_character(),
##   ..   `Unit Code` = col_character(),
##   ..   Unit = col_character(),
##   ..   `PowerCode Code` = col_double(),
##   ..   PowerCode = col_character(),
##   ..   `Reference Period Code` = col_logical(),
##   ..   `Reference Period` = col_logical(),
##   ..   Value = col_double(),
##   ..   `Flag Codes` = col_logical(),
##   ..   Flags = col_logical()
##   .. )
##  - attr(*, "problems")=<externalptr>

Writing CSV files

We can also write data from R to a CSV file using the write_csv function. This is useful for saving processed data for use in other programs.

# Write data to a CSV file
write_csv(data, "./Data/file.csv")

# Write data with a different delimiter
write_delim(data, "./Data/file.txt", delim = "\t")

# Write data without column names
write_csv(data, "./Data/file_no_header.csv", col_names = FALSE)

Reading and Writing Excel files

Excel files are widely used for data storage and analysis. We will use the readxl package to read Excel files and the writexl and openxlsx packages to write Excel files.

# Load required packages
library(readxl)    # For reading Excel files
library(writexl)   # For writing Excel files
library(openxlsx)  # For advanced Excel operations

# Read Excel file
data_excel <- read_excel("./Data/SCIM.xls")

# Display the first few rows of the data
head(data_excel)

## # A tibble: 6 × 18
##   Pays      `Août-2022` `Sept-2022` `Oct-2022` `Nov-2022` `Déc-2022` `Janv-2023`
##   <chr>           <dbl>       <dbl>      <dbl>      <dbl>      <dbl>       <dbl>
## 1 Australie       35.2        36.0       33.9       34.3       35.0        35.5 
## 2 Autriche        17.3        17.2       16.9       18.0       18.0        18.5 
## 3 Belgique        55.1        54.2       51.0       55.0       53.2        49.4 
## 4 Canada          50.9        49.6       48.5       47.6       47.8        49.8 
## 5 Chili            8.07        8.34       8.47       8.13       8.43        8.57
## 6 Colombie         4.29        4.68       4.28       4.68       4.44        4.06
## # ℹ 11 more variables: `Févr-2023` <dbl>, `Mars-2023` <dbl>, `Avr-2023` <dbl>,
## #   `Mai-2023` <dbl>, `Juin-2023` <dbl>, `Juil-2023` <dbl>, `Août-2023` <dbl>,
## #   `Sept-2023` <dbl>, `Oct-2023` <dbl>, `Nov-2023` <chr>, `Déc-2023` <chr>

# Summary of the data
summary(data_excel)

##      Pays             Août-2022           Sept-2022            Oct-2022        
##  Length:50          Min.   :   0.6984   Min.   :   0.6095   Min.   :   0.5321  
##  Class :character   1st Qu.:   6.9794   1st Qu.:   6.7395   1st Qu.:   6.6275  
##  Mode  :character   Median :  24.8843   Median :  24.3044   Median :  20.8480  
##                     Mean   :  81.7001   Mean   :  79.9567   Mean   :  77.7920  
##                     3rd Qu.:  50.8811   3rd Qu.:  51.2174   3rd Qu.:  48.9129  
##                     Max.   :1166.6390   Max.   :1139.9140   Max.   :1101.9380  
##                     NA's   :1           NA's   :1           NA's   :1          
##     Nov-2022            Déc-2022           Janv-2023        
##  Min.   :   0.5568   Min.   :   0.7034   Min.   :   0.5558  
##  1st Qu.:   7.4955   1st Qu.:   6.8820   1st Qu.:   6.9891  
##  Median :  20.8586   Median :  23.0772   Median :  22.1596  
##  Mean   :  79.0860   Mean   :  79.9836   Mean   :  80.4081  
##  3rd Qu.:  48.2815   3rd Qu.:  47.8382   3rd Qu.:  49.4048  
##  Max.   :1125.0540   Max.   :1143.7330   Max.   :1148.4440  
##  NA's   :1           NA's   :1           NA's   :1          
##    Févr-2023           Mars-2023            Avr-2023        
##  Min.   :   0.5883   Min.   :   0.5759   Min.   :   0.5719  
##  1st Qu.:   6.2442   1st Qu.:   7.3807   1st Qu.:   6.9889  
##  Median :  22.0183   Median :  21.7005   Median :  20.5253  
##  Mean   :  79.9061   Mean   :  80.6198   Mean   :  79.2318  
##  3rd Qu.:  48.0585   3rd Qu.:  48.8357   3rd Qu.:  47.9008  
##  Max.   :1135.7370   Max.   :1129.5540   Max.   :1114.3250  
##  NA's   :1           NA's   :1           NA's   :1          
##     Mai-2023           Juin-2023           Juil-2023        
##  Min.   :   0.5317   Min.   :   0.5678   Min.   :   0.5119  
##  1st Qu.:   6.8126   1st Qu.:   6.6007   1st Qu.:   6.5040  
##  Median :  21.0225   Median :  21.5573   Median :  20.7020  
##  Mean   :  78.1063   Mean   :  77.5048   Mean   :  77.6646  
##  3rd Qu.:  46.5487   3rd Qu.:  45.2711   3rd Qu.:  47.6173  
##  Max.   :1108.4710   Max.   :1107.4590   Max.   :1106.7350  
##  NA's   :1           NA's   :1           NA's   :1          
##    Août-2023           Sept-2023            Oct-2023           Nov-2023        
##  Min.   :   0.5284   Min.   :   0.6117   Min.   :   0.4766   Length:50         
##  1st Qu.:   6.8339   1st Qu.:   6.5745   1st Qu.:   6.2254   Class :character  
##  Median :  21.1284   Median :  20.5472   Median :  21.3038   Mode  :character  
##  Mean   :  78.1792   Mean   :  77.4469   Mean   :  76.7197                     
##  3rd Qu.:  47.4622   3rd Qu.:  47.8237   3rd Qu.:  47.7631                     
##  Max.   :1114.5250   Max.   :1102.6130   Max.   :1097.7230                     
##  NA's   :1           NA's   :1           NA's   :1                             
##    Déc-2023        
##  Length:50         
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
##

# Display the structure of the data
str(data_excel)

## tibble [50 × 18] (S3: tbl_df/tbl/data.frame)
##  $ Pays     : chr [1:50] "Australie" "Autriche" "Belgique" "Canada" ...
##  $ Août-2022: num [1:50] 35.2 17.3 55.13 50.88 8.07 ...
##  $ Sept-2022: num [1:50] 35.97 17.16 54.21 49.64 8.34 ...
##  $ Oct-2022 : num [1:50] 33.94 16.86 50.98 48.49 8.47 ...
##  $ Nov-2022 : num [1:50] 34.31 17.95 55 47.59 8.13 ...
##  $ Déc-2022 : num [1:50] 34.97 18 53.23 47.77 8.43 ...
##  $ Janv-2023: num [1:50] 35.46 18.47 49.4 49.85 8.57 ...
##  $ Févr-2023: num [1:50] 33.3 18.2 48.7 48.1 8.4 ...
##  $ Mars-2023: num [1:50] 33.83 19.38 48.84 46.48 8.77 ...
##  $ Avr-2023 : num [1:50] 29.99 19.5 48.73 47.61 7.64 ...
##  $ Mai-2023 : num [1:50] 30.5 18.8 46.1 46.5 7.4 ...
##  $ Juin-2023: num [1:50] 29.3 18.36 45.27 45.02 7.89 ...
##  $ Juil-2023: num [1:50] 27.85 18.81 47.62 46.22 7.56 ...
##  $ Août-2023: num [1:50] 30.2 18.54 46.76 47.46 7.93 ...
##  $ Sept-2023: num [1:50] 28.61 17.95 46.68 47.82 7.86 ...
##  $ Oct-2023 : num [1:50] 29.7 18.2 45.4 47.8 7.9 ...
##  $ Nov-2023 : chr [1:50] "31.088999999999999" "18.0898" "44.97148" "47.747590000000002" ...
##  $ Déc-2023 : chr [1:50] ".." ".." ".." ".." ...

# Read a specific sheet by name
data_sheet <- read_excel("./Data/SCIM.xls", sheet = "Sheet1")

# Read a specific range of cells
data_range <- read_excel("./Data/SCIM.xls", range = "A1:D10")

# Write data to an Excel file
write_xlsx(data_excel, "./Data/file.xlsx")

# Write data to a specific sheet
write.xlsx(data_excel, "./Data/file_specific_sheet.xlsx", sheetName = "DataSheet")

# Write multiple data frames to multiple sheets
write.xlsx(list(Sheet1 = data_excel, Sheet2 = data_excel), "./Data/file_multiple_sheets.xlsx")

2.2 Loading Data from Packages

As R is a widely used tool for data analysis, several data sources are implementing packages to directly access their datasets in R.

Introduction to the WDI package

The WDI package provides access to the World Bank’s World Development Indicators, which include a wide range of economic, social, and environmental data.

Loading Data from WDI

# Load required package
library(WDI)

# Load GDP data for USA, China, and India from 2000 to 2020
gdp_data <- WDI(country = c("US", "CN", "IN"), 
                indicator = "NY.GDP.MKTP.CD", 
                start = 2000, 
                end = 2020)
head(gdp_data)

##   country iso2c iso3c year NY.GDP.MKTP.CD
## 1   China    CN   CHN 2020   1.468774e+13
## 2   China    CN   CHN 2019   1.427997e+13
## 3   China    CN   CHN 2018   1.389491e+13
## 4   China    CN   CHN 2017   1.231049e+13
## 5   China    CN   CHN 2016   1.123331e+13
## 6   China    CN   CHN 2015   1.106157e+13

In the above code, we load the WDI package and then use the WDI function to fetch GDP data for the USA, China, and India from the year 2000 to 2020. The country parameter takes a vector of country codes, the indicator parameter specifies the type of data (in this case, GDP), and start and end define the time range.

# Rename columns for clarity
colnames(gdp_data) <- c("Country", "iso2c", "iso3c", "Year", "GDP")
head(gdp_data)

##   Country iso2c iso3c Year          GDP
## 1   China    CN   CHN 2020 1.468774e+13
## 2   China    CN   CHN 2019 1.427997e+13
## 3   China    CN   CHN 2018 1.389491e+13
## 4   China    CN   CHN 2017 1.231049e+13
## 5   China    CN   CHN 2016 1.123331e+13
## 6   China    CN   CHN 2015 1.106157e+13

Here, we rename the columns to make them more understandable.

# Summary of the data
summary(gdp_data)

##    Country             iso2c              iso3c                Year     
##  Length:63          Length:63          Length:63          Min.   :2000  
##  Class :character   Class :character   Class :character   1st Qu.:2005  
##  Mode  :character   Mode  :character   Mode  :character   Median :2010  
##                                                           Mean   :2010  
##                                                           3rd Qu.:2015  
##                                                           Max.   :2020  
##       GDP           
##  Min.   :4.684e+11  
##  1st Qu.:1.825e+12  
##  Median :6.087e+12  
##  Mean   :8.032e+12  
##  3rd Qu.:1.409e+13  
##  Max.   :2.152e+13

# Display the structure of the data
str(gdp_data)

## 'data.frame':    63 obs. of  5 variables:
##  $ Country: chr  "China" "China" "China" "China" ...
##  $ iso2c  : chr  "CN" "CN" "CN" "CN" ...
##  $ iso3c  : chr  "CHN" "CHN" "CHN" "CHN" ...
##  $ Year   : int  2020 2019 2018 2017 2016 2015 2014 2013 2012 2011 ...
##  $ GDP    : num  1.47e+13 1.43e+13 1.39e+13 1.23e+13 1.12e+13 ...
##   ..- attr(*, "label")= chr "GDP (current US$)"
##  - attr(*, "lastupdated")= chr "2024-06-28"
##  - attr(*, "label")= chr [1:63] "GDP (current US$)" "GDP (current US$)" "GDP (current US$)" "GDP (current US$)" ...

The summary function provides basic statistics about the dataset, while the str function displays its structure.

# Plot GDP data
library(ggplot2)

ggplot(gdp_data, aes(x = Year, y = GDP, color = Country)) +
  geom_line() +
  labs(title = "GDP of USA, China, and India (2000-2020)", y = "GDP (current US$)")

Finally, we use the ggplot2 package to create a line plot showing the GDP trends for the three countries over the specified period.

Exercises

Exercise 1: Reading CSV files

Download a CSV file from the internet.
Load the data into R using the readr package.
Display the first 6 rows of the data.
Display the summary and structure of the data.
Write the data to a new CSV file.
Write the data to a text file with a different delimiter.
Write the data to a CSV file without column names.

Exercise 2: Reading and Writing Excel files

Download an Excel file from the internet.
Load the data into R using the readxl package.
Display the first 6 rows of the data.
Display the summary and structure of the data.
Read a specific sheet by name.
Read a specific range of cells.
Write the data to a new Excel file using the writexl package.
Write the data to a specific sheet.
Write multiple data frames to multiple sheets.

Exercise 3: Loading Data from WDI

Install and load the WDI package.
Retrieve data for a different set of countries (e.g., Japan, Germany, Brazil) for a different indicator (e.g., SP.POP.TOTL for total population) from 2000 to 2020.
Rename the columns for clarity.
Display the first 6 rows of the data.
Display the summary and structure of the data.
What is the yearly averaged value of your chosen indicator for your set of countries from 2000 to 2020?

Session 3: Basic Data Manipulation (3 hours)

3.1 Introduction to dplyr

The dplyr package is one of the most powerful tools for data manipulation in R. It provides a set of functions that perform common data manipulation tasks such as filtering rows, selecting columns, arranging data, adding new columns, and summarizing data. The %>% (pipe) operator is often used to chain multiple functions together in a readable manner.

3.2 Filtering, Selecting, and Arranging Data

Let’s start with some basic operations: filtering, selecting, and arranging data.

# Load required package
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# Filter, select, and arrange data
filtered_data <- gdp_data %>%
  filter(Year > 2010) %>%
  dplyr::select(Country, Year, GDP) %>%
  arrange(desc(GDP))

head(filtered_data)

##         Country Year          GDP
## 1 United States 2019 2.152140e+13
## 2 United States 2020 2.132295e+13
## 3 United States 2018 2.065652e+13
## 4 United States 2017 1.961210e+13
## 5 United States 2016 1.880491e+13
## 6 United States 2015 1.829502e+13

In this In the code above, we use the %>% (pipe) operator to chain multiple dplyr functions together:

filter(Year > 2010) keeps only the rows where the Year is greater than 2010.
select(Country, Year, GDP) keeps only the specified columns.
arrange(desc(GDP)) sorts the data in descending order of GDP.

3.3 Adding and Mutating Columns

The mutate function is used to add new columns or modify existing ones.

# Add and mutate columns
gdp_data <- gdp_data %>%
  mutate(GDP_in_Billions = GDP / 1e9)

head(gdp_data)

##   Country iso2c iso3c Year          GDP GDP_in_Billions
## 1   China    CN   CHN 2020 1.468774e+13        14687.74
## 2   China    CN   CHN 2019 1.427997e+13        14279.97
## 3   China    CN   CHN 2018 1.389491e+13        13894.91
## 4   China    CN   CHN 2017 1.231049e+13        12310.49
## 5   China    CN   CHN 2016 1.123331e+13        11233.31
## 6   China    CN   CHN 2015 1.106157e+13        11061.57

Here, we create a new column GDP_in_Billions by dividing the GDP values by 1 billion.

3.4 Summarizing Data

Grouping data and summarizing it with the summarise function are common tasks in data analysis.

# Summarize data
gdp_stats <- gdp_data %>%
  group_by(Country) %>%
  summarise(mean_gdp = mean(GDP, na.rm = TRUE),
            median_gdp = median(GDP, na.rm = TRUE))

gdp_stats

## # A tibble: 3 × 3
##   Country       mean_gdp median_gdp
##   <chr>            <dbl>      <dbl>
## 1 China          6.93e12    6.09e12
## 2 India          1.56e12    1.68e12
## 3 United States  1.56e13    1.50e13

In this example, we group the data by country and then calculate the mean and median GDP for each country.

3.5 Advanced dplyr Functions

3.5.1 `mutate` and `transmute`

The transmute function works like mutate but keeps only the new variables.

# Using mutate
mutated_data <- gdp_data %>%
  mutate(GDP_in_Billions = GDP / 1e9,
         GDP_in_Millions = GDP / 1e6)

# Using transmute
transmuted_data <- gdp_data %>%
  transmute(GDP_in_Billions = GDP / 1e9,
            GDP_in_Millions = GDP / 1e6)

head(mutated_data)

##   Country iso2c iso3c Year          GDP GDP_in_Billions GDP_in_Millions
## 1   China    CN   CHN 2020 1.468774e+13        14687.74        14687744
## 2   China    CN   CHN 2019 1.427997e+13        14279.97        14279969
## 3   China    CN   CHN 2018 1.389491e+13        13894.91        13894908
## 4   China    CN   CHN 2017 1.231049e+13        12310.49        12310491
## 5   China    CN   CHN 2016 1.123331e+13        11233.31        11233314
## 6   China    CN   CHN 2015 1.106157e+13        11061.57        11061573

head(transmuted_data)

##   GDP_in_Billions GDP_in_Millions
## 1        14687.74        14687744
## 2        14279.97        14279969
## 3        13894.91        13894908
## 4        12310.49        12310491
## 5        11233.31        11233314
## 6        11061.57        11061573

3.5.2 `filter` with Multiple Conditions

You can filter data using multiple conditions.

# Filter with multiple conditions
filtered_data <- gdp_data %>%
  filter(Year > 2010, GDP > 1e12)

head(filtered_data)

##   Country iso2c iso3c Year          GDP GDP_in_Billions
## 1   China    CN   CHN 2020 1.468774e+13        14687.74
## 2   China    CN   CHN 2019 1.427997e+13        14279.97
## 3   China    CN   CHN 2018 1.389491e+13        13894.91
## 4   China    CN   CHN 2017 1.231049e+13        12310.49
## 5   China    CN   CHN 2016 1.123331e+13        11233.31
## 6   China    CN   CHN 2015 1.106157e+13        11061.57

3.5.3 `select` with Helper Functions

The select function supports helper functions to make column selection easier.

# Select columns using helper functions
selected_data <- gdp_data %>%
    dplyr::select(starts_with("G"), contains("Year"))

head(selected_data)

##            GDP GDP_in_Billions Year
## 1 1.468774e+13        14687.74 2020
## 2 1.427997e+13        14279.97 2019
## 3 1.389491e+13        13894.91 2018
## 4 1.231049e+13        12310.49 2017
## 5 1.123331e+13        11233.31 2016
## 6 1.106157e+13        11061.57 2015

3.5.4 `summarise` with Multiple Summaries

You can create multiple summaries in one step.

# Multiple summaries
summary_stats <- gdp_data %>%
  group_by(Country) %>%
  summarise(mean_gdp = mean(GDP, na.rm = TRUE),
            median_gdp = median(GDP, na.rm = TRUE),
            total_gdp = sum(GDP, na.rm = TRUE))

summary_stats

## # A tibble: 3 × 4
##   Country       mean_gdp median_gdp total_gdp
##   <chr>            <dbl>      <dbl>     <dbl>
## 1 China          6.93e12    6.09e12   1.46e14
## 2 India          1.56e12    1.68e12   3.28e13
## 3 United States  1.56e13    1.50e13   3.28e14

3.6 Joining Data Frames

Some times, you will need to have several variables from different data sources. In those cases, one will need to merge data frames in order to get all variables in the same one. dplyr provides several functions for joining data frames: inner_join, left_join, right_join, full_join.

3.6.1 Inner Join

An inner_join returns only the rows that have matching values in both data frames.

# Example data frames
data1 <- data.frame(Country = c("US", "CN", "IN"), Value1 = 1:3)
data2 <- data.frame(Country = c("US", "CN", "BR"), Value2 = 4:6)

# Inner join
inner_join(data1, data2, by = "Country")

##   Country Value1 Value2
## 1      US      1      4
## 2      CN      2      5

Explanation:

The result will include only the rows where the Country values match in both data frames.
Here, only “US” and “CN” are common in both data1 and data2, so the result will be:

Country	Value1	Value2
US	1	4
CN	2	5

3.6.2 Left Join

A left_join returns all the rows from the left data frame and the matched rows from the right data frame. If there is no match, the result will contain NA for columns from the right data frame.

# Left join
left_join(data1, data2, by = "Country")

##   Country Value1 Value2
## 1      US      1      4
## 2      CN      2      5
## 3      IN      3     NA

Explanation:

The result will include all rows from data1, and the matching rows from data2.
If there is no match, NA will be used for the missing values from data2.
Here, “IN” from data1 has no match in data2, so the result will be:

Country	Value1	Value2
US	1	4
CN	2	5
IN	3	NA

3.6.3 Right Join

A right_join returns all the rows from the right data frame and the matched rows from the left data frame. If there is no match, the result will contain NA for columns from the left data frame.

# Right join
right_join(data1, data2, by = "Country")

##   Country Value1 Value2
## 1      US      1      4
## 2      CN      2      5
## 3      BR     NA      6

Explanation:

The result will include all rows from data2, and the matching rows from data1.
If there is no match, NA will be used for the missing values from data1.
Here, “BR” from data2 has no match in data1, so the result will be:

Country	Value1	Value2
US	1	4
CN	2	5
BR	NA	6

3.6.4 Full Join

A full_join returns all rows when there is a match in either left or right data frame. If there is no match, the result will contain NA for the missing values from either data frame.

# Full join
full_join(data1, data2, by = "Country")

##   Country Value1 Value2
## 1      US      1      4
## 2      CN      2      5
## 3      IN      3     NA
## 4      BR     NA      6

Explanation:

The result will include all rows from both data frames.
If there is no match, NA will be used for the missing values from either data frame.
The result will be:

Country	Value1	Value2
US	1	4
CN	2	5
IN	3	NA
BR	NA	6

Visual Representation

To help visualize these joins, you can think of them as operations on two sets:

Inner Join: Intersection of both sets.
Left Join: All elements from the left set and the intersection.
Right Join: All elements from the right set and the intersection.
Full Join: Union of both sets.

Exercises

Exercise 1: Basic dplyr Operations

Filter the gdp_data to include only data from the year 2015 onwards.
Select the columns Country, Year, and GDP.
Arrange the data in ascending order of GDP.
Add a new column GDP_in_Trillions by dividing the GDP by 1e12.
Group the data by Country and calculate the mean and total GDP.

Exercise 2: Advanced dplyr Functions

Use mutate to add columns GDP_in_Billions and GDP_in_Millions to gdp_data.
Use transmute to create a new data frame with columns GDP_in_Billions and GDP_in_Millions.
Filter the gdp_data to include only rows where Year is greater than 2010 and GDP is greater than 1e12.
Select columns that start with “G” and contain “Year”.
Create multiple summaries for mean_gdp, median_gdp, and total_gdp by grouping the data by Country.

Exercise 3: Joining Data Frames

Create two data frames with a common column.
Perform an inner join on the data frames using the common column.
Perform a left join on the data frames.
Perform a right join on the data frames.
Perform a full join on the data frames.

Session 4: Basic Statistics (3 hours)

4.1 Descriptive Statistics

Descriptive statistics provide simple summaries about the sample and the measures. These summaries are crucial for understanding the distribution and central tendency of the data.

Calculate Mean, Median, and Standard Deviation

Let’s start by calculating some basic descriptive statistics: mean, median, and standard deviation.

# Calculate mean, median, and standard deviation
mean_gdp <- mean(gdp_data$GDP, na.rm = TRUE)
median_gdp <- median(gdp_data$GDP, na.rm = TRUE)
sd_gdp <- sd(gdp_data$GDP, na.rm = TRUE)

mean_gdp

## [1] 8.031945e+12

median_gdp

## [1] 6.087192e+12

sd_gdp

## [1] 6.752724e+12

In the code above:

mean(gdp_data$GDP, na.rm = TRUE) calculates the mean of the GDP values, ignoring any missing values (NA).
median(gdp_data$GDP, na.rm = TRUE) calculates the median of the GDP values, also ignoring NAs.
sd(gdp_data$GDP, na.rm = TRUE) calculates the standard deviation of the GDP values, ignoring NAs.

Calculate Range, IQR, and Summary Statistics

In addition to mean, median, and standard deviation, other useful descriptive statistics include the range, interquartile range (IQR), and summary statistics.

# Calculate range
range_gdp <- range(gdp_data$GDP, na.rm = TRUE)

# Calculate interquartile range (IQR)
iqr_gdp <- IQR(gdp_data$GDP, na.rm = TRUE)

# Summary statistics
summary_gdp <- summary(gdp_data$GDP)

range_gdp

## [1] 4.683955e+11 2.152140e+13

iqr_gdp

## [1] 1.226209e+13

summary_gdp

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 4.684e+11 1.825e+12 6.087e+12 8.032e+12 1.409e+13 2.152e+13

range(gdp_data$GDP, na.rm = TRUE) gives the minimum and maximum GDP values.
IQR(gdp_data$GDP, na.rm = TRUE) calculates the interquartile range, which measures the spread of the middle 50% of the data.
summary(gdp_data$GDP) provides a summary of the minimum, 1st quartile, median, mean, 3rd quartile, and maximum values.

4.2 Visualizing Descriptive Statistics

Visualizations can provide additional insights into the distribution and spread of the data.

Histograms and Density Plots

# Histogram of GDP
hist(gdp_data$GDP, main = "Histogram of GDP", xlab = "GDP", breaks = 30, col = "blue")

# Density plot of GDP
plot(density(gdp_data$GDP, na.rm = TRUE), main = "Density Plot of GDP", xlab = "GDP", col = "red")

hist(gdp_data$GDP, ...) creates a histogram of GDP values.
plot(density(gdp_data$GDP, na.rm = TRUE), ...) creates a density plot, showing the distribution of GDP values.

Boxplots

Boxplots are useful for visualizing the spread and identifying outliers.

# Boxplot of GDP
boxplot(gdp_data$GDP, main = "Boxplot of GDP", ylab = "GDP", col = "green")

4.3 Correlation and Regression

Correlation and regression analysis are used to examine the relationships between variables.

Correlation measures the strength and direction of the linear relationship between two variables. The Pearson correlation coefficient $r$ is calculated using the following formula, for a given a pair of random variables ${\displaystyle (X,Y)}$ (for example, Height and Weight):

${\displaystyle r_{X,Y}={\frac {\operatorname {cov} (X,Y)}{\sigma _{X}\sigma _{Y}}}}$

where ${\displaystyle \operatorname {cov} }$ is the covariance ${\displaystyle \sigma _{X}}$ is the standard deviation of ${\displaystyle X}$ , ${\displaystyle \sigma _{Y}}$ is the standard deviation of $Y$. The formula for ${\displaystyle \operatorname {cov} (X,Y)}$ can be expressed in terms of mean and expectation. Since ${\displaystyle \operatorname {cov} (X,Y)=\operatorname {\mathbb {E} } [(X-\mu {X})(Y-\mu {Y})],}$ the formula for $r$ can also be written as:

${\displaystyle r_{X,Y}={\frac {\operatorname {\mathbb {E} } [(X-\mu _{X})(Y-\mu _{Y})]}{\sigma _{X}\sigma _{Y}}}}$ where ${\displaystyle \sigma _{Y}}$ and ${\displaystyle \sigma _{X}}$ are defined as above ${\displaystyle \mu _{X}}$ is the mean of ${\displaystyle X}$, ${\displaystyle \mu _{Y}}$ is the mean of ${\displaystyle Y}$, ${\displaystyle \operatorname {\mathbb {E} } }$ is the expectation. The formula for ${\displaystyle r }$ can be expressed in terms of uncentered moments. Since

${\displaystyle {\begin{aligned}\mu _{X}={}&\operatorname {\mathbb {E} } [\,X\,]\\\mu _{Y}={}&\operatorname {\mathbb {E} } [\,Y\,]\\\sigma _{X}^{2}={}&\operatorname {\mathbb {E} } \left[\,\left(X-\operatorname {\mathbb {E} } [X]\right)^{2}\,\right]=\operatorname {\mathbb {E} } \left[\,X^{2}\,\right]-\left(\operatorname {\mathbb {E} } [\,X\,]\right)^{2}\\\sigma _{Y}^{2}={}&\operatorname {\mathbb {E} } \left[\,\left(Y-\operatorname {\mathbb {E} } [Y]\right)^{2}\,\right]=\operatorname {\mathbb {E} } \left[\,Y^{2}\,\right]-\left(\,\operatorname {\mathbb {E} } [\,Y\,]\right)^{2}\\&\operatorname {\mathbb {E} } [\,\left(X-\mu _{X}\right)\left(Y-\mu _{Y}\right)\,]=\operatorname {\mathbb {E} } [\,\left(X-\operatorname {\mathbb {E} } [\,X\,]\right)\left(Y-\operatorname {\mathbb {E} } [\,Y\,]\right)\,]=\operatorname {\mathbb {E} } [\,X\,Y\,]-\operatorname {\mathbb {E} } [\,X\,]\operatorname {\mathbb {E} } [\,Y\,]\,,\end{aligned}}}$

the formula for ${\displaystyle r }$ can also be written as ${\displaystyle r_{X,Y}={\frac {\operatorname {\mathbb {E} } [\,X\,Y\,]-\operatorname {\mathbb {E} } [\,X\,]\operatorname {\mathbb {E} } [\,Y\,]}{{\sqrt {\operatorname {\mathbb {E} } \left[\,X^{2}\,\right]-\left(\operatorname {\mathbb {E} } [\,X\,]\right)^{2}}}~{\sqrt {\operatorname {\mathbb {E} } \left[\,Y^{2}\,\right]-\left(\operatorname {\mathbb {E} } [\,Y\,]\right)^{2}}}}}.}$

The value of $r$ ranges from -1 to 1:

$r=1$ indicates a perfect positive linear relationship.
$r=−1$ indicates a perfect negative linear relationship.
$r=0$ indicates no linear relationship.

Calculate Correlation

The cor function computes the correlation coefficient between two variables, indicating the strength and direction of their linear relationship.

# Calculate correlation
correlation <- cor(gdp_data$GDP, gdp_data$Year, use = "complete.obs")
correlation

## [1] 0.4380231

cor(gdp_data$GDP, gdp_data$Year, use = "complete.obs") calculates the correlation between GDP and Year, using only the complete observations (ignoring rows with NAs).

Simple Linear Regression

Simple linear regression models the relationship between a dependent variable (YYY) and an independent variable (XXX) using the equation:

$Y=\beta_0+\beta_1X+\epsilon$

Where:

$Y$ is the dependent variable (e.g., GDP).
$X$ is the independent variable (e.g., Year).
$\beta_0$ is the intercept (the value of $Y$ when $X=0$).
$\beta_1$ is the slope of the regression line (the change in $Y$ for a one-unit change in $X$).
$\epsilon$ is the error term.

The lm function performs a linear regression, modeling one variable as a function of another.

# Simple linear regression
model <- lm(GDP ~ Year, data = gdp_data)
summary(model)

## 
## Call:
## lm(formula = GDP ~ Year, data = gdp_data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.020e+13 -4.602e+12 -1.945e+12  7.041e+12  9.128e+12 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -9.660e+14  2.559e+14  -3.774 0.000366 ***
## Year         4.846e+11  1.273e+11   3.806 0.000330 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.12e+12 on 61 degrees of freedom
## Multiple R-squared:  0.1919, Adjusted R-squared:  0.1786 
## F-statistic: 14.48 on 1 and 61 DF,  p-value: 0.0003303

lm(GDP ~ Year, data = gdp_data) fits a linear model predicting GDP based on Year.
summary(model) provides detailed information about the fitted model, including coefficients, R-squared value, and p-values.

Visualizing Regression Results

Visualizing the results of a regression analysis can help interpret the relationship between variables.

# Scatter plot with regression line
plot(gdp_data$Year, gdp_data$GDP, main = "GDP vs Year", xlab = "Year", ylab = "GDP", pch = 19, col = "blue")
abline(model, col = "red")

plot(gdp_data$Year, gdp_data$GDP, ...) creates a scatter plot of GDP vs. Year.
abline(model, col = "red") adds the regression line to the plot.

4.4 More Advanced Regression Techniques

Multiple Linear Regression

Multiple linear regression extends simple linear regression by including multiple independent variables:

$Y=\beta_0+\beta_1X_1+\beta_2X_2+\cdots+\beta_k X_k+\epsilon$

Where:

$X_1, X_2, \ldots, X_k$ are the independent variables.

The model estimates the coefficients $\beta_0,\beta_1,\ldots,\beta_k$ to describe the relationship between the dependent variable and the independent variables.

Extending simple linear regression to include multiple predictors.

# Load required package
library(dplyr)
library(data.table)

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

data <- fread("./Data/WID_Data.csv")

## Warning in fread("./Data/WID_Data.csv"): Detected 1 column names but the data
## has 5 columns (i.e. invalid file). Added 4 extra default column names at the
## end.

colnames(data) <- c("Country", "indicator", "percentil", "Year", "Gini")


# Merge GDP data with Gini data
gdp_data <- left_join(gdp_data, data, by = c("Country", "Year"))

# Multiple linear regression
model_multi <- lm(GDP ~ Year + Gini, data = gdp_data)
summary(model_multi)

## 
## Call:
## lm(formula = GDP ~ Year + Gini, data = gdp_data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -4.047e+12 -7.968e+11  4.474e+11  8.813e+11  1.892e+12 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.771e+15  9.115e+13  -19.42   <2e-16 ***
## Year         9.137e+11  4.673e+10   19.55   <2e-16 ***
## Gini        -1.079e+14  7.281e+12  -14.82   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.347e+12 on 39 degrees of freedom
##   (21 observations deleted due to missingness)
## Multiple R-squared:  0.9084, Adjusted R-squared:  0.9037 
## F-statistic: 193.4 on 2 and 39 DF,  p-value: < 2.2e-16

lm(GDP ~ Year + Gini, data = gdp_data) fits a multiple linear regression model predicting GDP based on Year and Gini.
summary(model_multi) provides detailed information about the fitted model.

Exercises

Exercise 1: Descriptive Statistics

Calculate the mean, median, and standard deviation of another variable in the dataset.
Calculate the range and IQR of the same variable.
Create a histogram and a boxplot for the variable.

# Example solution
mean_Gini <- mean(gdp_data$Gini, na.rm = TRUE)
median_Gini <- median(gdp_data$Gini, na.rm = TRUE)
sd_Gini <- sd(gdp_data$Gini, na.rm = TRUE)
range_Gini <- range(gdp_data$Gini, na.rm = TRUE)
iqr_Gini <- IQR(gdp_data$Gini, na.rm = TRUE)

mean_Gini

## [1] 0.5718849

median_Gini

## [1] 0.5620298

sd_Gini

## [1] 0.03933434

range_Gini

## [1] 0.4979644 0.6336183

iqr_Gini

## [1] 0.05114127

# Histogram and Boxplot
hist(gdp_data$Gini, main = "Histogram of Gini", xlab = "Gini", breaks = 30, col = "blue")

boxplot(gdp_data$Gini, main = "Boxplot of Gini", ylab = "Gini", col = "green")

Exercise 2: Correlation and Regression

Calculate the correlation between GDP and another variable.
Fit a linear regression model predicting GDP based on another variable (e.g., Gini) and interpret the results.
Fit a multiple linear regression model predicting GDP based on multiple predictors and interpret the results.
Visualize the regression results with scatter plots and regression lines.

# Example solution
correlation_pop <- cor(gdp_data$GDP, gdp_data$Gini, use = "complete.obs")
correlation_pop

## [1] -0.1022499

# Simple linear regression with Gini
model_pop <- lm(GDP ~ Gini, data = gdp_data)
summary(model_pop)

## 
## Call:
## lm(formula = GDP ~ Gini, data = gdp_data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -4.494e+12 -2.967e+12 -1.577e+12  1.522e+12  1.035e+13 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)  1.070e+13  9.946e+12   1.076    0.289
## Gini        -1.128e+13  1.735e+13  -0.650    0.519
## 
## Residual standard error: 4.37e+12 on 40 degrees of freedom
##   (21 observations deleted due to missingness)
## Multiple R-squared:  0.01046,    Adjusted R-squared:  -0.01428 
## F-statistic: 0.4226 on 1 and 40 DF,  p-value: 0.5193

# Multiple linear regression with Year and Gini
model_multi <- lm(GDP ~ Year + Gini, data = gdp_data)
summary(model_multi)

## 
## Call:
## lm(formula = GDP ~ Year + Gini, data = gdp_data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -4.047e+12 -7.968e+11  4.474e+11  8.813e+11  1.892e+12 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.771e+15  9.115e+13  -19.42   <2e-16 ***
## Year         9.137e+11  4.673e+10   19.55   <2e-16 ***
## Gini        -1.079e+14  7.281e+12  -14.82   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.347e+12 on 39 degrees of freedom
##   (21 observations deleted due to missingness)
## Multiple R-squared:  0.9084, Adjusted R-squared:  0.9037 
## F-statistic: 193.4 on 2 and 39 DF,  p-value: < 2.2e-16

# Scatter plot with regression line
plot(gdp_data$Gini, gdp_data$GDP, main = "GDP vs Gini", xlab = "Gini", ylab = "GDP", pch = 19, col = "blue")
abline(model_pop, col = "red")

4.5 Hypothesis Testing

Hypothesis testing allows us to make inferences about GDP based on sample data.

T-Test

A t-test is used to compare the means of two groups. The test statistic $t$ is calculated using the formula:

$t = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}$

Where:

$\bar{X}_1$ and $\bar{X}_2$ are the sample means of the two groups.
$s_1^2$ and $s_2^2$ are the sample variances of the two groups.
$n_1$ and $n_2$ are the sample sizes of the two groups.

The t-test compares the calculated $t$ value to the critical value from the t-distribution to determine if the difference in means is statistically significant.

Performing a t-test to compare means.

# Creating a category to get two different samples

gdp_data <- gdp_data %>%
  mutate(region = case_when(
    Country %in% c("China", "India") ~ "Asia",
    Country == "United States" ~ "America",
    TRUE ~ NA_character_  # This handles cases where the country is neither China, India, nor USA
  ))

# Performing T-test between those two categories

t_test <- t.test(gdp_data$GDP ~ gdp_data$region)
t_test

## 
##  Welch Two Sample t-test
## 
## data:  gdp_data$GDP by gdp_data$region
## t = 11.106, df = 48.099, p-value = 7.104e-15
## alternative hypothesis: true difference in means between group America and group Asia is not equal to 0
## 95 percent confidence interval:
##  9.298051e+12 1.340854e+13
## sample estimates:
## mean in group America    mean in group Asia 
##          1.560081e+13          4.247513e+12

t.test(gdp_data$GDP ~ gdp_data$Group) performs a t-test comparing the means of GDP between two groups.

Exercises

Exercise 3: Hypothesis Testing

Perform a t-test to compare the means of GDP for two different groups (e.g., developed vs. developing countries).

4.6 Summarizing Results with Tables and Reports

Summarizing results in tables and generating reports can be useful for communicating findings.

Creating Summary Tables

Using the dplyr package to summarize data.

# Summarize data
gdp_summary <- gdp_data %>%
  group_by(region) %>%
  summarise(mean_gdp = mean(GDP, na.rm = TRUE),
            median_gdp = median(GDP, na.rm = TRUE),
            sd_gdp = sd(GDP, na.rm = TRUE))

gdp_summary

## # A tibble: 2 × 4
##   region  mean_gdp median_gdp  sd_gdp
##   <chr>      <dbl>      <dbl>   <dbl>
## 1 America  1.56e13    1.50e13 3.54e12
## 2 Asia     4.25e12    2.19e12 4.34e12

group_by(Region) groups the data by region.
summarise(mean_gdp = mean(GDP, na.rm = TRUE), ...) calculates the mean, median, and standard deviation of GDP for each region.

Session 5: Data Visualization with ggplot2 (3 hours)

5.1 Introduction to ggplot2

Overview

The ggplot2 package is one of the most powerful and flexible tools for creating complex, multi-layered graphics in R. It implements the Grammar of Graphics, a framework that breaks down plots into semantic components such as layers, scales, and themes.

Grammar of Graphics: The core idea is to build plots by combining independent components, making it easier to customize and create complex visualizations.
Advantages: Highly customizable, works well with dplyr and other tidyverse packages, and produces publication-quality plots.

Basic Concepts

Aesthetic Mappings (aes()): This function defines how data variables are mapped to visual properties like color, size, and shape.
Geometries (geom_*): Geometries define the type of plot, such as points (geom_point), lines (geom_line), and bars (geom_bar).
Layers: You can add multiple layers of geometries to a plot.
Scales and Coordinate Systems: Adjust the scales and coordinate systems for finer control over the plot appearance.
Themes: Themes allow you to control the non-data elements of the plot, such as background, grid lines, and text formatting.

Example: Basic Scatter Plot

# Load ggplot2
library(ggplot2)

# Scatter plot example
ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point(color = 'blue') +
  labs(title = "Car Weight vs. Miles Per Gallon",
       x = "Weight (1000 lbs)",
       y = "Miles per Gallon") +
  theme_minimal()

Exercise 1: Customizing Your First Plot

Objective: Create a scatter plot showing the relationship between disp (displacement) and mpg (miles per gallon) in the mtcars dataset. Customize the plot by changing the color, size, and shape of the points.

Instructions:

Change the color of the points to red.
Adjust the size of the points to 3.
Use triangles for the point shapes.

ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point(color = 'red', size = 3, shape = 17) +
  labs(title = "Car Weight vs. Miles Per Gallon",
       x = "Weight (1000 lbs)",
       y = "Miles per Gallon") +
  theme_minimal()

5.2 Scatter Plots and Line Plots

Creating scatter plots and line plots can help visualize relationships between variables.

# Scatter plot of GDP vs Year
ggplot(gdp_data, aes(x = Year, y = GDP, color = Country)) +
  geom_point() +
  labs(title = "GDP Over Time",
       x = "Year",
       y = "GDP (in USD)") +
  theme_minimal()

In the scatter plot above, we plot GDP against Year, with different colours for each country.

# Line plot of GDP over time
ggplot(gdp_data, aes(x = Year, y = GDP, color = Country)) +
  geom_line() +
  labs(title = "GDP Over Time",
       x = "Year",
       y = "GDP (in USD)") +
  theme_minimal()

The line plot shows how GDP changes over time for each country. We may also want to add the relationship between those two variables without regarding countries.

# Scatter plot of GDP vs Year
ggplot(gdp_data, aes(x = Year, y = GDP)) +
  geom_point(color="blue") +
  geom_smooth(method = "lm", se = TRUE, color = "black") +
  labs(title = "GDP Over Time",
       x = "Year",
       y = "GDP (in USD)") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

Exercise 2: Exploring Relationships with Scatter Plots

Create a scatter plot showing the relationship between horsepower (hp) and miles per gallon (mpg) in the mtcars dataset. Then, add a linear regression line.

# Scatter plot example
ggplot(mtcars, aes(x = hp, y = mpg)) +
  geom_point(color = 'blue') +
  labs(title = "Horsepower vs. Miles Per Gallon",
       x = "Horsepower",
       y = "Miles per Gallon") +
  theme_minimal()

# Adding regression line 

ggplot(mtcars, aes(x = hp, y = mpg)) +
  geom_point(color = 'blue') +
  geom_smooth(method = "loess", color= "pink", se = TRUE)+
  labs(title = "Horsepower vs. Miles Per Gallon",
       x = "Horsepower",
       y = "Miles per Gallon") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

5.3 Bar Plots and Histograms

Bar plots and histograms are useful for comparing categorical data and visualizing data distributions.

Bar Plots

Bar plots are used to compare categorical data.

# Bar plot of mean GDP
ggplot(gdp_stats, aes(x = Country, y = mean_gdp, fill = Country)) +
  geom_bar(stat = "identity", color = "black") +
  labs(title = "Mean GDP Comparison",
       x = "Country",
       y = "Mean GDP (in USD)") +
  theme_minimal()+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Exercise 3: Creating Bar Plots

Objective: Create a bar plot to compare the number of cars with different numbers of cylinders in the mtcars dataset.

Instructions:

Use the cyl variable (number of cylinders) in the mtcars dataset to create a bar plot.
Count the number of cars for each cylinder category.
Customize the bar plot by adding appropriate labels, colors, and a title.

# Load necessary libraries
library(ggplot2)

# Create a bar plot for the number of cars with different numbers of cylinders
ggplot(mtcars, aes(x = factor(cyl))) +
  geom_bar(fill = "lightblue", color = "black") +
  labs(title = "Number of Cars by Cylinder Count",
       x = "Number of Cylinders",
       y = "Count of Cars") +
  theme_minimal()

Histograms

Histograms display the distribution of a single variable.

# Histogram of GDP
ggplot(gdp_data, aes(x = GDP)) +
  geom_histogram(binwidth = 1e12, fill = "skyblue", color = "black") +
  labs(title = "GDP Distribution",
       x = "GDP (in USD)",
       y = "Frequency") +
  theme_minimal()

The histogram shows the distribution of GDP values.

Exercise 4: Creating Histograms

Objective: Create a histogram to explore the distribution of car weights (wt) in the mtcars dataset.

Instructions:

Create a histogram of the wt variable.
Adjust the bin width to show a more detailed distribution.
Customize the histogram by changing the colors, adding labels, and giving it a title.

# Create a histogram for the weight distribution of cars
ggplot(mtcars, aes(x = wt)) +
  geom_histogram(binwidth = 0.25, fill = "skyblue", color = "black") +
  labs(title = "Distribution of Car Weights",
       x = "Weight (1000 lbs)",
       y = "Frequency") +
  theme_minimal()

5.4 Creating Maps with ggplot2

The ggplot2 package allows for the creation of maps in R. By combining it with the sf package for working with spatial data and incorporating variables like GDP, we can create maps that visualize different metrics over geographic regions.

Example: World Map Colored by GDP

In this example, we will create a map that colors countries according to their GDP values. For simplicity, let’s assume you have GDP data for countries in a dataframe.

Step 1: Prepare GDP Data

Ensure that you have country-level GDP data. For this example, assume that gdp_data is a dataframe containing the GDP of different countries, where the country names match those in the world dataset.

head(gdp_data)

##   Country iso2c iso3c Year          GDP GDP_in_Billions
## 1   China    CN   CHN 2020 1.468774e+13        14687.74
## 2   China    CN   CHN 2019 1.427997e+13        14279.97
## 3   China    CN   CHN 2018 1.389491e+13        13894.91
## 4   China    CN   CHN 2017 1.231049e+13        12310.49
## 5   China    CN   CHN 2016 1.123331e+13        11233.31
## 6   China    CN   CHN 2015 1.106157e+13        11061.57
##                                                                                                      indicator
## 1 gptinc_pall_992_j_CN\nPre-tax national income \nTotal population | Gini coefficient | adults | equal split\n
## 2 gptinc_pall_992_j_CN\nPre-tax national income \nTotal population | Gini coefficient | adults | equal split\n
## 3 gptinc_pall_992_j_CN\nPre-tax national income \nTotal population | Gini coefficient | adults | equal split\n
## 4 gptinc_pall_992_j_CN\nPre-tax national income \nTotal population | Gini coefficient | adults | equal split\n
## 5 gptinc_pall_992_j_CN\nPre-tax national income \nTotal population | Gini coefficient | adults | equal split\n
## 6 gptinc_pall_992_j_CN\nPre-tax national income \nTotal population | Gini coefficient | adults | equal split\n
##   percentil      Gini region
## 1      pall 0.5641360   Asia
## 2      pall 0.5579220   Asia
## 3      pall 0.5591818   Asia
## 4      pall 0.5613709   Asia
## 5      pall 0.5537136   Asia
## 6      pall 0.5554577   Asia

Step 2: Join GDP Data with Spatial Data

You will need to join the GDP data with the geographic data from sf using a common identifier, such as the country name.

library(ggplot2)
library(sf)

## Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE

# Load the world map data from the 'rnaturalearth' package
world <- st_as_sf(rnaturalearth::ne_countries(scale = "medium", returnclass = "sf"))

# Rename variable name country
names(world)[names(world) == "name"] <- "Country"
world <- world %>%
  mutate(Country = ifelse(Country == "United States of America", "United States", Country))

# Join GDP data with world data
world_gdp <- left_join(world, gdp_stats, by="Country")

Step 3: Plot the Map with GDP

# Plot the world map colored by GDP
ggplot(data = world_gdp) +
  geom_sf(aes(fill = mean_gdp)) +
  scale_fill_viridis_c(option = "plasma", na.value = "grey") +
  ggtitle("World Map Colored by GDP") +
  theme_minimal()

Customizing the GDP Map

You can further customize the appearance of the map, adjusting titles, labels, and map colors to enhance the visual representation.

Code: Customizing the GDP Map

ggplot(data = world_gdp) +
  geom_sf(aes(fill = mean_gdp), color = "white") +
  scale_fill_viridis_c(option = "plasma", trans = "log", na.value = "grey") +
  labs(title = "Average GDP Distribution from 2000 to 2020",
       fill = "GDP (in billions)") +
  theme_minimal() +
  theme(legend.position = "bottom")

This version uses a logarithmic scale for better visualization of GDP values and adjusts the map’s theme and legend position for clarity.

Exercise 5: Mapping GDP per Capita

In this exercise, you will create a map that displays countries colored by GDP per capita.

Add a column to the world_gdp dataframe for GDP per capita.
Create a map that colors the countries according to their GDP per capita.

Code Solution

# Assuming the 'pop_est' column contains population estimates
world_gdp$gdp_per_capita <- world_gdp$mean_gdp / world_gdp$pop_est

# Plot the GDP per capita map
ggplot(data = world_gdp) +
  geom_sf(aes(fill = gdp_per_capita)) +
  scale_fill_viridis_c(option = "magma", trans = "log", na.value = "grey") +
  labs(title = "World GDP per Capita",
       fill = "GDP per Capita") +
  theme_minimal()

Exercise 6: Annotating Countries with Highest GDP

In this exercise, you will annotate the countries with the top highest GDP value.

Modify the GDP map to annotate the top 1 countries by GDP.
Use the geom_text() function to place the country names on the map.

Code Solution

# Find top 1 countries by GDP
top_gdp_countries <- world_gdp %>%
  top_n(1, mean_gdp)

# Plot the map with annotations
ggplot(data = world_gdp) +
  geom_sf(aes(fill = mean_gdp)) +
  geom_text(data = top_gdp_countries, aes(label = "Winner", geometry = geometry),
            stat = "sf_coordinates", size = 3, color = "black") +
  scale_fill_viridis_c(option = "plasma", trans = "log", na.value = "grey") +
  labs(title = "World GDP Map with Top Countries") +
  theme_minimal()

## Warning in st_point_on_surface.sfc(sf::st_zm(x)): st_point_on_surface may not
## give correct results for longitude/latitude data

Session 6: Advanced Data Visualization (3 hours)

6.1 Faceting and Themes

Concepts:

Faceting: Allows splitting data into multiple plots based on a factor variable.
Themes: Customize the appearance of your plots.

Example 1: Faceting by Country

# Facet plot by country
ggplot(gdp_data, aes(x = Year, y = GDP)) +
  geom_line() +
  facet_wrap(~ Country) +
  labs(title = "GDP Over Time by Country",
       x = "Year",
       y = "GDP (in USD)") +
  theme_minimal()

The above code creates a separate plot for each country, displaying GDP trends over time. One can also create custom themes to standardize the appearance of all plots.

Exercise 1: Faceting with mtcars Dataset

Task: Create a faceted plot showing the relationship between displacement (disp) and horsepower (hp) in the mtcars dataset, faceted by the number of cylinders (cyl).
Hints: Use facet_wrap(~ cyl) or facet_grid(cyl ~ .) for different layouts.

Solution:

ggplot(mtcars, aes(x = disp, y = hp)) +
  geom_point() +
  facet_wrap(~ cyl) +
  labs(title = "Displacement vs. Horsepower by Number of Cylinders",
       x = "Displacement (cu.in.)",
       y = "Horsepower") +
  theme_minimal()

Example 2: Creating Custom Themes

custom_theme <- theme_minimal() +
  theme(
    text = element_text(family = "Times New Roman"),
    plot.title = element_text(size = 14, face = "bold"),
    axis.title = element_text(size = 12)
  )

# Applying custom theme
ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  labs(title = "Car Weight vs. MPG") +
  custom_theme

Exercise 2: Create and Apply a Custom Theme

Task: Define your own custom theme that changes font size, axis labels, and background color. Apply this theme to a plot showing the relationship between mpg and hp in mtcars.
Hints: Use theme() to customize various elements.

6.2 Combining multiple Plots

Use cowplot or patchwork to combine multiple ggplot objects into one.

Example 3: Combining Plots with cowplot

library(cowplot)

#Create first plot
p1 <- ggplot(mtcars, aes(x = hp, y = mpg)) + 
  geom_point()+
    labs(title = "Horsepower vs. Miles Per Gallon",
       x = "Horsepower",
       y = "Miles per Gallon") 

#Create second plot
p2 <- ggplot(mtcars, aes(x = wt, y = mpg)) + 
  geom_boxplot()+
    labs(title = "Car weight vs. Miles Per Gallon",
       x = "Car weight",
       y = "Miles per Gallon") 

# Merging two plots
plot_grid(p1, p2, labels = "AUTO")

## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?

Exercise 3: Create a Multi-Plot Layout

Task: Combine three plots into a single layout: a scatter plot (hp vs. mpg), a boxplot (wt vs. mpg), and a histogram of mpg.
Hints: Use plot_grid() or patchwork syntax to arrange the plots.

library(patchwork)

## 
## Attaching package: 'patchwork'

## The following object is masked from 'package:cowplot':
## 
##     align_plots

p3 <- ggplot(mtcars, aes(x = mpg)) + 
  geom_histogram(binwidth = 2, fill = "lightblue", color = "black") +
  labs(title = "Distribution of Miles Per Gallon", x = "Miles Per Gallon")

(p1 | p2) / p3

## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?

6.3 Saving and Exporting Plots

You can save your plots to files using the ggsave function.

# Save plot to file
plot <- ggplot(gdp_data, aes(x = Year, y = GDP, color = Country)) +
  geom_line() +
  labs(title = "GDP Over Time",
       x = "Year",
       y = "GDP (in USD)") +
  theme_minimal()

ggsave("gdp_plot.png", plot = plot, width = 8, height = 6)

This code saves the plot as a PNG file.

6.4 Interactive Visualizations with ggplot2 and Plotly

Introduction to `plotly`

plotly: A library that converts ggplot2 visualizations into interactive plots.

Converting ggplot2 to Plotly

library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

p <- ggplot(mtcars, aes(x = hp, y = mpg)) +
  geom_point(color = 'blue') +
  geom_smooth(method = "loess", color= "pink", se = TRUE)+
  labs(title = "Horsepower vs. Miles Per Gallon",
       x = "Horsepower",
       y = "Miles per Gallon") +
  theme_minimal()

ggplotly(p)

## `geom_smooth()` using formula = 'y ~ x'

Exercise 4: Create an Interactive Plot

Task: Convert a faceted plot (from Exercise 1) into an interactive plot using ggplotly. Customize tooltips to show additional information like car model names.
Hints: Use the tooltip argument in ggplotly() to specify which variables to include.

# Load necessary libraries
library(ggplot2)
library(plotly)

# Create a faceted plot
p <- ggplot(mtcars, aes(x = disp, y = hp, text = rownames(mtcars))) +
  geom_point(color = "blue") +
  facet_wrap(~ cyl) +
  labs(title = "Displacement vs. Horsepower by Number of Cylinders",
       x = "Displacement (cu.in.)",
       y = "Horsepower") +
  theme_minimal()

# Convert to an interactive plot with tooltips
interactive_plot <- ggplotly(p, tooltip = c("text", "x", "y"))

# Display the interactive plot
interactive_plot

Explanation:

•   The text = rownames(mtcars) inside the aes() function allows you to

include car model names in the tooltips.

•   The ggplotly(p, tooltip = c("text", "x", "y")) command customizes the

tooltip to display the car model name along with the displacement and horsepower values.

Additional Exercises and Q&A (Optional Time)

Exercise 5: Explore the crosstalk package for shared data in interactive plots.

# Load necessary libraries
library(ggplot2)
library(plotly)
library(crosstalk)
library(dplyr)

# Create shared data object
shared_mtcars <- SharedData$new(mtcars)

# Create the first plot (Horsepower vs. MPG)
p1 <- ggplot(shared_mtcars, aes(x = hp, y = mpg, color = factor(cyl))) +
  geom_point() +
  labs(title = "Horsepower vs. MPG", x = "Horsepower", y = "Miles Per Gallon") +
  theme_minimal()

# Create the second plot (Weight vs. MPG)
p2 <- ggplot(shared_mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +
  geom_point() +
  labs(title = "Weight vs. MPG", x = "Weight", y = "Miles Per Gallon") +
  theme_minimal()

# Convert to interactive plots
interactive_p1 <- ggplotly(p1)
interactive_p2 <- ggplotly(p2)

# Combine the plots with crosstalk filtering
bscols(
  filter_select("cyl", "Cylinders", shared_mtcars, ~cyl),
  interactive_p1,
  interactive_p2
)

Cylinders

Explanation:

The SharedData$new(mtcars) command creates a shared data object, which allows the plots to filter each other interactively.
The filter_select(“cyl”, “Cylinders”, shared_mtcars, ~cyl) creates a filter control that users can interact with to filter the plots by the number of cylinders.
bscols() is used to combine the plots and the filter control into a responsive layout.

Exercise 6: Use the shiny package to create a web-based app that allows users to interactively explore the mtcars dataset.

# Load necessary libraries
library(shiny)

## 
## Attaching package: 'shiny'

## The following object is masked from 'package:crosstalk':
## 
##     getDefaultReactiveDomain

library(ggplot2)
library(dplyr)

# Define the UI
ui <- fluidPage(
  titlePanel("Interactive mtcars Data Explorer"),
  
  sidebarLayout(
    sidebarPanel(
      selectInput("xvar", "X-axis Variable", choices = names(mtcars)),
      selectInput("yvar", "Y-axis Variable", choices = names(mtcars)),
      sliderInput("cyl", "Number of Cylinders",
                  min = min(mtcars$cyl),
                  max = max(mtcars$cyl),
                  value = range(mtcars$cyl),
                  step = 1)
    ),
    
    mainPanel(
      plotOutput("scatterPlot")
    )
  )
)

# Define the server logic
server <- function(input, output) {
  
  filtered_data <- reactive({
    mtcars %>%
      filter(cyl >= input$cyl[1] & cyl <= input$cyl[2])
  })
  
  output$scatterPlot <- renderPlot({
    ggplot(filtered_data(), aes_string(x = input$xvar, y = input$yvar)) +
      geom_point(color = "blue") +
      labs(title = paste(input$yvar, "vs", input$xvar),
           x = input$xvar,
           y = input$yvar) +
      theme_minimal()
  })
}

# Run the application
shinyApp(ui = ui, server = server)

## 
## Listening on http://127.0.0.1:5806

## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Explanation:

The selectInput() functions create dropdowns for selecting the x- and y-axis variables.
The sliderInput() allows the user to filter the dataset by the number of cylinders.
filtered_data() is a reactive expression that updates the data based on the selected number of cylinders.
renderPlot() generates the scatter plot based on the selected variables and filtered data.

This shiny app provides an interactive interface for exploring relationships in the mtcars dataset, allowing users to dynamically change the variables plotted and filter data in real-time.

Conclusion

This 18-hour course provided a comprehensive introduction to R for beginners, covering data loading, basic statistics, and data visualization. By the end of the course, you should be comfortable working with data in R and creating meaningful visualizations using ggplot2.

Going further

Here are some highly recommended sources to keep you going:

Books:

“R for Data Science“ by Garrett Grolemund and Hadley Wickham - This book is freely available online and is highly recommended for beginners.
“Advanced R” by Hadley Wickham - For those who want to dive deeper into R programming concepts.
“The Art of R Programming” by Norman Matloff - A comprehensive guide to R programming.

Websites and Resources:

R Documentation: The official documentation provided by the R Project.
- CRAN - The Comprehensive R Archive Network
RStudio: The IDE commonly used for R programming also provides excellent learning resources.
- RStudio Education
Stack Overflow: A great community for asking specific programming questions related to R.

YouTube Channels:

R Programming 101: Offers tutorials and practical examples.
- R programming 101

Communities:

r-programming subreddit: A community of R programmers where you can ask questions and find resources.
- r-programming
Cross Validated: Stack Exchange’s site for statistics, data analysis, data mining, and machine learning using R.
- Cross Validated

R for data analysis: beginner level