R for data analysis: beginner level
Course Outline
This course is designed for those with no prior experience of data analysis. This 18-hour course on R covers loading datasets, performing basic statistics, and creating data visualizations. The course is divided into 6 3-hour sessions.
Session 1: Introduction to R (3 hours)
1.1 Getting Started with R and RStudio
Introduction to R and RStudio
R is a programming language and software environment specifically designed for statistical computing and data analysis. It is widely used by statisticians, data scientists, and researchers for its powerful data manipulation capabilities, extensive statistical techniques, and graphical tools. Here are some key aspects of R:
Key Features of R:
Statistical Analysis:
- R provides a wide range of statistical techniques, including linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, and more.
Data Manipulation:
- R includes robust tools for data manipulation and
transformation. Packages like
dplyr
andtidyr
allow for efficient data wrangling.
- R includes robust tools for data manipulation and
transformation. Packages like
Data Visualization:
- R excels in data visualization. The
ggplot2
package, based on the Grammar of Graphics, allows users to create complex and elegant visualizations with ease.
- R excels in data visualization. The
Reproducible Research:
- R supports reproducible research with tools like R Markdown, which integrates code and text in a single document, making it easy to share analysis and results.
Extensible:
- R is highly extensible through packages. The Comprehensive R Archive Network (CRAN) hosts thousands of packages that extend R’s functionality for various domains, from bioinformatics to finance.
Community Support:
- R has a large and active community. Numerous resources, forums, and user-contributed documentation are available, facilitating learning and problem-solving.
Integration:
- R can integrate with other languages and systems. It can call C, C++, and Fortran code and is also capable of interacting with databases and web services.
1.2 Basic R Syntax
Variables and data types
Introduction to different data types:
Numeric: Numbers, which can be either integers or floating-point.
Character: Text or string values.
Logical: Boolean values, either TRUE or FALSE.
Creating variables:
# Creating a numeric variable
num_var <- 42
# Creating a character variable
char_var <- "Hello, R!"
# Creating a logical variable
log_var <- TRUE
# Display the variables
num_var
## [1] 42
## [1] "Hello, R!"
## [1] TRUE
Exercises:
Create a variable age and assign it your age.
Create a variable name and assign it your name.
Create a variable is_student and assign it a logical value indicating whether you are a student or not.
Showing the results:
## [1] 25
## [1] "Pierre"
## [1] FALSE
Basic operations
Performing arithmetic operations:
## [1] 8
## [1] 6
## [1] 42
## [1] 5
## [1] 8
Logical operations:
## [1] TRUE
## [1] TRUE
## [1] TRUE
## [1] TRUE
## [1] TRUE
## [1] TRUE
## [1] FALSE
## [1] TRUE
Using basic functions:
## [1] 6
## [1] 3
## [1] 4
Exercises:
Perform addition, subtraction, multiplication, and division on two numeric variables you create.
## [1] 3
## [1] -1
## [1] 2
## [1] 0.5
Check if the number 10 is greater than 5 and print the result.
## [1] TRUE
Calculate the mean of the numbers 4, 8, 15, 16, 23, 42.
## [1] 18
Writing and running scripts
Creating and executing R scripts within RStudio:
Open RStudio.
Create a new script by clicking on
File -> New File -> R Script
.Write your R code in the script editor.
Save your script with a
.R
extension.To run the script, highlight the code and click the
Run
button, or use theCtrl+Enter
shortcut.
Exercises:
Create a script that assigns two numbers to variables and prints their sum, difference, product, and quotient.
Save the script and run it in RStudio.
Example script (example_script.R
):
# This is a comment
# Assign values to variables
x <- 10
y <- 5
# Perform arithmetic operations
sum <- x + y
difference <- x - y
product <- x * y
quotient <- x / y
# Print the results
print(sum)
## [1] 15
## [1] 5
## [1] 50
## [1] 2
1.3 Working with Vectors and Data Frames
Creating and Manipulating Vectors
Understanding vectors:
- Vectors are one-dimensional arrays that can hold numeric, character, or logical data. They are the simplest type of data structure in R and are extremely useful for storing sequences of values.
Creating vectors:
You can create vectors using the c()
function, which stands for
“combine” or “concatenate.”
# Numeric vector
num_vector <- c(1, 2, 3, 4, 5)
# Character vector
char_vector <- c("apple", "banana", "cherry")
# Logical vector
log_vector <- c(TRUE, FALSE, TRUE)
Creating sequences:
You can create sequences of numbers using the seq()
and rep()
functions.
## [1] 1 2 3 4 5 6 7 8 9 10
## [1] 1 3 5 7 9
## [1] 5 5 5
Subsetting vectors:
You can access specific elements of a vector using square brackets []
.
## [1] 2
## [1] 1 3 5
Vectorized operations:
R allows you to perform operations on entire vectors without the need for explicit loops. This is called vectorization and makes your code more efficient and concise.
## [1] 3 4 5 6 7
## [1] 3 6 9 12 15
Exercises:
Create a numeric vector with the numbers 1 to 10.
Create a character vector with the names of three fashion brands.
Access the third element of the numeric vector and print it.
## [1] 3
Add 10 to each element of the numeric vector and print the result.
## [1] 11 12 13 14 15 16 17 18 19 20
Introduction to data frames
Creating data frames:
# Create a data frame
df <- data.frame(
name = c("Alice", "Bob", "Charlie"),
age = c(25, 30, 35),
is_student = c(TRUE, FALSE, TRUE)
)
# Display the data frame
print(df)
## name age is_student
## 1 Alice 25 TRUE
## 2 Bob 30 FALSE
## 3 Charlie 35 TRUE
Accessing rows and columns:
## [1] "Alice" "Bob" "Charlie"
## name age is_student
## 1 Alice 25 TRUE
## [1] 30
Basic manipulations:
Exercises:
Create a data frame with three columns:
name
,age
, andis_student
, fill it with your information and those from your neighbors# Create a data frame df <- data.frame( name = c("Pierre"), age = c(25), is_student = c(FALSE) ) # Display the data frame print(df)
## name age is_student ## 1 Pierre 25 FALSE
Add a new column
grade
with some values.Access and print the
age
column.## [1] 25
Access and print the first row.
## name age is_student grade ## 1 Pierre 25 FALSE 20
Basic data frame operations
Sorting data frames:
## name age is_student grade
## 1 Pierre 25 FALSE 20
Filtering data frames:
## [1] name age is_student grade
## <0 rows> (or 0-length row.names)
Summarizing data frames:
## name age is_student grade
## Length:1 Min. :25 Mode :logical Min. :20
## Class :character 1st Qu.:25 FALSE:1 1st Qu.:20
## Mode :character Median :25 Median :20
## Mean :25 Mean :20
## 3rd Qu.:25 3rd Qu.:20
## Max. :25 Max. :20
Exercises
Sort the data frame by the
grade
column and print the result.## name age is_student grade ## 1 Pierre 25 FALSE 20
Filter the data frame to include only students aged more than 20 years old and print the result.
# Filter students students <- df[df$is_student == TRUE, ] filteredstudents <- students[students$age >= 20, ] print(filteredstudents)
## [1] name age is_student grade ## <0 rows> (or 0-length row.names)
Summarize the data frame and print the summary.
## name age is_student grade ## Length:1 Min. :25 Mode :logical Min. :20 ## Class :character 1st Qu.:25 FALSE:1 1st Qu.:20 ## Mode :character Median :25 Median :20 ## Mean :25 Mean :20 ## 3rd Qu.:25 3rd Qu.:20 ## Max. :25 Max. :20
Session 2: Data Import and Export (3 hours)
2.0 Setting up working directory
Before loading data, it’s important to set the working directory. This tells R where to look for files on your computer. You can set the working directory to the folder where your data files are stored.
# Set the working directory
# Replace "path/to/your/directory" with the actual path to your directory
setwd("/Users/pierrebeaucoral/Documents/Pro/Cours GPE")
# Verify the working directory
getwd()
## [1] "/Users/pierrebeaucoral/Documents/Pro/Cours GPE"
2.1 Loading Data from Files
A CSV (Comma Separated Values) file is a plain text file that contains data separated by commas. It’s a common format for data exchange. We will use the `readr` package to read a CSV file into R.
## Rows: 20580 Columns: 19
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (13): SUBJECT, Sujet, LOCATION, Pays, MEASURE, Mesure, FREQUENCY, Fréque...
## dbl (2): PowerCode Code, Value
## lgl (4): Reference Period Code, Reference Period, Flag Codes, Flags
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 6 × 19
## SUBJECT Sujet LOCATION Pays MEASURE Mesure FREQUENCY Fréquence TIME Temps
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 XTEXVA01 Export… ISR Isra… CXMLSA Monna… Q Trimestr… 2022… T2-2…
## 2 XTEXVA01 Export… ISR Isra… CXMLSA Monna… Q Trimestr… 2022… T3-2…
## 3 XTEXVA01 Export… ISR Isra… CXMLSA Monna… Q Trimestr… 2022… T4-2…
## 4 XTEXVA01 Export… ISR Isra… CXMLSA Monna… Q Trimestr… 2023… T1-2…
## 5 XTEXVA01 Export… ISR Isra… CXMLSA Monna… Q Trimestr… 2023… T2-2…
## 6 XTEXVA01 Export… ISR Isra… CXMLSA Monna… Q Trimestr… 2023… T3-2…
## # ℹ 9 more variables: `Unit Code` <chr>, Unit <chr>, `PowerCode Code` <dbl>,
## # PowerCode <chr>, `Reference Period Code` <lgl>, `Reference Period` <lgl>,
## # Value <dbl>, `Flag Codes` <lgl>, Flags <lgl>
## SUBJECT Sujet LOCATION Pays
## Length:20580 Length:20580 Length:20580 Length:20580
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## MEASURE Mesure FREQUENCY Fréquence
## Length:20580 Length:20580 Length:20580 Length:20580
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## TIME Temps Unit Code Unit
## Length:20580 Length:20580 Length:20580 Length:20580
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## PowerCode Code PowerCode Reference Period Code Reference Period
## Min. :0.000 Length:20580 Mode:logical Mode:logical
## 1st Qu.:0.000 Class :character NA's:20580 NA's:20580
## Median :9.000 Mode :character
## Mean :6.715
## 3rd Qu.:9.000
## Max. :9.000
## Value Flag Codes Flags
## Min. : -28821.1 Mode:logical Mode:logical
## 1st Qu.: -0.3 NA's:20580 NA's:20580
## Median : 8.2
## Mean : 3878.6
## 3rd Qu.: 55.7
## Max. :1167777.0
## spc_tbl_ [20,580 × 19] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ SUBJECT : chr [1:20580] "XTEXVA01" "XTEXVA01" "XTEXVA01" "XTEXVA01" ...
## $ Sujet : chr [1:20580] "Exportations des biens (en valeur)" "Exportations des biens (en valeur)" "Exportations des biens (en valeur)" "Exportations des biens (en valeur)" ...
## $ LOCATION : chr [1:20580] "ISR" "ISR" "ISR" "ISR" ...
## $ Pays : chr [1:20580] "Israël" "Israël" "Israël" "Israël" ...
## $ MEASURE : chr [1:20580] "CXMLSA" "CXMLSA" "CXMLSA" "CXMLSA" ...
## $ Mesure : chr [1:20580] "Monnaie nationale convertie en dollars, corrigée des variations saisonnières" "Monnaie nationale convertie en dollars, corrigée des variations saisonnières" "Monnaie nationale convertie en dollars, corrigée des variations saisonnières" "Monnaie nationale convertie en dollars, corrigée des variations saisonnières" ...
## $ FREQUENCY : chr [1:20580] "Q" "Q" "Q" "Q" ...
## $ Fréquence : chr [1:20580] "Trimestrielle" "Trimestrielle" "Trimestrielle" "Trimestrielle" ...
## $ TIME : chr [1:20580] "2022-Q2" "2022-Q3" "2022-Q4" "2023-Q1" ...
## $ Temps : chr [1:20580] "T2-2022" "T3-2022" "T4-2022" "T1-2023" ...
## $ Unit Code : chr [1:20580] "USD" "USD" "USD" "USD" ...
## $ Unit : chr [1:20580] "Dollar des États-Unis" "Dollar des États-Unis" "Dollar des États-Unis" "Dollar des États-Unis" ...
## $ PowerCode Code : num [1:20580] 9 9 9 9 9 9 9 9 9 9 ...
## $ PowerCode : chr [1:20580] "Milliards" "Milliards" "Milliards" "Milliards" ...
## $ Reference Period Code: logi [1:20580] NA NA NA NA NA NA ...
## $ Reference Period : logi [1:20580] NA NA NA NA NA NA ...
## $ Value : num [1:20580] 17.2 17.2 16.5 14 15.5 ...
## $ Flag Codes : logi [1:20580] NA NA NA NA NA NA ...
## $ Flags : logi [1:20580] NA NA NA NA NA NA ...
## - attr(*, "spec")=
## .. cols(
## .. SUBJECT = col_character(),
## .. Sujet = col_character(),
## .. LOCATION = col_character(),
## .. Pays = col_character(),
## .. MEASURE = col_character(),
## .. Mesure = col_character(),
## .. FREQUENCY = col_character(),
## .. Fréquence = col_character(),
## .. TIME = col_character(),
## .. Temps = col_character(),
## .. `Unit Code` = col_character(),
## .. Unit = col_character(),
## .. `PowerCode Code` = col_double(),
## .. PowerCode = col_character(),
## .. `Reference Period Code` = col_logical(),
## .. `Reference Period` = col_logical(),
## .. Value = col_double(),
## .. `Flag Codes` = col_logical(),
## .. Flags = col_logical()
## .. )
## - attr(*, "problems")=<externalptr>
Writing CSV files
We can also write data from R to a CSV file using the write_csv
function. This is useful for saving processed data for use in other
programs.
Reading and Writing Excel files
Excel files are widely used for data storage and analysis. We will use
the readxl
package to read Excel files and the writexl
and
openxlsx
packages to write Excel files.
# Load required packages
library(readxl) # For reading Excel files
library(writexl) # For writing Excel files
library(openxlsx) # For advanced Excel operations
# Read Excel file
data_excel <- read_excel("./Data/SCIM.xls")
# Display the first few rows of the data
head(data_excel)
## # A tibble: 6 × 18
## Pays `Août-2022` `Sept-2022` `Oct-2022` `Nov-2022` `Déc-2022` `Janv-2023`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Australie 35.2 36.0 33.9 34.3 35.0 35.5
## 2 Autriche 17.3 17.2 16.9 18.0 18.0 18.5
## 3 Belgique 55.1 54.2 51.0 55.0 53.2 49.4
## 4 Canada 50.9 49.6 48.5 47.6 47.8 49.8
## 5 Chili 8.07 8.34 8.47 8.13 8.43 8.57
## 6 Colombie 4.29 4.68 4.28 4.68 4.44 4.06
## # ℹ 11 more variables: `Févr-2023` <dbl>, `Mars-2023` <dbl>, `Avr-2023` <dbl>,
## # `Mai-2023` <dbl>, `Juin-2023` <dbl>, `Juil-2023` <dbl>, `Août-2023` <dbl>,
## # `Sept-2023` <dbl>, `Oct-2023` <dbl>, `Nov-2023` <chr>, `Déc-2023` <chr>
## Pays Août-2022 Sept-2022 Oct-2022
## Length:50 Min. : 0.6984 Min. : 0.6095 Min. : 0.5321
## Class :character 1st Qu.: 6.9794 1st Qu.: 6.7395 1st Qu.: 6.6275
## Mode :character Median : 24.8843 Median : 24.3044 Median : 20.8480
## Mean : 81.7001 Mean : 79.9567 Mean : 77.7920
## 3rd Qu.: 50.8811 3rd Qu.: 51.2174 3rd Qu.: 48.9129
## Max. :1166.6390 Max. :1139.9140 Max. :1101.9380
## NA's :1 NA's :1 NA's :1
## Nov-2022 Déc-2022 Janv-2023
## Min. : 0.5568 Min. : 0.7034 Min. : 0.5558
## 1st Qu.: 7.4955 1st Qu.: 6.8820 1st Qu.: 6.9891
## Median : 20.8586 Median : 23.0772 Median : 22.1596
## Mean : 79.0860 Mean : 79.9836 Mean : 80.4081
## 3rd Qu.: 48.2815 3rd Qu.: 47.8382 3rd Qu.: 49.4048
## Max. :1125.0540 Max. :1143.7330 Max. :1148.4440
## NA's :1 NA's :1 NA's :1
## Févr-2023 Mars-2023 Avr-2023
## Min. : 0.5883 Min. : 0.5759 Min. : 0.5719
## 1st Qu.: 6.2442 1st Qu.: 7.3807 1st Qu.: 6.9889
## Median : 22.0183 Median : 21.7005 Median : 20.5253
## Mean : 79.9061 Mean : 80.6198 Mean : 79.2318
## 3rd Qu.: 48.0585 3rd Qu.: 48.8357 3rd Qu.: 47.9008
## Max. :1135.7370 Max. :1129.5540 Max. :1114.3250
## NA's :1 NA's :1 NA's :1
## Mai-2023 Juin-2023 Juil-2023
## Min. : 0.5317 Min. : 0.5678 Min. : 0.5119
## 1st Qu.: 6.8126 1st Qu.: 6.6007 1st Qu.: 6.5040
## Median : 21.0225 Median : 21.5573 Median : 20.7020
## Mean : 78.1063 Mean : 77.5048 Mean : 77.6646
## 3rd Qu.: 46.5487 3rd Qu.: 45.2711 3rd Qu.: 47.6173
## Max. :1108.4710 Max. :1107.4590 Max. :1106.7350
## NA's :1 NA's :1 NA's :1
## Août-2023 Sept-2023 Oct-2023 Nov-2023
## Min. : 0.5284 Min. : 0.6117 Min. : 0.4766 Length:50
## 1st Qu.: 6.8339 1st Qu.: 6.5745 1st Qu.: 6.2254 Class :character
## Median : 21.1284 Median : 20.5472 Median : 21.3038 Mode :character
## Mean : 78.1792 Mean : 77.4469 Mean : 76.7197
## 3rd Qu.: 47.4622 3rd Qu.: 47.8237 3rd Qu.: 47.7631
## Max. :1114.5250 Max. :1102.6130 Max. :1097.7230
## NA's :1 NA's :1 NA's :1
## Déc-2023
## Length:50
## Class :character
## Mode :character
##
##
##
##
## tibble [50 × 18] (S3: tbl_df/tbl/data.frame)
## $ Pays : chr [1:50] "Australie" "Autriche" "Belgique" "Canada" ...
## $ Août-2022: num [1:50] 35.2 17.3 55.13 50.88 8.07 ...
## $ Sept-2022: num [1:50] 35.97 17.16 54.21 49.64 8.34 ...
## $ Oct-2022 : num [1:50] 33.94 16.86 50.98 48.49 8.47 ...
## $ Nov-2022 : num [1:50] 34.31 17.95 55 47.59 8.13 ...
## $ Déc-2022 : num [1:50] 34.97 18 53.23 47.77 8.43 ...
## $ Janv-2023: num [1:50] 35.46 18.47 49.4 49.85 8.57 ...
## $ Févr-2023: num [1:50] 33.3 18.2 48.7 48.1 8.4 ...
## $ Mars-2023: num [1:50] 33.83 19.38 48.84 46.48 8.77 ...
## $ Avr-2023 : num [1:50] 29.99 19.5 48.73 47.61 7.64 ...
## $ Mai-2023 : num [1:50] 30.5 18.8 46.1 46.5 7.4 ...
## $ Juin-2023: num [1:50] 29.3 18.36 45.27 45.02 7.89 ...
## $ Juil-2023: num [1:50] 27.85 18.81 47.62 46.22 7.56 ...
## $ Août-2023: num [1:50] 30.2 18.54 46.76 47.46 7.93 ...
## $ Sept-2023: num [1:50] 28.61 17.95 46.68 47.82 7.86 ...
## $ Oct-2023 : num [1:50] 29.7 18.2 45.4 47.8 7.9 ...
## $ Nov-2023 : chr [1:50] "31.088999999999999" "18.0898" "44.97148" "47.747590000000002" ...
## $ Déc-2023 : chr [1:50] ".." ".." ".." ".." ...
# Read a specific sheet by name
data_sheet <- read_excel("./Data/SCIM.xls", sheet = "Sheet1")
# Read a specific range of cells
data_range <- read_excel("./Data/SCIM.xls", range = "A1:D10")
# Write data to an Excel file
write_xlsx(data_excel, "./Data/file.xlsx")
# Write data to a specific sheet
write.xlsx(data_excel, "./Data/file_specific_sheet.xlsx", sheetName = "DataSheet")
# Write multiple data frames to multiple sheets
write.xlsx(list(Sheet1 = data_excel, Sheet2 = data_excel), "./Data/file_multiple_sheets.xlsx")
2.2 Loading Data from Packages
As R is a widely used tool for data analysis, several data sources are implementing packages to directly access their datasets in R.
Introduction to the WDI package
The WDI package provides access to the World Bank’s World Development Indicators, which include a wide range of economic, social, and environmental data.
Loading Data from WDI
# Load required package
library(WDI)
# Load GDP data for USA, China, and India from 2000 to 2020
gdp_data <- WDI(country = c("US", "CN", "IN"),
indicator = "NY.GDP.MKTP.CD",
start = 2000,
end = 2020)
head(gdp_data)
## country iso2c iso3c year NY.GDP.MKTP.CD
## 1 China CN CHN 2020 1.468774e+13
## 2 China CN CHN 2019 1.427997e+13
## 3 China CN CHN 2018 1.389491e+13
## 4 China CN CHN 2017 1.231049e+13
## 5 China CN CHN 2016 1.123331e+13
## 6 China CN CHN 2015 1.106157e+13
In the above code, we load the WDI package and then use the WDI
function to fetch GDP data for the USA, China, and India from the year
2000 to 2020. The country
parameter takes a vector of country codes,
the indicator
parameter specifies the type of data (in this case,
GDP), and start
and end
define the time range.
# Rename columns for clarity
colnames(gdp_data) <- c("Country", "iso2c", "iso3c", "Year", "GDP")
head(gdp_data)
## Country iso2c iso3c Year GDP
## 1 China CN CHN 2020 1.468774e+13
## 2 China CN CHN 2019 1.427997e+13
## 3 China CN CHN 2018 1.389491e+13
## 4 China CN CHN 2017 1.231049e+13
## 5 China CN CHN 2016 1.123331e+13
## 6 China CN CHN 2015 1.106157e+13
Here, we rename the columns to make them more understandable.
## Country iso2c iso3c Year
## Length:63 Length:63 Length:63 Min. :2000
## Class :character Class :character Class :character 1st Qu.:2005
## Mode :character Mode :character Mode :character Median :2010
## Mean :2010
## 3rd Qu.:2015
## Max. :2020
## GDP
## Min. :4.684e+11
## 1st Qu.:1.825e+12
## Median :6.087e+12
## Mean :8.032e+12
## 3rd Qu.:1.409e+13
## Max. :2.152e+13
## 'data.frame': 63 obs. of 5 variables:
## $ Country: chr "China" "China" "China" "China" ...
## $ iso2c : chr "CN" "CN" "CN" "CN" ...
## $ iso3c : chr "CHN" "CHN" "CHN" "CHN" ...
## $ Year : int 2020 2019 2018 2017 2016 2015 2014 2013 2012 2011 ...
## $ GDP : num 1.47e+13 1.43e+13 1.39e+13 1.23e+13 1.12e+13 ...
## ..- attr(*, "label")= chr "GDP (current US$)"
## - attr(*, "lastupdated")= chr "2024-06-28"
## - attr(*, "label")= chr [1:63] "GDP (current US$)" "GDP (current US$)" "GDP (current US$)" "GDP (current US$)" ...
The summary
function provides basic statistics about the dataset,
while the str
function displays its structure.
# Plot GDP data
library(ggplot2)
ggplot(gdp_data, aes(x = Year, y = GDP, color = Country)) +
geom_line() +
labs(title = "GDP of USA, China, and India (2000-2020)", y = "GDP (current US$)")
Finally, we use the ggplot2 package to create a line plot showing the GDP trends for the three countries over the specified period.
Exercises
Exercise 1: Reading CSV files
Download a CSV file from the internet.
Load the data into R using the
readr
package.Display the first 6 rows of the data.
Display the summary and structure of the data.
Write the data to a new CSV file.
Write the data to a text file with a different delimiter.
Write the data to a CSV file without column names.
Exercise 2: Reading and Writing Excel files
Download an Excel file from the internet.
Load the data into R using the
readxl
package.Display the first 6 rows of the data.
Display the summary and structure of the data.
Read a specific sheet by name.
Read a specific range of cells.
Write the data to a new Excel file using the
writexl
package.Write the data to a specific sheet.
Write multiple data frames to multiple sheets.
Exercise 3: Loading Data from WDI
Install and load the
WDI
package.Retrieve data for a different set of countries (e.g., Japan, Germany, Brazil) for a different indicator (e.g.,
SP.POP.TOTL
for total population) from 2000 to 2020.Rename the columns for clarity.
Display the first 6 rows of the data.
Display the summary and structure of the data.
What is the yearly averaged value of your chosen indicator for your set of countries from 2000 to 2020?
Session 3: Basic Data Manipulation (3 hours)
3.1 Introduction to dplyr
The dplyr
package is one of the most powerful tools for data
manipulation in R. It provides a set of functions that perform common
data manipulation tasks such as filtering rows, selecting columns,
arranging data, adding new columns, and summarizing data. The %>%
(pipe) operator is often used to chain multiple functions together in a
readable manner.
3.2 Filtering, Selecting, and Arranging Data
Let’s start with some basic operations: filtering, selecting, and arranging data.
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Filter, select, and arrange data
filtered_data <- gdp_data %>%
filter(Year > 2010) %>%
dplyr::select(Country, Year, GDP) %>%
arrange(desc(GDP))
head(filtered_data)
## Country Year GDP
## 1 United States 2019 2.152140e+13
## 2 United States 2020 2.132295e+13
## 3 United States 2018 2.065652e+13
## 4 United States 2017 1.961210e+13
## 5 United States 2016 1.880491e+13
## 6 United States 2015 1.829502e+13
In this In the code above, we use the %>%
(pipe) operator to chain
multiple dplyr functions together:
filter(Year > 2010)
keeps only the rows where the Year is greater than 2010.select(Country, Year, GDP)
keeps only the specified columns.arrange(desc(GDP))
sorts the data in descending order of GDP.
3.3 Adding and Mutating Columns
The mutate
function is used to add new columns or modify existing
ones.
# Add and mutate columns
gdp_data <- gdp_data %>%
mutate(GDP_in_Billions = GDP / 1e9)
head(gdp_data)
## Country iso2c iso3c Year GDP GDP_in_Billions
## 1 China CN CHN 2020 1.468774e+13 14687.74
## 2 China CN CHN 2019 1.427997e+13 14279.97
## 3 China CN CHN 2018 1.389491e+13 13894.91
## 4 China CN CHN 2017 1.231049e+13 12310.49
## 5 China CN CHN 2016 1.123331e+13 11233.31
## 6 China CN CHN 2015 1.106157e+13 11061.57
Here, we create a new column GDP_in_Billions
by dividing the GDP
values by 1 billion.
3.4 Summarizing Data
Grouping data and summarizing it with the summarise
function are
common tasks in data analysis.
# Summarize data
gdp_stats <- gdp_data %>%
group_by(Country) %>%
summarise(mean_gdp = mean(GDP, na.rm = TRUE),
median_gdp = median(GDP, na.rm = TRUE))
gdp_stats
## # A tibble: 3 × 3
## Country mean_gdp median_gdp
## <chr> <dbl> <dbl>
## 1 China 6.93e12 6.09e12
## 2 India 1.56e12 1.68e12
## 3 United States 1.56e13 1.50e13
In this example, we group the data by country and then calculate the mean and median GDP for each country.
3.5 Advanced dplyr Functions
3.5.1 mutate
and transmute
The transmute
function works like mutate
but keeps only the new
variables.
# Using mutate
mutated_data <- gdp_data %>%
mutate(GDP_in_Billions = GDP / 1e9,
GDP_in_Millions = GDP / 1e6)
# Using transmute
transmuted_data <- gdp_data %>%
transmute(GDP_in_Billions = GDP / 1e9,
GDP_in_Millions = GDP / 1e6)
head(mutated_data)
## Country iso2c iso3c Year GDP GDP_in_Billions GDP_in_Millions
## 1 China CN CHN 2020 1.468774e+13 14687.74 14687744
## 2 China CN CHN 2019 1.427997e+13 14279.97 14279969
## 3 China CN CHN 2018 1.389491e+13 13894.91 13894908
## 4 China CN CHN 2017 1.231049e+13 12310.49 12310491
## 5 China CN CHN 2016 1.123331e+13 11233.31 11233314
## 6 China CN CHN 2015 1.106157e+13 11061.57 11061573
## GDP_in_Billions GDP_in_Millions
## 1 14687.74 14687744
## 2 14279.97 14279969
## 3 13894.91 13894908
## 4 12310.49 12310491
## 5 11233.31 11233314
## 6 11061.57 11061573
3.5.2 filter
with Multiple Conditions
You can filter data using multiple conditions.
# Filter with multiple conditions
filtered_data <- gdp_data %>%
filter(Year > 2010, GDP > 1e12)
head(filtered_data)
## Country iso2c iso3c Year GDP GDP_in_Billions
## 1 China CN CHN 2020 1.468774e+13 14687.74
## 2 China CN CHN 2019 1.427997e+13 14279.97
## 3 China CN CHN 2018 1.389491e+13 13894.91
## 4 China CN CHN 2017 1.231049e+13 12310.49
## 5 China CN CHN 2016 1.123331e+13 11233.31
## 6 China CN CHN 2015 1.106157e+13 11061.57
3.5.3 select
with Helper Functions
The select
function supports helper functions to make column selection
easier.
# Select columns using helper functions
selected_data <- gdp_data %>%
dplyr::select(starts_with("G"), contains("Year"))
head(selected_data)
## GDP GDP_in_Billions Year
## 1 1.468774e+13 14687.74 2020
## 2 1.427997e+13 14279.97 2019
## 3 1.389491e+13 13894.91 2018
## 4 1.231049e+13 12310.49 2017
## 5 1.123331e+13 11233.31 2016
## 6 1.106157e+13 11061.57 2015
3.5.4 summarise
with Multiple Summaries
You can create multiple summaries in one step.
# Multiple summaries
summary_stats <- gdp_data %>%
group_by(Country) %>%
summarise(mean_gdp = mean(GDP, na.rm = TRUE),
median_gdp = median(GDP, na.rm = TRUE),
total_gdp = sum(GDP, na.rm = TRUE))
summary_stats
## # A tibble: 3 × 4
## Country mean_gdp median_gdp total_gdp
## <chr> <dbl> <dbl> <dbl>
## 1 China 6.93e12 6.09e12 1.46e14
## 2 India 1.56e12 1.68e12 3.28e13
## 3 United States 1.56e13 1.50e13 3.28e14
3.6 Joining Data Frames
Some times, you will need to have several variables from different data
sources. In those cases, one will need to merge data frames in order to
get all variables in the same one. dplyr
provides several functions
for joining data frames: inner_join
, left_join
, right_join
,
full_join
.
3.6.1 Inner Join
An inner_join
returns only the rows that have matching values in both
data frames.
# Example data frames
data1 <- data.frame(Country = c("US", "CN", "IN"), Value1 = 1:3)
data2 <- data.frame(Country = c("US", "CN", "BR"), Value2 = 4:6)
# Inner join
inner_join(data1, data2, by = "Country")
## Country Value1 Value2
## 1 US 1 4
## 2 CN 2 5
Explanation:
The result will include only the rows where the
Country
values match in both data frames.Here, only “US” and “CN” are common in both
data1
anddata2
, so the result will be:
Country | Value1 | Value2 |
---|---|---|
US | 1 | 4 |
CN | 2 | 5 |
3.6.2 Left Join
A left_join
returns all the rows from the left data frame and the
matched rows from the right data frame. If there is no match, the result
will contain NA
for columns from the right data frame.
## Country Value1 Value2
## 1 US 1 4
## 2 CN 2 5
## 3 IN 3 NA
Explanation:
The result will include all rows from
data1
, and the matching rows fromdata2
.If there is no match,
NA
will be used for the missing values fromdata2
.Here, “IN” from
data1
has no match indata2
, so the result will be:
Country | Value1 | Value2 |
---|---|---|
US | 1 | 4 |
CN | 2 | 5 |
IN | 3 | NA |
3.6.3 Right Join
A right_join
returns all the rows from the right data frame and the
matched rows from the left data frame. If there is no match, the result
will contain NA
for columns from the left data frame.
## Country Value1 Value2
## 1 US 1 4
## 2 CN 2 5
## 3 BR NA 6
Explanation:
The result will include all rows from
data2
, and the matching rows fromdata1
.If there is no match,
NA
will be used for the missing values fromdata1
.Here, “BR” from
data2
has no match indata1
, so the result will be:
Country | Value1 | Value2 |
---|---|---|
US | 1 | 4 |
CN | 2 | 5 |
BR | NA | 6 |
3.6.4 Full Join
A full_join
returns all rows when there is a match in either left or
right data frame. If there is no match, the result will contain NA
for
the missing values from either data frame.
## Country Value1 Value2
## 1 US 1 4
## 2 CN 2 5
## 3 IN 3 NA
## 4 BR NA 6
Explanation:
The result will include all rows from both data frames.
If there is no match,
NA
will be used for the missing values from either data frame.The result will be:
Country | Value1 | Value2 |
---|---|---|
US | 1 | 4 |
CN | 2 | 5 |
IN | 3 | NA |
BR | NA | 6 |
Visual Representation
To help visualize these joins, you can think of them as operations on two sets:
Inner Join: Intersection of both sets.
Left Join: All elements from the left set and the intersection.
Right Join: All elements from the right set and the intersection.
Full Join: Union of both sets.
Exercises
Exercise 1: Basic dplyr Operations
Filter the
gdp_data
to include only data from the year 2015 onwards.Select the columns
Country
,Year
, andGDP
.Arrange the data in ascending order of GDP.
Add a new column
GDP_in_Trillions
by dividing the GDP by 1e12.Group the data by
Country
and calculate the mean and total GDP.
Exercise 2: Advanced dplyr Functions
Use
mutate
to add columnsGDP_in_Billions
andGDP_in_Millions
togdp_data
.Use
transmute
to create a new data frame with columnsGDP_in_Billions
andGDP_in_Millions
.Filter the
gdp_data
to include only rows whereYear
is greater than 2010 andGDP
is greater than 1e12.Select columns that start with “G” and contain “Year”.
Create multiple summaries for
mean_gdp
,median_gdp
, andtotal_gdp
by grouping the data byCountry
.
Exercise 3: Joining Data Frames
Create two data frames with a common column.
Perform an inner join on the data frames using the common column.
Perform a left join on the data frames.
Perform a right join on the data frames.
Perform a full join on the data frames.
Session 4: Basic Statistics (3 hours)
4.1 Descriptive Statistics
Descriptive statistics provide simple summaries about the sample and the measures. These summaries are crucial for understanding the distribution and central tendency of the data.
Calculate Mean, Median, and Standard Deviation
Let’s start by calculating some basic descriptive statistics: mean, median, and standard deviation.
# Calculate mean, median, and standard deviation
mean_gdp <- mean(gdp_data$GDP, na.rm = TRUE)
median_gdp <- median(gdp_data$GDP, na.rm = TRUE)
sd_gdp <- sd(gdp_data$GDP, na.rm = TRUE)
mean_gdp
## [1] 8.031945e+12
## [1] 6.087192e+12
## [1] 6.752724e+12
In the code above:
mean(gdp_data$GDP, na.rm = TRUE)
calculates the mean of the GDP values, ignoring any missing values (NA
).median(gdp_data$GDP, na.rm = TRUE)
calculates the median of the GDP values, also ignoringNA
s.sd(gdp_data$GDP, na.rm = TRUE)
calculates the standard deviation of the GDP values, ignoringNA
s.
Calculate Range, IQR, and Summary Statistics
In addition to mean, median, and standard deviation, other useful descriptive statistics include the range, interquartile range (IQR), and summary statistics.
# Calculate range
range_gdp <- range(gdp_data$GDP, na.rm = TRUE)
# Calculate interquartile range (IQR)
iqr_gdp <- IQR(gdp_data$GDP, na.rm = TRUE)
# Summary statistics
summary_gdp <- summary(gdp_data$GDP)
range_gdp
## [1] 4.683955e+11 2.152140e+13
## [1] 1.226209e+13
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.684e+11 1.825e+12 6.087e+12 8.032e+12 1.409e+13 2.152e+13
range(gdp_data$GDP, na.rm = TRUE)
gives the minimum and maximum GDP values.IQR(gdp_data$GDP, na.rm = TRUE)
calculates the interquartile range, which measures the spread of the middle 50% of the data.summary(gdp_data$GDP)
provides a summary of the minimum, 1st quartile, median, mean, 3rd quartile, and maximum values.
4.2 Visualizing Descriptive Statistics
Visualizations can provide additional insights into the distribution and spread of the data.
Histograms and Density Plots
# Histogram of GDP
hist(gdp_data$GDP, main = "Histogram of GDP", xlab = "GDP", breaks = 30, col = "blue")
# Density plot of GDP
plot(density(gdp_data$GDP, na.rm = TRUE), main = "Density Plot of GDP", xlab = "GDP", col = "red")
hist(gdp_data$GDP, ...)
creates a histogram of GDP values.plot(density(gdp_data$GDP, na.rm = TRUE), ...)
creates a density plot, showing the distribution of GDP values.
4.3 Correlation and Regression
Correlation and regression analysis are used to examine the relationships between variables.
Correlation measures the strength and direction of the linear relationship between two variables. The Pearson correlation coefficient \(r\) is calculated using the following formula, for a given a pair of random variables \({\displaystyle (X,Y)}\) (for example, Height and Weight):
\({\displaystyle r_{X,Y}={\frac {\operatorname {cov} (X,Y)}{\sigma _{X}\sigma _{Y}}}}\)
where \({\displaystyle \operatorname {cov} }\) is the covariance \({\displaystyle \sigma _{X}}\) is the standard deviation of \({\displaystyle X}\) , \({\displaystyle \sigma _{Y}}\) is the standard deviation of \(Y\). The formula for \({\displaystyle \operatorname {cov} (X,Y)}\) can be expressed in terms of mean and expectation. Since \({\displaystyle \operatorname {cov} (X,Y)=\operatorname {\mathbb {E} } [(X-\mu {X})(Y-\mu {Y})],}\) the formula for \(r\) can also be written as:
\({\displaystyle r_{X,Y}={\frac {\operatorname {\mathbb {E} } [(X-\mu _{X})(Y-\mu _{Y})]}{\sigma _{X}\sigma _{Y}}}}\) where \({\displaystyle \sigma _{Y}}\) and \({\displaystyle \sigma _{X}}\) are defined as above \({\displaystyle \mu _{X}}\) is the mean of \({\displaystyle X}\), \({\displaystyle \mu _{Y}}\) is the mean of \({\displaystyle Y}\), \({\displaystyle \operatorname {\mathbb {E} } }\) is the expectation. The formula for \({\displaystyle r }\) can be expressed in terms of uncentered moments. Since
\({\displaystyle {\begin{aligned}\mu _{X}={}&\operatorname {\mathbb {E} } [\,X\,]\\\mu _{Y}={}&\operatorname {\mathbb {E} } [\,Y\,]\\\sigma _{X}^{2}={}&\operatorname {\mathbb {E} } \left[\,\left(X-\operatorname {\mathbb {E} } [X]\right)^{2}\,\right]=\operatorname {\mathbb {E} } \left[\,X^{2}\,\right]-\left(\operatorname {\mathbb {E} } [\,X\,]\right)^{2}\\\sigma _{Y}^{2}={}&\operatorname {\mathbb {E} } \left[\,\left(Y-\operatorname {\mathbb {E} } [Y]\right)^{2}\,\right]=\operatorname {\mathbb {E} } \left[\,Y^{2}\,\right]-\left(\,\operatorname {\mathbb {E} } [\,Y\,]\right)^{2}\\&\operatorname {\mathbb {E} } [\,\left(X-\mu _{X}\right)\left(Y-\mu _{Y}\right)\,]=\operatorname {\mathbb {E} } [\,\left(X-\operatorname {\mathbb {E} } [\,X\,]\right)\left(Y-\operatorname {\mathbb {E} } [\,Y\,]\right)\,]=\operatorname {\mathbb {E} } [\,X\,Y\,]-\operatorname {\mathbb {E} } [\,X\,]\operatorname {\mathbb {E} } [\,Y\,]\,,\end{aligned}}}\)
the formula for \({\displaystyle r }\) can also be written as \({\displaystyle r_{X,Y}={\frac {\operatorname {\mathbb {E} } [\,X\,Y\,]-\operatorname {\mathbb {E} } [\,X\,]\operatorname {\mathbb {E} } [\,Y\,]}{{\sqrt {\operatorname {\mathbb {E} } \left[\,X^{2}\,\right]-\left(\operatorname {\mathbb {E} } [\,X\,]\right)^{2}}}~{\sqrt {\operatorname {\mathbb {E} } \left[\,Y^{2}\,\right]-\left(\operatorname {\mathbb {E} } [\,Y\,]\right)^{2}}}}}.}\)
The value of \(r\) ranges from -1 to 1:
\(r=1\) indicates a perfect positive linear relationship.
\(r=−1\) indicates a perfect negative linear relationship.
\(r=0\) indicates no linear relationship.
Calculate Correlation
The cor
function computes the correlation coefficient between two
variables, indicating the strength and direction of their linear
relationship.
# Calculate correlation
correlation <- cor(gdp_data$GDP, gdp_data$Year, use = "complete.obs")
correlation
## [1] 0.4380231
cor(gdp_data$GDP, gdp_data$Year, use = "complete.obs")
calculates the correlation between GDP and Year, using only the complete observations (ignoring rows withNA
s).
Simple Linear Regression
Simple linear regression models the relationship between a dependent variable (YYY) and an independent variable (XXX) using the equation:
\(Y=\beta_0+\beta_1X+\epsilon\)
Where:
\(Y\) is the dependent variable (e.g., GDP).
\(X\) is the independent variable (e.g., Year).
\(\beta_0\) is the intercept (the value of \(Y\) when \(X=0\)).
\(\beta_1\) is the slope of the regression line (the change in \(Y\) for a one-unit change in \(X\)).
\(\epsilon\) is the error term.
The lm
function performs a linear regression, modeling one variable as
a function of another.
##
## Call:
## lm(formula = GDP ~ Year, data = gdp_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.020e+13 -4.602e+12 -1.945e+12 7.041e+12 9.128e+12
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.660e+14 2.559e+14 -3.774 0.000366 ***
## Year 4.846e+11 1.273e+11 3.806 0.000330 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.12e+12 on 61 degrees of freedom
## Multiple R-squared: 0.1919, Adjusted R-squared: 0.1786
## F-statistic: 14.48 on 1 and 61 DF, p-value: 0.0003303
lm(GDP ~ Year, data = gdp_data)
fits a linear model predicting GDP based on Year.summary(model)
provides detailed information about the fitted model, including coefficients, R-squared value, and p-values.
Visualizing Regression Results
Visualizing the results of a regression analysis can help interpret the relationship between variables.
# Scatter plot with regression line
plot(gdp_data$Year, gdp_data$GDP, main = "GDP vs Year", xlab = "Year", ylab = "GDP", pch = 19, col = "blue")
abline(model, col = "red")
plot(gdp_data$Year, gdp_data$GDP, ...)
creates a scatter plot of GDP vs. Year.abline(model, col = "red")
adds the regression line to the plot.
4.4 More Advanced Regression Techniques
Multiple Linear Regression
Multiple linear regression extends simple linear regression by including multiple independent variables:
\(Y=\beta_0+\beta_1X_1+\beta_2X_2+\cdots+\beta_k X_k+\epsilon\)
Where:
- \(X_1, X_2, \ldots, X_k\) are the independent variables.
The model estimates the coefficients \(\beta_0,\beta_1,\ldots,\beta_k\) to describe the relationship between the dependent variable and the independent variables.
Extending simple linear regression to include multiple predictors.
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## Warning in fread("./Data/WID_Data.csv"): Detected 1 column names but the data
## has 5 columns (i.e. invalid file). Added 4 extra default column names at the
## end.
colnames(data) <- c("Country", "indicator", "percentil", "Year", "Gini")
# Merge GDP data with Gini data
gdp_data <- left_join(gdp_data, data, by = c("Country", "Year"))
# Multiple linear regression
model_multi <- lm(GDP ~ Year + Gini, data = gdp_data)
summary(model_multi)
##
## Call:
## lm(formula = GDP ~ Year + Gini, data = gdp_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.047e+12 -7.968e+11 4.474e+11 8.813e+11 1.892e+12
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.771e+15 9.115e+13 -19.42 <2e-16 ***
## Year 9.137e+11 4.673e+10 19.55 <2e-16 ***
## Gini -1.079e+14 7.281e+12 -14.82 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.347e+12 on 39 degrees of freedom
## (21 observations deleted due to missingness)
## Multiple R-squared: 0.9084, Adjusted R-squared: 0.9037
## F-statistic: 193.4 on 2 and 39 DF, p-value: < 2.2e-16
lm(GDP ~ Year + Gini, data = gdp_data)
fits a multiple linear regression model predicting GDP based on Year and Gini.summary(model_multi)
provides detailed information about the fitted model.
Exercises
Exercise 1: Descriptive Statistics
Calculate the mean, median, and standard deviation of another variable in the dataset.
Calculate the range and IQR of the same variable.
Create a histogram and a boxplot for the variable.
# Example solution
mean_Gini <- mean(gdp_data$Gini, na.rm = TRUE)
median_Gini <- median(gdp_data$Gini, na.rm = TRUE)
sd_Gini <- sd(gdp_data$Gini, na.rm = TRUE)
range_Gini <- range(gdp_data$Gini, na.rm = TRUE)
iqr_Gini <- IQR(gdp_data$Gini, na.rm = TRUE)
mean_Gini
## [1] 0.5718849
## [1] 0.5620298
## [1] 0.03933434
## [1] 0.4979644 0.6336183
## [1] 0.05114127
Exercise 2: Correlation and Regression
Calculate the correlation between GDP and another variable.
Fit a linear regression model predicting GDP based on another variable (e.g., Gini) and interpret the results.
Fit a multiple linear regression model predicting GDP based on multiple predictors and interpret the results.
Visualize the regression results with scatter plots and regression lines.
# Example solution
correlation_pop <- cor(gdp_data$GDP, gdp_data$Gini, use = "complete.obs")
correlation_pop
## [1] -0.1022499
# Simple linear regression with Gini
model_pop <- lm(GDP ~ Gini, data = gdp_data)
summary(model_pop)
##
## Call:
## lm(formula = GDP ~ Gini, data = gdp_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.494e+12 -2.967e+12 -1.577e+12 1.522e+12 1.035e+13
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.070e+13 9.946e+12 1.076 0.289
## Gini -1.128e+13 1.735e+13 -0.650 0.519
##
## Residual standard error: 4.37e+12 on 40 degrees of freedom
## (21 observations deleted due to missingness)
## Multiple R-squared: 0.01046, Adjusted R-squared: -0.01428
## F-statistic: 0.4226 on 1 and 40 DF, p-value: 0.5193
# Multiple linear regression with Year and Gini
model_multi <- lm(GDP ~ Year + Gini, data = gdp_data)
summary(model_multi)
##
## Call:
## lm(formula = GDP ~ Year + Gini, data = gdp_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.047e+12 -7.968e+11 4.474e+11 8.813e+11 1.892e+12
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.771e+15 9.115e+13 -19.42 <2e-16 ***
## Year 9.137e+11 4.673e+10 19.55 <2e-16 ***
## Gini -1.079e+14 7.281e+12 -14.82 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.347e+12 on 39 degrees of freedom
## (21 observations deleted due to missingness)
## Multiple R-squared: 0.9084, Adjusted R-squared: 0.9037
## F-statistic: 193.4 on 2 and 39 DF, p-value: < 2.2e-16
4.5 Hypothesis Testing
Hypothesis testing allows us to make inferences about GDP based on sample data.
T-Test
A t-test is used to compare the means of two groups. The test statistic \(t\) is calculated using the formula:
\(t = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}\)
Where:
\(\bar{X}_1\) and \(\bar{X}_2\) are the sample means of the two groups.
\(s_1^2\) and \(s_2^2\) are the sample variances of the two groups.
\(n_1\) and \(n_2\) are the sample sizes of the two groups.
The t-test compares the calculated \(t\) value to the critical value from the t-distribution to determine if the difference in means is statistically significant.
Performing a t-test to compare means.
# Creating a category to get two different samples
gdp_data <- gdp_data %>%
mutate(region = case_when(
Country %in% c("China", "India") ~ "Asia",
Country == "United States" ~ "America",
TRUE ~ NA_character_ # This handles cases where the country is neither China, India, nor USA
))
# Performing T-test between those two categories
t_test <- t.test(gdp_data$GDP ~ gdp_data$region)
t_test
##
## Welch Two Sample t-test
##
## data: gdp_data$GDP by gdp_data$region
## t = 11.106, df = 48.099, p-value = 7.104e-15
## alternative hypothesis: true difference in means between group America and group Asia is not equal to 0
## 95 percent confidence interval:
## 9.298051e+12 1.340854e+13
## sample estimates:
## mean in group America mean in group Asia
## 1.560081e+13 4.247513e+12
t.test(gdp_data$GDP ~ gdp_data$Group)
performs a t-test comparing the means of GDP between two groups.
Exercises
Exercise 3: Hypothesis Testing
- Perform a t-test to compare the means of GDP for two different groups (e.g., developed vs. developing countries).
4.6 Summarizing Results with Tables and Reports
Summarizing results in tables and generating reports can be useful for communicating findings.
Creating Summary Tables
Using the dplyr
package to summarize data.
# Summarize data
gdp_summary <- gdp_data %>%
group_by(region) %>%
summarise(mean_gdp = mean(GDP, na.rm = TRUE),
median_gdp = median(GDP, na.rm = TRUE),
sd_gdp = sd(GDP, na.rm = TRUE))
gdp_summary
## # A tibble: 2 × 4
## region mean_gdp median_gdp sd_gdp
## <chr> <dbl> <dbl> <dbl>
## 1 America 1.56e13 1.50e13 3.54e12
## 2 Asia 4.25e12 2.19e12 4.34e12
group_by(Region)
groups the data by region.summarise(mean_gdp = mean(GDP, na.rm = TRUE), ...)
calculates the mean, median, and standard deviation of GDP for each region.
Session 5: Data Visualization with ggplot2 (3 hours)
5.1 Introduction to ggplot2
Overview
The ggplot2
package is one of the most powerful and flexible tools for
creating complex, multi-layered graphics in R. It implements the Grammar
of Graphics, a framework that breaks down plots into semantic components
such as layers, scales, and themes.
Grammar of Graphics: The core idea is to build plots by combining independent components, making it easier to customize and create complex visualizations.
Advantages: Highly customizable, works well with
dplyr
and othertidyverse
packages, and produces publication-quality plots.
Basic Concepts
Aesthetic Mappings (
aes()
): This function defines how data variables are mapped to visual properties like color, size, and shape.Geometries (
geom_*
): Geometries define the type of plot, such as points (geom_point
), lines (geom_line
), and bars (geom_bar
).Layers: You can add multiple layers of geometries to a plot.
Scales and Coordinate Systems: Adjust the scales and coordinate systems for finer control over the plot appearance.
Themes: Themes allow you to control the non-data elements of the plot, such as background, grid lines, and text formatting.
Example: Basic Scatter Plot
Exercise 1: Customizing Your First Plot
Objective: Create a scatter plot showing the relationship between disp (displacement) and mpg (miles per gallon) in the mtcars dataset. Customize the plot by changing the color, size, and shape of the points.
Instructions:
- Change the color of the points to red.
- Adjust the size of the points to 3.
- Use triangles for the point shapes.
5.2 Scatter Plots and Line Plots
Creating scatter plots and line plots can help visualize relationships between variables.
# Scatter plot of GDP vs Year
ggplot(gdp_data, aes(x = Year, y = GDP, color = Country)) +
geom_point() +
labs(title = "GDP Over Time",
x = "Year",
y = "GDP (in USD)") +
theme_minimal()
In the scatter plot above, we plot GDP against Year, with different colours for each country.
# Line plot of GDP over time
ggplot(gdp_data, aes(x = Year, y = GDP, color = Country)) +
geom_line() +
labs(title = "GDP Over Time",
x = "Year",
y = "GDP (in USD)") +
theme_minimal()
The line plot shows how GDP changes over time for each country. We may also want to add the relationship between those two variables without regarding countries.
# Scatter plot of GDP vs Year
ggplot(gdp_data, aes(x = Year, y = GDP)) +
geom_point(color="blue") +
geom_smooth(method = "lm", se = TRUE, color = "black") +
labs(title = "GDP Over Time",
x = "Year",
y = "GDP (in USD)") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
Exercise 2: Exploring Relationships with Scatter Plots
Create a scatter plot showing the relationship between horsepower (hp
)
and miles per gallon (mpg
) in the mtcars
dataset. Then, add a linear
regression line.
# Scatter plot example
ggplot(mtcars, aes(x = hp, y = mpg)) +
geom_point(color = 'blue') +
labs(title = "Horsepower vs. Miles Per Gallon",
x = "Horsepower",
y = "Miles per Gallon") +
theme_minimal()
# Adding regression line
ggplot(mtcars, aes(x = hp, y = mpg)) +
geom_point(color = 'blue') +
geom_smooth(method = "loess", color= "pink", se = TRUE)+
labs(title = "Horsepower vs. Miles Per Gallon",
x = "Horsepower",
y = "Miles per Gallon") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
5.3 Bar Plots and Histograms
Bar plots and histograms are useful for comparing categorical data and visualizing data distributions.
Bar Plots
Bar plots are used to compare categorical data.
Exercise 3: Creating Bar Plots
Objective: Create a bar plot to compare the number of cars with different numbers of cylinders in the mtcars dataset.
Instructions:
- Use the cyl variable (number of cylinders) in the mtcars dataset to create a bar plot.
- Count the number of cars for each cylinder category.
- Customize the bar plot by adding appropriate labels, colors, and a title.
# Load necessary libraries
library(ggplot2)
# Create a bar plot for the number of cars with different numbers of cylinders
ggplot(mtcars, aes(x = factor(cyl))) +
geom_bar(fill = "lightblue", color = "black") +
labs(title = "Number of Cars by Cylinder Count",
x = "Number of Cylinders",
y = "Count of Cars") +
theme_minimal()
Histograms
Histograms display the distribution of a single variable.
# Histogram of GDP
ggplot(gdp_data, aes(x = GDP)) +
geom_histogram(binwidth = 1e12, fill = "skyblue", color = "black") +
labs(title = "GDP Distribution",
x = "GDP (in USD)",
y = "Frequency") +
theme_minimal()
The histogram shows the distribution of GDP values.
Exercise 4: Creating Histograms
Objective: Create a histogram to explore the distribution of car weights (wt) in the mtcars dataset.
Instructions:
- Create a histogram of the wt variable.
- Adjust the bin width to show a more detailed distribution.
- Customize the histogram by changing the colors, adding labels, and giving it a title.
5.4 Creating Maps with ggplot2
The ggplot2 package allows for the creation of maps in R. By combining it with the sf package for working with spatial data and incorporating variables like GDP, we can create maps that visualize different metrics over geographic regions.
Example: World Map Colored by GDP
In this example, we will create a map that colors countries according to their GDP values. For simplicity, let’s assume you have GDP data for countries in a dataframe.
Step 1: Prepare GDP Data
Ensure that you have country-level GDP data. For this example, assume that gdp_data is a dataframe containing the GDP of different countries, where the country names match those in the world dataset.
## Country iso2c iso3c Year GDP GDP_in_Billions
## 1 China CN CHN 2020 1.468774e+13 14687.74
## 2 China CN CHN 2019 1.427997e+13 14279.97
## 3 China CN CHN 2018 1.389491e+13 13894.91
## 4 China CN CHN 2017 1.231049e+13 12310.49
## 5 China CN CHN 2016 1.123331e+13 11233.31
## 6 China CN CHN 2015 1.106157e+13 11061.57
## indicator
## 1 gptinc_pall_992_j_CN\nPre-tax national income \nTotal population | Gini coefficient | adults | equal split\n
## 2 gptinc_pall_992_j_CN\nPre-tax national income \nTotal population | Gini coefficient | adults | equal split\n
## 3 gptinc_pall_992_j_CN\nPre-tax national income \nTotal population | Gini coefficient | adults | equal split\n
## 4 gptinc_pall_992_j_CN\nPre-tax national income \nTotal population | Gini coefficient | adults | equal split\n
## 5 gptinc_pall_992_j_CN\nPre-tax national income \nTotal population | Gini coefficient | adults | equal split\n
## 6 gptinc_pall_992_j_CN\nPre-tax national income \nTotal population | Gini coefficient | adults | equal split\n
## percentil Gini region
## 1 pall 0.5641360 Asia
## 2 pall 0.5579220 Asia
## 3 pall 0.5591818 Asia
## 4 pall 0.5613709 Asia
## 5 pall 0.5537136 Asia
## 6 pall 0.5554577 Asia
Step 2: Join GDP Data with Spatial Data
You will need to join the GDP data with the geographic data from sf using a common identifier, such as the country name.
## Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
# Load the world map data from the 'rnaturalearth' package
world <- st_as_sf(rnaturalearth::ne_countries(scale = "medium", returnclass = "sf"))
# Rename variable name country
names(world)[names(world) == "name"] <- "Country"
world <- world %>%
mutate(Country = ifelse(Country == "United States of America", "United States", Country))
# Join GDP data with world data
world_gdp <- left_join(world, gdp_stats, by="Country")
Customizing the GDP Map
You can further customize the appearance of the map, adjusting titles, labels, and map colors to enhance the visual representation.
Code: Customizing the GDP Map
ggplot(data = world_gdp) +
geom_sf(aes(fill = mean_gdp), color = "white") +
scale_fill_viridis_c(option = "plasma", trans = "log", na.value = "grey") +
labs(title = "Average GDP Distribution from 2000 to 2020",
fill = "GDP (in billions)") +
theme_minimal() +
theme(legend.position = "bottom")
This version uses a logarithmic scale for better visualization of GDP values and adjusts the map’s theme and legend position for clarity.
Exercise 5: Mapping GDP per Capita
In this exercise, you will create a map that displays countries colored by GDP per capita.
- Add a column to the world_gdp dataframe for GDP per capita.
- Create a map that colors the countries according to their GDP per capita.
Code Solution
# Assuming the 'pop_est' column contains population estimates
world_gdp$gdp_per_capita <- world_gdp$mean_gdp / world_gdp$pop_est
# Plot the GDP per capita map
ggplot(data = world_gdp) +
geom_sf(aes(fill = gdp_per_capita)) +
scale_fill_viridis_c(option = "magma", trans = "log", na.value = "grey") +
labs(title = "World GDP per Capita",
fill = "GDP per Capita") +
theme_minimal()
Exercise 6: Annotating Countries with Highest GDP
In this exercise, you will annotate the countries with the top highest GDP value.
- Modify the GDP map to annotate the top 1 countries by GDP.
- Use the geom_text() function to place the country names on the map.
Code Solution
# Find top 1 countries by GDP
top_gdp_countries <- world_gdp %>%
top_n(1, mean_gdp)
# Plot the map with annotations
ggplot(data = world_gdp) +
geom_sf(aes(fill = mean_gdp)) +
geom_text(data = top_gdp_countries, aes(label = "Winner", geometry = geometry),
stat = "sf_coordinates", size = 3, color = "black") +
scale_fill_viridis_c(option = "plasma", trans = "log", na.value = "grey") +
labs(title = "World GDP Map with Top Countries") +
theme_minimal()
## Warning in st_point_on_surface.sfc(sf::st_zm(x)): st_point_on_surface may not
## give correct results for longitude/latitude data
Session 6: Advanced Data Visualization (3 hours)
6.1 Faceting and Themes
Concepts:
Faceting: Allows splitting data into multiple plots based on a factor variable.
Themes: Customize the appearance of your plots.
Example 1: Faceting by Country
# Facet plot by country
ggplot(gdp_data, aes(x = Year, y = GDP)) +
geom_line() +
facet_wrap(~ Country) +
labs(title = "GDP Over Time by Country",
x = "Year",
y = "GDP (in USD)") +
theme_minimal()
The above code creates a separate plot for each country, displaying GDP trends over time. One can also create custom themes to standardize the appearance of all plots.
Exercise 1: Faceting with mtcars Dataset
Task: Create a faceted plot showing the relationship between displacement (disp) and horsepower (hp) in the mtcars dataset, faceted by the number of cylinders (cyl).
Hints: Use facet_wrap(~ cyl) or facet_grid(cyl ~ .) for different layouts.
Solution:
Example 2: Creating Custom Themes
custom_theme <- theme_minimal() +
theme(
text = element_text(family = "Times New Roman"),
plot.title = element_text(size = 14, face = "bold"),
axis.title = element_text(size = 12)
)
# Applying custom theme
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point() +
labs(title = "Car Weight vs. MPG") +
custom_theme
Exercise 2: Create and Apply a Custom Theme
Task: Define your own custom theme that changes font size, axis labels, and background color. Apply this theme to a plot showing the relationship between mpg and hp in mtcars.
Hints: Use theme() to customize various elements.
6.2 Combining multiple Plots
Use cowplot
or patchwork
to combine multiple ggplot
objects into
one.
Example 3: Combining Plots with cowplot
library(cowplot)
#Create first plot
p1 <- ggplot(mtcars, aes(x = hp, y = mpg)) +
geom_point()+
labs(title = "Horsepower vs. Miles Per Gallon",
x = "Horsepower",
y = "Miles per Gallon")
#Create second plot
p2 <- ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_boxplot()+
labs(title = "Car weight vs. Miles Per Gallon",
x = "Car weight",
y = "Miles per Gallon")
# Merging two plots
plot_grid(p1, p2, labels = "AUTO")
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
Exercise 3: Create a Multi-Plot Layout
Task: Combine three plots into a single layout: a scatter plot (hp vs. mpg), a boxplot (wt vs. mpg), and a histogram of mpg.
Hints: Use plot_grid() or patchwork syntax to arrange the plots.
##
## Attaching package: 'patchwork'
## The following object is masked from 'package:cowplot':
##
## align_plots
p3 <- ggplot(mtcars, aes(x = mpg)) +
geom_histogram(binwidth = 2, fill = "lightblue", color = "black") +
labs(title = "Distribution of Miles Per Gallon", x = "Miles Per Gallon")
(p1 | p2) / p3
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
6.3 Saving and Exporting Plots
You can save your plots to files using the ggsave
function.
# Save plot to file
plot <- ggplot(gdp_data, aes(x = Year, y = GDP, color = Country)) +
geom_line() +
labs(title = "GDP Over Time",
x = "Year",
y = "GDP (in USD)") +
theme_minimal()
ggsave("gdp_plot.png", plot = plot, width = 8, height = 6)
This code saves the plot as a PNG file.
6.4 Interactive Visualizations with ggplot2 and Plotly
Introduction to plotly
- plotly: A library that converts
ggplot2
visualizations into interactive plots.
Converting ggplot2 to Plotly
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
p <- ggplot(mtcars, aes(x = hp, y = mpg)) +
geom_point(color = 'blue') +
geom_smooth(method = "loess", color= "pink", se = TRUE)+
labs(title = "Horsepower vs. Miles Per Gallon",
x = "Horsepower",
y = "Miles per Gallon") +
theme_minimal()
ggplotly(p)
## `geom_smooth()` using formula = 'y ~ x'
Exercise 4: Create an Interactive Plot
Task: Convert a faceted plot (from Exercise 1) into an interactive plot using ggplotly. Customize tooltips to show additional information like car model names.
Hints: Use the tooltip argument in ggplotly() to specify which variables to include.
# Load necessary libraries
library(ggplot2)
library(plotly)
# Create a faceted plot
p <- ggplot(mtcars, aes(x = disp, y = hp, text = rownames(mtcars))) +
geom_point(color = "blue") +
facet_wrap(~ cyl) +
labs(title = "Displacement vs. Horsepower by Number of Cylinders",
x = "Displacement (cu.in.)",
y = "Horsepower") +
theme_minimal()
# Convert to an interactive plot with tooltips
interactive_plot <- ggplotly(p, tooltip = c("text", "x", "y"))
# Display the interactive plot
interactive_plot
Explanation:
• The text = rownames(mtcars) inside the aes() function allows you to
include car model names in the tooltips.
• The ggplotly(p, tooltip = c("text", "x", "y")) command customizes the
tooltip to display the car model name along with the displacement and horsepower values.
Additional Exercises and Q&A (Optional Time)
Exercise 6: Use the shiny package to create a web-based app that allows users to interactively explore the mtcars dataset.
##
## Attaching package: 'shiny'
## The following object is masked from 'package:crosstalk':
##
## getDefaultReactiveDomain
library(ggplot2)
library(dplyr)
# Define the UI
ui <- fluidPage(
titlePanel("Interactive mtcars Data Explorer"),
sidebarLayout(
sidebarPanel(
selectInput("xvar", "X-axis Variable", choices = names(mtcars)),
selectInput("yvar", "Y-axis Variable", choices = names(mtcars)),
sliderInput("cyl", "Number of Cylinders",
min = min(mtcars$cyl),
max = max(mtcars$cyl),
value = range(mtcars$cyl),
step = 1)
),
mainPanel(
plotOutput("scatterPlot")
)
)
)
# Define the server logic
server <- function(input, output) {
filtered_data <- reactive({
mtcars %>%
filter(cyl >= input$cyl[1] & cyl <= input$cyl[2])
})
output$scatterPlot <- renderPlot({
ggplot(filtered_data(), aes_string(x = input$xvar, y = input$yvar)) +
geom_point(color = "blue") +
labs(title = paste(input$yvar, "vs", input$xvar),
x = input$xvar,
y = input$yvar) +
theme_minimal()
})
}
# Run the application
shinyApp(ui = ui, server = server)
##
## Listening on http://127.0.0.1:5806
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Explanation:
The selectInput() functions create dropdowns for selecting the x- and y-axis variables.
The sliderInput() allows the user to filter the dataset by the number of cylinders.
filtered_data() is a reactive expression that updates the data based on the selected number of cylinders.
renderPlot() generates the scatter plot based on the selected variables and filtered data.
This shiny app provides an interactive interface for exploring relationships in the mtcars dataset, allowing users to dynamically change the variables plotted and filter data in real-time.
Conclusion
This 18-hour course provided a comprehensive introduction to R for beginners, covering data loading, basic statistics, and data visualization. By the end of the course, you should be comfortable working with data in R and creating meaningful visualizations using ggplot2.
Going further
Here are some highly recommended sources to keep you going:
Books:
“R for Data Science“ by Garrett Grolemund and Hadley Wickham - This book is freely available online and is highly recommended for beginners.
“Advanced R” by Hadley Wickham - For those who want to dive deeper into R programming concepts.
“The Art of R Programming” by Norman Matloff - A comprehensive guide to R programming.
Websites and Resources:
R Documentation: The official documentation provided by the R Project.
- CRAN - The Comprehensive R Archive Network
RStudio: The IDE commonly used for R programming also provides excellent learning resources.
Stack Overflow: A great community for asking specific programming questions related to R.
YouTube Channels:
R Programming 101: Offers tutorials and practical examples.
Communities:
r-programming subreddit: A community of R programmers where you can ask questions and find resources.
Cross Validated: Stack Exchange’s site for statistics, data analysis, data mining, and machine learning using R.