TD 1 — Régression de Poisson (R)

Année universitaire 2025–2026 • Parcours Économie de la santé & Développement durable

Pierre Beaucoral

2025-11-27

1 Objectifs du TD

  • Construire un tableau comptages + exposition (population à risque).
  • Estimer un GLM Poisson avec offset (log-exposition).
  • Évaluer l’ajustement : Déviance & Pearson.
  • Comparer des modèles (LR test).
  • Interpréter en IRR et produire des comptes attendus.
  • Diagnostiquer la sur-dispersion et évoquer Quasi-Poisson / NB.

2 Rappel rapide — Modèle de Poisson

Formulation

  • \(Y_i \sim \text{Poisson}(\mu_i)\), \(\mathbb{E}[Y_i]=\mu_i\), \(\mathrm{Var}(Y_i)=\mu_i\)
  • Lien log : \(\log(\mu_i) ;=; \beta_0 + \beta_1 X_{1i} + \cdots + \beta_p X_{pi} + \log(\text{exposition}_i)\)
  • L’offset \(\log(\text{exposition}_i)\) ajuste la taille de population/temps.
  • Interprétation : \(e^{\beta_j} = \text{IRR}_j\) (ratio de taux d’incidence).

Diagnostics : Déviance, Pearson, sur-dispersion (sinon Quasi-Poisson / NB).

3 Données & packages

Contexte : décès cancer du poumon × classes d’âge × statut tabagique.
Exposition = population (centaines de milliers → convertir en personnes).

Code
#| label: setup
# installer si nécessaire : install.packages(c("readxl","dplyr","tidyr","janitor","ggplot2","broom","gt","performance","DescTools"))
library(readxl)
library(dplyr)
library(tidyr)
library(janitor)
library(ggplot2)
library(broom)
library(gt)
library(performance)   # check_overdispersion
library(DescTools)     # PChisq for GOF (si besoin)

4 Import & nettoyage

Code
data_path <- "data/smoking_dat.xlsx"

df <- read_excel(data_path) |>
  clean_names() |>
  rename(
    age = matches("^age$|^age_class|^agecat"),
    smoking_status = matches("^smoking|^smoke"),
    population = matches("^pop|^population"),
    deaths = matches("^deaths|^dead")
  )

glimpse(df)
Rows: 36
Columns: 4
$ age            <chr> "40-44", "45-59", "50-54", "55-59", "60-64", "65-69", "…
$ smoking_status <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "…
$ population     <dbl> 656, 359, 249, 632, 1067, 897, 668, 361, 274, 145, 104,…
$ deaths         <dbl> 18, 22, 19, 55, 117, 170, 179, 120, 120, 2, 4, 3, 38, 1…

5 Encodage des facteurs

Code
df <- df |>
  mutate(
    age = factor(age, ordered = FALSE),
    smoking_status = factor(smoking_status, ordered = FALSE)
  )

levels(df$age); levels(df$smoking_status)
[1] "40-44" "45-59" "50-54" "55-59" "60-64" "65-69" "70-74" "75-79" "80+"  
[1] "cigarPipeOnly"  "cigarretteOnly" "cigarrettePlus" "no"            

6 Exposition & échelle

Population en centaines de milliers → convertir en personnes.

Code
expo_personnes <- TRUE
scale_factor <- 1e5

df <- df |>
  mutate(
    exposure = if (expo_personnes) population * scale_factor else population
  )

summary(df$exposure)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
  9800000  36925000  85850000 155894444 230550000 605200000 

7 Modèle Poisson (mod1)

Code
mod1 <- glm(deaths ~ smoking_status + age + offset(log(exposure)),
            family = poisson(link = "log"), data = df)
summary(mod1)

Call:
glm(formula = deaths ~ smoking_status + age + offset(log(exposure)), 
    family = poisson(link = "log"), data = df)

Coefficients:
                              Estimate Std. Error  z value Pr(>|z|)    
(Intercept)                  -15.14514    0.06783 -223.293  < 2e-16 ***
smoking_statuscigarretteOnly   0.36915    0.03791    9.737  < 2e-16 ***
smoking_statuscigarrettePlus   0.17015    0.03643    4.671 3.00e-06 ***
smoking_statusno              -0.04781    0.04699   -1.017    0.309    
age45-59                       0.55388    0.07999    6.924 4.38e-12 ***
age50-54                       0.98039    0.07682   12.762  < 2e-16 ***
age55-59                       1.37946    0.06526   21.138  < 2e-16 ***
age60-64                       1.65423    0.06257   26.439  < 2e-16 ***
age65-69                       1.99817    0.06279   31.824  < 2e-16 ***
age70-74                       2.27141    0.06435   35.296  < 2e-16 ***
age75-79                       2.55858    0.06778   37.746  < 2e-16 ***
age80+                         2.84692    0.07242   39.310  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 4055.984  on 35  degrees of freedom
Residual deviance:   21.487  on 24  degrees of freedom
AIC: 285.51

Number of Fisher Scoring iterations: 4

Idée : on modélise un taux (décès / exposition). L’offset impose un coefficient de 1 sur \(\log(\text{exposition})\) .

8 Qualité d’ajustement

Déviance vs modèle saturé et Pearson GOF (mêmes df résiduels).

Code
dev_stat  <- deviance(mod1)
dev_df    <- df.residual(mod1)
dev_p     <- pchisq(dev_stat, dev_df, lower.tail = FALSE)

mu_hat    <- fitted(mod1)
y_obs     <- df$deaths
pearson   <- sum((y_obs - mu_hat)^2 / mu_hat)
pearson_p <- pchisq(pearson, dev_df, lower.tail = FALSE)

tibble(
  test = c("Deviance GOF", "Pearson GOF"),
  statistic = c(dev_stat, pearson),
  df = c(dev_df, dev_df),
  p_value = c(dev_p, pearson_p)
) |>
  gt()
test statistic df p_value
Deviance GOF 21.48674 24 0.6098720
Pearson GOF 20.61936 24 0.6610658

9 Test LR — Rôle du tabac

Comparer modèle sans tabac (mod0) vs mod1.

Code
mod0 <- glm(deaths ~ age + offset(log(exposure)),
            family = poisson, data = df)

anova(mod0, mod1, test = "Chisq")
Analysis of Deviance Table

Model 1: deaths ~ age + offset(log(exposure))
Model 2: deaths ~ smoking_status + age + offset(log(exposure))
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1        27    191.723                          
2        24     21.487  3   170.24 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conclusion attendue : le statut tabagique améliore significativement l’ajustement.

10 Variante binaire “fumeur de cigarette”

Créer cigarette_user = 1 si l’étiquette contient “cigarette”.

Code
df <- df |>
  mutate(
    cigarette_user = as.integer(grepl("cigarrette", tolower(as.character(smoking_status)))))

table(df$cigarette_user, df$smoking_status)
   
    cigarPipeOnly cigarretteOnly cigarrettePlus no
  0             9              0              0  9
  1             0              9              9  0

10.1 Modèle (mod2)

Code
mod2 <- glm(deaths ~ cigarette_user + age + offset(log(exposure)),
            family = poisson, data = df)

summary(mod2)

Call:
glm(formula = deaths ~ cigarette_user + age + offset(log(exposure)), 
    family = poisson, data = df)

Coefficients:
                Estimate Std. Error  z value Pr(>|z|)    
(Intercept)    -15.15590    0.06378 -237.625  < 2e-16 ***
cigarette_user   0.26910    0.02757    9.762  < 2e-16 ***
age45-59         0.55342    0.07999    6.919 4.56e-12 ***
age50-54         0.98480    0.07682   12.820  < 2e-16 ***
age55-59         1.37640    0.06526   21.092  < 2e-16 ***
age60-64         1.64629    0.06256   26.317  < 2e-16 ***
age65-69         1.99023    0.06277   31.708  < 2e-16 ***
age70-74         2.26143    0.06432   35.161  < 2e-16 ***
age75-79         2.54560    0.06766   37.626  < 2e-16 ***
age80+           2.82907    0.07215   39.211  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 4055.984  on 35  degrees of freedom
Residual deviance:   92.237  on 26  degrees of freedom
AIC: 352.26

Number of Fisher Scoring iterations: 4

10.2 Anova

Code
anova(mod2, test = "Chisq")
Analysis of Deviance Table

Model: poisson, link: log

Response: deaths

Terms added sequentially (first to last)

               Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                              35     4056.0              
cigarette_user  1     58.3        34     3997.6 2.195e-14 ***
age             8   3905.4        26       92.2 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

10.2.1 Interprétation générale

  • Dans ce modèle, l’effet de l’âge est massif et hautement significatif : les taux de décès augmentent très fortement avec l’âge.

  • En revanche, la variable binaire cigarette_user apporte tout de même de l’information par rapport au modèle nul (elle réduit la déviance de manière significative)

⚠️ Cela peut surprendre :

  • C’est probablement parce que age capte déjà la quasi-totalité de la structure de mortalité (gradient très fort), et que l’effet du tabac n’est visible que lorsqu’on distingue plus finement le type de tabac (cf. smoking_status à 4 classes dans mod1).

11 mod1 vs mod2 (granularité du tabac)

Code
anova(mod2, mod1, test = "Chisq")
Analysis of Deviance Table

Model 1: deaths ~ cigarette_user + age + offset(log(exposure))
Model 2: deaths ~ smoking_status + age + offset(log(exposure))
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1        26     92.237                          
2        24     21.487  2    70.75 4.334e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • Si mod1 (4 catégories) ≫ mod2 (binaire), alors le type de produit compte.

12 IRR & IC 95%

Code
tidy(mod1, conf.int = TRUE, exponentiate = TRUE) |>
  select(term, estimate, conf.low, conf.high, p.value) |>
  arrange(desc(estimate)) |>
  gt() |>
  tab_header(title = "IRR (modèle 1) — exp(coef) avec IC 95%")
IRR (modèle 1) — exp(coef) avec IC 95%
term estimate conf.low conf.high p.value
age80+ 1.723470e+01 1.496968e+01 1.988626e+01 0.000000e+00
age75-79 1.291740e+01 1.132626e+01 1.477500e+01 0.000000e+00
age70-74 9.693017e+00 8.558760e+00 1.101560e+01 6.852385e-273
age65-69 7.375556e+00 6.533324e+00 8.357251e+00 2.978913e-222
age60-64 5.229045e+00 4.633991e+00 5.922596e+00 4.943059e-154
age55-59 3.972749e+00 3.501347e+00 4.522490e+00 3.581870e-99
age50-54 2.665487e+00 2.294116e+00 3.100656e+00 2.658489e-37
age45-59 1.739985e+00 1.487796e+00 2.036040e+00 4.376858e-12
smoking_statuscigarretteOnly 1.446509e+00 1.343418e+00 1.558691e+00 2.099342e-22
smoking_statuscigarrettePlus 1.185481e+00 1.104252e+00 1.273769e+00 3.001586e-06
smoking_statusno 9.533182e-01 8.692799e-01 1.045138e+00 3.090007e-01
(Intercept) 2.645745e-07 2.312255e-07 3.016715e-07 0.000000e+00

Lecture : \(IRR>1\) → taux ↑ vs référence ; \(IRR<1\) → taux ↓.

13 Observé vs Attendu

Code
df_preds <- df |>
  mutate(
    y_obs = deaths,
    mu_hat = fitted(mod1)
  )

df_preds |>
  select(age, smoking_status, exposure, y_obs, mu_hat) |>
  arrange(age, smoking_status) |>
  gt() |>
  fmt_number(columns = c(exposure, y_obs, mu_hat), decimals = 2)
age smoking_status exposure y_obs mu_hat
40-44 cigarPipeOnly 14,500,000.00 2.00 3.84
40-44 cigarretteOnly 341,000,000.00 124.00 130.50
40-44 cigarrettePlus 453,100,000.00 149.00 142.11
40-44 no 65,600,000.00 18.00 16.55
45-59 cigarPipeOnly 10,400,000.00 4.00 4.79
45-59 cigarretteOnly 223,900,000.00 140.00 149.10
45-59 cigarrettePlus 303,000,000.00 169.00 165.36
45-59 no 35,900,000.00 22.00 15.76
50-54 cigarPipeOnly 9,800,000.00 3.00 6.91
50-54 cigarretteOnly 185,100,000.00 187.00 188.82
50-54 cigarrettePlus 226,700,000.00 193.00 189.53
50-54 no 24,900,000.00 19.00 16.74
55-59 cigarPipeOnly 37,200,000.00 38.00 39.10
55-59 cigarretteOnly 327,000,000.00 514.00 497.17
55-59 cigarrettePlus 468,200,000.00 576.00 583.40
55-59 no 63,200,000.00 55.00 63.33
60-64 cigarPipeOnly 84,600,000.00 113.00 117.04
60-64 cigarretteOnly 379,100,000.00 778.00 758.66
60-64 cigarrettePlus 605,200,000.00 1,001.00 992.58
60-64 no 106,700,000.00 117.00 140.73
65-69 cigarPipeOnly 94,900,000.00 173.00 185.19
65-69 cigarretteOnly 242,100,000.00 689.00 683.37
65-69 cigarrettePlus 388,000,000.00 901.00 897.57
65-69 no 89,700,000.00 170.00 166.87
70-74 cigarPipeOnly 82,400,000.00 212.00 211.32
70-74 cigarretteOnly 119,500,000.00 432.00 443.30
70-74 cigarrettePlus 203,300,000.00 613.00 618.07
70-74 no 66,800,000.00 179.00 163.31
75-79 cigarPipeOnly 66,700,000.00 243.00 227.95
75-79 cigarretteOnly 43,600,000.00 214.00 215.54
75-79 cigarrettePlus 87,100,000.00 337.00 352.89
75-79 no 36,100,000.00 120.00 117.62
80+ cigarPipeOnly 53,700,000.00 253.00 244.86
80+ cigarretteOnly 11,300,000.00 63.00 74.53
80+ cigarrettePlus 34,500,000.00 189.00 186.49
80+ no 27,400,000.00 120.00 119.11

Graphique

Code
shapes_9 <- c(16, 17, 15, 3, 7, 8, 0, 1, 2)
ggplot(df_preds, aes(mu_hat, y_obs, color = smoking_status, shape = age)) +
  geom_point(size = 2) +
  geom_abline(intercept = 0, slope = 1, linetype = 2) +
  scale_shape_manual(values = shapes_9) +
  labs(x = "Comptes attendus (mod1)", y = "Comptes observés",
       title = "Observé vs Attendu — GLM Poisson (mod1)")

14 Sur-dispersion ?

Code
check_overdispersion(mod1)
# Overdispersion test

       dispersion ratio =  0.859
  Pearson's Chi-Squared = 20.619
                p-value =  0.661
  • Si sur-dispersion marquée :

    • Quasi-Poisson family = quasipoisson (SE corrigés),

    • NB MASS::glm.nb(...) (variance flexible).

15 Bonnes pratiques (rapide)

  • Toujours documenter l’offset et l’unité d’exposition.

  • Vérifier les cellules rares (μ̂ très petits) avant χ².

  • Comparer modèles (parcimonie vs granularité).

  • Communiquer en IRR + IC.

  • Tester la robustesse (Quasi-Poisson / NB).

16 Fin — Ressources

  • GLM (Poisson) dans R : ?glm, family = poisson

  • Diagnostics : performance::check_overdispersion

  • Alternatives : quasipoisson, MASS::glm.nb

  • Visualisation : ggplot2