Interpreting ANOVA

Teaching Note

Author

Larry Vincent

Published

March 23, 2026

Modified

April 30, 2026

It’s 2017 and actress/comedian Chelsea Handler is mounting a national tour built around civic engagement objective. Disturbed by the 2016 election cycle, she wants to turn out the vote for the midterms. Though it’s no secret that she leans left, she wants a show that meets audiences where they are politically, makes them laugh, and nudges them toward the voting booth regardless of which direction they lean. She’s convinced the real problem is that not enough people are showing up at the polls.

She lays out an ambitious research brief. She wants to understand how Americans feel about central issues and how those attitudes and perceptions relate to audience segmentation. This will be used for planning the route of her tour and also to drive the marketing campaign.

The data for this note comes from a real talent research study. Respondents were recruited via Amazon’s Mechanical Turk platform and asked to complete a survey measuring their attitudes on a range of political and social issues, as well as their perceptions of several well-known comedians, including Chelsea Handler. The data provided for this note only contains the Chelsea Handler “familiar” audiences.

Respondents were grouped into three audience segments based on their issue attitudes using kmeans clustering. Those cluster assignments have been pre-calculated and are included in the dataset as segment. Your job here is to determine whether those segments differ in a meaningful and statistically defensible way on KPIs such as favorability and perceived relevance.

The Data

The practice dataset (ch-talent-survey.csv) contains 595 respondents. Each row is one survey participant.

Variable	Type	Description
`response_id`	ID	Unique respondent identifier
`segment`	Factor (1–3)	Pre-assigned audience segment from cluster analysis
`ch_favorability`	Numeric (1–5)	Overall opinion of Chelsea Handler (1 = very unfavorable, 5 = very favorable)
`ch_relevance`	Numeric (1–5)	Perceived relevance of Chelsea Handler to audiences today
`political_orientation`	Numeric (1–5)	Political self-identification (1 = very conservative, 5 = very liberal)

The three segments have meaningfully different profiles. Segment 2 is the largest and most politically liberal group, with high scores on reproductive rights and marriage equality. Segment 3 is the most conservative, with notably lower scores on LGBT and abortion-related issues. Segment 1 sits in the middle, with a mixed issue profile that values both free speech and women’s rights.

A quick look at segment sizes and mean favorability is a good starting point:

Code

df |>
  group_by(segment) |>
  summarise(
    n = n(),
    mean_favorability = mean(ch_favorability, na.rm = TRUE),
    sd = sd(ch_favorability, na.rm = TRUE),
    mean_relevance = mean(ch_relevance, na.rm = TRUE)
  ) |>
  mutate(across(where(is.numeric), ~round(., 2))) |>
  gt() |> 
  cols_label(
    segment = "Segment",
    n = "N",
    mean_favorability = "Mean Favorability",
    sd = "SD",
    mean_relevance = "Mean Relevance"
  ) |> 
  cols_align(
    columns = segment,
    align = "left"
  ) |> 
  cols_width(
    segment ~ px(200),
    mean_favorability:mean_relevance ~ px(100)
  )

Segment	N	Mean Favorability	SD	Mean Relevance
Moderates	192	3.32	1.23	2.80
Progressives	272	3.68	1.10	3.77
Conservatives	131	2.59	1.48	1.91

The means are different. But are they really different? Lots of things seem different until you look closer–politicians, dating profiles, reality TV stars … oh, and also sample means.

The Issue Battery and Audience Segments

Before looking at ANOVA results, it helps to understand what went into the segmentation. Respondents rated their agreement with eleven issue statements on a 1–6 scale. These are the variables the clustering algorithm used to sort people into groups.

Issue	Statement Summary
`right_to_choose`	A woman’s right to make decisions about her own body is very important
`voter_participation`	More needs to be done to encourage people to participate in elections
`gender_equality`	Much more can be done to level the playing field for women at work and in government
`free_speech`	Everyone should have the absolute right to say what they choose
`climate_change`	Climate change is the most important issue we face
`political_apathy`	There should be less polarization; people should stop talking about politics so much
`middle_class`	The government should focus more attention on helping middle class Americans
`marriage_equality`	Same-sex couples should enjoy the same right to marry as heterosexual couples
`roe_v_wade`	A woman’s right to an abortion is under attack and people should defend Roe v. Wade
`lgbt`	More should be done to prevent discrimination against the LGBT community
`women_in_politics`	Women need a stronger voice in government

The clustering algorithm found three groups of respondents whose issue attitudes hang together in a recognizable pattern. Here is how each segment is characterized:

Code

tibble::tribble(
  ~Segment, ~N, ~`Defining Attitudes`, ~`Skeptical Of`,
  "Moderates",     "192",     "Free speech, women's rights broadly construed", "Climate and identity-focused issues",
  "Progressives",  "272",     "Reproductive rights, marriage equality, gender equality", "Political disengagement",
  "Conservatives", "131",  "Free speech, middle class economic concerns", "LGBT rights, abortion access"
) |>
  gt() |>
  cols_align(align = "left", columns = everything()) |>
  cols_width(
    Segment ~ px(160),
    N ~ px(60),
    `Defining Attitudes` ~ px(220),
    `Skeptical Of` ~ px(200)
  )

Segment	N	Defining Attitudes	Skeptical Of
Moderates	192	Free speech, women's rights broadly construed	Climate and identity-focused issues
Progressives	272	Reproductive rights, marriage equality, gender equality	Political disengagement
Conservatives	131	Free speech, middle class economic concerns	LGBT rights, abortion access

These segments emerged from the data. The labels are shorthand. Keep in mind that each segment contains real variation; not every Conservative scored low on every progressive issue, and not every Progressive is uniformly activated on all of them. The segment names describe the center of gravity, not every individual in the group.

What ANOVA Does

ANOVA — Analysis of Variance — compares two kinds of variation in your data.

The first is variation between groups: how far apart are the group means from one another? The second is variation within groups: how much do individuals within the same group differ from each other?

If the differences between groups are large relative to the noise inside each group, ANOVA gives you evidence that something real is going on. If the between-group differences are small relative to within-group noise, those differences could easily be explained by chance.

Code

df |>
  ggplot(aes(x = segment, y = ch_favorability)) +
  geom_boxplot(alpha = 0.9, width = 0.6, show.legend = FALSE) +
  geom_point(position = position_jitter(width = 0.27), alpha = 0.15) +
  stat_summary(geom = "point", size = 5, shape = 19, fun = "mean") +
  stat_summary(geom = "label", aes(label = round(after_stat(y), 2)), size = 5, shape = 19, fun = "mean", hjust = -0.3, fill="white", alpha = 1) +
  labs(
    title = "Favorability Toward Chelsea Handler by Audience Segment",
    subtitle = "Are these differences large enough to be real?",
    x = NULL,
    y = "Favorability (1–5)"
  ) +
  theme(
    plot.title.position = "plot",
    plot.title = element_text(margin = margin(b=0, unit = "pt")),
    plot.subtitle = element_text(size = 14, margin = margin(t=3, b=14, unit = "pt")),
    plot.margin = margin(t=12, unit = "pt"),
    axis.text.x = element_text(size = 14),
    axis.title.y = element_text(face = "bold", size = 12)
    )

The F-Statistic and the P-Value

In order to understand ANOVA, we are going to have to show some formulas and discuss high-level statistics. Take a deep breath. This won’t hurt. I promise.

ANOVA produces two numbers that work together. The first is the F-statistic, which measures how large the differences between groups are relative to the variation within them. When F is well above 1, the between-group differences are outpacing the within-group noise. When it’s close to 1 or below, they’re not. A very simple way of thinking about how it is calculated looks like this:

\[F = \frac{\text{Variance Between Groups}}{\text{Variance Within Groups}}\]

The second is the p-value, which is calculated directly from F and gives us the yes-or-no verdict on significance. Think of F as measuring the size of the effect, and the p-value as telling us how surprised we should be to see an effect that large if nothing were actually going on.

By convention, a p-value below 0.05 is considered statistically significant. It basically means that if there were truly no differences, we’d see results this extreme less than 5% of the time by chance alone. Keep in mind that in the real world of business, statisticians sometimes allow for a p-value of 0.1 or lower, which is the same as saying that if there were truly no differences between the groups, we might see this kind of F value one out of 10 times.

Here’s what the ANOVA data for the favorability scores between groups in the Chelsea Handler study looks like:

Code

model <- aov(ch_favorability ~ segment, data = df)

f_val <- summary(model)[[1]][["F value"]][1]
p_val <- summary(model)[[1]][["Pr(>F)"]][1]

summary(model)

             Df Sum Sq Mean Sq F value   Pr(>F)    
segment       2   56.8  28.384   19.15 1.28e-08 ***
Residuals   350  518.8   1.482                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
242 observations deleted due to missingness

Here’s how to read the output:

Column	What It Means
`Df`	Degrees of freedom — reflects number of groups and respondents
`Sum Sq`	Total variance attributable to between-group vs. within-group differences
`F value`	Ratio of between-group to within-group variance
`Pr(>F)`	The p-value — the number you’ll report

With an F-value of 19.15 and a p-value of 0, there are indeed variances between groups and we can be very sure it isn’t random chance.

Which Groups Are Actually Different?

A significant ANOVA tells you the groups are not all the same. It doesn’t tell you which pairs of groups differ from each other. For that, you need a post-hoc test. The most common is the Tukey HSD (Fun Fact: HSD stands for “Honestly Significant Difference”) test, which compares every possible pair and controls for the fact that you’re making multiple comparisons.

Code

tidy(TukeyHSD(model)) |>
  select(-null.value) |> 
  mutate(across(where(is.numeric), ~round(., 3))) |>
  gt() |>
  cols_label(
    term = "Comparison",
    contrast = "Groups",
    estimate = "Difference",
    conf.low = "Lower CI",
    conf.high = "Upper CI",
    adj.p.value = "Adjusted p-value"
  ) |>
  cols_align(align = "left", columns = c(term, contrast)) |> 
  cols_width(
    adj.p.value ~ px(70)
  )

Comparison	Groups	Difference	Lower CI	Upper CI	Adjusted p-value
segment	Progressives-Moderates	0.352	0.004	0.699	0.046
segment	Conservatives-Moderates	-0.733	-1.179	-0.288	0.000
segment	Conservatives-Progressives	-1.085	-1.499	-0.672	0.000

Look at the Adjusted p-value column. Any pair with a value below 0.05 represents a statistically meaningful difference. All segment contrasts differ significantly. Now look at the Difference column. This tells you how much one group differs from another. Conservatives have favorability toward Chelsea that is substantially lower than that of either the Moderates or Progressives.

A means plot with confidence intervals puts the pairwise story on the page a little more cleanly:

Code

df |>
  group_by(segment) |>
  summarise(
    mean_fav = mean(ch_favorability, na.rm = TRUE),
    se = sd(ch_favorability, na.rm = TRUE) / sqrt(n()),
    .groups = "drop"
  ) |>
  ggplot(aes(x = fct_reorder(segment, mean_fav), y = mean_fav)) +
  geom_pointrange(
    aes(ymin = mean_fav - 1.96 * se, ymax = mean_fav + 1.96 * se),
    linewidth = 1.2,
    size = 1.1
  ) +
  coord_flip() +
  scale_y_continuous(limits = c(1,5)) +
  labs(
    title = "Mean Favorability by Audience Segment",
    subtitle = "Mean ± 95% confidence interval",
    x = NULL,
    y = "Mean Favorability (1–5)"
  ) +
  theme(plot.title.position = "plot")

Statistical Significance Is Not the Same as Importance

A significant p-value answers one question: is this real? It doesn’t answer: does this matter?

With a large enough sample, even trivially small differences will produce a significant result. This is why good researchers (like the ones I coach in my classes 😊) report effect size alongside significance. Effect size is a measure of how large the differences actually are in practical terms. You might be wondering how this is different from the Tukey HSD test we just ran. Tukey tells you which groups are significantly different from each other. It’s still basically a yes/no significance test, just applied to pairs instead of the whole model. Effect size is a different kind of question entirely. It doesn’t ask whether a difference is real. It asks whether a difference is large enough to care about. A finding can pass the Tukey test (meaning the difference between two groups is statistically real) and still represent a gap so small it wouldn’t change a single business decision.

For ANOVA, the standard effect size measure is eta-squared (η²). That is, the proportion of total variance in your outcome that is explained by group membership.

Code

eta_squared(model)

# Effect Size for ANOVA

Parameter | Eta2 |       95% CI
-------------------------------
segment   | 0.10 | [0.05, 1.00]

- One-sided CIs: upper bound fixed at [1.00].

η² Value	Conventional Interpretation
~0.01	Small effect
~0.06	Medium effect
~0.14 or above	Large effect

If η² comes back around 0.06, for instance, that means audience segment explains roughly 6% of the variance in favorability. That’s a medium effect, which is meaningful enough to inform a strategy, but a reminder that plenty of individual variation exists within each segment.

So, what do we do with all of this?

What This Means in Practice

The ANOVA result here is more than a statistics exercise. One audience has a real appeal problem with Chelsea. Conservatives rate her as largely irrelevant. If a network or streaming platform is trying to build broad audience reach, that gap matters. But we already knew this. What matters is how it connects to her plan for a tour.

Notice also that ch_relevance follows an even sharper pattern than favorability. Running the same ANOVA on relevance as your outcome variable is a useful extension and produces an instructive comparison. Sometimes the metric that seems most important (do people like her?) is less diagnostically powerful than one that speaks to strategic fit (is she relevant to the audience you’re trying to reach?). But there was more to this assignment than just measuring relevance and favorability. These have just been offered because they allowed us to play with statistical approaches to compare groups. See what I did there?

A Few Cautions

ANOVA assumes the observations are independent, that variances are roughly equal across groups, and that the outcome is approximately normally distributed within each group. With sample sizes like these, the normality assumption is robust. In R, you can check the equal-variance assumption with leveneTest() from the car package.

Ok. That’s enough statistics for one class. Onward.

AI Exploration Prompts

“I ran ANOVA on this talent survey data and got a significant F-statistic. Walk me through how to read the summary output and run a Tukey post-hoc test in R.”
“My ANOVA is significant but eta-squared is only 0.06. Does that mean the finding isn’t important enough to report to a client?”
“I want to run the same analysis with ch_relevance as the outcome instead of favorability. What would I change in the code, and what would I look for differently in the results?”
“The Tukey test shows that Segments 1 and 2 don’t differ from each other, but both differ from Segment 3. What does that tell me about this talent’s audience problem?”

The Data

The Issue Battery and Audience Segments

What ANOVA Does

The F-Statistic and the P-Value

Which Groups Are Actually Different?

Statistical Significance Is Not the Same as Importance

What This Means in Practice

A Few Cautions

Further Reading

AI Exploration Prompts