Cluster Analysis: Finding Segments in Survey Data

Teaching Note

Author

Larry Vincent

Published

February 16, 2026

You’ve just fielded a customer survey for a regional coffee chain. You have 1,342 respondents. For each, you have behavioral data from the company’s CRM (how many stores they’ve visited, how long they’ve been a customer, whether they’re a loyalty member) alongside survey responses rating the importance of attributes like coffee quality, store comfort, and WiFi speed. Your client wants to know, “Who are the different types of customers in our market, and what do they care about?”

This is one of the most common—and most consequential—questions in marketing research. And the technique most often used to answer it is cluster analysis.

Clustering is an unsupervised technique, which means you aren’t predicting anything. There’s no dependent variable. Instead, you’re asking the algorithm to examine the patterns in your data and group respondents who look similar to one another. The result is a set of clusters—groups of people whose responses are statistically closer to each other than they are to people in other groups.

The most familiar application is market segmentation, but the logic extends well beyond that. You can cluster respondents in a satisfaction study to find distinct experience profiles. You can cluster students in an exit survey to understand different motivational types. Anywhere you suspect your sample contains meaningfully different groups of people, clustering can help you find them.

The Practice Dataset

The data accompanying this note (coffee-customer-survey.csv) contains 1,342 customers of a coffee chain. Each row combines CRM behavioral data with survey responses.

Variable	Type	Description
Data Dictionary
customer_id	ID	Unique customer identifier
locations_visited	Numeric	Number of distinct store locations visited
days_since_last_visit	Numeric	Recency: days since last transaction
days_since_first_visit	Numeric	Tenure: days since first transaction
pct_food_purchases	Numeric	Proportion of transactions that included food
total_transactions	Numeric	Lifetime transaction count
profit_per_transaction	Numeric	Average profit contribution per transaction
loyalty_member	Binary	Enrolled in unlimited drink subscription (1 = yes)
buys_whole_bean	Binary	Purchases whole bean coffee (1 = yes)
uses_wifi	Binary	Uses in-store WiFi (1 = yes)
imp_social_mission	Likert 1–5	Importance of the company's social mission
imp_coffee_quality	Likert 1–5	Importance of coffee quality
imp_store_comfort	Likert 1–5	Importance of store comfort and atmosphere
imp_food_variety	Likert 1–5	Importance of food variety
imp_wifi_speed	Likert 1–5	Importance of WiFi speed and reliability
is_local_resident	Binary	Lives near primary store (1 = yes)
follows_on_social	Binary	Follows the brand on social media (1 = yes)
nps	Numeric	Net Promoter Score (0–10)

Notice the range of variable types: continuous behavioral metrics that span from 1 to 784, Likert scales with only 5 levels, and binary indicators. This is typical of real survey data. As you will see below, this is why pre-processing your data can be so important.

Why You Can’t Just Eyeball It

With two or three variables, you can sometimes spot groups visually. Plot respondents on a scatterplot and the clusters jump out. But this dataset has 17 variables, and the groupings exist in a high-dimensional space that no scatterplot can capture. That’s where algorithms come in. They do mathematically what your eyes can’t. They compute the distances between every pair of respondents across all variables simultaneously, and find the groupings that minimize the variance within clusters while maximizing the variance between them.

This is also why clustering is best done in statistical software rather than in Excel. While it is technically possible to compute Euclidean distances and iterate through centroid assignments in a spreadsheet, the process is tedious, error-prone, and effectively impossible once you have more than a handful of variables or respondents. The code examples in this note use R and Python. If you’re working in either language, you may be surprised to learn that clustering is usually only a few lines of code. If you’re working in Excel, this is a good reason to make the jump.

Pre-Processing: Get Your Data Ready

Before you hand your data to a clustering algorithm, you need to deal with the possibility (and usual reality) that your variables aren’t on the same scale. In this dataset, days_since_first_visit ranges from 1 to 784 while imp_coffee_quality ranges from 1 to 5. If you cluster on the raw values, the algorithm will treat tenure as overwhelmingly more important than any survey item—not because it is more important, but because its distances are numerically larger.

The fix is standardization–subtract the mean and divide by the standard deviation for each variable, so every variable has a mean of 0 and a standard deviation of 1.

R
Python

library(tidyverse)
library(tidymodels)

df <- read_csv("coffee-customer-survey.csv")

# Select clustering variables and standardize
CLUSTER_VARS <- c(
  "locations_visited", "days_since_last_visit", "days_since_first_visit",
  "pct_food_purchases", "total_transactions", "profit_per_transaction",
  "imp_social_mission", "imp_coffee_quality", "imp_store_comfort",
  "imp_food_variety", "imp_wifi_speed"
)

df_scaled <- df |>
  select(all_of(CLUSTER_VARS)) |>
  mutate(across(everything(), \(x) as.numeric(scale(x))))

import pandas as pd
from sklearn.preprocessing import StandardScaler

df = pd.read_csv("coffee-customer-survey.csv")

# Select clustering variables and standardize
CLUSTER_VARS = [
    "locations_visited", "days_since_last_visit", "days_since_first_visit",
    "pct_food_purchases", "total_transactions", "profit_per_transaction",
    "imp_social_mission", "imp_coffee_quality", "imp_store_comfort",
    "imp_food_variety", "imp_wifi_speed"
]

scaler = StandardScaler()
df_scaled = pd.DataFrame(
    scaler.fit_transform(df[CLUSTER_VARS]),
    columns=CLUSTER_VARS
)

If you skip this step, you’ll make one of the most common mistakes in cluster analysis and you will end up with distorted results.

Notice that we selected a subset of variables for clustering—the behavioral and attitudinal variables—and left out customer_id, the binary indicators, and nps. These holdout variables become valuable later when we profile the clusters. Choosing what to cluster on is an analytical decision, not a default. You should cluster on the variables that represent the differences you care about, then use everything else to describe the groups you find. This is also where the psychological aspect we discuss so often in class comes to play. Some students throw every variable into the algorithm. Why not? More data might lead to better clusters. Except, in some cases, that will make no sense at all. You want to carefully select the variables you wish to cluster upon. They should be tied to theoretical differences you might expect in customer attitudes, motivations, behaviors, and preferences.

One important caveat: standardize for clustering, but profile on original scales. Once you’ve assigned respondents to clusters, switch back to the original data for profiling and presentation. A vice president doesn’t want to hear that Segment A scored 1.3 standard deviations above the mean on price sensitivity. They want to hear that Segment A rated coffee quality a 4.6 out of 5. Scaled values are essential for the algorithm. Original values are essential for the audience.

Advanced: Pre-Processing with PCA

If you have many clustering variables, the distances between respondents become harder for any algorithm to parse cleanly—a problem sometimes called the curse of dimensionality. One solution is to first reduce your variables using Principal Components Analysis (PCA), then cluster on the resulting components rather than the raw variables.

PCA transforms your correlated variables into a smaller set of uncorrelated components. The first component explains the most variance, the second explains the next most, and so on. If enough of your data is explained by just the first two components, you can plot every respondent on a two-dimensional scatterplot. This is one of the great benefits to this pre-processing approach. It has the added benefit that those two macro-variables are virtually uncorrelated. This gives you a powerful way to visualize your clusters on a 2×2 plane—something that’s often impossible with the raw variables. In my own work, I often run PCA first just to get the visualization, even when I could cluster directly on the scaled data.

R
Python

# Run PCA on scaled data
pca_result <- prcomp(df_scaled, scale. = TRUE)


# Add the PCA scores onto our dataframe
df <- augment(pca_result, df)

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
pca_scores = pd.DataFrame(
    pca.fit_transform(df_scaled),
    columns=["PC1", "PC2"]
)

print(f"Variance explained by PC1 + PC2: {pca.explained_variance_ratio_.sum():.1%}")

If the first two components explain less than about 40% of the variance, the resulting plot will be a noisy approximation. You can still cluster on more components, but you lose the clean 2D visualization.

Sometimes, I find it useful to view my clustered variables in a PCA plot.

pca_tidy <- pca_result |>
  tidy(matrix = "rotation")

pca_tidy |>
  pivot_wider(
    names_from = "PC",
    names_prefix = "PC",
    values_from = "value"
  ) |> 
  ggplot(aes(x=PC1, y=PC2)) +
  geom_segment(
    xend = 0,
    yend = 0,
    arrow = arrow(ends = "first", length = unit(6, "pt"))
  ) +
  geom_label_repel(aes(label = column), size=5) +
  theme(
    plot.margin = margin(0, 0, 0, 0),
    plot.title.position = "plot"
  ) +
  labs(
    title = "PCA Plot of Scaled Variables",
    subtitle = "Coffee Customer Dataset"
  )

This visual inspection can help us see the clusters before we even begin clustering. Notice the five variables that gather together in the southern part of this plot. Also, the two variables–days_since_first_visit and total_transactions that seem to be traveling to gather in the northwestern quadrant. And then, days_since_last_visit, which seems to be doing its own thing.

A further note on mixed data: this dataset combines continuous variables with Likert scales and binary indicators. A technique called Factor Analysis of Mixed Data (FAMD) handles this by running PCA on the numeric variables and Multiple Correspondence Analysis on the categorical ones, then combining the results. It’s more principled but also more complex, and beyond the scope of this note.

K-Means: The Workhorse

K-means is the most widely used clustering algorithm in marketing research, and for good reason—it’s fast, intuitive, and works well with the kind of rectangular survey data you’ll encounter in practice.

The algorithm is simple in concept. You tell it how many clusters you want (K). It randomly assigns K starting points (called centroids), assigns each respondent to the nearest centroid, then recalculates the centroids based on the respondents assigned to them. It repeats this process until the assignments stabilize. The result is K groups of respondents, each defined by its centroid—the average profile of the cluster.

But you have to tell K-means how many clusters to look for. It won’t figure that out on its own. You can, of course, use a scree plot or a silhouette plot to estimate the optimal number of clusters. With statistical software, we can easily apply an iterative approach.

How Many Clusters? The Iterative Approach

The right number of clusters isn’t something you calculate once. You run the algorithm multiple times—with K=2, K=3, K=4, and so on—and evaluate each solution. The code below runs K-means for eight different values of K on our coffee dataset.

R
Python

set.seed(42)

kclusts <- tibble(k = 1:8) |>
  mutate(
    clust     = map(k, ~ kmeans(df_scaled, centers = .x, nstart = 25)),
    tidied    = map(clust, tidy),
    glanced   = map(clust, glance),
    augmented = map(clust, augment, df)
  )

# Scree plot
kclusts |>
  unnest(glanced) |>
  ggplot(aes(x = k, y = tot.withinss)) +
  geom_line(group = 1) +
  geom_point(size = 4) +
  annotate("point", x = 3, y = 10200, size = 12, shape = 1, stroke = 1.5, color = "red") +
  annotate("point", x = 4, y = 9400, size = 12, shape = 1, stroke = 1.5, color = "red") +
  annotate("text", x= 3.8, y=12000, label="Where is the elbow?", color = "red", size=5) +
  annotate("segment", x=3.8, xend=3.35, y=11700, yend=10800, arrow = arrow(ends = "last", length = unit(12, "pt")), color="red") +
  annotate("segment", x=3.8, xend=3.9, y=11700, yend=10200, arrow = arrow(ends = "last", length = unit(12, "pt")), color="red") +
  theme(
    plot.margin = margin(0, 0, 0, 0),
    plot.title.position = "plot"
  ) +
  labs(
    title = "Scree Plot",
    subtitle = "How many clusters is optimal?",
    x = "Number of Clusters (K)",
    y = "Total Within-Cluster Sum of Squares"
  )

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

inertias = []
K_range = range(1, 9)

for k in K_range:
    km = KMeans(n_clusters=k, n_init=25, random_state=42)
    km.fit(df_scaled)
    inertias.append(km.inertia_)

plt.figure(figsize=(8, 4))
plt.plot(K_range, inertias, marker="o")
plt.xlabel("Number of Clusters (K)")
plt.ylabel("Total Within-Cluster Sum of Squares")
plt.title("Scree Plot")
plt.show()

A few things worth unpacking. In the R version, you’re creating a tibble with one row per value of K. For each K, you run kmeans() with nstart = 25—which means the algorithm runs 25 times with different random starting points and keeps the best result. This is important because K-means is sensitive to its initial seed; running it once can produce an unstable solution. The Python equivalent achieves the same with n_init=25. Then the R code tidies the output three ways: tidy() gives you the cluster centroids, glance() gives you the overall fit statistics, and augment() attaches the cluster assignments back to your original data. This is the pattern I use in all of my segmentation work.

The y-axis of the scree plot shows the total within-cluster sum of squares—a measure of how tightly packed each cluster is. As K increases, this number always goes down. What you’re looking for is the elbow: the point where adding another cluster stops producing a meaningful improvement. If the curve bends sharply at K=3 and then flattens, three clusters may be your starting point.

But the scree plot is only a guide, not a verdict. A clear elbow is the exception, not the rule. In practice, you’ll often find yourself choosing between two or three plausible solutions. When that happens, the deciding factor should be interpretability: which solution produces clusters that make sense for your research question and can be described in a way that a decision-maker would find actionable?

If you pre-processed with PCA and retained two components, this is where the 2D visualization pays off:

R
Python

# Visualizing clusters on PCA dimensions
kclusts |>
  unnest(augmented) |>
  filter(k == 3) |>
  ggplot(aes(x = .fittedPC1, y = .fittedPC2, color = .cluster)) +
  geom_point(alpha = 0.4, size = 2) +
  scale_color_viridis_d() +
  theme(
    plot.margin = margin(0, 0, 0, 0),
    plot.title.position = "plot",
    legend.position = "top",
    legend.justification = "left"
  ) +
  labs(title = "Three-Cluster Solution on PCA Dimensions")
kclusts |>
  unnest(augmented) |>
  filter(k == 4) |>
  ggplot(aes(x = .fittedPC1, y = .fittedPC2, color = .cluster)) +
  geom_point(alpha = 0.4, size = 2) +
  scale_color_viridis_d() +
  theme(
    plot.margin = margin(0, 0, 0, 0),
    plot.title.position = "plot",
    legend.position = "top",
    legend.justification = "left"
  ) +
  labs(title = "Four-Cluster Solution on PCA Dimensions")

km_final = KMeans(n_clusters=4, n_init=25, random_state=42)
pca_scores["cluster"] = km_final.fit_predict(df_scaled).astype(str)

fig, ax = plt.subplots(figsize=(8, 6))
for cl, group in pca_scores.groupby("cluster"):
    ax.scatter(group["PC1"], group["PC2"], alpha=0.4, s=15, label=f"Cluster {cl}")
ax.legend()
ax.set_title("Four-Cluster Solution on PCA Dimensions")
plt.show()

Notice how there is significant data overlap in the four-cluster solution. Let’s go with three clusters for the remainder of this analysis.

Profiling: Making Clusters Meaningful

Once you’ve chosen a K and assigned respondents to clusters, the real analytical work begins. Cluster assignments by themselves are just numbers. Your job is to profile the clusters—describe them in terms that illuminate who these people are and how they differ.

Start by comparing the cluster means on the variables you used for clustering. Remember: your best practice is to profile on the original, unscaled data. When you tell a product manager that Segment B rated coffee quality a 4.6 out of 5, they know exactly what that means. When you tell them the z-score was 1.2, you’ve lost them.

For visualization, however, scaling remains useful. A line chart of scaled cluster means across variables makes the relative peaks and valleys pop—it shows you the shape of each segment’s profile:

R
Python

# Using my favorite color palette here
c8 <- c(
  "#C0292D", "#463752", "#609B71", "#DBA940", "#D4663A", "#9E4B6C", "#7A6880", "#251E2B"
)


k3 <- kclusts |>
  unnest(augmented) |>
  filter(k == 3)

# Profile chart: scaled data for VISUALIZATION
k3 |>
  select(.cluster, all_of(CLUSTER_VARS)) |>
  mutate(across(where(is.numeric), \(x) as.numeric(scale(x)))) |>
  pivot_longer(-.cluster, names_to = "variable", values_to = "score") |>
  group_by(.cluster, variable) |>
  summarise(mu = mean(score), .groups = "drop") |>
  ggplot(aes(x = variable, y = mu, color = .cluster, group = .cluster)) +
  geom_line(linewidth = 1.2, show.legend = FALSE) +
  geom_point(size = 3) +
  scale_color_manual(values = c8[1:3]) +
  scale_x_discrete(expand = expansion(mult = c(0.15, 0.1))) +
  theme(
    plot.margin = margin(0, 0, 0, 0),
    plot.title.position = "plot",
    legend.position = "top",
    legend.justification = "left",
    legend.text = element_text(size = 14),
    legend.title = element_text(size = 14),
    legend.key.height = unit(24, "pt"),
    axis.text.x = element_text(size = 12, face = "bold", angle = 25, hjust = 1)
  ) +
  labs(
    title = "Cluster Profiles",
    subtitle = "Mean scores on scaled variables",
    x = "", y = ""
  )

# Summary table: original data for REPORTING
k3 |>
  group_by(Cluster = .cluster) |>
  summarise(
    N = n(),
    `Avg Profit` = mean(profit_per_transaction),
    `Avg Transactions` = mean(total_transactions),
    `Coffee Quality` = mean(imp_coffee_quality),
    `Store Comfort` = mean(imp_store_comfort),
    NPS = mean(nps),
    .groups = "drop"
  ) |>
  gt() |>
  tab_options(
    table.align = "left"
  ) |> 
  fmt_number(columns = `Avg Profit`:`NPS`, decimals = 1) |> 
  opt_row_striping(row_striping = TRUE)

Cluster	N	Avg Profit	Avg Transactions	Coffee Quality	Store Comfort	NPS
1	450	−0.1	8.8	2.5	3.3	8.4
2	295	7.6	223.8	2.5	3.7	8.3
3	597	6.7	83.6	3.5	3.2	8.0

k3 = df.copy()
k3["cluster"] = km_final.labels_.astype(str)

# Profile chart: scaled data for VISUALIZATION
scaled_profiles = (
    k3.groupby("cluster")[CLUSTER_VARS]
    .mean()
    .apply(lambda col: (col - col.mean()) / col.std())
)

scaled_profiles.T.plot(marker="o", figsize=(10, 6))
plt.title("Cluster Profiles (Scaled)")
plt.ylabel("Standardized Mean")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()

# Summary table: original data for REPORTING
print(
    k3.groupby("cluster")
    .agg(
        N=("customer_id", "count"),
        avg_profit=("profit_per_transaction", "mean"),
        avg_transactions=("total_transactions", "mean"),
        coffee_quality=("imp_coffee_quality", "mean"),
        store_comfort=("imp_store_comfort", "mean"),
        nps=("nps", "mean")
    )
    .round(1)
)

Then extend the profiling beyond the clustering variables. Remember those variables we held out—loyalty_member, uses_wifi, is_local_resident, follows_on_social, nps? Cross-tabulate them by cluster. This is where the segments come alive. You’re not just saying “these people rated attributes differently.” You’re saying “these people rated attributes differently and they’re mostly loyalty members who live nearby and follow the brand on social media.”

Step	Purpose	What to Look For
A Profiling Checklist
Scale comparison	How big is each cluster?	Are any clusters too small to act on?
Clustering variables	What defines each cluster?	Peaks and valleys in the profile chart
Demographics	Who are these people?	Age, gender, income, geography
Behavioral data	What do they do?	Purchase frequency, tenure, channel preference
Satisfaction / NPS	How do they feel?	Differences in loyalty and advocacy
Open-ends (if available)	What do they say in their own words?	Qualitative texture to the quantitative profile

A Few Cautions

Clustering is powerful, but it’s easy to over-interpret. A few things to keep in mind.

K-means will always produce clusters, even if the data doesn’t really contain natural groupings. The algorithm doesn’t tell you whether the clusters are meaningful—that’s your job. Always ask whether the differences between clusters are substantively large enough to matter, not just statistically present.

Cluster solutions can also be unstable. Small changes to the data—removing a few respondents, adding a variable—can shift the results. This is why nstart matters (in R) and n_init matters (in Python), and why it’s good practice to run your analysis on a random split of the data and check whether the same basic structure emerges in both halves.

Finally, remember that clustering is exploratory. It generates hypotheses about the structure of your market. It doesn’t confirm them. If your segmentation identifies a promising cluster of high-value customers, the next step is to validate that segment with additional research—not to build an entire marketing strategy on one cluster solution from one survey.

AI Exploration Prompts

“I loaded the coffee customer dataset and want to run K-means in R [or Python]. Walk me through scaling, running multiple K values, and producing a scree plot.”
“My scree plot doesn’t show a clear elbow—K=3 and K=4 both look reasonable. What other criteria should I use to decide?”
“I’ve chosen a four-cluster solution. Help me write profiling code that compares cluster means on the original (unscaled) data and produces a summary table suitable for a presentation.”
“I want to try PCA before clustering. My first two components explain 38% of the variance. Is that enough, or should I include a third component?”

Note: AI is useful for debugging code and brainstorming analytical approaches, but the interpretation of clusters—deciding what they mean for the business—requires your judgment, not the model’s.