The Analytics Ladder
Posts
Statistical Checks: The Data Literacy Skill Nobody Talks About - Part 2

Statistical Checks: The Data Literacy Skill Nobody Talks About - Part 2

The midweek playbook for turning book smarts into career-making influence.

Aparna Joseph
September 19, 2025

In partnership with

Why this issue matters:

Statistical tests are powerful and often misunderstood.

Choosing and defending the right test builds trust. Choosing poorly can misdirect a project and erode credibility.

This week, data scientist Aparna Joseph shares a practical framework for selecting tests under pressure, so you can stop hoping your numbers are not challenged and start defending them with authority.

Good day to you one and all!

Last week, Aparna Joseph, a data scientist completing her master’s, broke down one of the most mystical yet critical parts of analytics work: statistical checks.

She bridges two worlds: the technical detail data scientists live in and the practical reality every other role in a project needs to understand.

Her point was simple: checks aren’t about how to crunch the numbers, they’re about why your analysis holds up when it’s under fire. Skip them, and even the most polished deck can quietly steer a project off a cliff.

Here’s Aparna with the second in her short series: a piece that should be required reading for anyone who touches data.

How to choose the right statistical test when someone's watching (and your credibility is on the line)

by Aparna Joseph (LinkedIn)

You've been in that meeting. (Yes - I know - we all hate meetings - they are a fact of life and especially so in a Data and Analytics career - so get used to it!!!)

The slides are polished. The charts are neat. The story sounds obvious.

And yet, something in the back of your mind says: "Do we really know this is true?"

Last week, I introduced article 1 of a multi-part series breaking down why statistical checks aren’t just for data scientists - they are for everyone who touches a data project.

In fact, they’re part of the basic data literacy every modern professional should have. Because if you can understand why they matter, you can ask better questions, spot shaky analysis, and protect your organisation from expensive mistakes.

Today, I'm continuing with article 2: how to choose the right statistical test when someone asks a clear question and the clock is ticking.

By the end, you’ll know:

Why test choice is the safety net you didn’t know you needed
Where it fits in the process so you can spot problems early
How it protects your credibility when the numbers are on the line

Lets go!

Why statistical checks aren’t optional - and how to spot when they’re missing Part 2 (The 2-minute test picker leaders trust)

Being right is not enough.

You need to be provably right.

Picture this: The slide shows a lift. The room leans in. "Are we sure?" You smile, because you can show the test, the assumptions, and the range of likely outcomes.

Too many teams ship on vibes.

Your career compounds when leaders trust your numbers.

The Problem: Most analysts freeze when someone challenges their test choice. They know correlation isn't causation, but when pressed on why they picked a t-test over Mann-Whitney, they mumble about assumptions and hope the meeting moves on.

The Damage: Every stumble in these moments costs you credibility. Leaders remember who can defend their work under pressure and who can't. The analyst who says "I need to double-check that" loses influence. The one who explains the reasoning gets promoted.

"You do not need every test in a textbook. You need a routine you can run when someone asks a clear question and the clock is ticking."

How Test Choice Protects You

Sometimes numbers tell a story you want to believe. Let’s say sales jump right after a product change. Without the right test, you can’t know if the change really caused the jump, or if it was seasonal demand, a competitor’s slip-up, or just plain luck. Test choice separates the signal from noise so you’re not building strategy on coincidence.

It also guards against the “beautifully wrong” analysis. One that produces impressive results for all the wrong reasons. A test can look like it’s confirming your hypothesis brilliantly, but if it’s the wrong fit for your data, those results will fall apart when the context changes.

Statistical test choice comes with assumptions. Even the simplest t-test expects that data is roughly normal, that variances are similar, and that observations are independent.

If any of these break, your results can quietly drift from reality. Without choosing wisely, you might never notice (until the business starts making decisions based on bad numbers).

Test choice also cuts through “question clutter”. It’s common to have business questions that look similar but need different tests. Worse, some questions may require non-parametric alternatives if your data is skewed. Choosing correctly keeps the work lean and focused.

And then there’s the problem of overconfidence.

Results always come with uncertainty, yet many reports present a single, definitive answer. The right test lets you express that uncertainty honestly .. “We expect between 1,150 and 1,250 sign-ups, with 95% confidence”.

Which is far more credible and useful than pretending you know exactly what will happen.

Last week we found out when to apply statistical tests ..

So .. how do we approach picking the right test to pick??

The Solution: A 2-minute picker that maps any business question to the right test, checks the assumptions that matter, and gives you language that sounds authoritative because it is.

Pick the right test, defend it in the room, and move decisions forward

Step 1: Name the outcome before you touch a tool

Say what you are measuring:

Numerical: values you can add or average, like revenue, time on page, or ratings
Categorical: labels or groups, like device type, region, or plan tier
Binary: yes or no results, like clicked or not, converted or not

Naming the outcome keeps test choice honest later.

Step 2: Match the question to the right family of tests

Most business questions fall into five patterns. Each pattern has a specific logic for why certain tests work better than others.

Use the smallest valid test that answers the question.

A) Compare two independent groups on a numeric outcome

This is the most common scenario. You have two separate groups and want to know if their averages differ meaningfully.

Use a t-test when the outcome is roughly normal. The t-test compares means directly, which gives you the cleanest business interpretation. It assumes your data follows a bell curve and has similar spread in both groups. When these conditions hold, the t-test has more statistical power, meaning it's better at detecting real differences when they exist.

Use Mann-Whitney U when the outcome is skewed or has outliers. This test compares the ranks of values instead of the raw numbers, making it immune to extreme values that would distort a t-test. It's asking "does one group tend to have higher values than the other?" rather than "are the averages different?" Use this when your data is heavily skewed, has clear outliers, or when you're dealing with ordinal data like satisfaction ratings.

Example: Are average purchases different between mobile and desktop users?

B) Compare three or more groups on a numeric outcome

When you have multiple groups, you need tests designed to handle the complexity of multiple comparisons.

Use ANOVA when normality is reasonable. ANOVA asks whether the means across all groups are equal, or if at least one group is systematically different. It controls the overall error rate when making multiple comparisons, something that breaks down if you run multiple t-tests. ANOVA assumes roughly normal data within each group and similar variances across groups.

Use Kruskal-Wallis when it is not. This is the non-parametric version of ANOVA. It converts all values to ranks and tests whether the average ranks differ across groups. Use this when your data is skewed, has outliers, or when you have ordinal outcomes. The trade-off is that you're testing for differences in rank order rather than actual mean values.

Example: Do customer satisfaction scores differ across regions?

C) Compare the same people before and after a change

This is about measuring change within individuals, not between groups. The key insight is that you're analysing the differences between paired measurements.

Use a paired t-test for roughly normal data. This test looks at the difference between each person's before and after measurements, then asks if the average of those differences is significantly different from zero. Because it accounts for individual baseline differences, paired tests are more powerful than comparing independent groups.

Use Wilcoxon signed-rank when it is not. This test ranks the absolute differences between paired measurements, then considers whether positive or negative changes dominate. It requires that the differences are symmetrically distributed around the median, but doesn't need normal distribution. Use this for skewed differences or ordinal outcomes.

Example: Did page load time improve after a backend update?

D) Check if two variables are related

This is about association, not causation. You're asking whether knowing one variable helps predict another.

Two numeric variables → Correlation. Use Pearson correlation when the relationship is linear. As one variable increases, the other increases (or decreases) at a consistent rate. Use Spearman correlation when the relationship is monotonic but not necessarily linear. One consistently goes up as the other goes up, but the rate of change varies.

Two categorical variables → Chi-square test of independence. This test compares the observed pattern of how categories coincide with the pattern you'd expect if the variables were unrelated. It requires that expected frequencies in each cell are at least 5, and that observations are independent.

Examples: Is time on site related to total spend? Are device type and subscription level related?

E) Test the effect of a feature on a binary outcome

This covers yes/no outcomes like conversion, click-through, or success rates.

Two rates → z-test for proportions. This directly compares the success rates between two groups. It assumes large enough sample sizes that the sampling distribution is approximately normal. This is the cleanest test when you simply want to know if two conversion rates are different.

Multiple features or controls → Logistic regression. When you need to control for other variables or test multiple features simultaneously, logistic regression provides the framework. It can handle complex designs while giving you odds ratios that translate cleanly to business impact. Use this when simple proportion tests aren't sufficient for your decision context.

Example: Do users who saw the new design convert more than users who did not?

The key principle: match the complexity of your test to the complexity of your question. Don't use a complicated test when a simple one answers what leadership needs to know. Don't use a simple test when the business question requires you to control for confounding factors.

Step 3: Check assumptions so your result is trustworthy

Parametric tests assume approximate normality and similar variance across groups. All tests assume independence.

Normality: run Shapiro-Wilk and also look at a histogram or Q-Q plot
Equal variance: check with Levene's test
Independence: confirm the design so the same user is not counted twice

Important nuance: on very large datasets, formal normality tests can flag tiny, irrelevant deviations. Pair them with visual checks and judgment. If assumptions clearly fail, either choose the non-parametric alternative or use a simple transformation, like a log, to tame heavy right-skew.

The goal is not to force normality. The goal is a result people can trust.

Step 4: Report like a decision maker

Statistical significance is only part of the story. Decision quality improves when you add two lenses:

Effect size: how big the difference is, in units leaders care about
Confidence interval: the plausible range for the true effect so risk can be priced

Example: a shift from 3.00% to 3.01% may be statistically real yet commercially trivial. Say that plainly, then recommend more data or a different lever.

What to ask in any review, even if you did not run the test

Are we using the right test for this data and question?
Did we check the assumptions, and what did we see?
How confident are we that this effect is real rather than noise, stated with an effect size and a confidence interval in business terms?

The right test depends on the outcome type, the number of groups, and the exact question. If in doubt, involve a data scientist early, before a decision rests on shaky numbers.

Statistical testing is not red tape. It is how you turn guesswork into evidence.

Worked examples

Marketing A/B
Outcome: binary. Test: z-test for proportions.

"Variant B converted 4.2% versus 3.6% for control, a lift of 0.6 points. The 95% confidence interval is 0.2 to 1.0. At current traffic that is about 180 additional signups per month. Risk that the true lift is near zero is low. I recommend we ship B and monitor for one week."

Of course - you would never ever say those words to any executive or stakeholder - something like this wold be more suitable:

"Variant B had a signup rate of 4.2%, compared to 3.6% for the control group. Our data suggests this improvement is unlikely to be due to chance. It means about 180 more signups per month if the trend holds. I recommend rolling out Variant B and monitoring the results for a week to confirm sustained performance."

Product latency
Outcome: numeric, same endpoints before and after, skewed distribution. Test: Wilcoxon signed-rank.

"Median p95 latency improved by 160 ms. Wilcoxon p = 0.01. The 95% interval is 90 to 230 ms. This enables Feature X and is likely to nudge checkout completion by about 0.4 points. Recommend we proceed and recheck in two weeks."

Again - you would never ever say those words to any executive or stakeholder - something like this wold be more suitable:

"The slowest 5% of page loads got 160 milliseconds faster after the fix. This change is very likely real, but as with any change, there is some uncertainty. It should help the checkout process work smoother and may increase sales a little. I suggest we keep an eye on it for two weeks to make sure it lasts."

Statistical tests aren't about being technically correct; they're about building the confidence for a decision. If your analysis doesn't give leadership the confidence to act, it's just expensive trivia.

Use this routine, modify it as you gain more confidence and practice, pair it with effect sizes and confidence intervals, and become the person leadership relies on.

I’m learning a lot as I grow in this field, and sharing what’s helped me think more clearly.
Thanks for reading, I hope this gave you something useful to take with you.
- Aparna

P.S. Forward this to one analytics teammate who needs to sound more confident in review meetings and help them climb the Ladder.

Want to support our work? Click on the ad, it keeps this newsletter free for all subscribers, and helps us to continue delivering the maximum value we can for you.

$10,000 free ad credits to test catalog ads on TV

LIMITED TIME ONLY.

Marpipe, the leader in catalog advertising, partnered with Universal Ads to launch catalog ads on streaming TV for the first time ever.

This isn’t your typical “brand awareness” TV play - this is pure performance marketing on premium streaming inventory.

If you’re running catalog ads on Meta or Google, you know they work. Higher ROAS, better performance, consistent results.

The problem? You’ve been stuck on small screens.

Not anymore.

Turn your existing product feed into high-performing video ads and launch them on TV as effortlessly as you do on social.

We’re so confident it’ll be your next big growth channel that we’re offering $10,000 in free ad credits with no strings attached to test it out. Be an early mover on a massive wave before everyone else catches on.

Ready to claim your $10,000?

Test CTV for free.

Know one teammate who’s drowning in rework or worried AI is eating their job? Forward this to them—you’ll help them climb and unlock the new referral reward: the Delta Teams Playbook, your crisis-mode toolkit when the wheels come off.

Not on The Analytics Ladder yet? You’re missing the brand-new 90-Day Analytics Leadership Action Kit. It’s free the moment you join—your step-by-step playbook to win trust in 14 days, build a system by day 45, and prove dollar impact by day 90.

Disclaimer: Some of the articles and excerpts referenced in this issue may be copyrighted material. They are included here strictly for review, commentary and educational purposes. We believe this constitutes fair use (or “fair dealing” in some jurisdictions) under applicable copyright laws. If you wish to use any copyrighted material from this newsletter for purposes beyond your personal use, please obtain permission from the copyright owner.

The information in this newsletter is provided for general educational purposes only. It does not constitute professional, financial, or legal advice. You use this material entirely at your own risk. No guarantees, warranties, or representations are made about accuracy, completeness, or fitness for purpose. Always observe all laws, statutory obligations, and regulatory requirements in your jurisdiction. Neither the author nor EchelonIQ Pty Ltd accepts any liability for loss, damage, or consequences arising from reliance on this content.

https://www.echeloniq.ai	Visit our website to see who we are, what we do.
https://echeloniq.ai/echelonedge	Our blog covering the big issues in deploying analytics at scale in enterprises.