Data Science — Analysis, Visualization & Statistical Thinking

Target Knew a Teenager Was Pregnant Before Her Father Did

In 2012, Target's data science team built a "pregnancy prediction" model from the purchase histories of millions of customers. They identified roughly 25 products that, when bought together, signaled a customer was likely pregnant: unscented lotion, calcium and magnesium supplements, large cotton bags, hand sanitizer, and washcloths. The model assigned each customer a "pregnancy score" and a predicted due date. Targeted coupons for baby products were sent accordingly. One day, a man stormed into a Target near Minneapolis and demanded to see the manager: "My daughter got this in the mail. She's still in high school, and you're sending her coupons for baby clothes and cribs? Are you trying to encourage her to get pregnant?" The manager apologized profusely. A few days later, the manager called to follow up, and the father apologized: "It turns out there's been some activities in my house I haven't been completely aware of. She's due in August."

This story is creepy - and it demonstrates exactly what data science does. It finds patterns in data that humans cannot see and turns those patterns into actionable decisions. The discipline sits at the intersection of statistics, programming, and domain expertise. It is not machine learning (though it often uses ML). It is not data engineering (though it depends on clean data). It is the full pipeline from question to answer to action. Every tech company runs on it. Most non-tech companies that win in their markets run on it too. Here is how the process actually works, from raw numbers to real decisions.

80%

Percentage of a data scientist's time spent cleaning and preparing data

$130K

Median data scientist salary in the US (2025, Glassdoor)

2.5 quintillion

Bytes of data created daily worldwide - 90% of all data was created in the last 2 years

120M

People who use Microsoft Excel globally - still the most widely used data tool in business

What Data Science Actually Is

Data science is not just machine learning with a different name. It is a complete methodology for extracting knowledge from data and translating that knowledge into decisions. The full pipeline has seven stages, and skipping any of them produces unreliable results.

The Venn diagram is not decorative - it represents a genuine insight about why the field is hard. Someone with strong math but no programming cannot process real datasets at scale. Someone who can code but lacks statistics will build models that appear to work but produce meaningless results. Someone with both skills but no domain expertise will ask the wrong questions and draw conclusions that subject matter experts immediately recognize as nonsensical. Data science requires all three.

Key Insight

The most underrated part of data science is the first step: asking the right question. "Use data to improve our business" is not a question. "Which customers are likely to cancel in the next 30 days, and what retention offers would be cost-effective for each risk segment?" is. The specificity of the question determines the entire analysis. Vague questions produce vague answers. Specific questions produce actionable insights. Most failed data science projects fail not because the model was wrong but because the question was poorly defined.

Data Cleaning: The Unglamorous 80%

The glamorous image of data science is a genius building brilliant models. The reality is an exhausted person at 11 PM staring at a CSV file trying to figure out why there are 15 different spellings of "United States" in the country column. Data cleaning is the unglamorous core of the profession, and it consumes roughly 80% of project time. "Garbage in, garbage out" is the most accurate statement in all of data science.

Here is what real-world data looks like before cleaning, using a sales dataset as an example:

Missing values. 12% of records have no customer email. 3% have no purchase amount. 0.5% have no date. Each requires a different strategy: impute the missing value (fill with the median or a prediction), drop the record, or flag it for manual review.

Duplicates. The same sale appears twice because the system retried after a timeout. Or the same customer appears with two IDs because they created a second account. Deduplication requires fuzzy matching because "John Smith, 123 Main St" and "Jon Smith, 123 Main Street" are probably the same person.

Inconsistent formats. Dates stored as "03/15/2025," "2025-03-15," "March 15, 2025," and "15-Mar-25" in the same column. Country names as "USA," "United States," "US," "U.S.A.," "United States of America," and "united states." Phone numbers with and without country codes, dashes, and parentheses. Each must be standardized to a single format.

Outliers. One record shows a purchase amount of $999,999. Is this a legitimate corporate purchase or a data entry error? Another shows an age of 150. A third shows a negative price. Outliers require domain knowledge to handle correctly - automatically removing them can discard the most interesting data points.

Encoding issues. Names with accented characters (Francois vs. François) rendered incorrectly because the file was saved in one encoding and read in another. Emoji in text fields that crash parsers expecting ASCII. Currency symbols that break number parsing.

Real-World Example

A healthcare data science team at a major hospital spent 6 months cleaning clinical data before building a single model. The electronic health records contained free-text doctor notes with inconsistent abbreviations ("pt" vs "patient"), measurements in different units (kg and lbs mixed in the same column), lab results entered in wrong fields, and dates that were impossible (February 30th). The cleaning phase revealed that 8% of all records had at least one clinically significant error. Fixing those errors improved model accuracy more than any algorithmic change they later made. This is the norm, not the exception.

Exploratory Data Analysis: Look Before You Model

Before building any model, data scientists explore the data visually and statistically. Exploratory Data Analysis (EDA) answers the question: what does this data actually look like, and what stories does it tell?

EDA involves histograms (what is the distribution of customer ages?), scatter plots (is there a relationship between price and quantity sold?), correlation matrices (which variables move together?), and summary statistics (mean, median, standard deviation). These simple tools reveal patterns, outliers, and potential problems that no amount of sophisticated modeling can fix.

The most famous cautionary tale in data exploration is Simpson's Paradox: a trend that appears in aggregate data can disappear or reverse when the data is segmented.

The Berkeley case is real. In 1973, UC Berkeley admitted 44% of male applicants and 35% of female applicants overall. This looked like clear discrimination. But when researchers examined each department individually, women were admitted at equal or higher rates in most departments. The paradox: women applied disproportionately to competitive departments with low admission rates (like English), while men applied more to departments with high admission rates (like engineering). The aggregate statistic was technically accurate but deeply misleading. The "bias" was in application patterns, not admissions decisions.

Simpson's Paradox is not a statistical curiosity. It shows up constantly in business data. A drug may appear ineffective overall but work for every subgroup. A website redesign may increase total revenue but decrease revenue per customer (because it attracted more lower-value customers). Without segmented analysis, you draw the wrong conclusions.

Data Visualization: Choosing the Right Chart

The purpose of data visualization is not to make things pretty. It is to make patterns visible that are invisible in raw numbers. But the wrong chart can distort the data as badly as no chart at all.

Edward Tufte, the foremost authority on data visualization, established a core principle: maximize the data-ink ratio. Every drop of ink on the page should represent data. Remove gridlines, decorative elements, redundant labels, and 3D effects - they add visual noise without adding information. A clean bar chart with clear labels communicates faster and more accurately than a flashy infographic with gradients and drop shadows.

Misleading Chart Tactics

Truncated y-axis: Starting the y-axis at 95% instead of 0% makes a 2% increase look like a 40% increase

Cherry-picked dates: Showing stock price from its lowest point to today to maximize perceived growth

Dual y-axes: Plotting two unrelated variables on the same chart with independent scales forces a visual correlation that may not exist

Pie charts: Humans are bad at comparing slice angles. Use bar charts instead.

Honest Chart Practices

Start y-axis at zero (or clearly label the break if truncating for good reason)

Show the full time range relevant to the question being asked

Use a single y-axis or clearly separate dual-axis charts into two panels

Use bar charts for comparisons - humans accurately compare lengths, not angles

Statistical Thinking: The Foundation of Sound Analysis

Statistics is the mathematical framework that protects you from drawing conclusions that the data does not support. Without it, data science is just storytelling with numbers.

Correlation does not imply causation. Ice cream sales correlate with drowning deaths. Both are caused by summer heat - ice cream does not cause drowning. A website redesign correlated with increased sales, but it launched the same week as a competitor's outage. Without controlled experiments, you cannot distinguish coincidence from causation. This is the single most abused concept in data-driven decision making.

A/B testing is how tech companies establish causation. Show version A of a web page to 50% of users and version B to the other 50%, randomly assigned. Measure which version produces a higher conversion rate. Because the groups are randomly assigned, any difference in outcomes is attributable to the page design, not to underlying differences between users. Google runs approximately 10,000 A/B tests per year. Every color, button placement, and font size on Google Search has been optimized through controlled experiments.

P-values are the most misunderstood concept in statistics. A p-value does not mean "the probability that the result is true." It means "the probability of observing this result (or something more extreme) if the null hypothesis is true." A p-value of 0.03 means: "If there is actually no difference between A and B, there is a 3% chance we would see a difference this large by random chance." The conventional threshold of p < 0.05 is not a law of nature - it is an arbitrary convention that Ronald Fisher proposed in the 1920s. Many fields now require p < 0.005 for major claims.

Sample Size Rule of Thumb

To detect a 5% difference in conversion rate (e.g., 10% vs 10.5%) with 95% confidence, you need roughly 30,000 users per group. Most A/B tests that report "significant" results with 500 users per group are statistically unreliable.

Confidence intervals are more informative than point estimates. Saying "our conversion rate is 12%" gives no sense of uncertainty. Saying "our conversion rate is 12% with a 95% confidence interval of [10.8%, 13.2%]" tells you the range of plausible values. If the confidence interval for the difference between A and B includes zero, the difference is not statistically significant - even if one number looks bigger than the other.

The Data Science Toolkit

The tools data scientists use have standardized around a core stack, with specialized tools for specific tasks.

Python is the dominant language, with three indispensable libraries. pandas provides DataFrames - think of them as programmable spreadsheets. You load a CSV, filter rows, group by categories, calculate statistics, and merge datasets, all in a few lines of code. NumPy handles numerical computation on arrays and matrices at speeds approaching C. matplotlib and seaborn create visualizations, from simple bar charts to complex statistical plots.

SQL is the language of databases and the single most important skill for a data analyst. Every company stores its data in relational databases. Being able to write SQL queries to extract, filter, join, aggregate, and transform data is non-negotiable. A data scientist who cannot write SQL is like a chef who cannot use a knife.

Jupyter notebooks are the standard environment for exploratory analysis. They let you write code in cells, run each cell independently, and see output (including visualizations) inline. Notebooks are terrible for production code (no version control, no testing, no modularity) but excellent for exploration, prototyping, and communication.

Tableau and Power BI are business intelligence tools that create interactive dashboards without code. Drag and drop data onto a canvas, choose chart types, filter by dimensions. These tools are designed for business analysts and managers who need to explore data without programming.

Excel remains the most widely used data tool in the world, with 120 million users. For datasets under 100,000 rows and analyses that do not require machine learning, Excel is often faster than writing code. Pivot tables, VLOOKUP, and conditional formatting solve a remarkable range of business data problems. The limitation is scale and reproducibility - Excel analyses are hard to audit, version, and automate.

Python (pandas, NumPy, scikit-learn)Data manipulation, ML

SQLData extraction, querying

Jupyter NotebooksExploration, prototyping

Tableau / Power BIDashboards, BI

ExcelQuick analysis, business users

Data Science Careers: Four Distinct Paths

The "data science" umbrella covers at least four distinct roles with different skills, responsibilities, and salary ranges.

Data Analyst (entry-level, $65K-$90K median). Primary tools: SQL, Excel, Tableau. Writes queries to extract data, creates dashboards and reports, answers business questions like "which products are selling well in Q3?" and "what is our customer retention rate by segment?" This is the most accessible entry point into data careers - strong SQL and visualization skills can get you hired.

Data Scientist ($110K-$160K median). Primary tools: Python, pandas, scikit-learn, statistics. Builds predictive models, runs A/B tests, performs statistical analyses. Answers questions like "which customers will churn in the next 30 days?" and "what factors predict whether a loan will default?" Requires stronger math and programming skills than analyst roles.

Data Engineer ($120K-$170K median). Primary tools: Python, SQL, Spark, Airflow, cloud platforms. Builds the infrastructure that makes data science possible: data pipelines (extracting data from sources, transforming it, loading it into warehouses), data warehouses (structured storage for analysis), and data quality systems. Data engineers do not build models - they build the plumbing that delivers clean data to the scientists who do.

ML Engineer ($130K-$200K median). Primary tools: Python, PyTorch/TensorFlow, Docker, Kubernetes, cloud ML services. Takes models from a data scientist's Jupyter notebook and puts them into production: building APIs, optimizing inference latency, monitoring model performance, handling versioning, and setting up retraining pipelines. The gap between "works in a notebook" and "works in production at scale" is the ML engineer's entire job.

Data Analyst: SQL + dashboards

→

Data Scientist: ML + statistics

→

Data Engineer: pipelines + infrastructure

→

ML Engineer: production ML systems

Answers to Questions People Actually Ask

Is data science overhyped? The job title "data scientist" was overhyped around 2015-2019 when it was called "the sexiest job of the 21st century." What was overhyped was the expectation that a single person with a Python course could solve all of a company's data problems. What was not overhyped is the underlying need: companies that make data-driven decisions outperform those that do not, and the skills required to turn raw data into reliable insights are real and valuable. The title may be less fashionable, but the work is more important than ever.

Do I need a math degree? For a data analyst role, no - solid SQL and business intuition are enough. For a data scientist role, you need working knowledge of probability, statistics, linear algebra, and calculus - but you do not need to prove theorems. You need to understand what a p-value means, when to use logistic regression vs. random forest, and why your sample size matters. A quantitative undergraduate degree (math, statistics, economics, physics, engineering) provides this foundation. Self-study from resources like Khan Academy, StatQuest, and textbooks like "An Introduction to Statistical Learning" can substitute for formal education if you are disciplined.

Python or R? Python. R has a devoted user base in academia and biostatistics, and its visualization libraries (ggplot2) are excellent. But Python dominates in industry because it serves dual duty: data analysis (pandas, scikit-learn) and production engineering (web servers, APIs, automation). If you learn Python for data science, you can also build the production systems around your models. If you learn R, you need a separate language for everything else. The exception: if your entire team uses R and your work stays in analysis (no production deployment), R is fine.

Is Excel really still relevant? Yes. 120 million people use it. Most business decisions are made in Excel, not Jupyter notebooks. A marketing manager who needs to analyze campaign results will use a pivot table, not write a Python script. Knowing Excel well is more immediately valuable in most business environments than knowing Python. The smart approach is: learn Excel for quick, everyday analysis. Learn SQL for data extraction. Learn Python for anything that requires automation, scale, or machine learning. Each tool has its zone.

How do I start learning data science? The most effective path for a self-learner: (1) Learn SQL well enough to extract and manipulate data from a database - this is immediately employable. (2) Learn Python basics and the pandas library for data manipulation. (3) Study statistics - focus on probability distributions, hypothesis testing, and regression. (4) Build projects with real (messy) data, not clean textbook datasets. Kaggle competitions, public government datasets, and your own data (spending habits, fitness data, local weather) are all good starting points. (5) Communicate your findings clearly - a data scientist who cannot explain their results to a non-technical audience has done only half the job.

The Bigger Picture: Data as a Competitive Advantage

Data science is not just a job title. It is a decision-making philosophy: replace intuition with evidence, replace arguments with experiments, replace anecdotes with analysis. Companies that have internalized this philosophy - Google, Amazon, Netflix, Spotify - make thousands of data-informed decisions daily, from which products to recommend to how many warehouse workers to schedule next Tuesday.

But data science is also a mirror of the data it runs on, which means it inherits all of the data's problems. Biased data produces biased conclusions. Incomplete data produces incomplete conclusions. And the most dangerous conclusions are the ones that look authoritative - wrapped in charts and p-values - but rest on fundamentally flawed data or flawed assumptions. This is why statistical literacy, skepticism, and domain expertise are not optional add-ons to data science. They are the core of it.

The Target pregnancy prediction story is instructive not because the data science was impressive (it was) but because it raises the question nobody on the data team asked: should we? Having the technical capability to predict pregnancy from purchase patterns does not mean it is wise to act on those predictions by mailing baby product coupons to people who have not disclosed a pregnancy. The data science was correct. The decision to deploy it without ethical consideration was not. That gap - between what data can tell you and what you should do with that knowledge - is the most important territory in the entire field.

The takeaway: Data science is the complete pipeline from question to decision: define the problem, collect data, clean it (80% of the work), explore it, model it, communicate the results, and act on the findings. It requires statistics (to avoid drawing false conclusions), programming (to work at scale), and domain expertise (to ask the right questions). The tools - Python, SQL, Jupyter, Tableau - are learnable. The harder skills are statistical thinking (correlation is not causation, Simpson's Paradox, sample size requirements) and the judgment to know when the data supports a conclusion and when it merely suggests one. Master those, and you have a skill set that applies to every industry, every organization, and every decision that involves numbers - which is nearly all of them.

Data Science: Turning Raw Numbers into Decisions