AP Statistics

Summary: Maybe it will happen

Unit 1: Exploring One-Variable Data

Core Idea: Stats starts with describing data — what it looks like, how it spreads, and where it centers.
Center: mean (average), median (middle value)
Spread: range, interquartile range (IQR = Q3 – Q1), standard deviation
Shape: symmetric, skewed left/right, uniform, bimodal
Outliers: values far from the norm; rule of thumb: < Q1 – 1.5×IQR or > Q3 + 1.5×IQR
Visuals: histograms, dotplots, boxplots
Z-score: z = (x - mean) / standard deviation → how many standard deviations from the mean
Use context — stats is meaningless without interpretation

Core Idea: Stats gets interesting when you look at relationships between variables.
Scatterplots: Show form (linear/nonlinear), direction (positive/negative), strength
Correlation (r): Measures linear relationship; -1 ≤ r ≤ 1
Least Squares Regression Line (LSRL):
- ŷ = a + bx, where b = slope, a = y-intercept
- Slope: b = r * (sy / sx)
- Residual: residual = actual - predicted = y - ŷ
- Coefficient of determination: r² → percent of variation explained by model
Correlation ≠ causation

Core Idea: Good data comes from good design — how you collect data shapes your conclusions.
Types of studies:
- Observational (no control), Experimental (with control/treatment)
Sampling methods:
- Simple Random Sample (SRS), Stratified, Cluster, Systematic
Biases:
- Voluntary response, undercoverage, nonresponse, response bias
Experimental design:
- Random assignment, control, replication, comparison
- Blocking: control for known confounding variables
Inference: Random sampling → generalize to population. Random assignment → cause-and-effect.

Core Idea: Probability models randomness and lets us make predictions about long-run outcomes.
Probability rules:
- 0 ≤ P(A) ≤ 1
- P(A or B) = P(A) + P(B) - P(A and B)
- Complement rule: P(not A) = 1 - P(A)
Conditional probability: P(A | B) = P(A and B) / P(B)
Independence: If P(A | B) = P(A), A and B are independent
Random Variables:
- Discrete: countable values
- Continuous: any value in an interval
Expected value (mean): E(X) = Σ [x * P(x)]
Standard deviation of X: σ = √Σ [(x - μ)² * P(x)]

Core Idea: A statistic (like a sample mean) varies from sample to sample — this variability is predictable.
Sampling distribution: Distribution of a statistic from all possible samples
Central Limit Theorem: If n is large, sampling distribution of sample mean is approximately normal
For proportions (p̂):
- Mean: μ = p
- Standard deviation: σ = √[p(1-p)/n]
For means (x̄):
- Mean: μ = μ
- Standard deviation: σ = σ / √n
Conditions for normality: Random, 10%, Large Counts (np ≥ 10, n(1-p) ≥ 10) or n ≥ 30

Core Idea: Use sample proportions to estimate or test claims about population proportions.
Confidence interval for p:
- p̂ ± z* √[p̂(1 - p̂)/n]
Significance test for p:
- Null: H₀: p = p₀, Alternative: Hₐ: p ≠ p₀, < or >
- Test statistic: z = (p̂ - p₀) / √[p₀(1 - p₀)/n]
- Get p-value from z, compare to α
Interpret confidence level: “In repeated samples, about 95% of intervals will contain the true proportion.”

Core Idea: Same logic as with proportions — just with means and t-distributions instead of z.
Confidence interval for μ:
- x̄ ± t* (s / √n)
Significance test for μ:
- t = (x̄ - μ₀) / (s / √n)
- Use t-distribution with df = n - 1
Still requires: random sample, 10% condition, normal population or large n

Core Idea: Use chi-square tests when you have counts and want to test for relationships in categories.
Types of tests:
- Goodness-of-Fit: one variable, compare to expected distribution
- Homogeneity: multiple populations, same variable
- Independence: one population, two variables
Test statistic:
- χ² = Σ [(observed - expected)² / expected]
Degrees of freedom:
- Goodness-of-fit: df = categories - 1
- Two-way tables: df = (rows - 1)(columns - 1)
Conditions: Random, Expected counts ≥ 5

Core Idea: Use inference to test if a relationship between variables (in regression) is real or just sample noise.
Model: ŷ = a + bx
Standard error of slope: SE_b (given)
Test statistic:
- t = (b - 0) / SE_b
Degrees of freedom: df = n - 2
Confidence interval for slope:
- b ± t* × SE_b
Interpret in context: Does the data support a real linear relationship between x and y?