Skip to content

Normal (Gaussian) Distribution

Definition / Introduction

  • The Normal distribution, also known as the Gaussian distribution or the "bell curve," is arguably the most important Continuous Probability Distribution in statistics and many scientific fields.
  • It describes data that cluster around a central mean value (\(\mu\)), with probabilities tapering off symmetrically as values move further away from the mean.
  • Many natural phenomena (e.g., heights, blood pressure), measurement errors, and sums/averages of random variables (due to the Central Limit Theorem) tend to follow a Normal distribution.

Key Concepts

1. The Normal Random Variable

  • A continuous random variable \(X\) follows a Normal distribution with mean \(\mu\) and variance \(\sigma^2\).
  • Notation: \(X \sim \text{Normal}(\mu, \sigma^2)\) or \(X \sim N(\mu, \sigma^2)\).

2. Probability Density Function (PDF)

  • The PDF formula defines the characteristic symmetrical bell shape: $$ f(x | \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( - \frac{(x-\mu)^2}{2\sigma^2} \right) $$
  • Where:
    • \(x\) is the value of the random variable (\(x \in \mathbb{R}\)).
    • \(\mu\) (mu) is the mean (center) of the distribution (\(\mu \in \mathbb{R}\)).
    • \(\sigma^2\) (sigma-squared) is the variance (spread) of the distribution (\(\sigma^2 > 0\)). \(\sigma = \sqrt{\sigma^2}\) is the standard deviation.
    • \(\pi\) (pi) is the mathematical constant (\(\pi \approx 3.14159...\)).
    • \(e\) is Euler's number (\(e \approx 2.71828...\)). exp(y) is equivalent to \(e^y\).
  • The curve is symmetric around \(x = \mu\).
  • The total area under the curve is 1: \(\int_{-\infty}^{\infty} f(x | \mu, \sigma^2) \, dx = 1\).

3. Parameters

  • The Normal distribution is completely defined by two parameters:
    • \(\mu\) (Mean): Determines the location (center) of the peak.
    • \(\sigma^2\) (Variance): Determines the spread or width of the bell. (\(\sigma\) is Standard Deviation).

4. Cumulative Distribution Function (CDF)

  • The CDF \(F(x) = P(X \le x)\) is the integral of the PDF from \(-\infty\) to \(x\): $$ F(x | \mu, \sigma^2) = \int_{-\infty}^{x} \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( - \frac{(t-\mu)^2}{2\sigma^2} \right) \, dt $$
  • There is no simple closed-form expression for this integral in terms of elementary functions. Probabilities are typically found using:
    • Statistical software or calculators (using error function erf or standard normal CDF \(\Phi\)).
    • Standard Normal (Z) tables after standardizing the variable.

5. The Standard Normal Distribution (Z-distribution)

  • A crucial special case where the mean is 0 and the variance (and standard deviation) is 1: \(Z \sim N(0, 1)\).
  • Its PDF is often denoted \(\phi(z)\): $$ \phi(z) = \frac{1}{\sqrt{2\pi}} e^{-z^2/2} $$
  • Its CDF is often denoted \(\Phi(z)\): $$ \Phi(z) = \int_{-\infty}^{z} \phi(t) dt $$
  • Standardization (Z-score): Any Normal random variable \(X \sim N(\mu, \sigma^2)\) can be transformed into a standard normal variable \(Z\) using the formula: $$ Z = \frac{X - \mu}{\sigma} $$ The Z-score measures how many standard deviations \(\sigma\) the value \(X\) is away from its mean \(\mu\).
  • This allows us to calculate probabilities for any Normal distribution using a single table or function for the Standard Normal distribution: \(P(X \le x) = P\left(Z \le \frac{x-\mu}{\sigma}\right) = \Phi\left(\frac{x-\mu}{\sigma}\right)\).

6. The Empirical Rule (68-95-99.7 Rule)

For any Normal distribution: * Approximately 68% of the data falls within 1 standard deviation of the mean: \(P(\mu-\sigma \le X \le \mu+\sigma) \approx 0.68\). * Approximately 95% of the data falls within 2 standard deviations of the mean: \(P(\mu-2\sigma \le X \le \mu+2\sigma) \approx 0.95\). * Approximately 99.7% of the data falls within 3 standard deviations of the mean: \(P(\mu-3\sigma \le X \le \mu+3\sigma) \approx 0.997\).

7. Expected Value (Mean) & Variance

  • By definition, the parameters directly give the mean and variance: $$ E[X] = \mu $$ $$ Var(X) = \sigma^2 $$

Connections to Other Topics

  • Central Limit Theorem (CLT): States that the sum (or average) of a large number of independent, identically distributed random variables will be approximately Normally distributed, regardless of the original distribution. This is why the Normal distribution appears so often.
  • Approximation to Binomial: The Normal distribution approximates the Binomial distribution \(B(n, p)\) for large \(n\) (using continuity correction).
  • Approximation to Poisson: The Normal distribution approximates the Poisson distribution \(\text{Poi}(\lambda)\) for large \(\lambda\).
  • Foundation for many Inferential Statistics methods: t-tests, ANOVA, Hypothesis testing for means, confidence intervals, Linear Regression assumptions often involve normality of errors.

Summary

  • The most common continuous distribution, the "bell curve".
  • Symmetric around the mean \(\mu\).
  • Parameters: \(\mu\) (mean, location), \(\sigma^2\) (variance, spread).
  • PDF: \(f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)\). No simple CDF formula.
  • Standard Normal (Z): \(N(0, 1)\). Use Z-scores \(Z = \frac{X - \mu}{\sigma}\) to find probabilities for any \(N(\mu, \sigma^2)\).
  • Empirical Rule: 68% within \(\mu \pm 1\sigma\), 95% within \(\mu \pm 2\sigma\), 99.7% within \(\mu \pm 3\sigma\).
  • Central Limit Theorem explains its prevalence. Foundational for much of statistics.

Sources