Think of it as representing a probability distribution of probabilities. It's often used to model uncertainty about a parameter that itself represents a probability, like the bias of a coin (p) or the conversion rate of a webpage.
Its shape is very flexible, controlled by two positive shape parameters, allowing it to model various forms of belief about a probability.
A continuous random variable \(X\) follows a Beta distribution with positive shape parameters \(\alpha\) and \(\beta\), if its Probability Density Function (PDF) is defined on the interval \((0, 1)\) as:
$$ f(x | \alpha, \beta) = \frac{1}{B(\alpha, \beta)} x^{\alpha-1} (1-x)^{\beta-1} \quad \text{for } 0 < x < 1 $$
Where:
\(x\) is a value between 0 and 1 (representing a probability or proportion).
\(\alpha > 0\) is the first shape parameter.
\(\beta > 0\) is the second shape parameter.
\(B(\alpha, \beta)\) is the Beta function, which acts as the normalization constant.
A function related to the [[10_Gamma_Distribution|Gamma function]] \(\Gamma\):
$$ B(\alpha, \beta) = \frac{\Gamma(\alpha) \Gamma(\beta)}{\Gamma(\alpha + \beta)} $$
It ensures that the total area under the Beta PDF integrates to 1.
2. Parameters \(\alpha\) and \(\beta\) (Shape Parameters)¶
These parameters control the shape of the distribution:
\(\alpha = \beta = 1\): Reduces to the Uniform(0, 1) distribution (all probabilities equally likely).
\(\alpha > 1, \beta > 1\): Unimodal distribution (single peak between 0 and 1). If \(\alpha=\beta\), the peak is at \(0.5\). If \(\alpha > \beta\), peak is closer to 1. If \(\beta > \alpha\), peak is closer to 0.
\(\alpha < 1, \beta < 1\): U-shaped distribution (probabilities near 0 and 1 are more likely).
The expected value (average probability) is:
$$ E[X] = \frac{\alpha}{\alpha + \beta} $$
Intuition:\(\alpha\) can be thought of as related to the count of "successes" and \(\beta\) to the count of "failures". The mean is like the proportion of successes.
The Beta distribution is the conjugate prior for the Bernoulli, Binomial, and [[08_Geometric_Distribution|Geometric]] distributions (in terms of the success probability parameter \(p\)).
What this means: If your prior belief about a probability parameter \(p\) can be represented by a Beta(\(\alpha_{prior}, \beta_{prior}\)) distribution, and you observe new data (e.g., \(k\) successes in \(n\) Binomial trials), then your updated belief (the posterior distribution) for \(p\) is also a Beta distribution:
$$ p | \text{data} \sim \text{Beta}(\alpha_{prior} + k, \beta_{prior} + n - k) $$
This makes Bayesian updates mathematically convenient. \(\alpha\) acts like a "prior count of successes + 1", \(\beta\) like a "prior count of failures + 1".