Probability Theory

Definitions and Basic Terms

💡 A discrete probability space is defined by the sample space/possibility space $Ω = {ω_{1}, ω_{2}, \dots}$ of elementary events. An (elementary-) probability $Pr [ω_{i}]$ is assigned to every elementary event $ω_{i}$ , where we require $0 \leq Pr [ω_{i}] \leq 1$ and

$ω \in Ω \sum Pr [ω] = 1.$

A set $E \subseteq Ω$ is called an event. The probability $Pr [E]$ of an event is defined by

$Pr [E] = ω \in E \sum Pr [ω] .$

If $E$ is an event, then we define $\overset{ˉ}{E} = Ω ∖ E$ as the complementary event of $E$ .

📌 For events $A, B$ the following hold:

$Pr [\emptyset] = 0, Pr [Ω] = 1$
$0 \leq Pr [A] \leq 1$
$Pr [\overset{ˉ}{A}] = 1 - Pr [A]$
If $A \subseteq B$ , then it follows that $Pr [A] \leq Pr [B] .$

📖 (Addition theorem).

If the events $A_{1}, \dots, A_{n}$ are pairwise disjoint (i.e., if it holds for all pairs $i \neq = j$ that $A_{i} \cap A_{j} = \emptyset$ ), then it holds that

$Pr [i = 1 ⋃ n A_{i}] = i = 1 \sum n Pr [A_{i}] .$

For an infinite set of disjoint events $A_{1}, A_{2}, \dots$ it holds analogously

$Pr [i = 1 ⋃ \infty A_{i}] = i = 1 \sum \infty Pr [A_{i}] .$

📖 (The sieve formula, Inclusion-Exclusion principle).

For events $A_{1}, \dots, A_{n} (n \geq 2)$ the following holds:

$Pr [i = 1 ⋃ n A_{i}] = l = 1 \sum n (- 1)^{l + 1} 1 \leq i_{1} < \dots < i_{l} \leq n \sum Pr [A_{i_{1}} \cap \dots \cap A_{i_{l}}] = i = 1 \sum n Pr [A_{i}] - 1 \leq i_{1} < i_{2} \leq n \sum Pr [A_{i_{1}} \cap A_{i_{2}}] + 1 \leq i_{1} < i_{2} < i_{3} \leq n \sum Pr [A_{i_{1}} \cap A_{i_{2}} \cap A_{i_{3}}] - \dots + (- 1)^{n + 1} \cdot Pr [A_{1} \cap \dots \cap A_{n}] .$

📎 (Boole’s Inequality, Union Bound).

For events $A_{1}, \dots, A_{n}$ the following holds:

$Pr [i = 1 ⋃ n A_{i}] \leq i = 1 \sum n Pr [A_{i}] .$

Analogously, the following holds for an infinite sequence of events $A_{1}, A_{2}, \dots$ :

$Pr [i = 1 ⋃ \infty A_{i}] \leq i = 1 \sum \infty Pr [A_{i}] .$

💡 (Principle of Laplace).

If nothings indicates otherwise, we assume that all elementary events are equally likely.

Therefore, $Pr [ω] = \frac{1}{∣ Ω ∣}$ for all elementary events $ω$ .

It immediately follows for an arbitrary event $E$ that

$Pr [E] = \frac{∣ E ∣}{∣ Ω ∣} .$

We say, the event of the modeled experiment on $Ω$ is uniformly distributed or equally distributed.

💡 In an information-theoretical sense such a probability space ( $Pr [ω] = \frac{1}{∣ Ω ∣}$ for all $ω \in Ω$ ) has the largest-possible entropy. Every deviation from uniform probability distribution requires that we put more information into the model (and therefore decrease entropy).

Conditional Probability

💡 Let $A$ and $B$ be events with $Pr [B] > 0$ . The conditional probability $Pr [A ∣ B]$ of $A$ given $B$ is defined by

$Pr [A ∣ B] = \frac{Pr [ A \cap B ]}{Pr [ B ]} .$

💡 The conditional probabilities of the form $Pr [\cdot ∣ B]$ form a new probability space over $Ω$ for an arbitrary event $B \subseteq Ω$ with $Pr [B] > 0$ .

The probabilities of the elementary events $ω_{i}$ are calculated through $Pr [ω_{i} ∣ B]$ . Then

$ω \in Ω \sum Pr [ω ∣ B] = ω \in Ω \sum \frac{Pr [ ω \cap B ]}{Pr [ B ]} = ω \in B \sum \frac{Pr [ ω ]}{Pr [ B ]} = \frac{Pr [ B ]}{Pr [ B ]} = 1.$

the definition of a discrete probability space is still fulfilled and therefore all rules for probabilities still apply to conditional probabilities.

📖 (Multiplication theorem).

Let the events $A_{1}, \dots, A_{n}$ be given. If $Pr [A_{1} \cap \dots \cap A_{n}] > 0$ then the following holds:

$Pr [A_{1} \cap \dots \cap A_{n}] = Pr [A_{1}] \cdot Pr [A_{2} ∣ A_{1}] \cdot Pr [A_{3} ∣ A_{1} \cap A_{2}] \cdot \dots \cdot Pr [A_{1} \cap \dots \cap A_{n - 1}] .$

📖 (Law of total probability).

Let the events $A_{1}, \dots, A_{n}$ be pairwise disjoint and let $B \subseteq A_{1} \cup \dots \cup A_{n} .$ Then it follows that

$Pr [B] = i = 1 \sum n Pr [B ∣ A_{i}] \cdot Pr [A_{i}] .$

Analogously, for pairwise disjoint events $A_{1}, A_{2}, \dots$ with $B \subseteq ⋃_{i = 1}^{\infty} A_{i}$ it follows that

$Pr [B] = i = 1 \sum \infty Pr [B ∣ A_{i}] \cdot Pr [A_{i}] .$

📖 (Bayes’ theorem).

Let the events $A_{1}, \dots, A_{n}$ be pairwise disjoint. Furthermore, let $B \subseteq A_{1} \cup \dots \cup A_{n}$ be an event with $Pr [B] > 0$ . Then it holds for an arbitrary $i = 1, \dots, n$ that

$Pr [A_{i} ∣ B] = \frac{Pr [ A _{i} \cap B ]}{Pr [ B ]} = \frac{Pr [ B ∣ A _{i} ] \cdot Pr [ A _{i} ]}{\sum _{j = 1}^{n} Pr [ B ∣ A _{j} ] \cdot Pr [ A _{j} ]} .$

Analogously, for pairwise disjoint events $A_{1}, A_{2}, \dots$ with $B \subseteq ⋃_{i = 1}^{\infty} A_{i}$ it holds that

$Pr [A_{i} ∣ B] = \frac{Pr [ A _{i} \cap B ]}{Pr [ B ]} = \frac{Pr [ B ∣ A _{i} ] \cdot Pr [ A _{i} ]}{\sum _{j = 1}^{\infty} Pr [ B ∣ A _{j} ] \cdot Pr [ A _{j} ]} .$

Independence

💡 The events $A$ and $B$ are called independent if

$Pr [A \cap B] = Pr [A] \cdot Pr [B]$

💡 The events $A_{1}, \dots, A_{n}$ are called independent if it holds for all subsets $I \subseteq {1, \dots, n}$ with $I = {i_{1}, \dots, i_{k}}$ that

$Pr [A_{i_{1}} \cap \dots \cap A_{i_{k}}] = Pr [A_{i_{1}}] \cdot \dots \cdot Pr [A_{i_{k}}] .$

An infinite family of events $A_{i}$ with $i \in N$ is called independent if the above condition is met for all finite subsets $I \subseteq N$ .

📌 The events $A_{1}, \dots, A_{n}$ are independent if and only if it holds for all $(s_{1}, \dots, s_{n}) \in {0, 1}^{n}$ that

$Pr [A_{1}^{s_{1}} \cap \dots \cap A_{n}^{s_{n}}] = Pr [A_{1}^{s_{1}}] \cdot \dots \cdot Pr [A_{n}^{s_{n}}],$

where $A_{i}^{0} = \overset{ˉ}{A}_{i}$ and $A_{i}^{1} = A_{i}$ .

📌 Let $A$ , $B$ and $C$ be independent events. Then $A \cap B$ and $C$ respectively $A \cup B$ and $C$ independent.

Random Variables

💡 A random variable is a transformation $X : Ω ⟶ R$ , where $Ω$ is the possibility space of a probability space.

💡 In discrete probability spaces the codomain of a random variable

$W_{X} = X (Ω) = {χ \in R ∣ \exists ω \in Ω X (ω) = χ}$

is in all cases finite or countably infinite, depending on $Ω$ being finite or countably infinite.

💡 When researching a random variable $X$ one is interested in the probabilities with which $X$ assumes a specific value. For $W_{X} = {χ_{1}, \dots, χ_{n}}$ respectively $W_{X} = {χ_{1}, χ_{2}, \dots}$ for an arbitrary $1 \leq i \leq n$ respectively $χ_{i} \in N$ we regard the event $X^{- 1} (χ_{i}) = {ω \in Ω ∣ X (ω) = χ_{i}}$ . We abbreviate $X^{- 1} (χ_{i})$ as $" X = χ_{i} "$ .

Therefore, we can define

$Pr [X \leq χ_{i}] = χ \in W_{X} ∣ χ \leq χ_{i} \sum Pr [X = χ] = Pr [{ω \in Ω ∣ X (ω) \leq χ_{i}}] .$

💡 The function

$f_{X} : R χ ⟶ [0, 1] ⟼ Pr [X = χ]$

is called the density (function) of $X$ .

💡 The function

$F_{X} : R χ ⟶ [0, 1] ⟼ Pr [X \leq χ] = χ^{'} \in W_{X} ∣ χ^{'} \leq χ \sum Pr [X = χ^{'}]$

is called the distribution (function) of $X$ .

Expected Value

💡 For a random variable $X$ we define the expected value $E [X]$ as

$E [X] = χ \in W_{X} \sum χ \cdot Pr [X = χ]$

if that sum converges absolutely. Otherwise the expected value is said to be undefined.

📌 If $X$ is a random variable then the following holds:

$E [X] = ω \in Ω \sum X (ω) \cdot Pr [ω] .$

📖 Let $X$ a random variable with $W_{X} \subseteq N_{0}$ . Then it holds that

$E [X] = i = 1 \sum \infty Pr [X \leq i] .$

💡 Conditional Random Variables

Let $X$ be a random variable and $A$ an event with $Pr [A] > 0$ . By $X ∣ A$ we denote that we calculate probabilities with which the random variable $X$ assumes specific values with respect to the on $A$ conditional probabilities. It thus holds that

$Pr [(X ∣ A) \leq χ] = Pr [X \leq χ ∣ A] = \frac{Pr [{ ω \in A ∣ X ( ω ) \leq χ }]}{Pr [ A ]}$

📖 Let $X$ be a random variable. For pairwise disjoint events $A_{1}, \dots, A_{n}$ with $A_{1} \cup \dots \cup A_{n} = Ω$ and $Pr [A_{1}], \dots, Pr [A_{n}] > 0$ it holds that

$E [X] = i = 1 \sum n E [X ∣ A_{i}] \cdot Pr [A_{i}] .$

For pairwise disjoint events $A_{1}, A_{2}, \dots$ with $⋃_{i = 1}^{\infty} A_{k} = Ω$ and $Pr [A_{1}], Pr [A_{2}], .. > 0$ it holds analogously that

$E [X] = i = 1 \sum \infty E [X ∣ A_{i}] \cdot Pr [A_{i}] .$

💡 Linearity of the Expected Value

Assume, we have defined $n$ random variables:

$X_{1}, \dots, X_{n} : Ω ⟶ R .$

For an $ω \in Ω$ we thus receive $n$ real numbers $X_{1} (ω), \dots, X_{n} (ω)$ . When we define a function $f : R^{n} ⟶ R$ we immediately see that the concatenation $f (X_{1}, \dots, X_{n})$ in turn is also a random variable, for it holds that:

$f (X_{1}, \dots, X_{n}) : Ω ⟶ R .$

This holds for arbitrary functions $f : R^{n} ⟶ R$ , in particular for affine linear functions:

$f : R^{n} (χ_{1}, \dots, χ_{n}) ⟶ R ⟼ a_{1} χ_{1} + \dots + a_{n} χ_{n} + b,$

where $a_{1}, \dots, a_{n}, b \in R$ are arbitrary real numbers. In this case we usually denote the random variable $f (X_{1}, \dots, X_{n})$ explicitly as

$X = a_{1} X_{1} + \dots + a_{n} X_{n} + b .$

📖 (Linearity of the Expected Value).

For random variables $X_{1}, \dots, X_{n}$ and $X = a_{1} X_{1} + \dots + a_{n} X_{n} + b$ with $a_{1}, .., a_{n}, b \in R$ it holds that

$E [X] = a_{1} E [X] + \dots + a_{n} E [X_{n}] + b .$

💡 For an event $A \subseteq Ω$ the corresponding indicator variable $X_{A}$ is defined by:

$X_{A} (ω) = {1, 0, if ω \in A, else .$

For the expected value of $X_{A}$ it holds that: $E [X_{A}] = Pr [A]$ .

Variance

💡 For a random variable $X$ with $μ = E [X]$ we define the variance $Var [X]$ as

$Var [X] = E [(X - μ)^{2}] = χ \in W_{X} \sum (χ - μ)^{2} \cdot Pr [X = χ] .$

The quantity $σ = Var [X]$ is called the standard deviation of $X$ .

📖 For an arbitrary random variable $X$ it holds that

$Var [X] = E [X^{2}] - E [X]^{2} .$

📖 For an arbitrary random variable $X$ and $a, b \in R$ it holds that

$Var [a \cdot X + b] = a^{2} \cdot Var [X] .$

💡 For a random variable $X$ we call $E [X^{k}]$ the $* k$ -th moment* and $E [(X - E [X])^{k}]$ the $k$ -th central moment.

The expected value is therefore identical to the first moment, and the variance identical to the second central moment.

Important Discrete Probability Distributions

💡 Recall the probability density function $f_{X}$ and the probability distribution function $F_{X}$ :

$f_{X} : R χ F_{X} : R χ ⟶ [0, 1] ⟼ Pr [X = χ] = Pr [{ω ∣ X (ω) = χ}] . ⟶ [0, 1] ⟼ Pr [X \leq χ] = Pr [{ω ∣ X (ω) \leq χ}] .$

Bernoulli Distribution

💡 A random variable $X$ with $W_{X} = {0, 1}$ and density

$f_{X} (χ) = ⎩ ⎨ ⎧ p 1 - p 0 for χ = 1, for χ = 0, else$

is called Bernoulli distributed. The parameter $p$ is called the probability of success of the Bernoulli distribution.

If a random variable $X$ is Bernoulli distributed with parameter $p$ , then it is denoted by

$X \sim Bernoulli (p) .$

For a Bernoulli distributed random variable $X$ the following hold:

$E [X] = p and Var [X] = p (1 - p) .$

Binomial Distribution

💡 A random variable $X$ with $W_{X} = {0, 1, \dots, n}$ and density

$f_{X} (χ) = {(χ n) p^{χ} (1 - p)^{n - χ} 0 for χ \in {0, 1, \dots, n}, else$

is called binomially distributed. The parameter $n$ is called the number of trials, the parameter $p$ is called the probability of success of the binomial distribution.

If a random variable $X$ is binomially distributed with parameters $n$ and $p$ , then it is denoted by

$X \sim Bin (n, p) .$

For a binomially distributed random variable $X$ the following hold:

$E [X] = n p and Var [X] = n p (1 - p) .$

Geometric Distribution

💡 A random variable $X$ with density

$f_{X} = {p (1 - p)^{i - 1} 0 for i \in N, else$

is called geometrically distributed. The parameter $p$ is called the probability of success of the geometric distribution.

If a random variable $X$ is geometrically distributed with parameter $p$ , then it is denoted by

$X \sim Geo (p) .$

For a geometrically distributed random variable $X$ the following hold:

$E [X] = \frac{1}{p} and Var [X] = \frac{1 - p}{p ^{2}} .$

📖 If $X \sim Geo (p)$ , then for all $s, t \in N$ the following holds:

$Pr [X \geq s + t ∣ X > s] = Pr [X \geq t] .$

Waiting for the $n$ -th Success - Negative Binomial Distribution

💡 Let $Z$ be the random variable that counts how often we have to repeat an experiment with probability of success $p$ until the $n$ -th success. For $n = 1$ , $Z \sim Geo (p)$ . For $n \geq 2$ , $Z$ is called negatively binomially distributed with order $n$ .

The density of $Z$ is

$f_{Z} (z) = (n - 1 z - 1) \cdot p^{n} (1 - p)^{z - n} .$

Let $X_{i}$ denote the number of experiments strictly after the $(i - 1)$ -st success up until (including) the $i$ -th success. Then, each of the $X_{i}$ is geometrically distributed with parameter $p$ .

If a random variable $Z$ is negatively binomially distributed with parameters $n$ and $p$ , then it is denoted by

$Z \sim NB (n, p) .$

Thus, by linearity of the expected value, it holds for the expected value $E [Z]$ that

$E [Z] = i = 1 \sum n E [X_{i}] = \frac{n}{p} .$

Application: Coupon-Collector Problem

Coupon-Collector Problem

Poisson Distribution

💡 A random variable $X$ with density

$f_{X} = {\frac{e ^{- λ} λ ^{i}}{i !} 0 for i \in N_{0}, else$

is called Poisson distributed. The parameter $λ$ is equal to the mean and variance of the Poisson distribution.

If a random variable $X$ is Poisson distributed with parameter $λ$ , then it is denoted by

$X \sim Po (λ) .$

For a Poisson distributed random variable $X$ the following hold:

$E [X] = Var [X] = λ .$

Poisson Distribution as Limit of Binomial Distribution

💡 The binomial distribution $Bin (n, \frac{λ}{n})$ converges towards the Poisson distribution $Po (λ)$ for $n \to \infty$ .

Multiple Random Variables

💡 We are interested in random variables $X$ and $Y$ and probabilities of the form

$Pr [X = χ, Y = y] = Pr [{ω \in Ω ∣ X (ω) = χ, Y (ω) = y}] .$

💡 The function

$f_{X, Y} = Pr [X = χ, Y = y]$

is called joint density of the random variables $X$ and $Y$ .

If the joint density is given one can extract the density of the random variables themselves using

$f_{X} (χ) = y \in W_{Y} \sum f_{X, Y} (χ, y) respectively f_{Y} (y) = x \in W_{X} \sum f_{X, Y} (χ, y) .$

The functions $f_{X}$ and $f_{Y}$ are called marginal densities.

💡 The function

$F_{X, Y} (χ, y) = Pr [X \leq χ, Y \leq y] = Pr [{ω \in Ω ∣ X (ω) \leq χ, Y (ω) \leq y}] = χ^{'} \leq χ \sum y^{'} \leq y \sum f_{X, Y} (χ^{'}, y^{'}) .$

is called joint distribution of the random variables $X$ and $Y$ .

If the joint distribution is given one can extract the distribution of the random variables themselves using

$F_{X} (χ) = χ^{'} \leq χ \sum f_{X} (χ^{'}) = χ^{'} \leq χ \sum y \in W_{Y} \sum f_{X, Y} (χ^{'}, y) .$

The functions $F_{X}$ and $F_{Y}$ are called marginal distributions.

Independence of Random Variables

💡 Random variables $X_{1}, \dots, X_{n}$ are called independent, if and only if it holds for all $(χ_{1}, \dots, χ_{n}) \in W_{X_{1}} \times \dots \times W_{X_{n}}$ that

$Pr [X_{1} = χ_{1}, \dots, X_{n} = χ_{n}] = Pr [X_{1} = χ_{1}] \cdot \dots \cdot Pr [X_{n} = χ_{n}] .$

📌 For independent random variables $X_{1}, \dots, X_{n}$ and arbitrary sets $S_{1}, \dots, S_{n}$ with $S_{i} \subseteq W_{X}$ it holds that

$Pr [X_{1} \in S_{1}, \dots, X_{n} \in S_{n}] = Pr [X_{1} \in S_{1}] \cdot \dots \cdot Pr [X_{n} \in S_{n}] .$

📎 For independent random variables $X_{1}, \dots, X_{n}$ and the set $I = {i_{1}, \dots, i_{k}} \subseteq [n]$ , then $X_{i_{1}}, \dots, X_{i_{k}}$ are also independent.

📖 Let $f_{1}, \dots, f_{n}$ be real-values functions ( $f_{i} : R ⟶ R$ for $i = 1, \dots, n$ ). If the random variables $X_{1}, \dots, X_{n}$ are independent then so are $f_{1} (X_{1}), \dots, f_{n} (X_{n})$ .

Composite Random Variables

📖 For two independent random variables $X$ and $Y$ let $Z = X + Y$ . It holds that

$f_{Z} (z) = χ \in W_{X} \sum f_{X} (χ) \cdot f_{Y} (z - χ) .$

💡 The expression $\sum_{χ \in W_{X}} f_{X} (χ) \cdot f_{Y} (z - χ)$ is called convolution, analogously to the corresponding terms for power series.

Moments of Composite Random Variables

📖 (Linearity of the Expected Value).

For random variables $X_{1}, \dots, X_{n}$ and $X = a_{1} X_{1} + \dots + a_{n} X_{n}$ with $a_{1}, \dots, a_{n} \in R$ it holds that

$E [X] = a_{1} E [X_{1}] + \dots + a_{n} E [X_{n}] .$

📖 (Multiplicativity of the Expected Value).

For independent random variables $X_{1}, \dots, X_{n}$ it holds that

$E [X_{1} \cdot \dots \cdot X_{n}] = E [X_{1}] \cdot \dots \cdot E [X_{n}] .$

📖 For independent random variables $X_{1}, \dots, X_{n}$ and $X = X_{1} + \dots + X_{n}$ it holds that

$Var [X] = Var [X_{1}] + \dots + Var [X_{n}] .$

Wald’s Identity

📖 (Wald’s Identity).

Let $N$ and $X$ be two independent random variables, where $W_{N} \subseteq N$ holds for the codomain of $N$ . Furthermore, let

$Z = i = 1 \sum N X_{i},$

where $X_{1}, X_{2}, \dots$ are independent copies of $X$ . Then the following holds:

$E [Z] = E [N] \cdot E [X] .$

Estimating Probabilities

Inequalities of Markov and Chebyshev

📖 (Markov’s Inequality).

Let $X$ be a random variable, that only assumes non-negative values. Then it holds for all $t \in R$ with $t > 0$ that

$Pr [X \geq t] \leq \frac{E [ X ]}{t} .$

Or equivalently: $Pr [X \geq t \cdot E [X]] \leq \frac{1}{t}$ .

📖 (Chebyshev’s Inequality).

Let $X$ be a random variable and $t \in R$ with $t > 0$ . It then holds that

$Pr [∣ X - E [X]∣ \geq t] \leq \frac{Var [ X ]}{t ^{2}} .$

Or equivalently: $Pr [∣ X - E [X]∣ \geq t Var [X]] \leq \frac{1}{t ^{2}} .$

Chernoff’s Inequality

📖 (Chernoff-Bounds).

Let $X_{1}, \dots, X_{n}$ be independent Bernoulli distributed random variables with $Pr [X_{i} = 1] = p_{i}$ and $Pr [X_{i} = 0] = 1 - p_{i}$ .

For $X = \sum_{i = 1}^{n} X_{i} :$

$Pr [X \geq (1 + δ) E [X]] \leq e^{- \frac{1}{3} δ^{2} E [X]}$ for all $0 < δ \leq 1$ ,
$Pr [X \leq (1 - δ) E [X]] \leq e^{- \frac{1}{2} δ^{2} E [X]}$ for all $0 < δ \leq 1$ ,
$Pr [X \geq t] \leq 2^{- t}$ for $t \geq 2 e E [X]$ .

Probability Theory

Definitions and Basic Terms

Conditional Probability

Independence

Random Variables

Expected Value

Variance

Important Discrete Probability Distributions

Bernoulli Distribution

Binomial Distribution

Geometric Distribution

Waiting for the n-th Success - Negative Binomial Distribution

Application: Coupon-Collector Problem

Poisson Distribution

Poisson Distribution as Limit of Binomial Distribution

Multiple Random Variables

Independence of Random Variables

Composite Random Variables

Moments of Composite Random Variables

Wald’s Identity

Estimating Probabilities

Inequalities of Markov and Chebyshev

Chernoff’s Inequality

Waiting for the $n$ -th Success - Negative Binomial Distribution