|
EECS 126 - Probability and Random
Processes - J. Walrand |
Probability Theory is a mathematical model of uncertainty. In these lectures, we introduce examples of uncertainty and we explain how the theory models them.
It is important to appreciate the difference between uncertainty in the physical world and the models of Probability Theory. That difference is similar to that between laws of theoretical physics and the real world: even though mathematicians view the theory as standing on its own, when engineers use it, they see it as a model of the physical world.
Consider flipping a fair coin repeatedly. Designate by 0 and 1 the two possible outcomes of a coin flip (say 0 for head and 1 for tail). This experiment takes place in the physical world. The outcomes are uncertain. This week, we try to appreciate the probability model of this experiment and to relate it to the physical reality.
In my twenty years of teaching probability models, I have always found that what is most subtle is the interpretation of the models, not the calculations. In particular, this introductory course uses mostly elementary algebra and some simple calculus. However, understanding the meaning of the models, what one is trying to calculate, requires becoming familiar with some new and nontrivial ideas.
When I was a junior in college, one of my professors liked to repeat that “definitions do not require interpretation.” Well, I beg to disagree. Although as a logical edifice, it is perfectly true that no interpretation is needed, to develop some intuition about the theory, to be able to anticipate theorems and results, to relate these developments to the physical reality, it is important to have some interpretation of the definitions and of the basic axioms of the theory. We will attempt to develop such interpretations as we go along, using physical examples and pictures.
One idea is that the uncertainty in the world is fully contained in the selection of some hidden variable. If this variable were known, then nothing would be uncertain anymore. I like to think of this variable as being picked by nature at the big bang. Many choices were possible, but one particular choice was made and everything derives from it. [In most cases, it is easier to think of nature’s choice only as it affects a specific experiment, but we worry about this type of detail later.] In other words, everything that is uncertain is a function of that hidden variable. By function, we mean that if we know the hidden variable, then we know everything else.
Let us denote the hidden variable by w. Take one uncertain thing, such as the outcome of the fifth coin flip. This outcome is a function of w. If we designate the outcome of the fifth coin flip by X, then we conclude that X is a function of w. We can denote that function by X(w). Another uncertain thing could be the outcome of the twelfth coin flip. We can denote it by Y(w). The key point here is that X and Y are functions of the same w. Remember, there is only one w (picked by nature at the big bang).
Summing up, everything that is random is some function X of some hidden variable w. This is a model. To make this model more precise, we need to explain how w is selected and what these functions X(w) are like. These ideas will keep us busy for a while!
Legendre (Adrien Marie, 1752-1833). Best use of inaccurate measurements: Method of Least Squares.

To start our exploration of “uncertainty” I propose to review very briefly the various attempts at making use of inaccurate measurements. (I condense this historical account from the very nice book by S. M. Stigler: The history of statistics – The measurement of uncertainty before 1900. Belknap Harvard, 1999. For ease of exposition, I simplify the examples and the notation.)
Say that an amplifier has some gain A that we would like to measure. We observe the input X and the output Y and we know that Y = AX. If we could measure X and Y precisely, then we could determine A by a simple division. However, assume that we cannot measure these quantities precisely. Instead we make two sets of measurements: (X, Y) and (X’, Y’). We would like to find A so that Y = AX and Y’ = AX’. For concreteness, say that (X, Y) = (2, 5) and (X’, Y’) = (4, 7). No value of A works exactly for both sets of measurements. The problem is that we did not measure the input and the output accurately enough, but that may be unavoidable. What should we do?
One approach is to average the measurements, say by taking the arithmetic means: ((X + X’)/2, (Y + Y’)/2) = (2, 6) and to find the gain A so that 6 = A2, so that A = 3. This approach was commonly used in astronomy before 1750.
A second approach is to solve for A for each pair of measurements: For (X, Y), we find A = 2.5 and for (X’, Y’), we find A = 1.75. We can average these values and decide that A should be close to (2.5 + 1.75)/2 = 2.125.
I will skip over many variations proposed by Mayer, Euler, and Laplace.
Another approach is to try to find A so as to minimize the sum of the squares of the errors between Y and AX and between Y’ and AX’. That is, we look for A that minimizes (Y – AX)2 + (Y’ – AX’)2. In our example, we need to find A that minimizes
(5 – 2A) 2 + (7 – 4A) 2 = 74 – 76A + 20A2. Setting the derivative with respect to A equal to 0, we find – 76 + 40A = 0, or A = 1.9. This is the solution proposed by Legendre in 1805. He called this approach the “method of least squares.”
The method of least squares is one that produces the “best” prediction of the output based on the input, under rather general conditions. However, to understand this notion, we need to make a detour on the characterization of uncertainty.
Bernoulli (Jacob, 1654 – 1705). Making sense of uncertainty and chance: Law of Large Numbers.

If an urn contains 5 red balls and 7 blue balls, then the odds of picking “at random” a red ball from the urn are 5 out of 12. One can view the likelihood of a complex event as being the ratio of the number of favorable cases divided by the total number of “equally likely” cases (yes, this is a somewhat circular definition, but not completely). However, in most situations, one cannot determine – let alone count – the equally likely cases nor the favorable cases. (Consider for instance the odds of having a sunny memorial day in Berkeley.) Jacob Bernoulli (one of twelve Bernoullis who contributed to Mathematics, Physics, and Probability) showed the following result. If we pick a ball from an urn with r red balls and b blue balls a large number N of times (always replacing the ball before the next attempt), then the fraction of times that we pick a red ball approaches r/(r + b). More precisely, he showed that the probability that this fraction differs from r/(r + b) by more than any given e > 0 goes to 0 as N increases. We will learn this result as the weak law of large numbers.
de Moivre (Abraham, 1667 – 1754). Bounding the probability of deviation: Normal distribution
.
De Moivre found a useful approximation of the probability that preoccupied Jacob Bernoulli. When N is large and e small, he derived the normal approximation to the probability discussed earlier. This is the first mention of this distribution and an example of the Central Limit Theorem.
Simpson (Thomas, 1710 – 1761). A first attempt at posterior probability.
Looking again at Bernoulli’s and de Moivre’s problem, we see that they assumed p = r/(r + b) known and worried about the probability that the fraction of N balls selected from the urn differs from p by more than a fixed e > 0. Bernoulli showed that this probability goes to zero (he also got some conservative estimates of N needed for that probability to be a given small number). De Moivre improved on these estimates.
Simpson (a heavy drinker) worried about the “reverse” question. Assume we do not know p and that we observe the fraction q of a large number N of balls being red. We believe that p should be close to q, but how close can we be confident that it is? Simpson proposed a naïve answer by making arbitrary assumptions on the likelihood of the values of p.
Bayes (Thomas, 1701? – 1761). The importance of the prior distribution: Bayes’ rule.

Bayes understood Simpson’s error. To appreciate Bayes’ argument, assume that p = 0.6 and that we have made 100 experiments. What are the odds that q Î [0.55, 0.65]? If I tell you that q = 0.5, then these odds are 0. However, if I tell you that the urn was chosen such that q = 0.5 or q = 1, with equal probabilities, then the odds that q Î [0.55, 0.65] are now close to 1.
Bayes understood how to include systematically the information about the prior distribution in the calculation of the posterior distribution. He discovered what we know today as Bayes’ rule, a simple but very useful identity.
Laplace (Pierre Simon, 1749 – 1827). Posterior distribution: Analytical methods.

Laplace introduced the transform methods to evaluate probabilities. He provided derivations of the central limit theorem and various approximation results for integrals (based on what is known as Laplace’s method).
Gauss (Carl Friedrich, 1777 – 1855). Least Squares Estimation with Gaussian errors.


Jean Walrand
– January 2000 --- INDEX