Categories
Cheenta Probability Series Probability Research Statistics

Bayes’ in-sanity || Cheenta Probability Series

One of the most controversial approaches to statistics, this post mainly deals with the fundamental objections to Bayesian methods and Bayesian school of thinking. Turning to the Bayesian crank, Fisher put forward a vehement objection towards Bayesian Inference, describing it as “fallacious rubbish”.

However, ironically enough, it’s interesting to note that Fisher’s greatest statistical failure, fiducialism, was essentially an attempt to “enjoy the Bayesian omelette without breaking any Bayesian eggs” !

Ronald Fisher - Objections to Bayesian theory
Ronald Fisher

Inductive Logic

An inductive logic is a logic of evidential support. In a deductive logic, the premises of a valid deductive argument logically entail the conclusion, where logical entailment means that every logically possible state of affairs that makes the premises true must make the conclusion truth as well. Thus, the premises of a valid deductive argument provide total support for the conclusion. An inductive logic extends this idea to weaker arguments. In a good inductive argument, the truth of the premises provides some degree of support for the truth of the conclusion, where this degreeofsupport might be measured via some numerical scale.

If a logic of good inductive arguments is to be of any real value, the measure of support it articulates should be up to the task. Presumably, the logic should at least satisfy the following condition:

Criterion of Adequacy (CoA):
The logic should make it likely (as a matter of logic) that as evidence accumulates, the total body of true evidence claims will eventually come to indicate, via the logic’s measure of support, that false hypotheses are probably false and that true hypotheses are probably true.

One practical example of an easy inductive inference is the following:

” Every bird in a random sample of 3200 birds is black. This strongly supports the following conclusion: All birds are black. “

This kind of argument is often called an induction by enumeration. It is closely related to the technique of statistical estimation.

Critique of Inductive Logic

Non-trivial calculi of inductive inference are shown to be incomplete. That is, it is impossible for a calculus of inductive inference to capture all inductive truths in some domain, no matter how large, without resorting to inductive content drawn from outside that domain. Hence inductive inference cannot be characterized merely as inference that conforms with some specified calculus.
A probabilistic logic of induction is unable to separate cleanly neutral support from disfavoring evidence (or ignorance from disbelief). Thus, the use of probabilistic representations may introduce spurious results stemming from its expressive inadequacy. That such spurious results arise in the Bayesian “doomsday argument” is shown by a re-analysis that employs fragments of inductive logic able to represent evidential neutrality. Further, the improper introduction of inductive probabilities is illustrated with the “self-sampling assumption.”

Objections to Bayesian Statistics

While Bayesian analysis has enjoyed notable success with many particular problems of inductive inference, it is not the one true and universal logic of induction. Some of the reasons arise at the global level through the existence of competing systems of inductive logic. Others emerge through an examination of the individual assumptions that, when combined, form the Bayesian system: that there is a real valued magnitude that expresses evidential support, that it is additive and that its treatment of logical conjunction is such that Bayes’ theorem ensues.

The fundamental objections to Bayesian methods are twofold: on one hand, Bayesian methods are presented as an automatic inference engine, and this raises suspicion in anyone with applied experience. The second objection to Bayes’ comes from the opposite direction and addresses the subjective strand of Bayesian inference.

Andrew Gelman , a staunch Bayesian pens down an interesting criticism of the Bayesian ideology in the voice of a hypothetical anti-Bayesian statistician.

Here is the list of objections from a hypothetical or paradigmatic non-Bayesian ; and I quote:

“Bayesian inference is a coherent mathematical theory but I don’t trust it in scientific applications. Subjective prior distributions don’t transfer well from person to person, and there’s no good objective principle for choosing a non-informative prior (even if that concept were mathematically defined, which it’s not). Where do prior distributions
come from, anyway? I don’t trust them and I see no reason to recommend that other people do, just so that I can have the warm feeling of philosophical coherence. To put it another way, why should I believe your subjective prior? If I really believed it, then I could just feed you some data and ask you for your subjective posterior. That would save me a lot of effort!”

Andrew Gelman
Andrew Gelman

In 1986 , a statistician as prominent as Brad Efron restates these concerns mathematically:

“I like unbiased estimates and I like confidence intervals that really have their advertised confidence coverage. I know that these aren’t always going to be possible, but I think the right way forward is to get as close to these goals as possible and to develop robust methods that work with minimal assumptions. The Bayesian approach—to give up even trying to approximate unbiasedness and to instead rely on stronger and stronger assumptions—that seems like the wrong way to go. When the priors I see in practice are typically just convenient conjugate forms. What a coincidence that, of all the infinite variety of priors that could be chosen, it always seems to be the normal, gamma, beta, etc., that turn out to be the right choices?

Well that really sums up every frequentist’s rant about Bayes’ 😀 !

And the torrent of complaints never ceases….

Some frequentists believe that in the old days, Bayesian methods at least had the virtue of being mathematically
clean. Nowadays, they all seem to be computed using Markov chain Monte Carlo, which means that, not only can you not realistically evaluate the statistical properties of the method, you can’t even be sure it’s converged, just adding one more item to the list of unverifiable (and unverified) assumptions in Bayesian belief.

As the applied statistician Andrew Ehrenberg wrote :

Bayesianism assumes:

(a) Either a weak or uniform prior, in which case why bother?,

(b) Or a strong prior, in which case why collect new data?,

(c) Or more realistically, something in between,in which case Bayesianism always seems to duck the issue.”

Many are skeptical about the new found empirical approach of Bayesians which always seems to rely on the assumption of “exchangeability”, which is almost impossible to obtain in practical scenarios.

Finally Peace!!!

No doubt, some of these are strong arguments worthy enough to be taken seriously.

There is an extensive literature, which sometimes seems to overwhelm that of Bayesian inference itself, on
the advantages and disadvantages of Bayesian approaches. Bayesians’ contributions to this discussion have included defense (explaining how our methods reduce to classical methods as special cases, so that we can be as inoffensive as anybody if needed).

Obviously, Bayesian methods have filled many loopholes in classical statistical theory.

And always remember that you are subjected to mass-criticism only when you have done something truly remarkable walking against the tide of popular opinion.

Hence : “All Hail the iconoclasts of Statistical Theory:the Bayesians

N.B. The above quote is mine XD

Wait for our next dose of Bayesian glorification!

Till then ,

Stay safe and cheers!

References

1.”Critique of Bayesianism”- John D Norton

2.”Bayesian Informal Logic and Fallacy” – Kevin Korb

3.”Bayesian Analysis”- Gelman

4.”Statistical Re-thinking”- Richard McElreath

Some Important Links:

Categories
Cheenta Probability Series Probability Research

Nonconglomerability and the Law of Total Probability || Cheenta Probability Series

This explores the unsung sector of probability : “Nonconglomerability” and its effects on conditional probability. This also emphasizes the idea of how important is the idea countable additivity or extending finite addivity to infinite sets.

**10 min read**

“I believe that we do not know anything for certain, but everything probably.”~ Christiaan Huygens

One week into conditional probability, it’s time to get our hands dirty with the Law of Total Probability and paradoxes which have emerged out of it.Let’s be formal enough to state the law first.

The Law of Total Probability

Adorably called LOTP , it is one of the cardinal results in Conditional Probability.

Suppose the events \(A_1,A_2,…,A_k \) are a partition (mutually exclusive and exhaustive) of the event space and let \(H\) be any arbitrary event of the event space then it states that \(P(H)=P(H|A_1)P(A_1)+P(H|A_2)P(A_2)+…+P(H|A_k)P(A_k) \)

Nonconglomerability and law of total probability
k=6

The day to day event I always relate to while recalling this law is that Suppose you have a thin glass block placed on a table and accidentally some water has been spilled on it. A part of this water has been trapped in between the surface of the table and the glass. If you look at this from above, you will see a puddle of water almost circular,trapped within the rectangular block. This puddle is actually our arbitrary event \(H\) and our block the event space. How can you get the partitions? Any wild guesses? Well, drop a hard stone on the glass and it cracks, or even if you have strong arms and like fantasizing about hurting your knuckles, you can do it too :P. The cracks partition the sample space into various segments and there is water trapped in each of them. There you go!

As we have stressed already, from a false proposition, or from a fallacious argument that leads to a false proposition – all propositions true and false, may be deduced.But this is just the danger;if fallacious reasoning always led to absurd conclusions,it would be found out at once and corrected.But once an easy , shortcut mode of reasoning has led to a few correct results, almost everybody accepts it; those who try to warn against it are generally not listened to.

When a fallacy reaches this stage, it takes on a life of its own and develops very effective defenses for self preservation in the face of all criticisms.Here is one such instance.

Nonconglomerability

If \( (C_1,C_2,…,C_n )\) denote a finite set of mutually exclusive, exhaustive propositions on prior information
I , then for any proposition \(A\), we have:

\( P(A|I) = \sum_{i=1}^{n} P(AC_i|I) = \sum_{i=1}^{n} P(A|C_i I)P(C_i|I) \)

As you all seen in the previous blog post, the prior probability \(P(A|I)\) is written as a weighted average of the conditional probabilities \( P(A|C_i I) \).

Now, it is an elementary result that the weighted mean of a set of real numbers cannot lie outside the range spanned by those numbers, i.e. if \( L \le P(A|C_i I) \le U \) ; then necessarily \( L \le P(A|I) \le U \).

De Finetti (1972) called this property as “conglomerability” of the partition \( \{C_i\}\).

Obviously, non-conglomerability cannot arise from a correct application of the rules of probability theory on finite
sets.It cannot, therefore occur in an infinite set which is approached as a well defined limit of a sequence of finite sets.

Yet nonconglomerability has become a minor industry, with a large and growing literature.There are writers who believe that it is a real phenomenon, and that they are proving theorems about the circumstances in which it occurs, which are important for the foundations of probability theory. Nonconglomerability has become, quite literally, institutionalized in our literature and taught as truth.

Let us examine some case where “nonconglomerability” has been claimed to be true.

Rectangular Array

This particular example by the famous trio Kadane,Schevish and Seidenfeld (1986).

We start from a 2 dimensional \( (M \times N) \) set of probabilities: \( p(i,j) , 1 \le i \le M ; 1\le j \le N \).The sample space is a rectangular array of \( MN \) points in the first quadrant. It will suffice to take some prior information \( I \) for which these probabilities are uniform : \(p(i,j) = (\frac{1}{MN} ) \). Let us define the event \(A : i<j \).

Therefore, \(P(A|I)\) can be found by direct counting and in fact it is given by :

\( P(A|I) =
\begin{cases}
\frac{(2N-M-1)}{2N}, & M \le N \\
\frac{(N-1)}{2M}, & N \le M \\
\end{cases}
\)

Now let us resolve this conditionally, using the partition \(\{C_1,C_2,…,C_M\}\).We have \(P(C_i |I)=\frac{1}{M} \).

So, we get \( P(A|C_i I) =
\begin{cases}
\frac{(N-i)}{N}, & 1 \le i \le M \le N \\
\frac{(N-i)}{N}, & 1 \le i \le N \le M \\ 0 & N \le i \le M \\
\end{cases}
\)

These conditional probabilities reach the upper and lower bounds \( U= \frac{(N-1)}{N}, \forall M,N \) and

\(L=1-R \) ; if \(M \le N\) and \(0\) otherwise, where \( R=\frac{M}{N} \).

Now, if we check the conglomerability criteria using these \(L,U\) , then it seems to work fine with no ambiguity. So, where can one possibly create a non-conglomerability out of this?

Just take \( M \rightarrow \infty, N \rightarrow \infty \) and look at the probabilities \( P(A|C_i I) \). We try to evaluate these probabilities directly on the infinite set.

Then it is argued that, for any given \(i\), there are an infinite number of points where \( A \) is true and only a finite number where it is false.Thus, the conditional probability \(P(A|C_i I) =1 \) for all \(i\); yet \(P(A|I) <1 \).

Now, consider the set of propositions \( \{D_1,D_2,…,D_N \} \) , where \( D_j\) is the statement that we are on the \(j^{th}\) row of the array,counting from the bottom.Now, by the same argument , for any given \( j \) , there are an infinite number of points where \(A\) is false, and only a finite number where \(A\) is true.

Then, the conditional probability \(P(A|D_j I)=0 ; \forall j\) , yet \(P(A|I) >0 \).By this reasoning, we have produced two nonconglomerabilities, in opposite directions, from the same model!!

But wait wait wait… aren’t we missing something? I don’t think this is a fallacy at all. Let’s think of this elementary problem in analysis:

It was Gauss who pointed out that any given infinite series \(S=\sum_{i} a_i \) converges to any real number \(x\) as per your choice.

Suppose you define the partial sums \(s_n =a_1 + a_2 + … +a_n \). Define \(s_0=0\).

Write \(a_n = (s_n – x) – (s_{n-1} -x ), so our series becomes:

\(S= (s_1 – x)+(s_2 -x )+ (s_3 -x)+…-(s_0 – x)-(s_1 -x )-(s_2-x)-… \) .

The terms \( (s_1-x), (s_2-x),….\) cancel out and BOOM !! your sum is \( S=-(s_0 -x)=x \).

Pause for a moment and taste the BLUNDER 😛

Hadn’t a great man once said:

Apply the ordinary processes of arithmetic and analysis only to expressions with a finite
number n of terms. Then after the calculation is done, observe how the resulting finite
expressions behave as the parameter n increases indefinitely.

Yes, exactly! even stalwarts like Gauss, Weierstrauss,Abel and many accompanying them did not follow this advice meticulously and in many cases reached wrong conclusions. If you can understand the fallacy of this proof, you can pretty well admire the forgery in the rectangular array problem.

Once one has understood the fallacy in the analysis problem, then whenever someone claims to have proved some result by carrying out arithmetic or analytical operations directly on an infinite set, it is hard to shake off a feeling that he could have proved the opposite just as easily and by an equally sound argument, had he wished to. Thus there is no reason to be surprised by what we have just found.

Nonconglomerability on a rectangular array, far from being a phenomenon of probability theory, is only an artifact of failure to obey the rules of probability theory.

Bourbaki were the first to point out the fallacy

Borel-Kolmogorov Paradox

Another abuse of conglomerability , this has its name written down in the history books.

Suppose a random variable has a uniform distribution on a unit sphere. Now we choose a point randomly. We will do this in two ways:

1.Choose longititude \( \lambda \) uniformly from \([-\pi,\pi] \).

2. Choose latitude \( \phi \) from \( [-\frac{\pi}{2},\frac{\pi}{2}] \) with density \( \frac{1}{2} \cos \phi \)

The problem asks us to find the conditional distribution of \(X\) on a great circle.

Because of the symmetry of the sphere, one might expect that the distribution is uniform and independent of the choice of coordinates. However, two analyses give contradictory results. First, note that choosing a point uniformly on the sphere is equivalent to choosing the longitude \( \lambda \)  uniformly from \( [-\pi, \pi] \) and choosing the latitude \( \phi \) from \([-\frac{\pi}{2}, \frac{\pi}{2}] \).

a sphere with latitudes

For a line with longitude with \( \lambda =0 \) , \(f(\phi| \lambda=0) = \frac{1}{2} \cos \phi \) .

Whereas, for a line with latitude \( \phi=0 \), \( f(\lambda| \phi=0) = \frac{1}{2 \pi} \).

One is uniform on the circle , while the other is not! Yet both refer to the same great circle :O

Now, I am not going to geek the resolution of this paradox as it requires greater knowledge in probability theory which we will surely cover in future posts. But interested readers can go through E.T. Jaynes’ explanation of the same.

We will be back with a few more posts in conditional probability and will try our best to enamor you with the various angles and spheres which are less trodden in probability.

Till then stay safe.

Have a great weekend!

References:

1. Conglomerability and finite Partitions – Alan Zame

2. The Extent of Non Conglomerability of Finitely Additive Probabilities- Kadane, Schervish, Seidenfeld

3.Probability Theory- the logic of science – E.T. Jaynes


Categories
Cheenta Probability Series Probability Research

Probability From A Frequentist’s Perspective || Cheenta Probability Series

This post discusses about the history of frequentism and how it was an unperturbed concept till the advent of Bayes. It sheds some light on the trending debate of frequentism vs bayesian thinking.

***10 min read***

“The probable is that which for the most part happens” – “Rhetoric”,Aristotle

Frequentism

Hopefully this example will be able to explicate the true sense of the word:

Suppose, I have misplaced my phone somewhere in my home. I can use the phone locator on the base of the instrument to locate the phone and when I press the phone locator the phone starts beeping.

Now the question is which area of my home should I search?

Apparently, there are two clear approaches to this problem:

Approach 1:

I can hear the phone beeping. I also have a mental model which helps me identify the area from which the sound is coming. Therefore, upon hearing the beep, I infer the area of my home I must search to locate the phone.

Approach 2:

I can hear the phone beeping. Now, apart from a mental model which helps me identify the area from which the sound is coming from, I also know the locations where I have misplaced the phone in the past. So, I combine my inferences using the beeps and my prior information about the locations I have misplaced the phone in the past to identify an area I must search to locate the phone.

The first approach, which is probably the trite way out defines frequentism or reflects a person’s frequentist ideas or beliefs.

The second approach surely, as you have guessed already is how a Bayesian thinks.

Now, I am not going into the debate on which approach is better Frequentist or Bayesian ? Being a fan of Bayesian thought processing myself, I hear that many contemporary statisticians feel that the Bayesians have lobbied into the limelight with their “crooked” mindset of unnecessarily complicating things. Rather I will like to elucidate Approach 1 , it’s history and the present scenario of frequentist statistics.

A Brief History of Frequentism

The term ‘frequentist’ was first used by M.G. Kendall in 1949, however the belief had already emerged centuries before. It was a behemoth of a belief which was unquestionably the most dominant in all sciences.

What is the relation of frequency to probability?

Bernoulli (also Borel) gave part of an answer in terms of “Laws of Large Numbers”.Probabilities of single cases are essential in stating and proving the laws. Venn provided a thorough exposition of his staunch frequentist views in his treatise “The Logic of Chance: An Essay on the Foundations and Province of the Theory of Probability“. Von Mises postulated kinds of infinite sequences that would typically(with probability 1) be produced by independent and identically distributed trials.

As carefully put by Diaconis, frequentism radically restricts the range of probability theory, and it must be judged inadequate as a general account of probability. So, whether Persi Diaconis was a frequentist or not will be left to the reader as an exercise to judge 😛 .But a deep question has been uncovered in the deellopment of the frequentist view: What is the nature of a random sequence?

Furthermore, What is the relation of idealization to reality? In frequentism, it arose because the development of a theory demanded more than actual frequencies in the world.It required limiting relative frequencies in idealized infinite sequences.

There is another question that the frequency view leaves hanging, that is, how probability can be a guide to life. To be more precise, how can probability as frequency inform degrees of belief and rational decision?

Venn answers this roughly that as degrees of belief in a single event, we should take the corresponding relative frequency in a series of like events.

So, again we trace back to our earlier question of equi-probable events or likely events. How can we define them without circularity?

Venn’s flawed idealization of “likely” events

Venn claimed that if one’s degrees of belief agree with the relative frequencies, then an infinite series of fair bets would have a payoff in the limit that is fair, and that is general, expected value could be identified with limiting average value.

He has no right to claim this. Consider an idealized sequence of coin flips with limiting relative frequency of heads = \( \frac{1}{2} \). Any bet at even odds is a fair bet. Consider an idealized agent A who bets on every toss and always loses. You might want to argue that this cannot happen, but there is in fact nothing a frequentist can do to preclude it.

However Venn was modest enough to see holes in his own theory. 😉

John Venn
John Venn

Bernoulli and the WLLN

Jacob Bernoulli was well aware of the limitations of relying on intuitively equi-probable cases. He lookd towards frequency evidence to inform probability judgements, as practical men had always done formally.

He proved the first law of large numbers. With arbitrarily high probability, the relative frequency of heads can be made to approximate the probability of heads as closely as you please by choosing a long enough series of trials.

He aimed at a determination of the number of draws with replacement from an urn that would be required for the relative frequencies to be within specified bounds of the chances with a specified high probability.He remarks that if it were a question of frequencies exactly equaling the chances, a long series of trials would just make things worse. At this point, frequencies and chances are clearly treated as two distinct and separate things.

Bernoulli derived( with a tacit assumption of independence) an upper bound on the required number of trials. This was what he called his golden theorem. The law of large numbers follows.

Nevertheless, his bound was not very good and conjures up very large numbers of trials. Ars Conjectandi presents an instance:

The chance is \( \frac{3}{5} \), the desired interval for the relative frequency is between \( \frac{29}{50} \) and \( \frac{31}{50} \), and the desired probability that the frequency falls within that interval is \( \frac{1000}{1001} \). Bernoulli’s bound says that this is achieved if the number of trials is at least 25,550.Datasets of this magnitude were not available at Bernoulli’s time

Jacob Bernoulli - frequentism vs bayesian thinking
Jacob Bernoulli

Bernoulli’s Swindle

Bernoulli was quite into determination of chance from empirical data.He was well convinced that in many areas, it was impossible to determine chances by counting symmetric cases.An idea flashed through his mind that what we are not given to derive a priori , we at least can obtain a posteriori, that is, we can extract it from repeated observation of the results of similar examples.

The question is, given the data- the number of trials and the relative frequencies of success in those trials- what is the probability that the chances fall within a certain interval? It is evident that this is not the problem which Bernoulli solved.He called this problem the “inverse problem“.

But, here comes the funny part; Bernoulli somehow convinced himself that he had solved the “inverse problem“. Well how? It was by a vague argument using the concept of moral certainty. Bernoulli used this term to refer to a probability so close to 1 , that for all intents and purposes, one may treat it as a certainty. He argued that he had shown that with a large enough number of trials, it would be morally certain that relative frequency would be approximately equal to chance. But if frequency equals chance, then chance equals frequency.

What a slipup! Really motivational for those who make silly mistakes … At least they don’t publish them :P.

Thomas Bayes solved the “inverse problem” and this time there was no cheating.

Frequentism in recent times

Major contributors to frequentist statistics in the early 20th century included Fisher,Neyman and Pearson. Fisher contributed to most of statistics and made significance testing the core of experimental science; Neyman formulated confidence intervals and contributed heavily to sampling theory; Neyman and Pearson paired in the creation of hypothesis testing. All valued objectivity, so the best interpretation of probability available to them was frequentist. Fisher said, “…the theory of inverse probability is founded upon an error, and must be wholly rejected.” (from his Statistical Methods for Research Workers). While Neyman was a pure frequentist, Fisher’s views of probability were unique; Both had nuanced view of probability. 

Lindley’s Paradox

This is a major checkpoint in statistical history where the frequentists started to doubt themselves.

Lindley’s paradox is in fact a difficulty reconciling two paradigms — Bayesian and frequentist statistics. There is no mathematical inconsistency.

Let’s look at an example : In a certain city 49,581 boys and 48,870 girls have been born over a certain time period. The observed proportion \(x \) of male births is thus \( \frac{49,581}{98,451} \approx 0.5036 \). We assume the number of male births is a binomial variable with parameter \( \theta \). We are interested in testing whether \( \theta \)  is 0.5 or some other value. That is, our null hypothesis is  \( H_0 : \theta = 0.5 \) and the alternative is \( H_1: \theta \neq 0.5 \).

A frequentist will approach the problem by computing a quantity called p-value .

Making a quite naive assumption that as the number of male births is quite large, we may assume normality of the fraction of male births \(X \sim N(\mu,\sigma^2) \) , where \( \mu=n \theta , \sigma^2 = n \theta (1- \theta) \)

We then calculate the quantity: \( P(X \ge x) \) taking the value of \( \mu \) in accordance to the null hypothesis.

We find that the so called p-value obtained is lower than \( \alpha=0.05 \).Thus, we reject \(H_0\). [This is just an empirical rule (quite questionable!!)]

A Bayesian on the other hand a Bayesian will use Bayes’ Theorem:

Assuming no reason to favor one hypothesis over the other, the Bayesian approach would be to assign prior probabilities  \( \pi(H_0)= \pi(H_1)=0.5 \).

Then he calculates \( P(H_0 | k) \) which is quite high (\( 0.95 \) ). This strongly favours \(H_0\) over \(H_1\).

Now, you decide , which among the two is the more realistic approach to this inference problem?

Can you show that this disagreement between the two approached becomes negligible as sample size increases?

This is the acid test of your inclination to either of the two approaches 🙂 .

Also, let me know if you are convinced whether p-values are misused everywhere or not. This is also a raging debate and it seems Bayesians tend to hate p-values too much :D.

More insight on the Bayesian approach will be provided by Uttaran in the consequent blog post..

Till then stay safe and enjoy learning!

References

  1. 1. Ten Great Ideas About Chance- Skyrms, Diaconis

2. A Treatise on Probability – John Maynard Keynes

Categories
Cheenta Probability Series Geometry Probability Research

Some Classical Problems And Paradoxes In Geometric Probability||Cheenta Probability Series

This is our 6th post in our ongoing probability series. In this post, we deliberate about the famous Bertrand’s Paradox, Buffon’s Needle Problem and Geometric Probability through barycentres.

**(10 min read)**

“Geometry is not true, it is advantageous.” ~ Henri Poincare

Yes , exactly… it’s time to merge the two stalwarts together : “Geometry” and “Probability”.

The Probability Measure of Geometrical Elements

In probability theory one is usually concerned with random variables which are quantities, or sets of quantities, taking values in some set of possibilities on which there is defined a non-negative measure, satisfying certain required conditions which enable us to interpret it as a probability. In the theory of geometrical probabilities the random elements are not quantities but geometrical objects such as points, lines and rotations. Since the ascription of a measure to such elements is not quite an obvious procedure, a number of “paradoxes” can be produced by failure to distinguish the reference set. These are all based on a simple confusion of ideas but may be useful in illustrating the way in which geometric probabilities should be defined.

Bertrand’s Paradox

We consider one paradox due to J.Bertrand (1907).

The problem of interest is precisely: Determine the probability that a random chord of a circle of unit radius has a length greater than the square root of 3, the side of an inscribed equilateral triangle.

Context of the problem

The development of the Theory of Probability has not been smooth at all. The first attempts to formalize the calculus of probability were due to Marquis De Laplace (1749-1827) who proposed to define the probability \( P(A) \) of an outcome A as the ratio of the number of events that result in the outcome A to the total number of possible events. This is of course only meaningful if the number of all possible events is finite and, in addition, all the events are equi-probable. The notion which Laplace has also defined. However, in our first blog post, we addressed the fact that the definition is, in a sense, circular – a notion of equi-probable is defined prior to the introduction of probable.

Thus, at the time, the field did not seem to have a sound foundation. Attempts to extend the definition to the case of infinite number of events led to even greater difficulties. The Bertrand’s Paradox is one such discovery that made mathematicians wary of the whole notion of probability.

Apparently, this problem has more than one solution, meaning as the perspective of the reader changes, the solution also changes! Worthy of a paradox right?!

Some of the most discussed solutions

What about the probability is \( \frac{1}{3} \) ?

Yeah, this is correct! Provided your thought process follows the same lines as this proof :

Any chord of the circle intersects it in two points, and we may suppose these to be independently distributed in probability distributions which are uniform over the circumference of the circle. Without loss of generality, we can suppose one of the two points to be at a vertex of an inscribed equilateral triangle. There is then just \(\frac{1}{3} \) of the circumference in which the other point can lie in order that the resulting chord has length greater than \( \sqrt{3} \) so that the probability is \( \frac{1}{3} \).

Do you get what are the favourable areas?

What about \( \frac{1}{4} \) ?

Umm…sure! why not? Any chord is uniquely defined by the foot of a perpendicular on it from the centre. If this point is distributed uniformly over the circle the probability of it lying in any region of area \( A \) is \( A \pi^{-1} \) since
the total area of the circle is \(\pi\). For the chord to have length greater than \( \sqrt{3} \) the foot of the perpendicular must lie inside a circle of radius \( \frac{1}{2} \) and hence the probability is \( \frac{1}{4} \).

But but.. it is also \( \frac{1}{2} \) !?

Try to think of a proof why this probability is also \( \frac{1}{2} \) .

Based on constructing a random chord in a circle, the paradox involves a single mathematical problem with three reasonable but different solutions. It’s less a paradox and more a cautionary tale. It boils down to the same old question: “What do you mean by random?”

Let’s hand it over to Buffon

The “Buffon needle problem” which many of us encountered in our college days has now been with us for 200 years. One major aspect of its appeal is that its solution has been tied to the value of \( \pi \) which can then be estimated by physical or computer simulation today. It is interesting that in the original development, Buffon (1777) extols geometry as a companion tool to the calculus in establishing a science of probability and suggests that chance is amenable to the methods of geometry as well as those of the calculus. Buffon indicates that the human mind, because of prior mathematics, preferred numbers to measures of area but that the invention of games revolving around area and ratios of areas could rectify this. To highlight this point he investigated a game already in practice in the 18th century
known as “clean tile”.

Clean Tile Problem

In a room tiled or paved with equal tiles, of any shape, a coin is thrown upwards; one of the players bets that after its fall the coin will rest cleanly, i.e., on one tile only; the second bets that the coin will rest on two tiles, i.e., that it
will cover one of the cracks which separate them; a third player bets the coin will rest over 3, 4, or 6 cracks: it is required to find the chances for each of these players.

This problem is regarded as the precursor of the famous “needle problem” .

Buffon in his own words states, “I assume that in a room, the floor of which is merely divided by parallel lines, a stick is thrown upwards and one of the players bets the stick will not intersect any of the parallels on the floor, whereas on the contrary the other one bets the stick will intersect some one of these lines; it is required to find the chances of the two players. It is possible to play this game with a sewing needle or a headless pin.

Buffon’s Needle

Suppose we have a floor made of parallel strips of wood, each the same width, and we drop a needle onto the floor. What is the probability that the needle will lie across a line between two strips?

Uspensky’s Proof:

Let the parallel lines be separated \( d \) units apart. The length of the needle is given by \( l \),with the assumption \( l \le d \).Uspensky (1937) provides a proof that the probability of an intersection is \( p = \frac{2l}{\pi d} \) . He develops this by considering a finite number of possible positions for the needle as equally likely outcomes and then treats the limiting case as a representation of the problem. This includes a definition of randomness for the distance \(x\) of the needle’s midpoint to the nearest line and the acute angle \( \phi \) if formed by the needle and a perpendicular from the midpoint to the line. The solution is obtained by computing the ratio of favorable outcomes to the total set
of outcomes and passing to the limit.

A measure of the set of total outcomes is given by:

\( \int_{0}^{\frac{\pi}{2}} \int_{0}^{\frac{d}{2}} dx d \phi = \frac{\pi d}{4} \)

From he figure above, it’s evident that the measure of the set of intersections is:

\( \int_{0}^{\frac{\pi}{2}} \int_{0}^{\frac{l}{2} \cos \phi } dx d \phi = \frac{l}{2} \)

Therefore \( p = \frac{l/2}{(\pi d)/4} = \frac{2l}{\pi d} \).

For knowing about the further generalizations to this problem you must go through Laplace’s Extension and Long Needle case.

Now, let’s explore something new!

Barycentric Coordinates take guard!

What are barycentric coordinates?

For precise definiton and illustration of barycentres , you can go through Particle Geometry and Triangles.

The barycentric coordinates are generally defined in context of triangles but they can be set up in a more general space \(\mathbb{R}^n \) . For n=1, it takes 2 distinct points \(A \) and \(B \) and two coordinates \(u\) and \(v\). Every point \(K\) on the real line is uniquely represented as \( K = uA+vB \) , where \( u+v=1 \). More generally, to define the barycentric coordinates in \( \mathbb{R}^n \) , one need \( n+1 \) points that do not lie in a space of lesser dimension.

Now, let’s look at a problem:

Choose \(n\) points at random on a segment of length 1. What is the probability that an \((n+1)\)-gon (a polygon with \((n+1)\) sides) can be constructed from the \((n+1)\) thus obtained segments? 

This is a generalization of this famous problem: “Two points are selected at random on a straight line segment of length 1. What is the probability that a triangle can be constructed out of thus obtained three segments?”

The above problem has a simple geometric proof and does not require heavy machinery , which will probably be addressed in our future posts. But the generalization makes the problem tempting enough to use barycentres.

First of all for the validity of the \( (n+1) \)-gon, it must satisfy:

\(x_i < x_0 + x_1 +…+x_n – x_i =1- x_i \ \forall i=\{ 0,1,…,n\} \)

Thus, \( x_i < \frac{1}{2} \).

The object obtained by this set of inequalities is is obtained from the basic simplex by removing smaller simplexes, one at every vertex. Each of these \( (n+1) \) smaller simplexes has the hypervolume equal to \( (\frac{1}{2})^n \) of the hypervolume of the basis simplex.

Thus, the probability that the barycentric co-ordinates satisfy this set of inequalities is \(p= 1- \frac{n+1}{2^n} \).

See that this probability goes to 1 as \(n\) grows larger and larger explaining the fact that it is easier to find segments to construct a many sided polygon than it is to find the sides of a triangle, which is rather natural.

Food For Thought

Today’s section does not contain any problem, rather I would like to share a popular research problem regarding “The extension of Buffon’s Needle Problem in three dimensions”. There have been many attempts to incorporate another dimension to the traditional needle problem for instance, If the needle were released in a weightless environment, then it wouldn’t drop down to the plane, it would float. This introduces another dimension into the problem. I would suggest some research articles in the bibliography which discusses this problem in detail. You can go through them if this problem excites you enough to dig deep into it.

Till then , stay safe!

Ciao!

References:

1. ” The Buffon Needle Problem Extended ” – JAMES MCCARRY & FIROOZ KHOSRAVIYANI

2. “The Buffon–Laplace needle problem in three dimensions” – Zachary E Dell and Scott V Franklin

3. “Fifty Challenging Problems in Probability with Solutions” – Mosteller

4. “Geometric Probability” – Kendall, Morran

Categories
I.S.I. and C.M.I. Entrance

Understanding Statistical Regularity Through Random Walks | Cheenta Probability Series

This is another blog of the Cheenta Probability Series. Let’s give a formal definition of statistical regularity to bring some seriousness into account.

**10 min read**

“The Law of Statistical Regularity formulated in the mathematical theory of probability lays down that a moderately large number of items chosen at random from a very large group are almost sure to have the characteristics of the large group.”  ~ W.I.King

Starting off with statistical regularity

So, let’s give a formal definition of this to bring some seriousness into account. If a sequence of independent experiments is held under the same specified conditions, the proportion of occurrences of a given event stabilize as the number of experiments becomes larger. This is ideally what is known to be statistical regularity.

It is an umbrella term that covers the law of large numbers, all central limit theorems and ergodic theorems.

But keeping in mind that we cover stuff for undergraduates mainly,we would not get into the aforementioned topics.

Richard Von Mises first mathematically demonstrated the idea of statistical regularity by pinpointing that no method for forming a subsequence of a random sequence (an infinite sequence of 0’s and 1’s) improves the odds for a specific event.

For instance, a sequence of fair coin tosses produces equal and independent 50/50 chances for heads and tails. A simple system of betting on heads every 3rd, 7th, or 21st toss, etc., does not change the odds of winning in the long run.

This is famously known as the “Impossibility of a gambling system“.

The Random Walk

This is in itself a topic in advanced probability theory and stochastic processes , but I will try to keep it simple here. Let’s consider a game  in which a player starts at the point \( x=0 \) and at each move, is required to take a step either forward (toward  \(+x\) ) or backward (toward \(−x\)). The choice is to be made randomly (maybe by tossing a coin). How shall we describe the resulting motion? In general, this problem is closely related to the coin tossing problem.

First let us look at  a few examples of a random walk. We may characterize the walker’s progress by the net distance \(D_N \) traveled in N steps. We illustrate the graph of the random walker in three instances below:

The progress made in a random walk

 What can we say about such a motion? We might first ask: “How far does he get on the average?” We must expect that his average progress will be zero, since he is equally likely to go either forward or backward. But we have the feeling that as \( N \) increases, he is more likely to have strayed farther from the starting point. We might, therefore, ask what is his average distance travelled in absolute value, that is, what is the average of \(|D|\). It is, however, more convenient to deal with another measure of “progress,” the square of the distance:\( D^2 \) is positive for either positive or negative motion, and is therefore a reasonable measure of such random wandering.

Now, till now we have not defined expected value of a quantity or a variable, which we sure will in the upcoming blog posts. For now , by “expected value” we mean the probable value (our best guess), which we can think of as the expected average behavior in many repeated sequences.

We represent such an expected value by \( ⟨D_N ^2⟩ \), and may refer to it also as the “mean square distance.” After one step, \( D^2 \)  is always \( +1 \), so we have certainly \( ⟨D_1 ^2⟩=1 \).

Now , we have an obvious recursion between \(D_N\) and \(D_{N-1} \). More specifically, \(D_N = D_{N-1} +1 \) or \(D_N=D_{N-1}-1 \).

Thus, squaring, \(D_N ^2 = D_{N-1} ^2 + 2 D_{N-1}+1 \) or, \(D_N^2 = D_{N-1} ^2 – 2 D_{N-1}+1 \)

In a number of independent sequences, we expect to obtain each value one-half of the time, so our average expectation is just the average of the two possible values. The expected value of \(D_N ^2 \) is  \(D_{N−1} ^2+1 \) . In general, we should expect for \(D_{N−1} ^2 \) its “expected value” \( ⟨D_{N−1} ^2 ⟩ \) (by definition!). So \( ⟨D_N ^2⟩=⟨D_{N−1} ^2 ⟩+1 \).

We have already seen that \(⟨D_1 ^2⟩=1\); it follows then that \( ⟨D_N ^2 ⟩=N \).

Damn, that was easy. Pretty simple right?

Now let’s draw an analogy of this game with a simple coin tossing experiment ( which many authors use as the prototype for demonstrating regularity !). Yeah, we were thinking slightly in an unorthodox manner ;).

To appropriately represent drifting away from the origin, in a random walk, we can use the Root Mean Square distance:

\( D_R = \sqrt{ ⟨D ^2⟩ } =\sqrt{N} \).

If we imagine the direction of each step to be in correspondence with the appearance of heads or tails in a coin toss, then \( D = N_H−N_T \) , the difference in the number of heads and tails. Since \( N_H+N_T=N \), the total number of steps (and tosses), we have \( D=2N_H−N \).

Now, it’s time to merge our intuition into reality. If the coin is honest or fair, what do you expect?

In \( N \) tosses, you should get \(\frac{N}{2} \) heads right?

So, lets observe the difference \( N_H – \frac{N}{2} =\frac{D}{2} \).

The RMS deviation is given by \( (N_H- \frac{N}{2})_{\text{RMS}}=\frac{\sqrt{N}}{2} \).

Thus, an actual \( N_H \)  deviates from \( \frac{N}{2} \) by about \( \frac{\sqrt{N}}{2} \), or the fraction to deviate by \( \frac{1}{N} \frac{\sqrt{N}}{2} \).

So, the larger the \(N\) , the closer we expect the fraction \( \frac{N_H}{N} \) to \( \frac{1}{2} \).

Sounds familiar right? we circled back to statistical regularity again!

The fraction of tosses that gave heads in a particular sequence of tosses

 Unfortunately, for any given run or combination of runs there is no guarantee that the observed deviation will be even near the expected deviation. There is always the finite chance that a large fluctuation—a long string of heads or tails—will give an arbitrarily large deviation. All we can say is that if the deviation is near the expected \( \frac{1}{2 \sqrt{N}} \) , we have no reason to suspect the honesty of the coin. If it is much larger, we may be suspicious, but cannot prove, that the coin is loaded (or that the tosser is clever!).

If you still suspect the fairness of the coin, you should probably learn the Physics of Coin Tossing which Uttaran Chatterjee would address in the next blog.

Another Simple Game of Chance

This is mainly for our reader friends who like to visualize through coding.

Let’s consider a simple game of chance using a spinner. But this game is somewhat kind to the gambler.In our game,the payoff in each of several repeated plays is determined by spinning the spinner. We pay an entry fee for each play of the game and then receive the payoff indicated by the spinner. Let the payoff on the spinner be distributed uniformly around the circle; i.e. if the angle after the spin is \( \theta \), he receives \( \frac{\theta}{2 \pi} \) rupees. Thus, our payoff on one play is \(U\) rupees, where \(U\) is a random number taking values in \([0,1]\). Clearly, this is gambling for the poor :P.

Let us simulate the game to see what cumulative payoffs the gambler might receive, not counting the entry fees obviously, if he plays the game repeatedly.

Construct partial sums \( S_k = U_1+U_2+…+U_k, 1 \le k \le n \).

The successive partial sums form a random walk, with \(U_n\) being the \(n^{th} \) step and \(S_n\) being the position after \(n\) steps.

R Codes and plots

walk <- function(j)
{
  uniforms <- runif(10^j)
  
  firstsums <- cumsum(uniforms)
  
  sums <- c(0, firstsums)
  
  index <- order(sums)-1
  
  plot(index,sums,main="Random walk for for the partial sums",xlab="Number of trials",ylab="winnings",col=j)
  
}
par(mfrow=c(2,2))

walk(1)

walk(2)

walk(3)

walk(4)

We now present the 4 plots for 4 values of \(N\) , \(10,100,1000,10000 \).

Possible realizations of the first \(10^j \) steps

For small \(n\), that is for \(n=10\), we see irregularly spaced points increasing to the right, but as \(n\) increases, the spacing between the points becomes blurred and regularity emerges: the plots approach a straight line with slope equal to \( \frac{1}{2} \), the mean of a single step \(U_k\). If we look from a macroscopic point of view, ignoring the units on the axes, we see that the plots become independent of \(n\) as \(n\) increases. This is what regularity signifies.

Finally, here comes Number Theory!!

Trust me, I just can’t get enough of the Prime Number Theorem. Here is a short problem to think about for future researchers in this field.

Suppose a random walker starts at the point \( S_0 = 2 \), and walks according to the following rules:

1. If the walker is on the \( n^{th} \) prime number \( p_n \), he moves to either
\( p_n + 1 \) or \( p_{n+1} \) with equal probability.

2. If the walker is on a composite number \(x \), he moves to one of the prime factors of \(x \), each with probability \( \frac{1}{\omega(x) }\), where \(\omega(n) \) denotes the number of distinct prime factors of \(n\).

The random walk is given by the sequence of moves \(S_n\).

What can you say about the quantity \( \mathbb{P}(\sup_{n \ge 0} S_n = \infty) \) ?

Give it a try.

Food For Thought

For our gambling game described above,if the expected payoff is \( \frac{1}{2} \) rupees each play of the game,the gamme is fair if the fee to play is \( \frac{1}{2} \) rupees.

Make a minor modification of this simulation by repeating the experiment after subtracting the mean \( \frac{1}{2} \) from each step of the random walk.

Now, plot the centered random walk (i.e. centered partial sums \(S_k – \frac{k}{2} \) for the same values of \(n\) as before.

Do you observe the same plots?

References

1.Experiencing Statistical Regularity – Ward Whitt

2.The Problem of The Random Walk- K.Pearson

3.An Introduction To Probability Theory and its applications – W.Feller

Stay tuned and keep following this series for getting more interesting perspectives on standard probability results and terms. I bet you too will see probability theory in a different light!

Keep learning, keep thinking!

Cheers..

Categories
Cheenta Probability Series Probability

Measuring Chances and Coincidences | Cheenta Probability Series

Let’s explore coincidence statistics!

(10 min read)

Coincidence Statistics – A non-formal approach to probability !

As Maxwell puts it straight : “The true logic of this world is in the calculus of probabilities.”

When we talk literature, we mean literature only . No need to bring up bookish terms to define probability ! To put this in a simpler way, think of the following scenario:

Suppose some dark night a policeman walks down a street, apparently deserted; but suddenly he
hears a burglar alarm, looks across the street, and sees a jewellery store with a broken window. Then
a gentleman wearing a mask comes crawling out through the broken window, carrying a bag which
turns out to be full of expensive jewellery. The policeman doesn’t hesitate at all in deciding that this
gentleman is dishonest. But by what reasoning process does he arrive at this conclusion?

Ok, now you get it, right?? Probability is the science of this intuition or in more stricter terms deductive and plausible reasoning.

Time to get formal now, how can we measure probability??

It was as late as the 16th and the 17th centuries that scientists could actually realize that chance (probability) can be measured.The Greeks even had a goddess of chance, Tyche. Several contemporary philosophers tried to give an empirical measurement of chance, but it appears that there was no quantitative theory of chance in these times.

Consider the analogous problem for another metric, say length. How would you measure length? Of course, it is silly to ask this question now, but this was a valid problem in the 16th century.

You find a standard of equal length, apply it repeatedly and count . The standard might be your foot,as you pace off a distance. One such approach was made in 1522, etched eternally in history books, known as the “lawful rood“.

The idea was to line up the feet of 16 people as they emerged from the church,as shown in the figure below:

coincidence Statistics
Determination of the lawful rood

The various folks have different foot lengths, but an implicit averaging effect was accepted by a group, even though the explicit term “average” was non-existent at that time.

However, there is an objection. There is a kind of circularity involved in the process. We are defining length, but we are already assuming that our standard remains the same length as we step off the distance.

Eventually, we refine our notion of length.Your foot may change length;so may the rod; so may the standard meter stick, at a fine enough precision.We refine the measurement of length using physics.The circularity is real, but it indicates a path for refinement rather than a fatal objection.

The same thing happens with probability. To measure probability, we first find or make equally probable cases and then count them.

We speak of probability only for observations that we contemplate being made in the future. By the “probability” of a particular outcome of an observation we mean our estimate for the most likely fraction of a number of repeated observations that will yield that particular outcome. If we imagine repeating an observation—such as looking at a freshly tossed coin— \( N \) times, and if we call \( N_{A} \) our estimate of the most likely number of our observations that will give some specified result A, say the result “heads,” then by P(A), the probability of observing A, we mean \(P(A)=\frac{N_A}{N}\)

First of all, we may speak of a probability of something happening only if the occurrence is a possible outcome of some repeatable observation.

We should emphasize that \( N \) and \(N_A \)  are not intended to represent numbers based on actual observations. \(N_A \)  is our best estimate of what would occur in \( N \)  imagined observations. Probability depends, therefore, on our knowledge and on our ability to make estimates. In effect, on our common sense! Fortunately, there is a certain amount of agreement in the common sense of many things, so that different people will make the same estimate. Probabilities need not, however, be “absolute” numbers. Since they depend on our ignorance, they may become different if our knowledge changes.

The Birthday Paradox

Reiterating Diaconis’ words ,coincidences occur to all of us. The birthday paradox has emerged as a useful tool to enable standards of surprise.

 In a room of just 23 people, do you know there’s a 50-50 chance of at least two people having the same birthday ?

This amazingly counter-intuitive result is actually called a paradox because our brain hates processing compounding exponents.We expect probabilities to be linear and only consider the scenarios we’re involved in.

Let’s investigate the case for \( N \) people and \(C \) categories.(We consider days as categories here, i.e. C=365)

We make a few assumptions before tackling this. First of all we assume the birth of all individuals to be independent of each other(i.e. we drop the cases of twins) and for simplicity, we assume a non-leap year, which means we only deal with 365 days.

So, we basically have \( N \) people, each independently and uniformly distributed in \( \{1,2,..,C\} \).

Now, if \(N=C+1 \), is there a chance that all these numbers are distinct? Of course not . (Simply by the Pigeon Hole Principle)

Call the chance of all these numbers being distinct as \(P(C,N)\).

So, see that \( P(C,N)=1(1-\frac{1}{C})(1-\frac{2}{C})…(1-\frac{N-1}{C}) \).

For a simpler approximation, for \(N= 1.2 \sqrt{C} \), the chance of success is close to \( \frac{1}{2} \).

The basis of this approximation is rooted in this proposition that when \(N,C\) are large and \( \frac{N^{\frac{2}{3}}}{C} \) is small, the chance of no match is \(P(C,N) \sim e^{-\frac{N(N-1)}{2C}} \)

The proof is quite simple, just see that \(\log(1-x) \sim -x \) when \(x\) is small.

Thus, \(P(C,N)=e^{\sum_{i=1}^{N-1} \log(1-\frac{i}{C})} \sim e^{-(\sum_{i=1}^{N-1} \frac{i}{C})}=e^{-\frac{N(N-1)}{2C}} \)

So, guess what if \(N=75 \) and \( C=365 \)?

Voila! there’s almost a 99.9% chance of at least two people matching.

Coincidence statistics and probability
Exponential decrease of P(No match) with increasing N

The Birthday paradox has turned tables in cryptography and in a more generic sense applies to hash functions: the expected number of N-bit hashes that can be generated before getting a collision is not \(2^N\), but rather only \(2^{\frac{N}{2}}\). This is exploited by birthday attacks on hash functions.

Food For Thought

Let’s think of an extension to the Birthday Problem. Suppose there are two types of people, say m men and n women. Then find the probability of no shared birthdays between at least one man and one woman.

A possible hint : You will require Stirling Numbers of the second kind .

You may also use the Birthday problem to study multiple coincidences. As an example, how large should \(N\) be to have approximately even odds of a triple birthday match?

Till then, keep learning, keep thinking and keep innovating!

Stay tuned for the next post by Uttaran Chatterjee, I guarantee you that would be a treat!

References:

1.10 Great Ideas About Chance – Persi Diaconis,Brian Skyrms

2. Feynman’s Lectures on Probability

3. Mathematical Recreations and Essays- Ball & Coxeter

Check out another blog from this Probability Series: An Unexpected Correspondence and some unfinished games.

Categories
ISI M.Stat PSB ISI MSTAT

ISI MSTAT PSB 2011 Problem 4 | Digging deep into Multivariate Normal

This is an interesting problem from ISI MSTAT PSB 2011 Problem 4 that tests the student’s knowledge of how he visualizes the normal distribution in higher dimensions.

The Problem: ISI MSTAT PSB 2011 Problem 4

Suppose that \( X_1,X_2,… \) are independent and identically distributed \(d\) dimensional normal random vectors. Consider a fixed \( x_0 \in \mathbb{R}^d \) and for \(i=1,2,…,\) define \(D_i = \| X_i – x_0 \| \), the Euclidean distance between \( X_i \) and \(x_0\). Show that for every \( \epsilon > 0 \), \(P[\min_{1 \le i \le n} D_i > \epsilon] \rightarrow 0 \) as \( n \rightarrow \infty \)

Prerequisites:

  1. Finding the distribution of the minimum order statistic
  2. Multivariate Gaussian properties

Solution:

First of all, see that \( P(\min_{1 \le i \le n} D_i > \epsilon)=P(D_i > \epsilon)^n \) (Verify yourself!)

But, apparently we are more interested in the event \( \{D_i < \epsilon \} \).

Let me elaborate why this makes sense!

Let \( \phi \) denote the \( d \) dimensional Gaussian density, and let \( B(x_0, \epsilon) \) be the Euclidean ball around \( x_0 \) of radius \( \epsilon \) . Note that \( \{D_i < \epsilon\} \) is the event that the gaussian \( X_i \) will land in this Euclidean ball.

So, if we can show that this event has positive probability for any given $x_0, \epsilon$ pair, we will be done, since then in the limit, we will be exponentiating a number strictly less than 1 by a quantity that is growing larger and larger.

In particular, we have that : \( P(D_i < \epsilon)= \int_{B(x_0, \epsilon)} \phi(x) dx \geq |B(x_0, \epsilon)| \inf_{x \in B(x_0, \epsilon)} \phi(x) \) , and we know that by rotational symmetry and as Gaussians decay as we move away from the centre, this infimum exists and is given by \( \phi(x_0 + \epsilon \frac{x_0}{||x_0||}) \) . (To see that this is indeed a lower bound, note that \( B(x_0, \epsilon) \subset B(0, \epsilon + ||x_0||) \).

So, basically what we have shown here is that exists a \( \delta > 0 \) such that \( P(D_i < \epsilon )>\delta \).

As, \( \delta \) is a lower bound of a probability , hence it a fraction strictly below 1.

Thus, we have \( \lim_{n \rightarrow \infty} P(D_i > \epsilon)^n \leq \lim_{n \rightarrow \infty} (1-\delta)^n = 0 \).

Hence we are done.

Food for thought:

There is a fantastic amount of statistical literature on the equi-density contours of a multivariate Gaussian distribution .

Try to visualize them for non singular and a singular Gaussian distribution separately. They are covered extensively in the books of Kotz and Anderson. Do give it a read!

Some Useful Problems:

Categories
IIT JAM Statistics ISI M.Stat PSB

ISI MStat PSB 2015 Question 8 | MLE & amp | Stochastic Regression

This is a problem involving BLUE for regression coefficients and MLE of a regression coefficient for a particular case of the regressors. This problem is from ISI MStat PSB 2015 Question 8.

The Problem- ISI MStat PSB 2015 Question 8

Consider the regression model:

\( y_i = bx_i + e_i \), \(1 \le i \le n \)

where \( x_i ‘s \) are fixed non-zero real numbers and \( e_i \) ‘s are independent random variables with mean 0 and equal variance.

(a) Consider estimators of the form \( \sum_{i=1}^{n} a_i y_i \) (where \(a_i \)’s are non random real numbers) that are unbiased for \(b \). Show that the least squares estimator of \(b\) has the minimum variance in this class of estimators.

(b) Suppose that \( x_i \) ‘s take values \( -1 \) or \(+1 \) and \( e_i \)’s have density \( f(t)=\frac{1}{2} e^{-|t|} , t \in \mathbb{R} \).

Find the maximum likelihood estimator of \( b \).

Pre-requisites:

1.Linear estimation

2.Minimum Variance Unbiased Estimation

3.Principle of Least Squares

4.Finding MLE

Solution:

Clearly, part(a) is a well known result that the least squares estimator is the BLUE(Best linear unbiased estimator) for the regression coefficients.

You can probably look up its proof in the internet or in any standard text on linear regression.

Part(b) is worth caring about.

Here \( x_i \)’s take values \(+1 ,-1\). But the approach still remains the same.

Let’s look at the likelihood function of \(b\) :

\(L(b) = L(b,y_i,x_i)=\frac{1}{2^n} e^{-\sum_{i=1}^{n} |y_i-bx_i|} \)

or, \( \ln{L} = c- \sum_{i=1}^{n} |y_i -bx_i| \) where \(c \) is an appropriate constant (unimportant here)

Maximizing \( \ln{L} \) w.r.t \(b \) is same as minimizing \( \sum_{i=1}^{n} |y_i – bx_i| \) w.r.t . \(b\).

Note that \( |x_i|=1 \). Let us define \( t_i =\frac{y_i}{x_i} \).

Here’s the catch now: \( \sum_{i=1}^{n} |y_i-bx_i|= \sum_{i=1}^{n} |y-bx_i| . \frac{1}{|x_i|} = \sum_{i=1}^{n} |\frac{y_i}{x_i}-b| =\sum_{i=1}^{n} |t_i – b| \).

Now remember your days when you took your first baby steps in statistics , can you remember the result that “Mean deviation about median is the least” ?

So, \( \sum_{i=1}^{n} |t_i – b| \) is minimized for \( b= \) Median\( (t_i) \) .

Thus, MLE of \(b \) is the median of \( \{ \frac{y_1}{x_1},\frac{y_2}{x_2},…,\frac{y_n}{x_n} \} \).

Food For Thought:

In classical regression models we assume \(X_i\) ‘s are non-stochastic. But is it really valid always? Not at all.

In case of stochastic \(X_i \)’s , there is a separate branch of regression called Stochastic Regression, which deals with a slightly different analysis and estimates.

I urge the interested readers to go through this topic from any book/ paper .

You may refer Montgomery, Draper & Smith etc.

Previous MStat Posts:

Categories
IIT JAM Statistics ISI M.Stat PSB Theory of Estimation

ISI MStat 2016 Problem 10 | PSB Sample | It’s a piece of cake!

This is a sample problem from ISI MStat 2016 Problem 10, which tests the student’s ability to write a model and then test the equality of parameters in it using appropriate statistics.

ISI MStat 2016 Problem 10:

A cake weighing one kilogram is cut into two pieces, and each piece is weighed separately. Denote the measured weights of the two pieces by \( X \) and \( Y \) . Assume that the errors in obtaining \( X \) and \(Y \) are independent and normally distributed with mean zero and the same (unknown) variance. Devise a test for the hypothesis that the true weights of the two pieces are equal.

Prerequisites:

1.Testing of Hypothesis

2.Model formation

3.Idea about RSS (Residual Sum of Squares)

Solution:

Let us write the two cases in the form of a model:

\( X= \mu_1 + \epsilon_1 \)

\(Y = \mu_2 + \epsilon_2 \)

where, \( \mu_1,\mu_2 \) are the true weights of the two slices and \( \epsilon_1 , \epsilon_2 \sim N(0, \sigma^2) \) (independently).

So, you get \( X \sim N(\mu_1,\sigma^2) \) and \( Y \sim N(\mu_2, \sigma^2 ) \).

Also, see that \( X,Y \) are independent.

So, we need to test \( H_0: \mu_1=\mu_2 =\frac{1}{2} \) against \(H_1: \mu_1 \neq \mu_2 \).

See that, under \( H_0 \), \( X-Y \sim N(0,2 \sigma^2) \)

So, \( \frac{X-Y}{\sqrt{2} \sigma} \sim N(0,1) \).

But have you noticed that \( \sigma \) is unknown? So this isn’t a statistic after all.

Can you replace \( \sigma \) by an appropriate quantity so that you can conduct the test?

Hint: What do you know about RSS? Does it estimate something?

Food For Thought:

Okay, let’s move from cakes to doughnuts!!

Yeah, I know this is off topic and nothing related to statistics but it’s good for the brain to alter cuisines once a while!

This is the famous doughnut slicing problem:

What is the largest number of pieces you can slice a doughnut into using only 3 cuts? (Note that you can only make planar cuts and you are not allowed to rearrange the pieces between the cuts)

I would request you to try this on your own without looking up solutions directly.

Categories
IIT JAM Statistics ISI M.Stat PSB Theory of Estimation

CLT and Confidence Limits | ISI MStat 2016 PSB Problem 8

This is a problem from ISI MStat Examination 2016. This primarily tests the student’s knowledge in finding confidence intervals and using the Central Limit Theorem as a useful approximation tool.

The Problem:

Let \( (X_1,Y_1),(X_2,Y_2),…,(X_n,Y_n) \) be independent and identically distributed pairs of random variables with \( E(X_1)=E(Y_1) \) , \(\text{Var}(X_1)=\text{Var}(Y_1)=1 \), and \( \text{Cov}(X_1,Y_1)= \rho \in (-1,1) \)

(a) Show that there exists a function \( c(\rho) \) such that :

\(\lim_{n \rightarrow \infty} P(\sqrt{n}(\bar{X}-\bar{Y}) \le c(\rho) )=\Phi(1)\) , where \( \Phi \) is the cdf of \( N(0,1) \)

(b)Given \( \alpha>0 \) , obtain a statistic \(L_n \) which is a function of \( (X_1,Y_1),(X_2,Y_2),..,(X_n,Y_n) \) such that
\(\lim_{n \rightarrow \infty} P(L_n<\rho<1)=\alpha \) .

Prerequisites:

(a)The Central Limit Theorem and Convergence of Sequence of RVs

(b)Idea of pivots and how to obtain confidence intervals.

Solution:

(a) See that \( E(\bar{X}-\bar{Y})=0 \).

See that \( \text{Var}(\bar{X}-\bar{Y})=\frac{\sum_{i=1}^{n} \text{Var}(X_i-Y_i)}{n^2} =\frac{2(1-\rho)}{n} \)

Take \( c(\rho)=2(1-\rho) \).

By the Central Limit Theorem,

see that \( P(\frac{\sqrt{n}(\bar{X}-\bar{Y})}{\sqrt{2(1-\rho)}} \le 1) \rightarrow \Phi(1) \) as \(n \rightarrow \infty \)

(b) Let \( Z_i = X_i – Y_i \)

\(Z_i\)’s are iid with \( E(Z_i)=0 \) and \(V(Z_i)=2(1- \rho) \)

Again, by CLT, \( \frac{\sqrt{n} \bar{Z}}{\sqrt{2-2\rho}} \stackrel{L}\longrightarrow N(0,1) \)

Use this as the pivot to obtain an asymptotic confidence interval for \( \rho \).

See that \( P(\frac{\sqrt{n} \bar{Z}}{\sqrt{2-2\rho}}) \ge \tau_{\alpha})=\alpha \) , where \( \tau_\alpha \) : upper \(\alpha\) point of \(N(0,1) \).

Equivalently , you can write, \( P( \rho \ge 1- \frac{n \bar{Z}^2 }{2 \tau_{\alpha}^2} ) =\alpha \) , as \( n \rightarrow \infty \).

Thus , \(L_n=1- \frac{n \bar{Z}^2 }{2 \tau_{\alpha}^2} \).

Food For Thought:

Suppose \(X_1,..,X_n \) are equi-correlated with correlation coefficient \( \rho \).

Given, \( E(X_i)= \mu \) , \(V(X_i) = \sigma_i ^2 \), for \(i=1,2,..,n\).

Can you see that \( \rho \ge -\frac{1}{\frac{(\sum \sigma_i)^2}{\sum \sigma_i ^2} -1 } \) ?

It’s pretty easy right? Yeah, I know 😛 . I am definitely looking forward to post inequalities more often .

This may look trivial but it is a very important result which can be used in problems where you need a non-trivial lower bound for \( \rho \) .

Well,well suppose now \(Y_i=X_i – \bar{X} \).

Can you show that a necessary and sufficient condition for \(X_i\) , \( \{ i=1,2,..,n \} \) to have equal variance is that \( Y_i \) should be uncorrelated with \( \bar{X} \)?

Till then Bye!

Stay Safe.