Skews and Tails in Histograms and Normal probability plots or qqplots




Дата канвертавання27.04.2016
Памер12.57 Kb.

Skews and Tails in Histograms and Normal probability plots or QQplots

NOTE. Here “normal plot” is short for normal probability plot or normal qq-plot. Comparison relative to “normal” means relative to the normal distribution or to a normal random variable.


One focus of section meetings today (Stat 104, 2000-10-25) was the interpretation of normal probability plots in terms of right/positive and left/negative skew or thin/light and fat/heavy tails. There was some misleading scatter in my chalkboard normal plots, which should be strictly increasing from left to right, or bottom to top, because the data is sorted before plotting..
Standard Normal “Data”

The first set of figures may review some important points. Top to bottom they feature “data” that is 10, 100, or 1000 standard normal random variables N(0,1) generated by computer. (Note the changes in scale between rows and changes in orientation between columns. Left and right in histograms correspond to bottom and top in normal plots because the vertical axis represents the data values with normal z-scores on the horizontal axis.) The set of 10 normals has a right skew distribution that you should recognize in both the histogram and the normal plot. Why not a more normal distribution? Bad luck. A sample of ten is too small to reveal the normal pattern reliably.





Random digits 0 to 9

The next set of figures features old friends: random digits 0 to 9 with equal probability 1/10 such as the last digits of phone numbers. One random digit has uniform distribution on {0,1,2,3,4,5,6,7,8,9}, of course (top row). Even so, the mode of 1000 trials occurred more than 130 times. The sums of two and three random digits (middle and bottom) have “mound-shaped” distributions –probabilities first increasing, then decreasing-- with light tails relative to normal. The S-shape of the normal plot indicates light tails. Although the normal distribution domain is the entire real line and the sums of random digits are bounded, even three random digits are enough that their sum (with range 0 to 27) has distribution with tails only slightly lighter than normal.



NOTE. There are too many points in the last bins of the histograms, rows two and three, because the “binning” is asymmetric. For example, the first bin in row two is {0,1} and the last is {16,17,18}.


Skews and tails –classic cases


The standard uniform random variable U[0,1] is similar to the random digit but it is continuous rather than discrete; its value is any real number between 0 and 1. It doesn’t matter whether the endpoints are included because they never occur: the probability of every particular real number is zero! The S-shape normal plot indicates two light tails. Every uniform distribution has light tails no tails at all beyond its bounds. Relative to normal, the data values are “piled up” rather than “spread out” at both low and high magnitudes bottom and top on the vertical axis.
Student’s t distribution t(df) is used in estimating confidence intervals for one-variable means and for linear regressions (coming soon in Stat 104). As “degrees of freedom” df increases, t approaches normal. Probably t is the most important distribution with heavy tails, indicated by the inverted-S shape of the normal plot.
A classic skew distribution includes a light tail in one direction and a heavy tail in the other, the direction of the skew. The standard exponential random variable Expo(1) has right skew distribution. Like the uniform, the exponential has no left tail at all. Relative to normal the data values are “piled up” at low magnitudes (left, bottom), spread out at high magnitudes (right, top). The exponential is single-peaked or unimodal but strictly decreasing rather than “mound-shaped”, not remotely normal.

Now for some distributions that may not come up much outside of computer-generated illustrations or chalkboard mathematics.
The LaPlace distribution is exponential on both sides of zero. Take the exponential and toss a coin on each trial to make the value positive or negative. The distribution has two moderately heavy tails indicated by moderate inverse-S shape of the normal plot.
“Exponential squared on both sides of zero” is a bimodal distribution and it has two moderately light tails indicated by moderate S-shape in the normal plot.
The Cauchy distribution includes some exceptionally large negative and positive magnitudes relative to normal –so many so large that its standard deviation is infinite! (Note the scale. In 1000 trials this Cauchy random variable surpassed –50 and surpassed +40 four times whereas a normal random variable practically never surpasses 4.) Cauchy has extremely heavy tails and strong inverse-S shape in the normal plot.

.

Binomial


The binomial random variable Binomial(n, p) takes only integer values that represent the count of “successes” in n independent “success/failure” trials with success probability p (top). Although those values are discrete, the binomial distribution is roughly normal if the expected numbers of successes n*p and failures n*(1-p) are both at least 10 (row 2 illustrating hw5.3, Mendel’s peas). Otherwise the binomial is a right skew distribution like the exponential if “success” is rare (row 3) or a left skew if “failure” is rare (bottom).



T or Student’s t

The most important heavy-tail distribution must be the t distribution t(df) which is used in estimating confidence intervals for one-variable means and linear regressions. T approaches normal as its “degrees of freedom” df increases. These figures show df=2 to 5 to 10 from top to bottom; only the vertical scales differ from left to right. The normal plot for a normal distribution would follow the “45-degree” lines (dotted) and DataDesk would add the regression lines (dashed). For the top figure at left and all three figures at right, some of the simulated data is out of bounds above or below.


When degrees of freedom df increase to infinity, the limit of Student’s t distribution t(df) is normal. The approach to normal is rapid with t(5) already closer to normal than to t(2) and with t(30) practically indistinguishable from normal.


Extreme values are common in the “heavy tail” t distribution relative to the standard normal (black). Extreme values are also relatively common in a normal distribution with twice the variance (dashed).



Heavy tails, light tails.

A “heavy tail” random variable takes extreme values, both low and high in a classic case, more frequently than does a normal random variable. Heavy tails imply high variance but a heavy tail distribution is not normal with high variance; beyond some point its extreme values are more common even than in a normal distribution with the same variance. (The next figure shows T and Normal distributions with mean 0 and variance 2. “Beyond some point” is beyond about 3.7.)


Similarly a “light tail” random variable such as a uniform (pages 2-3) takes extreme low and high values less frequently than does a normal random variable. Light tails imply low variance but “light tail” extreme values are less common even than in a normal distribution with the same variance. (Uniform random variables are bounded. Beyond some point they have no tails at all.)



База данных защищена авторским правом ©shkola.of.by 2016
звярнуцца да адміністрацыі

    Галоўная старонка