Statistician, St. Louis MO

Thursday, October 14, 2010

Inferential Statistics

Inferential statistics is a method of describing a population without actually measuring the entire population itself. Instead of having to observe each member of the population, a sample can be drawn instead. Based upon the theories of probability, the measurement from the sample can then be used to describe the population. The measurement taken from the sample is called an estimator. Two critical things must be remembered: (1) the sample must be randomly drawn - that means, every member of the population has an equal chance of being picked, and, (2) the estimator does not exactly match the true population value, therefore, the chance for error must be included.

Z-tests, t-tests, and confidence intervals are classic, common types of inferential statistics.

Hypothesis Testing

Hypothesis testing compares the data being studied to an observed characteristic of the population from which the data are sampled. This is a method of inferential statistics, and the data must be properly sampled in order for the results of testing to be valid.

The researcher has a proposed hypothesis about a population characteristic and conducts a study to discover if it is reasonable, or, acceptable. The proposed hypothesis is called the alternative hypothesis and is labelled H_a.

The observed characteristic is a value such as a mean, or a proportion, or a variance, that is already known as "true." This value is called a parameter. The null hypothesis states what the parameter is, and is labelled H_o.

The alternative hypothesis claims that the population characteristic is different than the observed parameter. This difference is either that the characteristic has increased, decreased, or, possibly either increased or decreased.

The standard notation for these hypotheses are:

H_o: ε ≤ #
H_a: ε > # (an increase)

H_o: ε ≥ #
H_a: ε < # (a decrease)

H_o: ε = #
H_a: ε ≠ # (either increase or decrease)

- where ε represents the symbol for the parameter.
For example, a study of the mean value would show a μ symbol:
H_o: μ = #
H_a: μ ≠ #

The researcher will measure the sample's characteristic and use it to calculate a test statistic. There are a number of different test statistic formulas, that depend upon what data is used, and which parameter is being tested.

The test is based upon an assumed distribution of the population. The null is made upon this assumption. The test statistic will have a certain likelihood for occuring, according to the distribution being used. When this likelihood is small, this indicates that the sample data are either from an unusual sample, or, that the distribution of the population actually is different than assumed. If the sample is properly drawn, there is small risk that the sample is unusual, and, so, it is safe to draw a conclusion that the distribution may be changed. This allows the conclusion that the null hypothesis may have changed, and that the alternative hypothesis might be accepted instead. This conclusion leads the researcher to "reject" the null hypothesis.

The likelihood that is small "enough" to reject the null is a subjective rule that is determined by the researcher before the test is conducted. This likelihood is called alpha α. Common practice sets alpha to either .01, .05, .10. This is also called a rejection region - referring to the graphed area of the distribution.

An alternative approach to using alpha is to calculate a p-value, which is thought to bring more flexibilty to the conclusion.

The case that the sample data are unusual, and the underlying population actually would have fit the distribution, so, the null hypothesis is rejected in error, is called Type I error. The probability for making this error is equal to the value of alpha, or, to the p-value, whichever has been used to draw the erroneous conclusion.

Tuesday, September 28, 2010

Rejection Region

The rejection region is used in hypothesis testing. When a sample measurement, called a test statistic, falls into this area, the null hypothesis will be rejected. Please also see hypothesis testing.

The rejection region is also called the "tail" of the graph since the area is small and extends away from the main body of the graph.

See also: alpha.

Friday, September 24, 2010

Interquartile Range IQR

IQR stands for Interquartile Range. Its the difference (range) between the first quartile and the third. It is sometimes called the middle fifty. One-fourth of the values fall beneath the first quartile (Q1) and one-fourth lie above the third quartile (Q3). The numbers must be arranged in order before the IQR can be found.

Picture a line cut into four equal parts. There are three cuts to make four parts. The first cut is the first quartile and the third cut is the third quartile. The IQR wil be Q3 - Q1.

For example: find the IQR of 145, 149, 158, 146, 159, 156, 154, 149, 158, 148

Step 1.
Order the numbers smallest to greatest.
>>> 145, 146, 148, 149, 149, 154, 156, 158, 158, 159

Step 2.
Find Q1's location using this equation:
(1/4)(n+1)
n is the number of values in the list. Here, n=10.
If you are imagining the numbers as a line, n+1 gives you the "length"
>>>2.75

Step 3.
Get Q1 using the answer found in step 2. There are three cases:
Case 1. If the answer to step 2 is a whole number, then Q1 is the number found at (1/4)(n+1) position in the ordered list. For example, if n=7, then (1/4)(n+1) = 2, therefore Q1 is the number found at position 2.
Case 2. If the answer to step 2 is a fractional half, then Q1 is the average of the two values found by rounding the position both down and up. For example: if n=29, then (1/4)(n+1) =7.5, therefore take the average of the values found at positions 7 and 8.
Case 3. If it’s neither a whole number nor a fractional half, then round to the nearest integer. For example: if n=14, then (1/4)(n+1)=3.75, therefore Q1 is the value found at position 4.
>>>148

Step 4.
Find Q3's location:
(3/4)(n+1)
>>>8.25

Step 5.
Using the answer to step 4, locate the value of Q3. Follow the same cases as described in step 3.
>>>158

Step 6.
Calculate the range:
Q3 - Q1
>>>10
>>>The IQR of 145, 149, 158, 146, 159, 156, 154, 149, 158, 148 is 10.

Thursday, September 16, 2010

Median

Median is the middle point in the data set. An equal number of items are below and above this value.

The dataset must be ordered before the median can be determined.

The number of items (n) will determine where the median is located.

(n+1)/2 = median rank

For example:

1 , 1 , 2 , 4 , 6 , 2 , 9 , 3 , 7 , 5 , 2 , 5 , 9 , 6

Ordered:
1 , 1 , 2 , 2 , 2 , 3 , 4 , 5 , 5 , 6 , 6 , 7 , 9 , 9

Total count:
n=14

Median rank = (14+1)/2=7.5

Since this dataset has an even number of items (n=14) then the median is found between the 7th and 8th position. The 7th value is 4, and the 8th value is 5. The value inbetween is (4+5)/2=4.5. The median is 4.5. There are 7 items above and below this value.

For datasets with odd number n, the median falls exactly on the median rank.

To illustrate:

1 , 1 , 2 , 4 , 6 , 2 , 9 , 3 , 7 , 5 , 2 , 5 , 9 , 6, 4

Ordered:
1 , 1 , 2 , 2 , 2 , 3 , 4 , 4 , 5 , 5 , 6 , 6 , 7 , 9 , 9

Median rank = (15+1)/2 = 8

The value in the 8th position is 4, so the median is 4. There are 7 items above and below it.

The mean, median, and mode are all measures of central tendency. The skew can be determined by comparing these three measures.

Mean

The mean is the average value in the dataset.

It is calculated by adding up the data values (x), then dividing by the number of items (n).

The mean of a sample is traditionally labelled x-bar. The mean of a population is labelled µ (mu).

sum(x)/n = x-bar

For example, find the mean of the following sample dataset:

10
12
1
16
10
11
13
6
15
6

sum(x) = 10+12+1+16+10+11+13+6+15+6 =100

n=10

x-bar = 100/10 = 10

The mean is 10.

It is also the "center" of the data - in the sense that the difference of each value from the mean will sum up to zero. This is because there are equal positive differences as there are negative.

Check this, using the above example:

10 - 10 = 0
12 - 10 = 2
1 - 10 = -9
16 - 10 = 6
10 - 10 = 0
11 - 10 = 1
13 - 10 = 3
6 - 10 = -4
15 - 10 = 5
6 - 10 = -4

0 + 2 + -9 + 6 + 0 + 1 + 3 + -4 + 5 + -4 = 0

The mean, median, and mode are all measures of central tendency. The skew can be determined by comparing these three measures.

Frequency

Frequency is the number of times an item is counted in a dataset.

For example:

In the following the sample, the frequency of 2 is 3.

1 , 1 , 2 , 5 , 6 , 2 , 9 , 5 , 7 , 5 , 2 , 5 , 9 , 5

Count the number of times 2 appears.

Frequency is important for presenting data in tables or charts, and in calculating probability (relative frequency).

The value that occurs with the most frequency is the mode.

Tuesday, September 14, 2010

Mode

Mode is the value in a dataset that appears the most frequently.

For example:

In the following the sample, the mode is 5

1
1
2
5
6
2
9
5
7
5
2
5
9
5

Count the number of times 5 appears. It appears the most, so it is the mode.

Some datasets have more than one mode.

If there is a single mode, the term 'unimodal' is used. The example above is unimodal. There are five 5's. Had there also been five 2's, than the example is no longer unimodal. Then, both five and two would be called modes.

The mean, median, and mode are all measures of central tendency. The skew can be determined by comparing these three measures.

Sample Size

The size of a sample influences the cost of a study, as well as the usefulness of the results. A sample that is too small can exclude information. One too large is costly and cumbersome.

Often, researchers need to know the smallest sample that can be taken and yet still have estimates that are accurate.

Decision-makers first agree to the amount of error they will tolerate from the results. This is called the margin of error (E).

Along with margin of error, researchers also assign a critical value (C.V.) that is based upon the probability for extreme values in the population.

These two factors are combined with knowledge about the population's standard deviation (sigma) to reach a recommmended sample size.

n= [(C.V. * sigma) / E]^2

In order to apply the Central Limit Theorem, the common rule of thumb is a minimum sample size of 30. However, if the population is bell-shaped, it can be smaller.

Friday, July 23, 2010

Margin of Error

Margin of Error (E) is the error that can be tolerated when estimating a value.

For confidence intervals, it is calculated as the critical value multiplied by the standard error -

E = Crit Val * Std Err

First, you look up the critical value from the probability table (t or z), then you calculate the standard error. Multiply these together.

Margin of Error tells you how much 'cushion' to place on your estimated value.

This cushion will be larger or smaller depending on the critical value that the researcher has chosen.

However, to determine sample size (n), the margin of error is chosen, not calculated.

For example, a buyer wants to know the sample size needed to estimate the average cost of shoes. He needs the estimate to be within ten dollars of the true population mean.

In this case, you will use E=10 in the formula for solving sample size.

Alpha

Alpha is chosen and represents the level of error the researcher can tolerate. Alpha is the probability of rejecting a correct null hypotheses. It is also referred to as the rejection region.

Alpha corresponds with a critical value. It is graphically defined as a 'tail' region - that is, the diminishing area under a bell-shaped curve, that extends either left of a negative critical value, or right of a positive critical value. See an image at: rejection region.

Assuming that a hypothesis is true, then sample measurements are not expected to fall in this tail region, since it is a small area. When such a sample measurement occurs, it is unlikely, and, therefore, indicates that the hypothesis could be wrong. Researchers will reject an hypothesis if it falls into this alpha region.

However, these unlikely values do occur, even if they are less likely. When the hypothesis is rejected due to an unlikely sample measurement, when, in fact, the hypothesis is true, this is called "Type I error."

Popular alpha values are .01, .05, and .10.

If an alpha value of .10 is used, then type I error is 10% likely to occur.

The terms type I error and alpha are sometimes used synonymously, depending on context.

Critical Value

A critical value (C.V.) is a number that is used to make estimates and test hypotheses. Critical values always correspond to a probability.

This number represents the distance from itself to the center of a bell-shaped graph, either the z or t distribution. The area in this section represents the probability of the C.V.

For example, using the z distribution, the number 1.96 is 47.5% likely. When you also include -1.96, then the likelihood is doubled.

Alpha and Confidence Level are probabilities that correspond to critical values.

Thursday, July 22, 2010

Variance

Variance represents how spread out the data are. It is the average of the squared differences from the mean.

The distances from the mean are calculated by subtracting each x from the mean. These distances are squared and then averaged to arrive at the variance.

Because the differences are squared, the result is in squared units - for example, if the measurements are "miles," then the variance is "miles^2". Therefore, the variance value does not intuitively describe the data. To overcome this, the square-root of the variance is taken. The square-root of variance is called standard deviation.

Here is an example data set:

miles driven (x): 43, 70, 27, 36
n = 4
mean = 44 miles

differences
43 - 44 = -1
70 - 44 = 26
27 -44 = -17
36 - 44 = -8

differences^2
(-1)^2 = 1
(26)^2 = 676
(-17)^2 = 289
(-8)^2 = 64

The average of the differences^2:
(1+676+289+64)/4 = 257.5

The variance is:
257.5 miles^2

Coefficient of variation

The coefficient of variation (c.v.) is a measure of dispersion that allows comparison between groups that are measured in different units. It is calculated as the ratio of the standard deviation (SD) to the mean and is always expressed as a percent.

c.v. = (SD/mean)*100%

The c.v. can only be used when the data collected are ratio variables.

Standard Deviation

The standard deviation is a measure of dispersement, or, how spread out the data are. Each value in the data lies a distance from the sample mean (x minus x-bar). These distances are averaged in order to give a general sense of how the values tend to vary.

The sample mean is the center of the data, where there are an equal amount of negatives as positives. So, the sum of the differences will equal zero. Therefore, the differences must be squared before they are averaged. Squaring cancels the negatives. This average of the squared differences is called variance.

Variance is then square-rooted. This result is the standard deviation.

Degrees of Freedom

Degrees of freedom (df) equals the sample size minus the number of estimated parameters. Commonly there is only one estimator - the sample standard deviation.

df=n-k, where k is the # of parameters being estimated.

df=n-1, k = 1 for a single population using an estimated standard deviation

Pages