# statistics

showing relationship: observational studies surveys showing

causation: controlled experiment survey is used to analyze the construct

Median is robust Mode – measure of the center   cut tail: lower 25% upper 25% Boxplots – IQR,max-min=range   how spread out of the chart small bin size to have as more deatails as possible

probability density function we’re never 100% sure

central limit theorem – the distribution of sample means is approximately normal.

95% confidence interval for the mean

A confidence interval gives an estimated range of values which is likely to include an unknown population parameter, the estimated range being calculated from a given set of sample data.

If independent samples are taken repeatedly from the same population, and a confidence interval calculated for each sample, then a certain percentage (confidence level) of the intervals will include the unknown population parameter. Confidence intervals are usually calculated so that this percentage is 95%, but we can produce 90%, 99%, 99.9% (or whatever) confidence intervals for the unknown parameter.

The width of the confidence interval gives us some idea about how uncertain we are about the unknown parameter (see precision). A very wide interval may indicate that more data should be collected before anything very definite can be said about the parameter.

Confidence intervals are more informative than the simple results of hypothesis tests (where we decide “reject H0” or “don’t reject H0”) since they provide a range of plausible values for the unknown parameter.

A confidence interval for a mean specifies a range of values within which the unknown population parameter, in this case the mean, may lie. These intervals may be calculated by, for example, a producer who wishes to estimate his mean daily output; a medical researcher who wishes to estimate the mean response by patients to a new drug; etc.

margin of error

95% of sample means fall within 1.96 standard errors from the population mean.

98% of sample means fall within 2.33 standard errors from the population mean.

levels of likelihood

critical region

if falls into the crtical region, it can be concluded that most likely we do not get this sample mean by chance.

the critical region defines unlikely values if the null hypothesis is true.

z-critical value

when we do the statical test, we set up our own criteria to make a decision

two-tailed test

t-test

we reject the null hypothesis when p value is less than the a value.

cohen’s d:

standardized mean difference that measures the distance between means in standardized units

margin of error

dependent t-test for paired samples:

same subject take the test twice,

within-subject:

• two conditions: each subject is assigned two condition in random order
• pre-test, post-test
• growth over time- longitudinal study

statistical significance

• reject the null
• results are not likely due to chance – sampling error

“statistically significant” finding

When a statistic is significant, it simply means that you are very sure that the statistic is reliable. It doesn’t mean the finding is important or that it has any decision-making utility.

To say that a significant difference or relationship exists only tells half the story. We might be very sure that a relationship exists, but is it a strong, moderate, or weak relationship?

After finding a significant relationship, it is important to evaluate its strength. Significant relationships can be strong or weak. Significant differences can be large or small. It just depends on your sample size.

One-Tailed and Two-Tailed Significance Tests

When your research hypothesis states the direction of the difference or relationship, then you use a one-tailed probability. For example, a one-tailed test would be used to test these null hypotheses: Females will not score significantly higher than males on an IQ test.

A two-tailed test would be used to test these null hypotheses: There will be no significant difference in IQ scores between males and females.

Procedure Used to Test for Significance

Whenever we perform a significance test, it involves comparing a test value that we have calculated to some critical value for the statistic. It doesn’t matter what type of statistic we are calculating (e.g., a t-statistic, a chi-square statistic, an F-statistic, etc.), the procedure to test for significance is the same.

1. Decide on the critical alpha level you will use (i.e., the error rate you are willing to accept).
2. Conduct the research.
3. Calculate the statistic.
4. Compare the statistic to a critical value obtained from a table.

If your statistic is higher than the critical value from the table:

• You reject the null hypothesis.
• The probability is small that the difference or relationship happened by chance, and p is less than the critical alpha level (p < alpha ).

via http://www.statpac.com/surveys/statistical-significance.htm

The formula for calculating margin of error is made for two-tailed tests, i.e. while calculating margin of error, we only take one side of t=0 into account. If you remember, when we were doing that example, for calculating margin of error on a two-tailed test, we didn’t take twice the t-critical (or t-critical positive minus t-critical negative), we took only one t-critical. Now, when we are calculating t-critical for a one tailed test, how can we use the same margin of error formula, to remove the mental burden of remembering another formula? So, we assume that it’s a two-tailed test. But how does that work out? It works out as the total critical area in both tails for the same alpha value would equal the critical area in a one-tailed test. Let’s say we take -1.711 as our t-critical for calculating margin of error, then this is same as doing a two-tailed test, but now the total alpha value changes to 0.05*2=0.1. Now to keep the total alpha value (or the total critical area same for both tests) as 0.05, we think the test as a two-tailed test and use the same formula for calculating the margin of error we used for a two-tailed test. Does that make sense?

when do we use t-test rather z-test?

Z-test and t-test are basically the same; they compare between two means to suggest whether both samples come from the same population. There are however variations on the theme for the t-test. If you have a sample and wish to compare it with a known mean (e.g. national average) the single sample t-test is available. If both of your samples are not independent of each other and have some factor in common, i.e. geographical location or before/after treatment, the paired sample t-test can be applied. There are also two variations on the two sample t-test, the first uses samples that do not have equal variances and the second uses samples whose variances are equal.

Posted in user study

# hypothesis test

for example, claiming that a new drug is better than the current drug for treatment of the same symptoms.

In each problem considered, the question of interest is simplified into two competing claims / hypotheses between which we have a choice; the null hypothesis, denoted H0, against the alternative hypothesis, denoted H1.

These two competing claims / hypotheses are not however treated on an equal basis: special consideration is given to the null hypothesis.

We have two common situations:

1. The experiment has been carried out in an attempt to disprove or reject a particular hypothesis, the null hypothesis, thus we give that one priority so it cannot be rejected unless the evidence against it is sufficiently strong. For example,
H0: there is no difference in taste between coke and diet coke
against
H1: there is a difference.
1. If one of the two hypotheses is ‘simpler’ we give it priority so that a more ‘complicated’ theory is not adopted unless there is sufficient evidence against the simpler one. For example, it is ‘simpler’ to claim that there is no difference in flavour between coke and diet coke than it is to say that there is a difference.

The hypotheses are often statements about population parameters like expected value and variance; for example H0 might be that the expected value of the height of ten year old boys in the Scottish population is not different from that of ten year old girls. A hypothesis might also be a statement about the distributional form of a characteristic of interest, for example that the height of ten year old boys is normally distributed within the Scottish population.

The outcome of a hypothesis test test is “Reject H0 in favour of H1” or “Do not reject H0”.

Null Hypothesis

We give special consideration to the null hypothesis. This is due to the fact that the null hypothesis relates to the statement being tested, whereas the alternative hypothesis relates to the statement to be accepted if / when the null is rejected.

The final conclusion once the test has been carried out is always given in terms of the null hypothesis. We either “Reject H0 in favour of H1” or “Do not reject H0”; we neverconclude “Reject H1”, or even “Accept H1”.

If we conclude “Do not reject H0”, this does not necessarily mean that the null hypothesis is true, it only suggests that there is not sufficient evidence against H0 in favour of H1. Rejecting the null hypothesis then, suggests that the alternative hypothesis may be true.

Type I Error

In a hypothesis test, a type I error occurs when the null hypothesis is rejected when it is in fact true; that is, H0 is wrongly rejected.

Type II Error

In a hypothesis test, a type II error occurs when the null hypothesis H0, is not rejected when it is in fact false. For example, in a clinical trial of a new drug, the null hypothesis might be that the new drug is no better, on average, than the current drug; i.e.H0: there is no difference between the two drugs on average.

Critical Value(s)

The critical value(s) for a hypothesis test is a threshold to which the value of the test statistic in a sample is compared to determine whether or not the null hypothesis is rejected.

The critical value for any hypothesis test depends on the significance level at which the test is carried out, and whether the test is one-sided or two-sided.

Critical Region

The critical region CR, or rejection region RR, is a set of values of the test statistic for which the null hypothesis is rejected in a hypothesis test. That is, the sample space for the test statistic is partitioned into two regions; one region (the critical region) will lead us to reject the null hypothesis H0, the other will not. So, if the observed value of the test statistic is a member of the critical region, we conclude “Reject H0; if it is not a member of the critical region then we conclude “Do not reject H0“.

Significance Level

The significance level of a statistical hypothesis test is a fixed probability of wrongly rejecting the null hypothesis H0, if it is in fact true.

It is the probability of a type I error and is set by the investigator in relation to the consequences of such an error. That is, we want to make the significance level as small as possible in order to protect the null hypothesis and to prevent, as far as possible, the investigator from inadvertently making false claims.

The significance level is usually denoted by Significance Level = P(type I error) = Usually, the significance level is chosen to be 0.05 (or equivalently, 5%).

P-Value

The probability value (p-value) of a statistical hypothesis test is the probability of getting a value of the test statistic as extreme as or more extreme than that observed by chance alone, if the null hypothesis H0, is true.

It is the probability of wrongly rejecting the null hypothesis if it is in fact true.

The p-value is compared with the actual significance level of our test and, if it is smaller, the result is significant. That is, if the null hypothesis were to be rejected at the 5% signficance level, this would be reported as “p < 0.05”.

Small p-values suggest that the null hypothesis is unlikely to be true. The smaller it is, the more convincing is the rejection of the null hypothesis. It indicates the strength of evidence for say, rejecting the null hypothesis H0, rather than simply concluding “Reject H0‘ or “Do not reject H0“.

One-sided Test

A one-sided test is a statistical hypothesis test in which the values for which we can reject the null hypothesis, H0 are located entirely in one tail of the probability distribution.

In other words, the critical region for a one-sided test is the set of values less than the critical value of the test, or the set of values greater than the critical value of the test.

A one-sided test is also referred to as a one-tailed test of significance.

The choice between a one-sided and a two-sided test is determined by the purpose of the investigation or prior reasons for using a one-sided test.

Example

Suppose we wanted to test a manufacturers claim that there are, on average, 50 matches in a box. We could set up the following hypotheses
H0: µ = 50,
against
H1: µ < 50 or H1: µ > 50
Either of these two alternative hypotheses would lead to a one-sided test. Presumably, we would want to test the null hypothesis against the first alternative hypothesis since it would be useful to know if there is likely to be less than 50 matches, on average, in a box (no one would complain if they get the correct number of matches in a box or more).
Yet another alternative hypothesis could be tested against the same null, leading this time to a two-sided test:
H0: µ = 50,
against
H1: µ not equal to 50
Here, nothing specific can be said about the average number of matches in a box; only that, if we could reject the null hypothesis in our test, we would know that the average number of matches in a box is likely to be less than or greater than 50.
Two-Sided Test

A two-sided test is a statistical hypothesis test in which the values for which we can reject the null hypothesis, H0 are located in both tails of the probability distribution.

In other words, the critical region for a two-sided test is the set of values less than a first critical value of the test and the set of values greater than a second critical value of the test.

A two-sided test is also referred to as a two-tailed test of significance.

The choice between a one-sided test and a two-sided test is determined by the purpose of the investigation or prior reasons for using a one-sided test.

Example

Suppose we wanted to test a manufacturers claim that there are, on average, 50 matches in a box. We could set up the following hypotheses
H0: µ = 50,
against
H1: µ < 50 or H1: µ > 50
Either of these two alternative hypotheses would lead to a one-sided test. Presumably, we would want to test the null hypothesis against the first alternative hypothesis since it would be useful to know if there is likely to be less than 50 matches, on average, in a box (no one would complain if they get the correct number of matches in a box or more).
Yet another alternative hypothesis could be tested against the same null, leading this time to a two-sided test:
H0: µ = 50,
against
H1: µ not equal to 50
Here, nothing specific can be said about the average number of matches in a box; only that, if we could reject the null hypothesis in our test, we would know that the average number of matches in a box is likely to be less than or greater than 50.

via http://www.stats.gla.ac.uk/steps/glossary/hypothesis_testing.html#h0

Posted in user study

# Latin square

Latin square is an n × n array filled with n different symbols, each occurring exactly once in each row and exactly once in each column. Here is an example:

 A B C C A B B C A

For the past three decades, Latin Squares techniques have been widely used in many statistical applications. Much effort has been devoted to Latin Square Design. In this paper, I introduce the mathematical properties of Latin squares and the application of Latin squares in experimental design. Some examples and SAS codes are provided that illustrates these methods.

——Lei gao 2005

http://www.mth.msu.edu/~jhall/classes/MTH880-05/Projects/latin.pdf

## The Latin Square design

The Latin square design is used where the researcher desires to control the variation in an experiment that is related to rows and columns in the field.

Field marks:

• Treatments are assigned at random within rows and columns, with each treatment once per row and once per column.
• There are equal numbers of rows, columns, and treatments.
• Useful where the experimenter desires to control variation in two different directions

This is just one of many 4×4 squares that you could create. In fact, you can make any size square you want, for any number of treatments – it just needs to have the following property associated with it – that each treatment occurs only once in each row and once in each column.

Note that a Latin Square is an incomplete design, which means that it does not include observations for all possible combinations of ij and k.  This is why we use notation k = d(i, j).  Once we know the row and column of the design, then the treatment is specified. In other words, if we know i and j, then k is specified by the Latin Square design.

This property has an impact on how we calculate means and sums of squares, and for this reason we can not use the balanced ANOVA command in Minitab even though it looks perfectly balanced. We will see later that although it has the property of orthogonality, you still cannot use the balanced ANOVA command in Minitab because it is not complete.

The randomization procedure for assigning treatments that you would like to use when you actually apply a Latin Square, is somewhat restricted to preserve the structure of the Latin Square. The ideal randomization would be to select a square from the set of all possible Latin squares of the specified size.  However, a more practical randomization scheme would be to select a standardized Latin square at random (these are tabulated) and then:

1. randomly permute the columns,
2. randomly permute the rows, and then
3. assign the treatments to the Latin letters in a random fashion.

via https://onlinecourses.science.psu.edu/stat503/node/21

Posted in research, user study

# zz – Within-Subjects Designs

A within-subjects design is an experiment in which the same group of subjects serves in more than one treatment. Note that I’m using the word “treatment” to refer to levels of the independent variable, rather than “group”. It’s probably always better to use the word “treatment”, as opposed to group. The term “group” can be very misleading when you are using a within-subjects design because the same “group” of people is often in more than one treatment. As an example of a within-subjects design, let’s say that we are interested in the effect of different types of exercise on memory. We decide to use two treatments, aerobic exercise and anaerobic exercise. In the aerobic condition we will have participants run in place for five minutes, after which they will take a memory test. In the anaerobic condition we will have them lift weights for five minutes, after which they will take a different memory test of equivalent difficulty. Since we are using a within-subjects design we have all participants begin by running in place and taking the test, after which we have the same group of people lift weights and then take the test. We compare the memory test scores in order to answer the question as to what type of exercise aids memory the most.

Strengths

There are two fundamental advantages of the within subjects design: a) power and b) reduction in error variance associated with individual differences. A fundamental inferential statistics principle is that, as the number of subjects increases, statistical power increases, and the probability of beta error decreases (the probability of not finding an effect when one “truly” exists). This is why it is always better to have more subjects, and why, if you look at a significance table, such as the t-table, as the number of subjects increases the t value necessary for statistical significance decreases. The reason this is so relevant to the within subjects design is that, by using a within-subjects design you have in effect increased the number of “subjects” relative to a between subjects design. For example, in the exercise experiment, since you have the same subjects in both groups, you will have twice as many “subjects” as you would have had if you would have used a between-subjects design. If ten students sign up for the experiment, and you use a between-subjects design, with equal size groups, you will have five subjects in the aerobic condition and 5 in the anaerobic condition. However, if you use a within-subjects design you will in effect have 10 subjects in both conditions. Just as with the term “groups” vs. “treatments”, instead of using the term “subjects” it’s better to speak of “observations”, since the term subjects is misleading in the within-subjects design when the same person may effectively be more than one “subject”.

The reduction in error variance is due to the fact that much of the error variance in a between-subjects’ design is due to the fact that, even though you randomly assigned subjects to groups, the two groups may differ with regard to important individual difference factors that effect the dependent variable. With within-subjects designs, the conditions are always exactly equivalent with respect to individual difference variables since the participants are the same in the different conditions. So, in our exercise example above, any factor that may effect performance on the dependent variable (memory) such as sleep the night before, intelligence, or memory skill, will be exactly the same for the two conditions, because they are the exact same group of people in the two conditions.

Weaknesses

There is also a fundamental disadvantage of the within-subjects’ design, which can be referred to as “carryover effects”. In general, this means the participation in one condition may effect performance in other conditions, thus creating a confounding extraneous variable that varies with the independent variable. Two basic types of carryover effects are practice and fatigue. As you read about the hypothetical exercise and memory experiment, you may very possibly have recognized that one problem with this experiment would be that participating in one exercise condition first, followed by the memory test, may inadvertently effect performance in the second condition. First of all, participants may very possibly be more tired from running in place and weight lifting than they are from just running in place so that they perform worse on the second memory test. If this is the case, they wouldn’t do worse on the second test because aerobic exercise is better for memory than anaerobic, rather they would do worse because they were actually more worn out from exercising for ten minutes total than after only exercising for five. When one within-subjects treatment negatively effects performance on a later treatment this is referred to as a fatigue effect. On the other hand, in the exercise experiment the second memory test may be very similar to the first, so that by practicing with the first test they perform much better the second time. Again, the difference between the two conditions would not be due to the independent variable (aerobic vs. anaerobic), rather it would be due to practice with the test. When a within-subjects treatment positively effects performance on a later treatment this is referred to as a practice effect.

Posted in user study

### TomTom BlueAsteroid

Just another WordPress.com site

Jing's Blog

Just another WordPress.com site

Start from here......

Just another WordPress.com site

Where On Earth Is Waldo?

A Project By Melanie Coles

the Serious Computer Vision Blog

A blog about computer vision and serious stuff

Cauthy's Blog

paper review...

Cornell Computer Vision Seminar Blog

Blog for CS 7670 - Special Topics in Computer Vision

datarazzi

Life through nerd-colored glasses

Luciana Haill

Brainwaves Augmenting Consciousness

1,2,∞

Dr Paul Tennent

and the university of nottingham