程序辅导案例 > Program >

代写辅导接单-MM1F28 Seminar

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

MM1F28 Seminar

Week 2

Table of Contents

What are confidence intervals?

How do we calculate confidence intervals?

What is the significance level?

What is the normal distribution?

Exercises

What are confidence intervals?

Let’s start by looking at the difference in salary between associate professors and full professors from a study in the US.

Note that in this study 67 associate professors and 266 full professors took part. That is a sample, not the entire population.

We may be tempted to say that, simply by looking at the graph, we can conclude that on average professors make more money than associate professors. This makes sense. However, we cannot say this yet. Remember, these are sample statistics, not population parameters, so how can we infer that it is likely profs make more than associates on a population level, when all we have is a sample of 333 people? This is where confidence intervals come in. Essentially, confidence intervals show us that, if we were to sample ad infinitum from the population, we would end up with range of possible values for the sample means. We can often (erroneously, but it doesn’t matter for this course), think of this as being a range for the population mean (in other words, that the true population mean is somewhere in that range). I’ll try to illustrate this with another graph. This one looks at the effect of algorithm on tour_lengths (tour_lengths ~ algorithm):

From the graph we see that each point estimate (sample mean), has some error bars around it. We can say that for the FI algorithm, the tour_length based on our sample was ~475, but ad infinitum we could expect it to produce sample means anywhere from around 470 and 480, 95% of the time. I will explain the 95% later, for now, just roll with it.

The way we would interpret this graph is by assuming that since the population mean and the estimated sample means are probably quite close, it’s likely that the true population mean is somewhere between the upper bound and the lower bound. The actual explanation is quite complex and requires you to understand sampling distributions and the central limit theorem in particular. These are shown in the lecture slides, but are not a requirement for this course.

So how does that help us? Well, for one, we can say that we expect that the FI algorithm will produce graphs with tour lengths ranging from 470 to 480, but most importantly, we can say that there is no significant difference between the length of graphs produced by the FI and HI algorithms, but there are differences between the CI and FI as well as HI algorithms. Why? Well because the intervals overlap. In other words, there are plenty of places where I can draw a horizontal line on this graph that will intersect both the intervals for FI and HI, as shown below.

This implies that the population mean for both the FI and HI algorithms could be the same. In other words, it doesn’t make a difference on the tour_length whether you used FI or HI; there is no effect on tour length. This does not apply to CI, which clearly produced graphs that have a large tour length. We are confident of this, in fact we are 95% confident.

Back to our original example. Take a look at the graph of professor salaries, this time with error bars:

Now we can confidently (95% confidence in fact) say that there is a significant difference in the salaries between professors and associate professors since the confidence intervals do not overlap. In other words, there is an effect of rank on salary, denoted salary ~ rank.

How do we calculate confidence intervals?

You will not be asked to calculate confidence intervals manually for this course. However, I will tell you which variables are used, as well as show you the formula. We will then look at how to create the intervals using software (R-commander). There are actually two ways (that I know of) to calculate confidence intervals. One is by using a formula, and the other using bootstrapping. We will not cover bootstrapping as this is an advanced topic.

The simplest formula is:

B =

Which is assuming that the sample size n is either greater than 30, or the standard deviation s of the population is known. You may be wondering about the Z variable. This requires some more explanation, which is included in the lecture slides but is not really a requirement for the course. Let’s just say that the depending on your sample size you either use a Z statistic, or a T statistic, which refers to two types of bell-curve distributions (one is the standard normal, and the other the t-distribution). The Z statistic is often used when we have a sample size greater than 30, while the t statistics is often used when we have a sample size less than that. I won’t explain the a/2 here, but you can ask me in the seminar if you are super interested in what exactly that means.

Let’s look at each part of the formula and how it helps construct the interval.

As we can see the x-bar is in the middle. This is our sample mean. The rest of the formula constructs the upper and lower parts of the interval. These are called the margin of error (you’ve probably heard this one before). So in essence, the margin of error is simply half of the confidence interval.

I will show you how to create confidence intervals in the workshop using software (which is what you will be expected to do for the assignment).

What is the significance level?

In short, we can never be 100% certain that differences in means are not due to sampling error. Even in extreme cases. No matter how unlikely, when talking about probabilities, we can never say we are 100% sure that X or Y will occur. However, what we can say is that it would be unusual for an event to occur. The question is, when do we accept something as being unusual?

If I told you that there is only a 20% chance of an event occurring, you may say that’s not very unusual. What if I told you 10%? Or 5%? We needed to decide on a cut-off. So, through consensus, we decided that 1 in 20 chance is unlikely to happen, and is therefore unusual. Hence the 95% confidence level that we kept setting.

A 95% confidence means a 5% error rate (denoted as a). This is the same a you saw in the formula earlier on. What is important to note is that we decide on the confidence level. This is often dictated by the field, or the type of study that we are working on, as the confidence level will affect the type of error we are likely to make. We will discuss errors in more detail next week, but for now, keep in mind that if we want to be more confidence in our results, our margin of error will have to increase, as shown below:

Note that the 99% CI has a larger interval. This makes intuitive sense, since we want to be more confident that the population mean is included in this interval, and therefore we want to accept less error. You may wonder why we would ever want to accept more error. This is because if we decrease one type of error, we increase another. I won’t go into this today, but we will discuss it next week.

What is the normal distribution?

You will be hearing a lot about the normal distribution in this course. The reason for this is simple: mathematically, the normal distribution is easy to work with. It also helps that a lot of continuous data from measurements tends to follow the normal distribution, like people’s heights.

There is a special case of the normal distribution called the standard normal, or z distribution. This is a normal distribution with a mean of 0 and a standard deviation of 1. The distribution is used for mathematical reasons, which I won’t go into, but essentially if you remember the z statistic from our confidence interval formula earlier on, this is where it comes from (values under the z distribution).

Finally, the last distribution you may hear about is the t-distribution. This looks like a normal distribution but is more platykurtic (more flat). It’s used when we have less confidence in our mean due to small sample sizes, but is also the basis for a lot of calculations we make when determining whether there is an effect of some IV on a DV. Don’t worry too much about this now, here is the t-distribution (it only uses one parameter, degrees of freedom, which is denoted as v.

Note that as v increases, the distribution gets more leptokurtic (thinner).

Exercises

As I said before, the assignments will test your knowledge from the seminars and workshops, not from the lectures. So, make sure you can answer all these questions.

Question 1

What is a confidence interval? How does it help us infer differences in the means of measurements?

Question 2

Consider the following plot. Is there an effect of IV on DV?

Question 3

When I asked in Q2 whether there is an effect of IV on DV, what did I actually mean? Can you ask this in a different way?

Question 4

What is the confidence level? How does it relate to confidence intervals?

Question 5

What happens to the confidence interval when we increase the confidence level?

Question 6

What is the margin of error?

Question 7

What is the point estimate?

Question 8

Sally wants to check whether there is a significant difference in the mean time senior HR employees spend browsing the Internet, compared to junior HR employees, during work hours. If we assume that IT logs this information and hands it over to Sally, given everything you’ve learned so far, can you describe a way for Sally to use this data in order to understand the differences?

Question 9

What do I mean by DV ~ IV?

Question 10

What is the normal distribution?

Question 11

You collect 500 people’s heights and find that (a) your data follows the normal distribution, and (b) the mean height is 170.6cm. What is the median and mode?

Question 12

What happens to the t-distribution as you increase the degrees of freedom?

Question 13

How many observations do we expect to have one SD away from the mean in normally distributed data?