Lecture 1 ECON 2100, FALL 2024 Overview • Population vs Sample • Methods of Sampling • Types of Variables • Data Visualization • Descriptive Statistics • Population Parameters Population vs. Sample POPULATION A population contains all of the items or
individuals of interest that we seek to study. SAMPLE A sample contains only a portion of a
population of interest. Population vs. Sample Population Sample All the items or individuals about
which we want to draw conclusion(s). A portion of the population of
items or individuals.
Say, we wish to find out fraction of students
who speak Spanish at RPI: Entire student body at RPI is the population
while only Math majors is a sample. Lucio wants to know whether the food he serves in
his restaurant is within a safe range of temperatures.
He randomly selects 70 entrees and measures their
temperatures just before he serves them to his
customers. Identify the population and the sample: a. The population is all of the hot entrees Lucio
serves; the sample is the entrees that are a safe
temperature. b. The population is the 70 selected entrees; the
sample is the entrees that are a safe temperature. c. The population is all of the entrees Lucio serves;
the sample is the 70 selected entrees. O Lucio wants to know whether the food he serves in
his restaurant is within a safe range of temperatures.
He randomly selects 70 entrees and measures their
temperatures just before he serves them to his
customers. Identify the population and the sample: a. The population is all of the hot entrees Lucio
serves; the sample is the entrees that are a safe
temperature. b. The population is the 70 selected entrees; the
sample is the entrees that are a safe temperature. c. The population is all of the entrees Lucio serves;
the sample is the 70 selected entrees. Probability Sample: Simple Random
Sample • Every individual or item from the frame has an equal chance
of being selected. • Selection may be with replacement (selected individual is
returned to frame for possible reselection) or without
replacement (selected individual isn’t returned to the frame). • Samples obtained from table of random numbers or
computer random number generators. - We will see how to do this in R
Selecting a Simple Random Sample
Using Random Number Table Sampling Frame For Population
With 850 Items Item Name
Item # Bev R. 001 Ulan X. 002 . . . . . . . . Joann P. 849 Paul F. 850 Portion Of Random Number Table 49280
88924
35779
00283
81163
07275 11100
02340
12860
74697
96644
89439 09893
23997
20048
49420
88872
08401 The first 5 items in a simple
random sample Item # 492 Item # 808 Item # 892
-- does not exist so ignore Item # 435 Item # 779 Item # 002 • Decide on sample size: n • Divide frame of N individuals into groups of k individuals:
k=N/n • Randomly select one individual from the 1st group.
• Select every kth individual thereafter. Probability Sample: Systematic Sample N = 40 n = 4 k = 10 First Group Probability Sample: Stratified Sample • Divide population into two or more subgroups (called strata) according to
some common characteristic. • A simple random sample is selected from each subgroup, with sample sizes
proportional to strata sizes. • Samples from subgroups are combined into one. • This is a common technique when sampling population of voters, stratifying
across racial or socio-economic lines. Population divided into 4 strata Probability Sample: Cluster Sample • Population is divided into several “clusters,” each representative of the population. • A simple random sample of clusters is selected. • All items in the selected clusters can be used, or items can be chosen from a cluster
using another probability sampling technique. • A common application of cluster sampling involves election exit polls, where certain
election districts are selected and sampled. Population
divided into
16 clusters. Randomly selected clusters for sample Probability Sample: Comparing Sampling Methods Simple random sample and Systematic sample: ◦ Simple to use. ◦ May not be a good representation of the population’s
underlying characteristics. Stratified sample: ◦ Ensures representation of individuals across the entire
population. Cluster sample: ◦ More cost effective. ◦ Less efficient (need larger sample to acquire the same level of
precision). Stratified Sampling Cluster Sampling Researcher decides the criterion
for division Natural division Homogeneity within subgroups
and heterogeneity between
subgroups Heterogeneity within subgroups
and homogeneity between
subgroups Ex. Students at RPI divided
based on year/major and then
individuals are sampled from
each subgroup Ex. Determine proportion of
students in Capital Region who
are science majors. Divide into clusters based on
schools. Then, randomly sample
schools 1. Interview every 10th student who enters the school in the morning. a. Random Sampling b. Cluster Sampling c. Systematic Sampling d. Stratified Sampling 2. Assign each car in a dealership a number and then use a random-number
table to select the cars to be inspected. a. Random Sampling b. Cluster Sampling c. Systematic Sampling d. Stratified Sampling 3. A teacher wants to know how well her students are doing on a topic. She
randomly picks one class to survey. a. Random Sampling b. Cluster Sampling c. Systematic Sampling d. Stratified Sampling O O O 1. Interview every 10th student who enters the school in the morning. a. Random Sampling b. Cluster Sampling c. Systematic Sampling d. Stratified Sampling 2. Assign each car in a dealership a number and then use a random-number
table to select the cars to be inspected. a. Random Sampling b. Cluster Sampling c. Systematic Sampling d. Stratified Sampling 3. A teacher wants to know how well her students are doing on a topic. She
randomly picks one class to survey. a. Random Sampling b. Cluster Sampling c. Systematic Sampling d. Stratified Sampling Classifying Variables By Type • Categorical (qualitative) variables take categories as their values such
as “yes”, “no”, or “blue”, “brown”,
“green”.
• Numerical (quantitative) variables have values that represent a
counted or measured quantity. ⸰ Discrete variables arise from a counting process. ⸰ Continuous variables arise from a measuring process. Examples of Types of Variables Question Responses Variable Type Do you have an Instagram
profile? Yes or No How many text messages
have you sent in the past
three days? --------------- How long did the mobile
app update take to
download? --------------- Examples of Types of Variables Question Responses Variable Type Do you have an Instagram
profile? Yes or No Categorical How many text messages
have you sent in the past
three days? --------------- Numerical (discrete) How long did the mobile
app update take to
download? --------------- Numerical (continuous) Types of Variables Variables Categorical Numerical Discrete Continuous Examples: n Marital Status n Political Party n Eye Color (Defined Categories) Examples: n Number of Children n Defects per hour (Counted Items) Examples: n Weight n Voltage (Measured Characteristics) Nominal Ordinal Examples:
Ratings n Good, Better, Best n Low, Med, High (Ordered Categories) 1. List all quantitative variables. 2. List all qualitative variables.
3. List all continuous variables. 4. List all discrete variables. 5. List all ordinal variables. 6. List all nominal variables. Age ,Height , LDL , children Gender , B6 , Happy , SmokegSC Height Age , LDL # children Happy , SC Gender , BG , Smoke Visualizing Categorical Data:
The Bar Chart • The bar chart visualizes a categorical variable as a series of
bars. The length of each bar represents either the frequency or
percentage of values for each category. Reason For
Shopping Online? Percent Better prices 37% Avoiding holiday
crowds or hassles 29% Convenience 18% Better selection 13% Ships directly 3% Visualizing Categorical Data:
The Pie Chart • The pie chart is a circle broken up into slices that represent
categories. The size of each slice of the pie varies according to
the percentage in each category. Reason For
Shopping Online? Percent Better prices 37% Avoiding holiday
crowds or hassles 29% Convenience 18% Better selection 13% Ships directly 3% Visualizing Numerical Data:
The Histogram Class
Frequency 10 but less than 20
3
.15
15 20 but less than 30
6
.30
30 30 but less than 40
5
.25
25
40 but less than 50
4
.20
20 50 but less than 60
2
.10
10
Total 20
1.00
100 Relative Frequency Percentage 0 2 4 6 8 5 15 25 35 45 55 More Fr eq ue nc y Histogram: Age Of Students (In a percentage
histogram
the vertical
axis would be defined to
show the percentage of
observations per class). i togram: T mperature Visualizing Two Numerical Variables:
The Scatter Plot • Scatter plots are used for numerical data consisting of paired
observations taken from two numerical variables. • One variable is measured on the vertical axis and the other
variable is measured on the horizontal axis. • Scatter plots are used to examine possible relationships between
two numerical variables. Scatter Plot Example Volume
per day Cost per
day 23 125 26 140 29 146 33 160 38 167 42 170 50 188 55 195 60 200 Cost per Day vs. Production Volume
0 50 100 150 200 250 20 30 40 50 60 70 Volume per Day C o st
p er
D ay
Summary Definitions • The central tendency is the extent to which the values of a
numerical variable group around a typical or central value. • The variation is the amount of dispersion or scattering away
from a central value that the values of a numerical variable
show. • The shape is the pattern of the distribution of values from the
lowest value to the highest value. Measures of Central Tendency: The Mean • The arithmetic mean (often just called the “mean”) is the most
common measure of central tendency. ◦ For a sample of size n: Sample size n XXX n X X n21 n 1i i +++ == å = Observed values The ith value Pronounced x-bar Measures of Central Tendency: The Mean • The most common measure of central tendency. • Mean = sum of values divided by the number of values. • Affected by extreme values (outliers). 11
12
13
14
15
16
17
18
19 20 Mean = 13 11
12
13
14
15
16
17
18
19 20 Mean = 14 31 5 65 5 5141312111 == ++++ 41 5 70 5 2041312111 == ++++ Measures of Central Tendency: The Median • In an ordered array, the median is the “middle” number (50%
above, 50% below). Less sensitive than the mean to extreme values. Median = 13 Median = 13 11
12
13
14
15
16
17
18
19 20 11
12
13
14
15
16
17
18
19 20 Measures of Central Tendency: Locating the Median • The location of the median when the values are in numerical order
(smallest to largest): • If the number of values is odd, the median is the middle number. • If the number of values is even, the median is the average of the two
middle numbers. dataorderedtheinposition 2 1npositionMedian += 2 1n +Note that
is not the value of the median, only the position of the median
in the ranked data. • n = 7, then median is on
which position? • n = 8, then median is on
which position? • Ex 1. Find the median for 1, 4, 5, 9, 21, 22 • Ex 2. Find the median for 12, 32, 35, 78, 90 # = 4 &H = 4 . 5 ; arg , between 4th & 5th pos #9 = 7 35 • n = 7, then median is on
which position? Ans: (7+1)/2 = 8th position • n = 8, then median is on
which position? Ans: (8+1)/2 = 4.5th position i.e., average of 4th and 5th positions
• Ex 1. Find the median for 1, 4, 5, 9, 21, 22 Ans: (5+9)/2 = 7 • Ex 2. Find the median for 12, 32, 35, 78, 90 Ans: 35 ↳ Measures of Central Tendency: The Mode • Value that occurs most often. • Not affected by extreme values. • Used for either numerical or categorical data. • There may be no mode. • There may be several modes. 0
1
2
3
4
5
6
7
8
9
10
11
12
13
14 Mode = 9 0
1
2
3
4
5
6 No Mode Measures of Central Tendency: Review Example House Prices:
$2,000,000 $
500,000 $
300,000 $
100,000 $
100,000 Sum $ 3,000,000 § Mean: =
§ Median: middle value of ranked
data
=
§ Mode: most frequent value
=
3 , 000 , 000/5 600 , 000 300 , 000 100 , 000 Measures of Central Tendency: Review Example House Prices:
$2,000,000 $
500,000 $
300,000 $
100,000 $
100,000 Sum $ 3,000,000 § Mean: ($3,000,000/5)
=
$600,000 § Median: middle value of ranked
data
= $300,000 § Mode: most frequent value
= $100,000 Measures of Central Tendency: Which Measure to Choose? • The mean is generally used, unless extreme values (outliers)
exist. • The median is often used, since the median is not sensitive to
extreme values.
For example, median home prices may be
reported for a region; it is less sensitive to outliers. • In many situations it makes sense to report both the mean and
the median. Quiz 1. What is the mode of the following numbers? 4, 9, 6, 3, 4, 2 2. What is the median of the following numbers? 3, 5, 6, 7, 9, 6, 8 3. A data set can have more than one median. True
or False? 4 6 False Same center,
different variation Measures of Variation • Measures of variation give
information on the spread or variability or dispersion of
the data values. Variation Standard
Deviation Range Variance Measures of Variation: The Range • Simplest measure of variation. • Difference between the largest and the smallest values: Range = Xlargest – Xsmallest 0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Range = 13 - 1 = 12 Example: Measures of Variation: Why the Range Can Be Misleading • Does not account for how the data are distributed. • Sensitive to outliers. 7
8
9
10
11
12 Range = 12 - 7 = 5 7
8
9
10
11
12 Range = 12 - 7 = 5 1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5 1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120 Range = 5 - 1 = 4 Range = 120 - 1 = 119 • Average (approximately) of squared deviations of values
from the mean. ◦ Sample variance: Measures of Variation: The Sample Variance 1-n )X(X S n 1i 2 i 2 å = - = Where = arithmetic mean n = sample size Xi = ith value of the variable X X Measures of Variation: The Sample Standard Deviation • Most commonly used measure of variation. • Shows variation about the mean. • Is the square root of the variance. • Has the same units as the original data. ◦ Sample standard deviation: 1-n )X(X S n 1i 2 iå = - = Measures of Variation: Comparing Standard Deviations Smaller standard deviation Larger standard deviation Locating Extreme Outliers: Z-Score • To compute the Z-score of a data value, subtract the mean and
divide by the standard deviation. • The Z-score is the number of standard deviations a data value is
from the mean. • A data value is considered an extreme outlier if its Z-score is less
than -3.0 or greater than +3.0. • The larger the absolute value of the Z-score, the farther the data
value is from the mean. Locating Extreme Outliers: Z-Score Where
X represents the data value X is the sample mean S is the sample standard deviation S XXZ -= Sx2 = x -* Locating Extreme Outliers: Z-Score • Suppose the mean math SAT score is 490, with a standard
deviation of 100. • Compute the Z-score for a test score of 620. z=0-490 = Locating Extreme Outliers: Z-Score • Suppose the mean math SAT score is 490, with a standard
deviation of 100. • Compute the Z-score for a test score of 620. 3.1 100 130 100 490620 == - = - = S XXZ A score of 620 is 1.3 standard deviations above the
mean and would not be considered an outlier. Numerical Descriptive
Measures for Population • Descriptive statistics discussed previously described a
sample, not the population. • Summary measures describing a population, called
parameters, are denoted with Greek letters. • Important population parameters are the population mean,
variance, and standard deviation. Numerical Descriptive
Measures for Population • Descriptive statistics discussed previously described a
sample, not the population. • Summary measures describing a population, called
parameters, are denoted with Greek letters. • Important population parameters are the population mean,
variance, and standard deviation. • The population parameter is a constant while the sample
statistic is variable Numerical Descriptive Measures
for Population:
The Mean µ • The population mean is the sum of the values in the population
divided by the population size, N. N XXX N X N21 N 1i i +++ ==µ å = μ = population mean N = population size Xi = ith value of the variable X Where Y mch • Average of squared deviations of values from the mean. ◦ Population variance: Numerical Descriptive Measures
for Population:
The Variance σ2 N μ)(X σ N 1i 2 i 2 å = - = Where μ = population mean N = population size Xi = ith value of the variable X K sigma Numerical Descriptive Measures for
Population:
The Standard Deviation σ • Most commonly used measure of variation. • Shows variation about the mean. • Is the square root of the population variance. • Has the same units as the original data. ◦ Population standard deviation: N μ)(X σ N 1i 2 iå = - =C Sigma Sample Statistics vs.
Population Parameters Measure Population
Parameter Sample
Statistic Mean Variance Standard
Deviation X 2S S µ 2s s Quartile Measures • Quartiles split the ranked data into 4 segments with an
equal number of values per segment. 25% ⸰ The first quartile, Q1, is the value for which 25% of the
values are smaller and 75% are larger. ⸰ Q2 is the same as the median (50% of the values are
smaller and 50% are larger). ⸰ Only 25% of the values are greater than the third quartile. Q1 Q2 Q3 25% 25% 25% Quartile Measures: Locating Quartiles • Find a quartile by determining the value in the appropriate
position in the ranked data, where: First quartile position:
Q1 = (n+1)/4
ranked value Second quartile position:
Q2 = (n+1)/2 ranked value Third quartile position:
Q3 = 3(n+1)/4
ranked value Where
n is the number of observed values. Quartile Measures: Calculation Rules When calculating the ranked position use the following
rules: ◦ If the result is a whole number then it is the ranked position to
use. ◦ If the result is a fractional half (e.g. 2.5, 7.5, 8.5, etc.) then
average the two corresponding data values. ◦ If the result is not a whole number or a fractional half then
round the result to the nearest integer to find the ranked
position. (n = 9) Q1 is in the (9+1)/4 = 2.5 position of the ranked data, so
Q1 = (12+13)/2 = 12.5. Q2 is in the (9+1)/2 = 5th position of the ranked data, so
Q2 = median = 16. Q3 is in the 3(9+1)/4 = 7.5 position of the ranked data, so
Q3 = (18+21)/2 = 19.5. Quartile Measures Calculating The Quartiles:
Example Sample Data in Ordered Array:
11
12
13
16
16
17
18
21
22 Q1 and Q3 are measures of non-central location. Q2 = median, is a measure of central tendency. (n = 8) Q1 is in the (8+1)/4 = 2.25 position of the ranked data, so
Q1 =
Q2 is in the (8+1)/2 = 4.5 position of the ranked data, so
Q2 = median =
Q3 is in the 3(8+1)/4 = 6.75 position of the ranked data, so
Q3 =
Quartile Measures Calculating The Quartiles:
Example Sample Data in Ordered Array:
11
12
13
16
16
17
18
21
Q1 and Q3 are measures of non-central location. Q2 = median, is a measure of central tendency. (n = 8) Q1 is in the (8+1)/4 = 2.25 position of the ranked data, so
Q1 = 12. Q2 is in the (8+1)/2 = 4.5 position of the ranked data, so
Q2 = median = 16. Q3 is in the 3(8+1)/4 = 6.75 position of the ranked data, so
Q3 = 18. Quartile Measures Calculating The Quartiles:
Example Sample Data in Ordered Array:
11
12
13
16
16
17
18
21
Q1 and Q3 are measures of non-central location. Q2 = median, is a measure of central tendency. Quartile Measures: The Interquartile Range (IQR) • The IQR is Q3 – Q1 and measures the spread in the middle 50% of the data. • The IQR is also called the midspread because it covers the
middle 50% of the data. • The IQR is a measure of variability that is not influenced by
outliers or extreme values. • Measures like Q1, Q3, and IQR that are not influenced by outliers are called resistant measures. Calculating the Interquartile Range Median (Q2) X maximumXminimum Q1 Q3 Example: 25%
25%
25%
25% 12
30
45
57
70 Interquartile range
= 57 – 30 = 27 The Five Number Summary The five numbers that help describe the center, spread and shape
of data are: ⸰ Xlargest ⸰ Third Quartile (Q3) ⸰ Median (Q2) ⸰ First Quartile (Q1) ⸰ Xsmallest 25% of data
25%
25%
25% of data of data
of data Five Number Summary and The Boxplot • The boxplot is a graphical display of the data based on the five- number summary: Example: Xsmallest -- Q1 -- Median
-- Q3 -- Xlargest Xsmallest Q1 Median
Q3 Xlargest Five Number Summary: Shape of Boxplots • If data are symmetric around the median then the box and
central line are centered between the endpoints. • A boxplot can be shown in either a vertical or horizontal
orientation. Xsmallest Q1 Median
Q3 Xlargest Distribution Shape and
The Boxplot Right-SkewedLeft-Skewed Symmetric Q1 Q2 Q3 Q1 Q2 Q3 Q1 Q2 Q3 Two Measures of the Relationship
Between Two Numerical Variables • Scatter plots allow us to examine the relationship between
two numerical variables. • Two quantitative measures of such relationships: ⸰ The Covariance ⸰ The Coefficient of Correlation The Covariance • The covariance measures the strength of the linear relationship
between two numerical variables (X & Y). The sample covariance: • Only concerned with the strength of the relationship.
• No causal effect is implied. 1n )YY)(XX( )Y,X(cov n 1i ii - -- = å = • Covariance between two variables: cov(X,Y) > 0
X and Y tend to move in the same direction. cov(X,Y) < 0
X and Y tend to move in opposite directions. • The covariance has a major flaw: ◦ It is not possible to determine the relative strength of the relationship from
the size of the covariance. Interpreting Covariance Coefficient of Correlation • Measures the relative strength of the linear relationship between
two numerical variables. Sample coefficient of correlation: Where, YXSS Y),(Xcovr = 1n )X(X S n 1i 2 i X - - = å = 1n )Y)(YX(X Y),(Xcov n 1i ii - -- = å = 1n )Y(Y S n 1i 2 i Y - - = å = Features of the Coefficient of Correlation • The population coefficient of correlation is referred as ρ. • The sample coefficient of correlation is referred to as r. • Either ρ or r have the following features: ◦ Unit free. ◦ Range between –1 and 1. ◦ The closer to –1, the stronger the negative linear relationship. ◦ The closer to 1, the stronger the positive linear relationship. ◦ The closer to 0, the weaker the linear relationship. Scatter Plots of
Sample Data with
Various Coefficients of Correlation Y X Y X Y X Y X r = -1 r = -.6 r = +.3r = +1 Y X r = 0 Quiz 1. Suppose we measure heigh-weight correlation. The height in the data
set are changed from feet to inches, so all values are multiplied by 12.
The correlation coefficient for the new data will be: a.
12 times the original b. 144 times larger than the original c. The same as original 2. The pairs in a data set are exchanged, so the x-coordinates are now
the y-coordinates and all values are multiplied by 12. The correlation
coefficients for the original data and for the new data are: a. Opposites b. Reciprocals c. The same 3. Let Y be a random variable. Then V(Y) equals: a.
b.
c.
d.
4. To infer the political tendencies of the students at your university, you
sample 150 of them. Only one is a simple random sample:
a. make sure that the proportion of minorities are the same in your sample
as in the entire student body b. call every fiftieth person in the student directory at 9 a.m. If the person
does not answer the phone, you pick the next name listed, and so on. c. go to the main dining hall on campus and interview students randomly
there. d. have your statistical package generate 150 random numbers in the range
from 1 to the total number of students in your academic institution, and
then choose the corresponding names in the student telephone
directory. 2[( ) ]YE Y µ- [| ( ) |]YE Y µ- 2[( ) ]YE Y µ- [( )]YE Y µ- Quiz 1. Suppose we measure heigh-weight correlation. The height in the data
set are changed from feet to inches, so all values are multiplied by 12.
The correlation coefficient for the new data will be: a.
12 times the original b. 144 times larger than the original c. The same as original 2. The pairs in a data set are exchanged, so the x-coordinates are now
the y-coordinates and all values are multiplied by 12. The correlation
coefficients for the original data and for the new data are: a. Opposites b. Reciprocals c. The same 3. Let Y be a random variable. Then V(Y) equals: a.
b.
c.
d.
4. To infer the political tendencies of the students at your university, you
sample 150 of them. Only one is a simple random sample:
a. make sure that the proportion of minorities are the same in your sample
as in the entire student body b. call every fiftieth person in the student directory at 9 a.m. If the person
does not answer the phone, you pick the next name listed, and so on. c. go to the main dining hall on campus and interview students randomly
there. d. have your statistical package generate 150 random numbers in the range
from 1 to the total number of students in your academic institution, and
then choose the corresponding names in the student telephone
directory. 2[( ) ]YE Y µ- [| ( ) |]YE Y µ- 2[( ) ]YE Y µ- [( )]YE Y µ- 51作业君版权所有