辅导案例-DATA1001

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

1. DATA1001 期末总复习知识点梳理
目录
1. DATA1001 期末总复习知识点梳理...................................................................................... 1
1.1 Module 1 重点知识梳理 ............................................................................................. 2
1.1.1 对比试验........................................................................................................... 2
1.1.2 Qualitative and Quantitative Data .................................................................... 5
1.2 Module 2 重点知识梳理 ............................................................................................. 8
1.2.1 正态分布与正态分布曲线............................................................................... 8
1.2.2 Chance Error 与 Outliers................................................................................. 10
1.2.3 Linear Association and Correlation ................................................................. 10
1.3 Module 3 重点知识梳理 ........................................................................................... 15
1.3.1 概率................................................................................................................. 15
1.3.2 Permutation and Combination – 排列组合 ................................................... 18
1.3.3 Box Model ....................................................................................................... 19
1.4 Module 3 重点知识梳理 ........................................................................................... 21
1.4.1 Hypothesis Testing .......................................................................................... 21
2. 各知识点对应例题总结 ...................................................................................................... 23

1.1 Module 1 重点知识梳理
1.1.1 对比试验
对比试验
在进行调查的时候，我们通常被要求调查某一个群体（Population）。但由于群体的数
量通常过于庞大，大多数情况下我们无法直接对群体进行调查。
- 我们从群体中随机抽取部分作为样本（Sample），通过对样本进行调查，将获得的
结果进行反推，从而判断群体的特征。
- 调查的方法是对这个样本进行实验（Experiment）。
Randomised Controlled Trial:
- 实验组与控制组：这类实验分为两组。实验组是实施干预或治疗（treatment）的
分组，而控制组则是为了进行对比的组。
- 安慰剂：对于控制组的成员，安慰剂是一种外表与药物 A 一致、但完全没有疗效的
药剂。安慰剂的作用是减缓安慰剂效应（Placebo effect）。
- 安慰剂效应：成员在接受了干预后，“认为”或“相信”实验产生了效应，从而忽略客
观因素而主观给出反馈的效应，这对实验的准确性会造成影响。
- 之所以使用安慰剂，是因为如果只对实验组进行干预而不向控制组提供任何措施，
实验组的成员得知自己已经接收到了干预，从而可能产生安慰剂效应。而加入了安
慰剂之后，实验组和控制组的成员不知道自己接受的是真正意义上的干预还是没有
效应的安慰剂，因此会在某种程度上缓解这种心理暗示。
上述实验叫做控制实验（Controlled Experiment / Randomised Controlled Trial）。
Observational Study:
尽管控制实验能够很好地减少实验组和控制组之间的差异，从而让研究者得到更加直
观的结果，而现实生活中，我们很难直接进行控制实验。
- “调查抽烟是否为导致肺癌的因素”——使用控制实验进行这一调查会导致严重的
ethical issues。
- 在这种情况下，我们使用观察研究（Observational Study）。
1.1.1.2 – Example of observational study

实验目的：调查抽烟是否为导致肺癌的因素。

样本选取过程：实验组从已经患有肺癌的人群中随机抽选 100 人，控制组从健康人
群中随机抽取 100 人。

实验过程：通过已经分好的实验组和控制组进行统计对比，分别记录实验组和控制
组拥有长期吸烟历史的成员的人数。最终对这两组的吸烟群体人数进行比对，从而
推理结果。
在上述实验中，我们依然保留了实验组和控制组。只不过研究人员无法自主进行分组，
只能通过结果反推。
- 观察研究的目的是建立因素（ Factor ）和结果（ Outcome ）之间的联系
（Association），而不是验证“因素导致（Causation）结果”这一命题。
- 即： Observational Studies may establish association, but it does not establish
causation.
- （Refer to Linear model）
对比试验中的常见问题：
上述的 Controlled Experiment 与 Observational Study 都属于对比实验。其中，对比实验
包括主要三种常见问题：
1. 干扰因子（Confounder）
◼ 在调查自变量和因变量之间可能存在的关系时，由于实验设计不当或样本个体
偏差等因素，可能存在除自变量之外其他干扰因变量结果的因素。这种因素叫
做 confounder。
◼ 控制实验中需要尽可能减少 confounder的存在，而观察研究几乎不可避免地会
掺杂 confounder。
◼ 在观察研究中，为了降低 confounder的影响，需要尽可能全面地判断可能存在
的 confounder，并对 confounder 影响实验的程度进行分组。
2. 偏差因素（Bias factor）
◼ Bias 大多存在于实验设计和取样本阶段。
◼ 进行测量时存在的偏差，可能导致系统性误差。
◼ 常见的偏差因素类型：
◆ 选择偏差（Selection Bias）：研究人员（有意识或无意识地）倾向于选择
拥有某特征的人员参加试验
◆ Measurement Bias: 当调查的问题本身带有引导性或倾向性时，参与者的判
断可能会受到影响。
◆ 幸存者偏差（Survivor Bias）：只有拥有某种特征的人员完成实验
◆ Adherer bias：拥有某种特征的人员被反复选至实验组或控制组。
◆ Non-Adherer bias：与 Adherer bias 相反
◼ 如何减缓偏差因素对实验的影响：
◆ 双盲研究（Double-blind Experiment）：进行实验时，受试者与测验人员双
方都不清楚受试者属于实验组还是控制组，直到数据收集完毕之后才会根
据受试者代码揭示分组。
3. 道德伦理（Ethics）
◼ 主要考虑 Privacy
Simpson’s Paradox:
Simpson’s Paradox 是 confounder 可能带来的问题之一。
- 有时，在某个条件下的独立的几组数据可能会呈现出某种趋势，然而当这几组数据
合并到一起时，可能会导致相反的结论。
1.1.1.1 – Example of Simpson’s Paradox

2016 年美国大选之前，某媒体对 4271 名公民进行了调查。其中 2513 名给出了回
应。

以下为最终结果：

在上述事例中，如果只观察 White 和 Black 两种群体，会发现穷人似乎更倾向于投票给
Trump。而事实上，当我们把这两组结合在一起，结论则是受调查的人群中，46.9%的
富裕人士选择投票给 Trump，而穷人中只有 45%选择投给 Trump。
上述事例导致辛普森悖论的 confounder 可能是，贫富差异与选择投票给 Trump 其实无
关，或本身并不是影响投票人选的主要因素。而出现贫富之间的比例差可能纯属巧合。

1.1.2 Qualitative and Quantitative Data
我们需要处理的数据主要分为两种：定性（Qualitative）与定量（Quantitative）。
- Qualitative Data：又被称为 categorical data。这种数据指的是能够进行分类的数据。
通常情况下，Qualitative data 由文字或非数字的内容组成。然而在某些情况下，数
字也可以作为 Qualitative Data 使用。即：并非所有带数字的数据都是 quantitative
data
◼ 性别
◼ 类型
◼ 年份（虽然是数字但通常作为 qualitative data 读取，但具体也要视情况而定）
- Quantitative Data：能够进行测量、计数或计算的数据，一定是 Numerical data（数
字相关）
◼ 金钱
◼ 年龄
Data Type与可视化：
课程需要掌握的可视化类型有 4 种：
1. 柱状统计图（Bar Plot）:
◼ 有时又叫做 Bar chart 或 Bar graph
◼ 对 qualitative data 进行简单的统计与总结。

2. 直方图（Histogram）：
◼ 用于总结测量 Quantitative Data。
◼ 对于 histogram 来说，整张图所有区域的面积为 100%，而每个组所占的面积则
是该组在整个 sample 中所占的比例。
◼ Histogram 的横向坐标可以分作 Class Interval。每个 interval 的间隔不一定相等。
◼ 注意与 Bar Plot 区分：
◆ Bar Plot 的横向宽度永远是固定的，且没有数据意义。而 Histogram 的宽度
则用来表示横跨的数据的组。
◆ Bar Plot 用高度表示各个类别的频数，而 Histogram 则用面积（高度 x 宽度）
表达各个组的频数。
◆ Histogram 的数据组具有连续性，直方图只是列出不同数据的类型。

3. 箱线图（Box Plot）：
◼ Box Plot 用来测量多个数据组。它能够直观地将每组数据的中位数以及中间的
50%数据点（1 to 3 quartiles）展示出来，剩余的数据点则会以 box plot 尾巴或
outlier 的形式呈现。
◼ 通常情况下，box plot 通过 qualitative 变量进行分组，每组展现的变量都是
quantitative 变量。
◼ 能够直接通过对比几组的数据判断它们的大小、中位数所在位置以及 normality。

4. 散布图（Scatter Plot）:
◼ 测量 quantitative 与 quantitative 变量之间的联系
◼ 如果 Scatter plot 呈现出某种上升或下降的趋势，则说明两个变量之间可能有所
联系。反之则可能没有联系。

Numerical Summary:
- 平均数（Mean）：
◼ 一组 quantitative variable 的平均数值，同时也是这组数据的 balancing point。
即：如果用所有数据减去平均数，得出来的差的总和为 0。
◼ 一个班级学生的平均成绩
◼ Quantitative variable
- 中间数（Median）：
◼ 一组 quantitative 的数据，从小到大排列，中间的那个数据就是中间数。如果
这组数据的数量是偶数，则中间两个数字的平均数就是这组数据的中间数。
◼ Quantitative variable
- 左偏（Left Skew）：
◼ 将一组数据用 histogram 统计出来，呈现出的趋势用曲线模拟。如果曲线的尾
端偏向左，则该组数据为左偏数据。
◼ 左偏数据中，平均值小于中位数。
- 右偏（Right Skew）：
◼ 将一组数据用 histogram 统计出来，呈现出的趋势用曲线模拟。如果曲线的尾
端偏向右，则该组数据为右偏数据。
◼ 右偏数据中，平均值大于中位数。
- 均方根（Root Mean Square）：
◼ 用来测量某组数据的分布
◼ 公式：√
∑数据值
2

，n 指的是数据总共的数量。
- 标准差（Standard Deviation）:
◼ 用来测量某组数据的分布。
◼ 平均值无法用来直接测量分布（由于 balancing point 的缘故）
◼ 公式：√
∑（数据值−平均值）
2

- 四分位距（Interquartile Range / IQR）：
◼ 数据按照从小到大的顺序排列，中间 50%的数据即为 IQR。

1.2 Module 2 重点知识梳理
1.2.1 正态分布与正态分布曲线
Normal Curve（常态曲线）是可以描述许多自然现象分布规律的曲线。

形容：Symmetric and bell-shaped curve
正态分布曲线左右对称，峰值在图像中央，向左向右呈现递减趋势。
Normal curve 之所以常见，是因为它呈现出的姿态与许多自然规律相符。但记录
sample 的时候，我们不可能拿到百分百完美的 normal distribution。通常情况下，我们
会将 sample 获取的数据记录下来，并用直方图——histogram 进行表示。而 normal
distribution 可以用来接近 histogram 所呈现的形态。
通过 normal curve，我们能够估算呈现正态分布的直方图所对应的面积。Histogram 的
形态越接近 normal distribution，估算出的数值的准确度越高。
- Central Limit Theorem：假设有一个 population，不管这个 population 的 distribution
长成什么样子，如果我们不断从这个 population 中取出足够大的 sample，并在每次
取出 sample 之后记录 sample mean，那么，当这个步骤重复次数足够多的前提下，
最终得出的 sample mean 的 distribution 将会接近于 normal distribution。
◼ Central limit theorem 的前提：
◆ 每次取出的 sample size 足够大
◆ 总共取出 sample 的次数足够多
◼ 可以参考网站： https://seeing-theory.brown.edu/probability-
distributions/index.html
所有的 normal curve 都符合“68%-95%-99.7%”的特征。

对于 normal distribution 来说，我们可以将每条常态曲线进行分割：
- 峰值所对应的 x 的值是这组数据的平均值。
- 从平均值所在的点开始，分别向左、向右出发至 1 个 sd 的长度，这部分 normal
curve 底下对应的面积是总面积的 68%。
- 从平均值所在的点开始，分别向左、向右出发至 2 个 sd 的长度，这部分 normal
curve 底下对应的面积是总面积的 95%。
- 从平均值所在的点开始，分别向左、向右出发至 3 个 sd 的长度，这部分 normal
curve 底下对应的面积是总面积的 99.7%。
即：从 normal distribution 的中心点出发，3 个 sd 基本可以覆盖整个 distribution。
General Normal与 Standard Normal：
所有的正态分布的共同点：
- 所有 normal curve 都拥有一个平均值。这个平均值是其峰值所对应的数据点，同时
也是这个 normal curve 的对称轴所在的数据点。
- 所有 normal curve 都通过 sd 来测量其覆盖的范围。
而其中的差异是，不同曲线的平均值和标准差的数值（即覆盖范围）有所不同。
通常情况下，我们用 ~ (, 2) 来表示 X 按照中心为 mean、标准差为 sd 的常
态曲线分布。
所有的常态分布曲线都符合这个表达方式。而其中，(0, 1) 表示 Standard Normal – 标
准的正态分布。即：平均值为 0、sd 为 1 的分布。
所有的常态分布都可以通过“标准化”的步骤转换成 Standard normal 的形态。这需要用
到 standard units （标准单位）。同时，它也被称作 z score（z值）。
标准单位的公式：
数据值−平均值
标准差

1.2.2 Chance Error 与 Outliers
误差：
无论如何进行细致的测量，我们实验得到的数据结果都有很大概率与理论计算的数值
产生偏差。这是由于 chance error （误差）所导致的。为了测量误差的大小，通常情
况下，我们对同一个实验以相同的条件重复多次，并测量其对应的 standard deviation。
误差本身不能完全消除（例如：投硬币），但可以通过多次重复实验来减少误差对结
果的影响。
误差导致的结果：
- Outliers（离群值）：在大量进行重复实验时，我们可能会在结果中看到很少量的
极端数据。这些数据叫做 outlier。通常情况下，当数据值距离这组数据的平均值超
过 3 个标准差时，我们就将它归类为 outlier。（在 normal distribution 中，3 个 sd
覆盖 99.7%的内容）
- 当图像中存在 outlier 时，该图像呈现出的 normal distribution 可能会被影响，同时
也可能会影响该数据组的平均值和标准差。
偏差：
与误差不同，偏差指的是实验中对每个测量数值都有所影响的因素。偏差无法通过重
复实验消除。
1.2.3 Linear Association and Correlation
Linear Association:
Scatter plot 能够呈现两个 quantitative 变量之间的联系（Association）。
对于两个变量来说，如果一个变量（因变量）随着另外一个变量（自变量）的改变而
呈现出规律、一致的改变趋势，那么这两个变量之间可能存在 Linear Association（线
性关系）。将数据点画在 Scatter plot 上，如果观察到数据点围绕着一条直线汇聚，则
说明很有可能呈现 linear association。
Linear Association 可以用来创建 Linear model，进而对未知数据进行估计和预测。

Correlation:
Linear Association 的强弱可以用 correlation coefficient （相关系数）来表示，符号为 r。
它测量的是数据点围绕趋势线分布的系数。这个数据表示的是 linear association 的强弱
以及方向。
Correlation coefficient 的值在-1 和 1 的范围之间：
- 如果 r 为正数：因变量随自变量的增长而增长。
- 如果 r 为负数：因变量随自变量的增长而降低。
- 如果 r 接近 1 或-1：数据点与趋势线之间联系越紧密，表示自变量与因变量之间的
联系较强。

Correlation coefficient 是将两个变量分别转换为标准单位后相乘并计算平均值所的来的
数据。它表示数据之间联系的紧密程度，而不代表 Linear model的斜率。对于 r=0.8 来
说，这个数据代表的不是 80%的数据点距离趋势线近，同时，它也不表示与 r=0.4 的数
据相比，r=0.8 的两个变量的联系紧密程度是这组数据的两倍。
- 对于非线性的图像来说，即使能够计算得出 correlation coefficient，这个数值也是无
意义的。Correlation coefficient只在 linear association中成立。
注意：在对 correlation 进行分析时，不要通过 scatter plot 显示的趋势来判断 x 变量是
导致 y 变量变化的因素，只能判断这两者之间是否有关联——association cannot
establish causation。
Regression Line:
连接 scatter plot 中心点(̅, ̅)以及(̅+ , ̅ + )的直线，即一个呈现 Linear Association
的 scatter plot 中数据点所接近的线.
Regression Line 在中心点和 SD 的基础上，同时考虑到了 r（correlation coefficient）的数
值对预测数据的影响
计算出 regression line 之后，可以通过它对数据进行预测。但注意：linear model 的预测
必须建立在原始数据中所有自变量的最大最小值之内。超出该范围的预测是无效
（invalid）的。
Regression line prediction：通过输入一组数据，创建出 scatter plot 和对应的回归线。
需要计算出对应的 linear model（二元一次函数），输入需要预测的自变量的值，从而
得出因变量的预测值。
- Linear model：通过对输入的数据进行分析，判断其中是否存在 linear association。
如果存在，该 linear association 对应的回归线可以作为线性模型。模型可以在已知
数据的基础上，对新输入的数据对应的、在原先数据范围内的数值进行相对准确
的预测。但是超出原始数据范围的情况下，很难进行可靠的预测。同时，创建
linear model 的基础是 scatter plot 上的点群必须呈现 linear trend，否则一切对
correlation 的计算以及创建的 linear model 都是无意义的。

Percentile ranks prediction：通过给出的数据 x 的百分比范围，预测对应的 y 的数据范
围百分比。
- 找 x 方向对应的 z 值（标准值）
- 通过 x 方向对应的 z 值，找 y 对应的 z 值
- 将 y 对应的 z 值通过 normal distribution 转化为百分比

残差（residual）：
Residual 是测量 scatter plot 上某个数据点到 regression line 的距离的变量。它类似于
standard deviation——测量数据点到数据平均值的距离的变量。

Residual 表示准确数值与预测数值的差异： = − ̂，为 x 值为 i 时的 residual，为
准确数值，̂表示通过 linear model 在 x 取值为 i 时对 y 的预测值。
Residual plot 是将数据的 residual 绘制成的图。如果 residual plot 呈现出非常明显的图案
（即：纵向观察，会发现不同 x 对应的 residual 的分布有明显差异），说明数据之间可
能不呈现 linear relationship（例如 fanning 图案）。
通过观察 scatter plot，我们可以绘制对应的 residual plot。将每个 x 对应的预测值（即
regression line 对应的 y 的数值）标准化（设置成 0），然后将每个 x 对应的 residual
map 上去，会得到 residual plot。

2.2.3.1 Homoscedastic 与 Heteroscedastic

通过 RMS，我们可以判断所有数据点到回归线的距离是否相似。不论数据点在回归
线上还是回归线下，RMS 都会返回大于等于 0 的数值。对于一个 scatter plot 中所有
x 对应的 y 数据点，如果这些数据点的分散度大致一致，那么这组数据被称为
homoscedastic （同方差）的数据。相反地，如果这些数据点的分散度差异较大，这
组数据则被称为 heteroscedastic（异方差）的数据。

1.3 Module 3 重点知识梳理
1.3.1 概率
Prosecutor’s Fallacy (检察官谬误)：
这是一种思维上的谬误，通过“无关的数据”或“有关但未正确考虑几率的数据”从而判定
“被告无辜的几率很小”。
- 曾用来错误地断定被告“有罪”
- 曾被辩护律师用来辩论其被告“无罪”
概率（Chance）:
Chance / Probability（概率）是在重复某过程若干次时，某特定事件发生的可能性。
- 概率的范围：0 – 1 （或 0% - 100%）
◼ 0 表示该事件不可能发生
◼ 1 表示该事件必定发生
- P(Event A)表示 Event A 发生的概率。
Complement：与当前事件相反的事件。
Conditional Probability
Conditional Probability（条件概率）是在给出某事件已经发生的前提下，另一事件发生
的概率。
符号表示：P(Event 1| Event 2) – 在 Event 2 已经发生的前提下，Event 1 发生的几率。
1.3.1.1 – Conditional Probability

假设有一个标准的骰子（6 个均匀的面，每个面出现的概率一致）。

设定两个事件：
- 事件 A: 投掷骰子之后，得到的点数是偶数（2，4，6）
- 事件 B：投掷骰子之后，得到的点数是质数（2，3，5）

如果单独看这两个事件，事件 A 和事件 B 的概率均为 1/2。

但如果给出前置条件：掷出骰子后，已知事件 A 成立（即：骰子的点数是 2，4，6
中的一个）。此时判断事件 B 的概率，会发现事件 B 的概率由 1/2 变成 1/3。

这是由于事件 B 的抽取范围由从（1，2，3，4，5，6）中抽取到（2，3，5）变成了
从（2，4，6）中抽取到（2，3，5）。由于 3 和 5 不在新的抽取范围内，事件变成
了从（2，4，6）中抽取到 2，即概率变成 1/3。

假定第三个事件：
- 事件 C：投掷骰子之后，得到的点数是奇数（1，3，5）。

P(Event C| Event A)和 P(Event C| Event B)的概率分别是多少？
- P(Event C| Event A) = 0
- P(Event C| Event B) = 2/3

1.3.1.2 – Multiplication Principle

两个事件（Event A & Event B）同时发生的可能性为：

P(Event A & Event B) (=P(Event B & Event A)) = P(Event A) * P(Event B | Event A)

用 1.3.1.1 举例：

P(Event A & Event B) = P(Event B) * P(Event A | Event B)
= 1/2 x 1/3
= 1/6

如果只按照 Event A 和 B 的定义来看：
- 事件 A: 投掷骰子之后，得到的点数是偶数（2，4，6）
- 事件 B：投掷骰子之后，得到的点数是质数（2，3，5）

A 与 B 的重合范围仅在 2 上，因此只有当数字 2 朝上的时候，才会同时满足 A 与 B。
此时的概率为 1/6，与计算结果相同。
Independent and Dependent Events
假设两个事件 A、B，事件 A 发生之后事件 B 的概率不受影响，则说明 A 与 B 之间呈独
立关系（independent events）。即：( | ) = ( )。在 A 与 B之
间呈 independent 关系的前提下，事件 A 与 B 同时发生的可能性为：
( ) = ( )( )。
反之如果( ) ≠ ( | )，则事件A与 B之间呈现依存（dependent）
关系。
Mutually Exclusive (互斥)
Mutually Exclusive (互斥)指的是，假定有两个事件：事件 A 与事件 B，这两个事件永远
无法同时发生。这种情况下，事件 A 与事件 B 互斥。
1.3.1.3 – Mutually Exclusive

假设投掷一个均匀的硬币（正反面出现几率一致）

设定两个事件：
- 事件 A：投掷硬币之后正面朝上
- 事件 B：投掷硬币之后反面朝上

事件 A 与事件 B 永远不可能同时发生——硬币投掷之后要么呈现正面，要么呈现反
面。
Addition Rule:
- 如果两个事件互斥，那么这两个事件中至少发生一件的可能性为这两个事件的可能
性的总和。
◼ 例：投掷一个均匀的骰子
◆ 事件 A：投掷骰子之后呈现数字为 3
◆ 事件 B：投掷骰子之后呈现数字为偶数
◆ 投掷骰子之后呈现出 2，3，4，6 的可能性为 4/6 = 1/6 + 1/2
1.3.1.4 – Confusion and Rules

Mutually Exclusive Event A 与 Event B 不可能同时发生
Independence Event A 与 Event B 可能同时发生，Event A 发生后不会影响
Event B 发生的几率，同时 Event B 发生后不会影响 Event A 发
生的几率。

Addition Rule P(Event A or Event B
occurs)
P(Event A) + P(Event B) Mutually exclusive
Multiplication
Rule
P(Event A and Event
B occurs)
P(Event A) x P(Event B) Independent
P(Event A) x P(Event B |
Event A)
Dependent

1.3.2 Permutation and Combination – 排列组合
Factorial (阶乘)：
公式为：! = 1 × 2 × 3 × … × ，通常用在 permutation 中计算排列问题，对于简单的
排列问题也可以直接通过阶乘得出答案。例如为 10 个人排队，共有 10!种排法。
注意：0！= 1
Permutation: =
!
(−)!
，通常用来计算顺序有关紧要的排列问题。与单独的阶乘不
同，Permutation 更加复杂一些。例如从 10 个人中选 3 个人排队，共有 10P3 中排列方
式。
Binomial Coefficient (二项式系数)：
是 binomial model 的系数。当我们展开二项式( + )的时候，展开项的系数遵循
binomial model 的系数。

Binomial Coefficient 的公式可以以 Combination 的表达形式展现： =
!
!(−)!
，通常
用来计算顺序无关紧要的组合问题。例如从 10 个人种选 3 个作为代表，共有 10C3 种
选法。
Binomial probability 指的是，假设某一实验只有两种结果，且这两种结果出现的可能性
固定（p 和 1-p，分别对应 outcome A 与 outcome B），则反复重复这项实验 n 次，对
应的获得 x 个 outcome A 的可能性为 binomial probability。公式为： × (1 −
)(−)

1.3.3 Box Model
Box Model:
我们可以用 Box Model 对某个 population 进行模拟，从 box 中抽取 ticket 就相当于从
Population 中抽取 sample。Box Model 可以方便我们对复杂的 population 进行模拟分析。
Expected Value:
在 box model 种，每抽取一次 ticket，我们期盼/预计会得出的值就是这一次抽取的
expected value （期望值）。随着抽取次数的增多，总体的 expected value 为每一次抽
取的预期值的总和。
- 通常情况下，单次抽取 ticket 的 expected value 为这个 box 的平均值。
- 每次抽到的 sample 可能都会与预计值有一定误差（Chance error），但随着实验次
数（抽取次数）的增加，Chance error 对实验结果走向的影响会逐渐减少。
Standard Error:
Standard error（标准误差）指的是，从一个 population 中取出大量的 sample，这些
sample 最终形成的 distribution（即 sample distribution）的 standard deviation。
Standard error 测量的是 population 的分布（即 Population distribution）与 sample
distribution 之间的差异。通常情况下，我们用 Population mean 和 sample mean 更加明
显地判断两者的误差。在 box model 中进行模拟， SE 为√ ×
。
计算 box model 的 SD 有三种方法：
- 使用 RMS
- R 直接使用 popsd()函数（而非 sd()函数，sd 返回 sample sd，而 box model 本身用来
模拟 population distribution）
- 对于简单的 box model（只包含两种 ticket: ticket A 与 ticket B，且 A 的数值比 B 数值
高）：
◼ = ( − )√ ×
通常情况下，chance error 与 standard error 差异不会太大。而 observed value（单次抽
取获得的结果）基本在 expected value 的 2 至 3 个 SE 以内。超出这个范围的，通常被
认为是 outliers。

Confidence Interval:
Parameter 指的是关于某个 population 的某个 fact 或 factor，而 estimate（或 statistic）
是通过从 population 取出的 sample，从而对 population 中该对应 parameter 的测量或预
测。
而 Confidence Interval （自信区间）指的是，在通过 sample 对 population 的某个
parameter 进行调查时，我们无法保证调查的 parameter 绝对正确，只能大概确定其范
围。这个范围就是我们的 confidence Interval。
从一个平均值未知的 population 里抽取 100 个足够大的 sample，然后对每个 sample 分
别计算它们的 95% confidence interval，则这些 100 个 sample 分别计算的 confidence
interval 当中，大概有 95 个会包含原始 population 的平均值。

1.4 Module 3 重点知识梳理
1.4.1 Hypothesis Testing
当我们想要调查一个大的 population 的某种特性时，经常会从中取出一个 random
sample。在 sample size 足够大的前提下，该样本呈现出的趋势通常与 population 的这
种特性一致或接近。
而 Hypothesis testing （假设实验），是对这个特性进行某种程度的假定（即，提出
hypothesis 或假设），根据该样本展现出的趋势来判断该假设是否成立，从而通过
sample 推断 population。
为什么使用 Hypothesis testing？
⚫ 能够测量“证据”的 significance
⚫ 能够通过证据的 significance 来判定 sample 所呈现的 pattern 能够代表 population
distribution，还是由于偶然因素造成。
Hypothesis testing framework
1. 设立需要调查的问题，并建立两个 Hypothesis（假设）：H0 和 H1，H0 又叫做 null
hypothesis（零假设/原假设），即进行假设检验时预先建立的假设（注意，null
hypothesis 是“符合”原先提出的 distribution 的 statement）。与之相反，H1 叫做
alternative hypothesis（备择假设），与零假设相反，当证据显示零假设难以成立
时，我们 reject null hypothesis，retain alternative hypothesis （反对零假设，保留备
择假设）。
2. 分析 sample 得出 evidence。在此之前，首先要判断 assumptions 是否成立。在进行
hypothesis testing 之前，我们需要先分析该 population 和 sample 的 assumption (假
定条件)。对于 hypothesis testing 来说，不同类型的 hypothesis testing 需要符合的
条件也是不同的。如果不说明 assumptions，那么得出的结论会是 not transparent
（不透明）的；如果假设测验不符合 assumption，那么得出的结论将有可能是
invalid（无效）的。
3. 计算 Test statistic（检验统计量）。Test statistic 测量的是 observed data（sample 测
量的数据）与 population 中对应数据的期望值的偏差，公式为 0 =
− ()
()
，其中 T0 表示的是 test statistic。

4. 得到 Test statistics 之后，我们要计算 P-value（p值）。P-value 是判断 sample 能否
支撑 null hypothesis 的数值。其定义是，如果我们反复从 population 中取出 sample，
假设 null hypothesis 成立的前提下，我们取出的这些 sample 计算出的 test statistics
比当前这个 sample 计算出的 test statistics 还要极端化的几率。
对于 test statistic 来说，其实是运用了 central limit theorem，将 sample 以正态分布
的形式画出图像来，test statistics 是对该正态分布进行的 standardization。而 p-
value，计算的则是在常态曲线下，对应 test statistics 所占的常态曲线下的面积。
通过 R，我们可以直接求出 p-value。对于上述例题，p-value 为 0.0027。
注意：p-value 并非一个表示 null hypothesis 成立的几率的数值。它是建立在 null
hypothesis 成立的基础上，从中取出的 sample 的 test statistics 比给出的 sample 得
出的 test statistic 还要极端的概率。
5. 得出结论。通常情况下，当 p-value 小于 0.05 时，我们说该 test statistic 是
statistically significant 的，即我们的 evidence 显示，sample 与 population 之间的误
差有很大可能不仅仅是由于测量误差导致的，因此，更进一步，我们可以得出结
论：null hypothesis 有很大概率是不成立的，因此我们 reject null hypothesis，retain
alternative hypothesis。
P value:
对于 hypothesis testing 来说，有两种 alternative hypothesis：
- 1-sided：1: > 0.5 或1: < 0.5
- 2-sided：1: ≠ 0.5 (相当于包含了 1-sided 的两种情况)

2. 各知识点对应例题总结
1. A survey question was posted on Facebook, asking the question “Which social media
platform is mostly used by you?” Is there bias involved in this question? If so, what type of
bias is involved?

2. What is the meaning of double-blind experiments? Why do we use them?

3. Since 2000, the median wage for high school dropouts, high school graduates with no
college education, people with some college education and people with Bachelor’s or higher
degrees have all fallen, however the overall wages of people in United States have risen. Can
you suggest a possible reason that is able to explain this in terms of Simpson’s Paradox?

4. Which graphical summaries would be most suitable for each of the following scenario?
a. Represent the distribution of students’ score in a subject
b. Comparing the median score of 3 groups of athletes.
c. Explore the relationship between students’ weekly exercise time and their height
d. Represent the total number of books of each category in a book store

5. Over the period 1964 – 70, the suicide rate in England fell by about 1/3. During this period,
a volunteer welfare organisation called ‘the Samaritans’ was expanding rapidly. One
investigator thought that the Samaritans were responsible for the decline in suicide. An
observational study was conducted to prove it. The study was based on 15 pairs of towns. To
control for confounding, the towns in a pair were matched on the variables regarded as
important. One town in each pair had a branch of the Samaritans, the other did not. On the
whole, the towns with the Samaritans had lower suicide rates. So the Samaritans prevented
suicides, or did they?

6. The figure below is a histogram showing the distribution of blood pressure for all 14,148
women in a Drug Study. Use the histogram to answer the following questions:

a) Is the percentage of women with blood pressures above 130mm around 25%, 50%, or 75%?
b) In which interval are there more women: 135-140mm or 140-150mm?
c) On the interval 125-130, height of histogram is about 2.1% per mm. What percentage of
women had blood pressures in this class interval?

7. Below are sketches of histograms for 3 lists.

a) In scrambled order, the averages are 40, 50, 60. Match the histograms with the averages.
b) Match the histogram with the description:
- The median is less than the average
- The median is about equal to the average
- The median is bigger than the average

8. A normal curve is defined by which parameters:
a) Mean and median
b) Mean and interquartile range
c) Median and standard deviation
d) Mean and standard deviation
e) Range and outliers

9. True or false:
a) If you add 7 to each entry on a list, that adds 7 to the average
b) If you add 7 to each entry on the list, that adds 7 to the SD.
c) The median and the average of any list are always close together
d) Half of a list is always below average

10. A study of the IQs of husbands and wives obtained the following results:
For husbands, average IQ = 100, SD = 15
For wives, average IQ = 100, SD = 15
r = 0.6
One of the following is a scatter diagram for the data. Which one?

11. The figure below has 6 scatter diagrams for hypothetical data. The correlation coefficients,
in scrambled order, are:
-0.85, -0.38, -1.00, 0.06, 0.97, 0.62
Match the scatter diagrams with the correlation coefficients

12. The scatter diagram below shows scores on the midterm and final in a certain course.
a) Was the average midterm score around 25, 50 or 75?
b) Was the SD of the midterm scores around 5, 10 or 20?
c) Was the SD of the final scores around 5, 10, or 20?
d) Which exam was harder – the midterm or the final?
e) Was there more spread in the midterm scores, or the final scores?
f) True or false: There was a strong positive association between midterm
scores and final scores.

13. Look at the figure below.
a) Is the SD of y about 0.6, 1.0 or 2.0?
b) Is the SD of the residuals about 0.6, 1.0 or 2.0?
c) Take the points in the scatter diagram whose x-coordinates are between 4.5 and
5.5. Is the SD of their y-coordinates about 0.6, 1.0 or 2.0?

14. It is known that screws produced by a certain company will be defective with probability
0.01, independently of one another. The company sells the screws in packages of 10 and
offers a money-back guarantee that at most 1 of the 10 screws is defective. That is, if the
package contains more than 1 defective screw, it will be replaced by the company. What
proportion of packages sold must the company replace?

15. Suppose that P(A) = 0.2 and P(B) = 0.3. What does P(A|B) equal if A and B are independent?

16. What are the differences between null and alternative hypothesis?

17. True or false and explain briefly.
a) A difference which is highly significant can still be due to chance
b) A statistically significant number is big and important

Answers：
1. Selection Bias – survey is posted on Facebook, so it is expected that more users would
prefer Facebook for the survey
2. A double-blind experiment means that both the studied groups and the investigators are
not aware of who belongs to the control group and the treatment group until the whole
experiment is finished.
3. It is possible that since 2000, there are more people attending jobs, and more people with
much higher wages than others. These people have increased the overall wages, even though
they are not the majority as seen in individual groups, however when looking at the overall
picture of the wages in the US, the total wages has increased.
4. a) Histogram b) Boxplot c) Scatter plot d) Bar plot
5. No matter how carefully an observational study was done, they are not experiments.
Samaritans might be associated to the declined suicide rates, but they do not prevent suicides.
6. a) 25% b)140-150mm c) 10.5%
7. a) 60, 50, 40
b) i) median > avg ii) median = avg iii) median < avg
8. d
9. a) True b) False c) False d) False
10. d
11. 0.62, -1.00, -0.85, 0.97, 0.06, -0.38
12. a)75 b) 5 c) 10 d) final e) final f) true
13. a) 1.0 b) 0.6 c) 0.6
14. 0.004
15. 0.2
16. The null hypothesis states the difference is due to chance but the alternative hypothesis
says that the difference is real.
17. a) True. A significant result means the null hypothesis is unlikely to happen under the
given data, however there is still possibility. B) False.

欢迎咨询51作业君