辅导案例-IE 300

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

IE 300 Analysis of Data Spring 2020
Section AL1 N. He
COURSE PROJECT I: DATA ANALYSIS FOR AIRLINE DELAYS
1. Objectives
The purpose of this project is to become familiarize with Python to perform basic data analysis.
There are many ways to assess the quality of an airline. You will learn how to assess airlines based
on historical flight data. In the first part of the project, we will determine which airline is the fastest
based on flight time calculations. In the second project, we will develop a very useful predictive
model, Naive-Bayes Classifier to determine the likelihood of delays and predict whether a flight
will be delayed or not.
The objectives of the course projects are to:
• Use Python to read and analyze data;
• Become familiarize with different data structures such as lists and dictionaries;
• Use conditionals and loops to process data, fix anomalies and perform calculations;
• Perform basic plotting in Python using the package matplotlib;
• Apply the concept of conditional probability and Bayes’ Theorem;
• Develop a simple Naive-Bayes classifier model using m-estimates.
Deliverables.
• You should code with Jupyter notebook, and provide detailed problem description, Python
code, and output for each exercise listed below. You can work on template provided.
• Your should submit a zip folder named IE300-ADx-Groupy-Lastname1-Lastname2-
Lastname3.zip, where x is your session number, y is your group number. The folder
should contain a .ipynb file that contains all of your work, and a .pdf file saved from
Jupyter Notebook, both with the same file names as the zip folder. You can also provide
any supplementary files if needed in the zip folder.
• Only one member of your group should submit the project on Gradescope.
• The deadline for the project is March 27 (Friday), midnight at 11:59pm.
1
2. How do we determine which airline is the “fastest”?
Consider the following scenario. Airline A says they will get you from LGA to DCA in 45
minutes, but took 60 minutes instead. On the same route, Airline B says it will only take 70
minutes but actually took 65 minutes. Which airline do you consider the fastest?
Most people will say Airline A since the trip took 60 minutes as opposed to 65. However,
according to the government, Airline A is “late” since it took longer than 15 minutes to arrive.
Because of this, most airlines pad their schedules by saying flight times are longer than they usually
are. How do we accurately determine which airline is the fastest? An approach using historical
data was proposed in [1]. In this case study, we will reproduce some of the results found in the
study to determine the fastest airlines.
The dataset FlightTime.csv contains 743 observations of flights from ORD to LAX throughout
November 2015. There are 10 variables.
Variable Number Variable Name Description
1 Flight Date Date of the flight (mm/dd/yyyy)
2 Carrier IATA Carrier Identification
3 Flight Number Number assigned to the flight
4 Origin Origin airport
5 Destination Destination airport
6 Departure Time Time of departure (hhmm)
7 Depature Delay Difference between scheduled and actual departure time (minutes)
8 Arrival Time Time of arrival (hhmm)
9 Arrival Delay Difference between scheduled and actual arrival time (minutes)
10 Flight Time Difference between actual departure and arrival time (minutes)
Reference: Bureau of Transportation Statistics
There are four types of time we are interested in:
(1) Flight Time: The difference between the departure time and arrival time.
(2) Average Flight Time: The average time an airline takes to complete the route.
(3) Target Flight Time: An estimate of how long a flight should take based on distance
and direction of travel. This is calculated based on the spherical distance (great circle
distance). There are many ways to compute this but we will use a simplified formula that
does not include the complex variables like windspeed, jetstreams, etc. Let lori and ldes
is the longitude of the origin and destination respectively, and d be the spherical distance
between two points. Assuming a constant velocity and an average time of 20 minutes to
runway time, define TFT = 0.117 ∗ d + .517 ∗ (lori − ldes) + 20.
(4) Typical time: Calculated by the target time plus the average delay associated with the
origin and departure airport.
The measure of an airline’s performance is determined by the difference between average flight time
and typical time which is called time added. The lower the time added, the faster the airline is
on that particular route.
Exercise 1: Using FlightTime.csv, write a Python program to perform the following.
(1) Read the dataset, ignoring any observations without a recorded departure or arrival time.
Some times are recorded incorrectly resulting in incorrect flight time. For any flight time
less than 230 minutes, delete the observations.
2
(2) Calculate the number of observations in your dataset.
(3) Calculate the target flight time with d = 1741.16 mi, lori = −87.90◦ and ldes = −118.41◦.
(4) Calculate the typical time of this route. You will need to get the average of the departure
and arrival delays and add it to the target flight time.
(5) Calculate the time added for each airline and determine which airline have the lowest
time added. Note, you will need to calculate the average flight time for each airline.
(6) Output the results.
Exercise 2: Using matplotlib, create a bar graph for the time added of each airline. Be sure to
label your axes and plot.
3. Modeling ad Predicting Flight Delays
As any frequent flier knows, flight delays are an inevitable part of commercial air travel. In the
previous case study, you were able to determined which airline is the fastest for a particular route.
In doing so, you had to take into account departure and arrival delays of each airline. While the
cause for delays are largely unpredictable, we can make some insights on which flights are more
likely to be delayed without taking weather conditions into consideration.
An obvious predictor of delays is the airline itself. For example, in 2014, only 74.34% of American
Airline’s flights were on-time while Delta had a 85.21% on-time record [2]. How can we use historical
data to make predictions about delays? You may have already learned about several predictive
models such as ordinary least-squares. We will study a very popular model in statistics and machine
learning called Naive-Bayes.
3.1. Naive-Bayes Classifier. Recall that Bayes’ Theorem tells you the probability of an event
based on conditions possibly related to that event. That is, given some event A and B,
(1) P (A|B) = P (A)P (B|A)
P (B)
The Naive-Bayes classifier is based on Bayes’ Theorem with the assumption of independence
between attributes. Suppose we have a vector X = (x1, x2, ..., xn) representing n attributes
(independent variables). We wish to compute
(2) P (Ck|X) = P (Ck)P (X|Ck)
P (X)
where Ck represents possible outcomes (classes). If we are trying to decide whether a particular
flight is likely to be delayed, the outcome would be from the set Ck ∈ {Y,N}, corresponding to the
flight being delayed (Y ) or not (N). The vector X would contain information about the flight (e.g.,
airline, origin airport, destination airport) that could give some insight into which class will occur.
One approach to picking a class is to pick the value Ck that is most probable, given that we know
the values of the attributes. In other words, we pick the Ck that maximizes P (Ck|X). Since the
vector X is known, and P (Ck|X) = P (Ck ∩X)/P (X) by the definition of conditional probability,
3
then P (X) will be the same for all possible classes, and we can instead focus on selecting the class
that maximizes P (Ck ∩X).
Noting that the attribute vector X typically contains values of several attributes, so we can
rewrite P (Ck ∩ X) = P (Ck ∩ x1 ∩ x2 ∩ . . . ∩ xn), where event xi tells us the particular value of
attribute i. In practice, the symbol ∩ is often replaced with a comma when intersecting many
events. By repeatedly applying the definition of conditional probability, we have
(3)
P (Ck ∩X) = P (Ck)P (X|Ck)
= P (Ck)P (x1, ..., xn|Ck)
= P (Ck)P (x1|Ck)P (x2, ..., xn|Ck, x1)
...
= P (Ck)P (x1|Ck)P (x2|Ck, x1) · · ·P (xn|Ck, x1, x2, ..., xn−1)
This decomposes P (Ck ∩ X) into a product of the probability of outcome Ck with conditional
probabilities of each attribute. We assume conditional independence between each attributes
(hence, the “naive” portion of the name). That is, P (xi|Ck, xj) = P (xi|Ck) where i 6= j, which
allows us to discard all conditions except Ck in Equation (2). Hence, this equation simplifies to
(4) P (Ck|X) ∝ P (Ck)
n∏
i=1
P (xi|Ck)
With this definition, we can define a classifier (a decision rule). A common classifier is to pick
the most probable class, as discussed above. This is called the maximum a posterior (MAP)
rule which assigns a label yˆ = Ck for some k:
(5) yˆ = argmax
k
P (Ck)
n∏
i=1
P (xi|Ck)
3.2. M-Estimates. The conditional densities P (xi|Ck) can be defined in different ways (e.g., Nor-
mal, Poisson) depending on the problem. If our dataset is small, we often do not know the under-
lying distributions. An intuitive way to estimate P (xi|Ck) is to define
(6) P (xi|Ck) = nˆ
n
where nˆ is the number of observations which C = Ck and X = xi and n is the number of observations
C = Ck. However, for small data sets we may find that that nˆ = 0 or n = 0, which pose a problem.
A way to avoid fix this problem is to use m-estimates:
(7) P (xi|Ck) = nˆ + mp
n + m
where m is called the equivalent sample size and p is the a priori estimate of P (xi|Ck). A typical
choice for p is p = 1/ (# of possible values for attribute i), which assumes that all attribute values
are equally likely. The idea behind m-estimates is to pretend we have m extra observations with
class C = Ck, with m · p of them having attribute X = xi. Essentially, m says how confident we
are of our prior estimate p since as m increases, we have P (xi|Ck)→ p.
Let us consider an example problem.
4
Example 1: Consider the following table containing information of delays.
Obs Origin Destination Airline Delay
1 DFW ORD Delta Y
2 DFW ORD Delta N
3 DFW ORD Delta Y
4 LAX ORD Delta N
5 LAX ORD United Y
6 LAX LGA United N
7 LAX LGA United Y
8 LAX LGA Delta N
9 DFW LGA United N
10 DFW ORD United Y
Table 1. Flight Delay 1
The class is the variable Delay and the attributes are Origin, Destination and Airline. Classify
DFW-LGA on Delta using Naive-Bayes with MAP and m-estimates. Assume m = 3.
Solution: First, note that DFW-LGA on Delta is not contained in the table. However, we can
compute the probability of delay for this flight using Equation (5). Using an equivalent sample size
of m = 3, we compute the relevant conditional probabilities as follows:
P (DFW |Y ) = 3 + 3(0.5)
5 + 3
= 0.5625
P (DFW |N) = 2 + 3(0.5)
5 + 3
= 0.4375
P (LGA|Y ) = 1 + 3(0.5)
5 + 3
= 0.3125
P (LGA|N) = 3 + 3(0.5)
5 + 3
= 0.5625
P (Delta|Y ) = 2 + 3(0.5)
5 + 3
= 0.4375
P (Delta|N) = 3 + 3(0.5)
5 + 3
= 0.5675
For P (DFW |Y ), we have five observations where Ck = Y , of which three observations have xi =
DFW . So, we have that n = 5, nˆ = 3 and p = 0.5 since there are only two values of the attribute
Origin (DFW and LAX). Since Y and N each occur in five of the ten data points, we compute
P (Y ) = P (N) = 0.5, we apply Equation (5) to obtain
P (Y |DFW,LGA,Delta) ∝ P (Y )P (DFW |Y )P (LGA|Y )P (Delta|Y )
= 0.0385
and
P (N |DFW,LGA,Delta) ∝ P (N)P (DFW |N)P (LGA|N)P (Delta|N)
= 0.0698
Since yˆ = argmax {0.0385, 0.0698} = 0.0698, we classify DFW-LGA on Delta as N .
5
Exercise 3: Using the Table 2, will the flight SEA-ATL on Southwest with good weather be
delayed? Use m-estimates for the conditional probabilities with m = 4. Print your results.
Obs Origin Destination Airline Weather Delay
1 SEA ATL Southwest Poor N
2 SEA BOS American Good Y
3 SEA ATL United Poor Y
4 SFO BOS Southwest Poor Y
5 SFO ATL American Good N
6 SFO BOS United Poor Y
7 SFO BOS United Good N
8 SEA BOS Southwest Poor Y
Table 2. Flight Delay 2
Let us now use real historical flight data to predict flight delays. The dataset FlightDelay.csv
contains 2,346 observations of flights departing from JFK or SFO to ATL, LAS or ORD in November
2015. There are 5 variables.
Variable Number Variable Name Description
1 Carrier IATA Carrier Identification
2 Origin Origin airport
3 Destination Destination airport
4 Depature Delay Difference between scheduled and actual departure time (minutes)
5 Arrival Delay Difference between scheduled and actual arrival time (minutes)
Reference: Bureau of Transportation Statistics
Exercise 4: Using FlightDelay.csv, write a Python program to perform the following.
(1) Read the dataset.
(2) Compute the total delay time for each flight. The total delay time is the sum of departure
delay and arrival delay time.
(3) Assign any flight with total delay time greater than 15 minutes with “’Y” for being delayed.
Otherwise, the flight is not delayed and is assigned “N”.
Exercise 5: Using your Python program from the previous exercise, for each of the following
flights, compute the probabilities of delay and not delay given the route and carrier and then clas-
sify whether the flight is delayed using Naive-Bayes (assume m = 3).
(1) JFK-LAS on American Airlines (AA)
(2) JFK-LAS on JetBlue (B6)
6
(3) SFO-ORD on Virgin Airlines (VX)
(4) SFO-ORD on Southwest Airlines (WN)
Your output should contain P (Delay|Ori,Dest, Carr), P (Notdelay|Ori,Dest, Carr) and a classi-
fication of whether the flight is delayed or not.
References
[1] “How We Found The Fastest Flights.” How We Found The Fastest Flights. 11 Mar. 2015. Web. 1 Aug. 2015.
[2] “The World’s Best and Worst On-Time Airlines ” Skift. 18 Nov. 2014. Web. 5 Aug. 2015.
[3] Mitchell, Tom M. “Machine Learning.” New York: McGraw-Hill. 1997.
7