辅导案例-IT270

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

Midterm #1 (ID:161)
IT270
ALEXANDER PELAEZ

Instructions.

Please make sure if you use R you copy and paste it into Word using Courier Font (makes it
easier to Read). For each of the problems that are looking for a response (not just a
calculation), be sure to explain and interpret the results. If you aren’t sure…ASK PLEASE.
Please start each question on a new page and clearly label that start of each problem (Maybe
slightly larger font, bold face , underline… ) anything that will help me find the problem you are
working on.

As a midterm you are to work on your problems individually. However, you may discuss
techniques and approaches. You may not copy code or answers and copying other students
code and answers could result in major penalty or even failure on the exam.

1. Matrix Algebra

Given the following Matrices (10pts)

A=
8 3
B=
4 1
C=
2
5 4 6 2 5
3 3 3

D=
4 1/3 9 13 12
1/3 1/4 E 4 3 10
4 6 12

∑=
21 34 8
7 15 12
8 12 11

Calculate the following: DO NOT USE R
1) BA
2) B’E
3) Find the determinant of E
4) (AA)A’
5) Find the trace of matrix ∑, what does the trace of a covariance matrix represent?
6) Compute the correlation matrix ⍴ from the covariance matrix ∑.
7) Compute the eigenvalues and eigenvectors of the covariance matrix ∑.
8) Prove using the matrix above AA-1 = I

2. Principal Component Analysis

The head of an airport is looking to determine issues related to efficiencies and operations at
the airport. However, they are uncertain where to look.

Given the dataset “airport_cancellations.csv” and “airport_operations.csv” conduct the following
analysis.

a) First you will need to merge this data together - see the merge function in R to create a
combined dataset. Consider what you will need to merge the data on. What happened
after you merged the data? Anything interesting? Why?
b) Are there any outliers or missing observations that need to be dealt with? How will you
deal with them.
c) Conduct a principal component analysis. After the PCA how many dimensions are
necessary? Explain how you determined this?
d) Are there any overlapping loadings? Explain what you will do with the overlapping
loadings.
e) Name the dimensions you are left with.
f) Is this analysis reasonable? Are there any issues with your final dimension list?
g) Create new variables in your dataset and compute values for each PCA dimension. The
conduct a correlation analysis between the variables and each dimension. Why is this
interesting?
h) Is there a correlation between the PCA values? Is this surprising?

3. Factor Analysis

It is a well known fact that sports analytics are very popular. The dataset fifa.csv contains
information about soccer (football) players obtained from FIFA19 information. It would be
interesting if the ratings that are used could be analyzed as a simple set of factors.

a) Identify the columns that are the best candidates for analysis as factors.
b) Conduct a factor analysis using these columns (if you get an error reduce the number of
factors in the function, until the error disappears).
c) Justify your use of the rotation method. What does it tell you ?
d) Are there any correlations between the factors?
e) Create the diagram (by hand or powerpoint) of the factors, be sure to label everything.
f) Split the data set into two parts (Left footed players and right footed players). Conduct a
factor analysis and explain if you see any differences between right and left footed
players . Note you do not need to draw this one.

4. K-nearest

A supervisor wishes to conduct a classification analysis on the breast_cancer.csv dataset to see
if new observations can be properly classified.

a) Examine the dataset and explain why KNN might be used, discuss the benefits of this
algorithm, and discuss the drawbacks.
b) Conduct a correlation plot of the relevant variables , and make an initial assessment of
the relationships between each of the variables and the classifier.
c) Conduct a KNN classification using all of the relevant columns
i) Discuss your initial approach including how you will set this up and what the
steps are including number of variables in your training set and number of
variables in your test set.
ii) Provide measure of accuracy (since this is a straight classification you only need
a simple assessment).
iii) Reduce the number of variables from the original 10. How would you decide
which variables might be better to use in this assessment.
iv) Conduct KNN analysis using these variables, and determine what would be a
reasonable KNN model considering the lowest number of columns needed, with
the most reasonable accuracy. Defend your position.

5. Decision Tree

People Analytics is becoming a more popular and demanding area for Data Mining. The Head
of Human Resources is looking to identify reasons for attrition (people leaving either voluntary
or otherwise). The task is to develop a decision tree to determine this. You will need to merge
three datasets (employee_survey_data.csv, general_data.csv, manager_survey_data.csv) to
accomplish this task.

a) Examine the dataset and determine if there are any issues. Produce histograms and
summary stats for the appropriate columns and identify any issues.
b) Determine and list any columns that are not needed in a first “full” model.
c) After examining the data, which methods (algorithms) do you think are appropriate
d) Develop the decision tree model and compare the methods based on accuracy.
e) Provide a chart of your final model.

6. Theoretical Problems

a) Explain why the covariance matrix is important for analysis. Additionally, explain what
the eigenvectors and eigenvalues are and why they are important.

b) In machine learning, explain the importance of splitting up the dataset. What are the
different ways to split and how should an analyst split the data.

c) What are the differences and similarities between factor analysis and PCA. Focus on the
equations that are produced.

d) What are some of the challenges with KNN. How would you advise someone who
decides to use KNN?

e) What is the difference between orthogonal rotation and oblique rotations. When would
we choose either, or neither.