辅导案例-ECON 406

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

Problem Set 5
(Due on Wednesday, Nov 11th, 11:59pm)
ECON 406: Data Science Computing for Economics

Expectations and deliverables:
● A pdf : including your code, graphs, and your explanation to each question. Your code
should be attached at the end of the file. Make sure you upload a pdf file instead of a
doc file. Points may be deducted if it’s not a pdf file.
● A python file name: regression.py
○ If Canvas changes your file name. It’s fine. Don’t worry about it.
○ The formatting will be checked with pylint, and points will be deducted if the
formatting is wrong.If you think there is some problem with the pylint error, please
contact us.
● No zip file is needed. No git repository is needed.
● This homework will be focusing on practice OLS Regression and Logistic Regression.
● For problems that require you to run some type of regression, use statsmodels (you may
choose whether to use sm or smf). Besides that, you are free to use any packages you
feel are needed.
● When loading files in your code, assume they are in the same directory as your code.
For example, if you are loading a .csv file called my_data.csv, your pandas call would be
“pd.read_csv(‘my_data.csv’)”, not something like,
“pd.read_csv(‘~/Documents/UW/SickEconClass/EvenSickerAssignment/my_data.csv’)”

Exercise 1 : Wage
You will be given the dataset “wage.csv” to better understand the impact of different
variables on expected wage rates. Please finish the following exercise with this dataset.
It’s a cross-sectional dataset on wages. In doing this exercise, you should make one
function, called “first_exercise”, that generates all your output. Note: this function should
not take any arguments.

1. Prep your data: this should include loading the data and making sure it’s ready
for analysis (deal with missing variables, generate any transformed variables if
needed, etc.).

2. Data visualization: make at least one plot with one or more variables in the
dataset. These can be any plots you think are appropriate to help this exercise.
One possibility includes using a scatterplot to see whether any obvious
relationship exists between wage and years of education. You could do similar
plots of education, experience, or tenure against logged wages.

3. Based on the visualization exercise and what you know about the variables, do
you think OLS or Logistic Regression is more suitable for understanding what
factors explain variability in wages？And why?

4. Propose a data generating process for wages. Write out your proposed model,
use any specification you think is valid. Note that wages (or some function of
wages) should be on the left hand side. A combination of additional variables,
coefficients, and an error term should be on the right hand side. And don’t forget
the subscripts!

5. Estimate the model you proposed in problem 1.4 above. Show the regression
table.

6. Interpret each coefficient. And are they statistically significant at ⍺ = 0.05?

7. How much of the variation of wages for different individuals can be explained by
the variable you choose for your model? (How well your model will do at
prediction?)

8. Given your estimated model, give an example of a type of person that could
expect to have hourly wages of $150 (example, how much education,
experience, tenure, etc.).

Exercise 2: Diabetes
You will be provided with a dataset “diabetes.csv”.The objective of the data set is to
diagnostically predict whether or not a patient has diabetes, based on certain diagnostic
measurements (independent variables) included in the data set. In doing this exercise,
you should make one function, called “second_exercise”, that generates all your output.
Note: this function should not take any arguments.

1. Prep your data: this should include loading the data and making sure it’s ready
for analysis (deal with missing variables, generate any transformed variables if
needed, etc.). As a part of this, you will need to convert the dependent variable
into something that can be used by statsmodels [Hint: convert string into integers
by mapping neg: 0 and pos: 1 using the .map( ) method]

2. Data visualization: make at least one plot with one or more variables in the
dataset. These can be any plots you think are appropriate to help this exercise.
For example, you could visualize how the probability of having diabetes changes
with the pedigree label. In this case, “pedigree” would be plotted on x-axis and
“diabetes” on the y-axis. Consider comparing an LPM plot to a Logistic
regression plot.

3. Based on your plots and your understanding of the data, do you think OLS or
Logistic Regression is more suitable to analyze this problem? And why?

4. Propose a data generating process for diabetes. Write out your proposed model,
use any specification you think is valid. If you decide on OLS, use the regular
regression model. If you decide on Logistic Regression, use the sigmoid function.

5. Estimate the model you proposed in 2.4 and show the regression table.

6. Interpret and compare each coefficient.

7. Consider a patient who has the median value of each of your independent
variables (in the 50th percentile). What is this patient’s probability of getting
diabetes? How much more/less likely is it for a patient who is in the 75th
percentile of each of your independent variables? What about for a patient in the
25th percentile?

欢迎咨询51作业君