程序代写案例-PPHA 30545
Machine Learning - Lab Mini-Project 1
PPHA 30545 - Professor Clapp
Winter 2021
This assignment must be handed in via Gradescope on Canvas by 11:45pm Central Time on
Monday, February 1st. You are welcome (and encouraged!) to form study groups (of no more
than 3 students) to work on the problem sets and mini-projects together. But you must write your
own code and your own solutions. Please be sure to include the names of those in your group on
your submission.
You should submit your code as a single Python (*.py) file and the write up of your solutions as a
single PDF. For the former, please also be sure to practice the good coding practices you learned in
PPHA 30535/6 and comment your code, cite any sources you consult, etc. For the latter, you may
type your answers or write them out by hand and scan them (as long as they are legible).
You are allowed to consult the textbook authors’ websites, Python documentation, and websites
like StackOverflow for general coding questions. You are not allowed to consult material from
other classes (e.g., old problem sets, exams, answer keys) or websites that post solutions under the
guise of tutoring.
1 Overview
After graduating from Harris, you are quickly hired to work for the President’s Council of Eco-
nomic Advisors (CEA).1 The CEA is an agency within the Executive Branch that provides the
President with objective advice to inform both domestic and international policy. According to its
webpage, the “[CEA] bases its recommendations and analysis on economic research and empirical
evidence, using the best data available to support the President in setting our nation’s economic
policy.”
Your boss has asked you to conduct research using data from the American Community Survey
(ACS) Public Use Microdata Sample (PUMS) to predict the returns to education and inform policy.
Your analysis will help shape your office’s recommendations to the President and help set her
education agenda with a specific focus on the expansion of access to higher education.2 The project
has three parts: (1) obtaining data from the Internet, (2) cleaning that data, and (3) performing data
analysis and answering questions.
1Your family is very proud and all of your friends are jealous of your great gig. You tell them you’re so glad that
you took Machine Learning, as it really helped you land the job.
2The ACS contains information similar to the Decennial Census Long Form Questionnaire that it replaced after the
2000 Census. It is an annual sample of one in 40 households in the country. For reference, every decade the Long Form
sampled one in 6 households. See https://www.census.gov/programs-surveys/acs/technical-documentation/pums.html
for more information.
1
2 Obtaining the Data
1. First, navigate to the IPUMS USA website: https://usa.ipums.org/usa/index.shtml.3
2. Choose “Browse and Select Data” from the menu on the left.
3. Choose “Select Samples” by clicking the light blue box.
4. Select the most current year of ACS data only. Do not include the 3 and 5-year versions of
the data.4 Then “Submit sample selections.”
5. Now you get to go shopping for data.5 Under “Select Variables” ->
(a) “Person” -> “Demographics,” add the following to your cart
i. AGE
ii. SEX
iii. MARST
(b) “Person” -> “Family Interrelationship,” add the following to your cart
i. NCHILD
ii. NCHLT5
(c) “Person” -> “Race, Ethnicity and Nativity,” add the following to your cart
i. RACE
ii. HISPAN
(d) “Person” -> “Education,” add the following to your cart
i. EDUC
(e) “Person” -> “Work,” add the following to your cart
i. EMPSTAT
(f) “Person” -> “Income,” add the following to your cart
i. INCWAGE
(g) “Person” -> “Veteran Status,” add the following to your cart
i. VETSTAT
6. Click on the “View Cart” button. Check to make sure you got everything. Click on the
“Create data extract” button.
3Census Bureau datasets are notoriously difficult to download in usable forms. In order make the data more
accessible, the wonderful people at the Institute for Social Research and Data Innovation at the University of Minnesota
created the Integrated Public Use Microdata Series (IPUMS) which is an awesomely streamlined way to get your hands
on the data you want. Note that that they make many additional datasets available for download via (for example)
IPUMS International, IPUMS Global Health, and IPUMS Time Use, among others.
4In order to ensure large enough sample sizes to maintain confidentiality, the Census pools data over multiple years
for geographic units with fewer people.
5This is like opening birthday presents for a data scientist!
2
7. Click on “Customize sample sizes.”
(a) Since we’re dealing with a large sample from the national population, we have far more
observations than we can easily process. Under “Households,” enter “10” so the dataset
you create has 10,000 households. This make working with the data easier, but will still
give us a “big data” dataset.6
(b) Click “Submit.”
8. Click on “Select cases.”
(a) Since we’re looking at wages as a function of education, we’re only going to keep
those involved in the labor force. Select EMPSTAT. Just to be safe, let’s also restrict
our sample by age. Select AGE.
(b) Click “Submit.”
(c) Check “Include only those persons meeting case selection criteria.”
i. Under EMPSTAT, check the box for “Employed” workers.
ii. For AGE, select ages from 18 to 65.
(d) Click “Submit.”
9. To the right of “Data Format,” click on “Change.”
(a) Select “Comma delimited (.csv)” or whatever your preferred format is.
(b) Select “Rectangular, person (default).”
(c) Click “Submit.”
10. Give your extract a brief description.
11. Click on the box that says “Submit extract.”
12. Request an account or sign in.
13. Finally, hit “Submit extract.”7
14. Once your extract has been created, navigate to the IPUMS download page: https://usa.ipums.org/usa-
action/data_requests/download.
(a) Click on the “Download CSV” link (in the first column). Save the file to your hard
drive.
6When you “Customize sample sizes,” IPUMS will randomly draw 10,000 observations for you. Since this is
a random process and the “Select cases” occurs after the random draw of observations, don’t be worried if a study
partner has a slightly (up to a few hundred) different number of observations.
7It will take the IPUMS system a little while to create your extract, so go take a break or work on something else.
The IPUMS system will email you once your extract has been created. Try to contain your excitement over the fun
data that you’ll soon get to play with, lest friends and family think you’re weird.
3
(b) Right-click on the “Basic” codebook file and save the *.cbk (text) file to your hard
drive.
(c) Unzip the data file and load the data in Python. For help unzipping a *.gz file (Unix’s
version of *.zip), check out “Step 2: Decompress the data file” here: https://usa.ipums.org/usa/extract_instructions.shtml
(just note that the instructions are for the *.dat (text) file and you want the *.csv file).
3 Preparing the Data
1. First, take a few minutes to become familiar with the data.
2. For our analysis, we’ll need to use the codebook we saved to clean and create a few variables.
(a) Education - We have a categorical measurement of education (educd). For some of our
analysis, we need a continuous variable. Use the educd variable to create a continuous
measure of education called educdc using the crosswalk at the end of this document.
A *.csv version of the crosswalk is available on Canvas.
(b) Dummy Variables - Create the following dummy variables:
i. A dummy, hsdip, equal to 1 if the individual has a high school diploma (but not
a bachelors or higher degree). Note: in general, how one codes individuals with
a GED or associates degree is a decision the researcher has to make based on the
context of his/her research question. To keep things standard for the project, code
these individuals as having a high school diploma.
ii. A dummy, coldip, equal to 1 if the individual has a four-year college diploma (or
a higher degree that required earning a college diploma first).
iii. A dummy, white, equal to 1 if the individual is white.
iv. A dummy, black, equal to 1 if the individual is black.
v. A dummy, hispanic, equal to 1 if the individual is of Hispanic origin.
vi. A dummy, married, equal to 1 if the individual is married.
vii. A dummy, f emale, equal to 1 if the individual is female.
viii. A dummy, vet, equal to 1 if the individual is a veteran.
(c) Interaction Terms - Create an interaction between each of the education dummy vari-
ables (A-B) and education.
(d) Created Variables - Create the following
i. Age squared.
ii. The natural log of incwage.
4 Data Analysis
1. Compute descriptive (summary) statistics for the following variables: year, incwage, lnincwage,
educdc, f emale, age, age2, white, black, hispanic, married, nchild, vet, hsdip, coldip, and
the interaction terms. In other words, compute sample means, standard deviations, etc.
4
2. Scatter plot ln(incwage) and education. Include a linear fit line. Be sure to label all axes and
include an informative title.
3. Estimate the following model:
ln(incwage) = β0 +β1educdc+β2 f emale+β3age+β4age2
+ β5white+β6black+β8hispanic
+ β9married+β10nchild+β11vet+ ε,
and report your results.
(a) What fraction of the variation in log wages does the model explain?
(b) Test the hypothesis that
H0 : β1 = β2 = . . . = β11 = 0
HA : β j 6= 0 f or some j
with α = 0.10.
(c) What is the return to an additional year of education? Is this statistically significant? Is
it practically significant? Briefly explain.
(d) At what age does the model predict an individual will achieve the highest wage?
(e) Does the model predict that men or women will have higher wages, all else equal?
Briefly explain why we might observe this pattern in the data.
(f) Interpret the coefficients on the white, black, and hispanic variables.
(g) Test the hypothesis that race has no effect on wages. Be sure to explicitly state the null
and alternative hypotheses and show your calculations.
4. Graph ln(incwage) and education. Include a three distinct linear fit lines specific to individ-
uals with no high school diploma, a high school diploma, and a college degree. Be sure to
label all axis and include an informative title.
5. Since the President is considering new education legislation, she asks you to determine
whether a college degree is a strong predictor of wages. Write down a model that will allow
the returns to education to vary by degree acquired (use the three categories in the previous
question).8 Be sure to include the controls from question 3. Explain/justify why you think
your model is the best possible representation of the way the world works.
6. Estimate the model you proposed in the previous question and report your results.
(a) Predict the wages of an 22 year old, female individual (who is neither white, black,
nor Hispanic, is not married, has no children, and is not a veteran) with a high school
diploma and an all else equal individual with a college diploma. Assume that it takes
someone 12 years to graduate high school and 16 years to graduate college.
8These are known as “sheepskin” effects.
5
(b) The President wants to know, given your results, do individuals with college degrees
have higher predicted wages than those without? By how much? Briefly explain.
(c) The President asked you to look into this question because she is considering legislation
that will expand access to college education (for instance, by increasing student loan
subsidies). She will only support the legislation if there are cost offsets (if college
education increases wages and therefore, future income tax revenues that help reduce
the net cost of the subsidy). Given that criteria, how would you advise the President?
7. There are many ways that this model could be improved. How would you do things dif-
ferently if you were asked to predict the returns to education given the data available on
IPUMS?
6
Table 1: Crosswalk
educd educdc
2 0
10 0
11 2
12 0
13 2.5
14 1
15 2
16 3
17 4
20 6.5
21 5.5
22 5
23 6
24 7.5
25 7
26 8
30 9
40 10
50 11
61 12
62 12
63 12
64 12
65 13
70 13
71 14
80 14
81 14
82 14
83 14
90 15
100 16
101 16
110 17
111 18
112 19
113 20
114 18
115 18
116 22
7

欢迎咨询51作业君
51作业君 51作业君

Email:51zuoyejun

@gmail.com

添加客服微信: ITCSdaixie