辅导案例-T2 2020
Actuarial Data and Analysis, T2 2020
Assignment Part A
Due time: Week 5 Wednesday, 1 July 2020, 11.55 am (sharp)
1 Skills developed
This assignment provides you with an opportunity to get familiar with the given datasets before applying
modeling techniques you are learning in the course lectures to a business task involving data. In addition,
your skills in understanding/applying data manipulation and analysis methods (from the course materials
and any additional reference material you consider) will be developed via this assignment. Communication
of the results of your investigations and analysis is also an important skill developed.
2 Task
You are a fresh actuarial graduate who has just joined the US Medicare Fraud Department as an analyst.
Your team is in charge of analyzing Medicare data for detecting Medicare frauds made by the providers.
Your manager has currently tasked you with providing a preliminary report on the attached datasets for you
to be familiar with the data and the Medicare provider characteristics, and get ready for further analysis.
Your main tasks involve data manipulation and analysis, as well as a report and a recommendation for
further analysis (i.e. modeling).
Note that all relevant steps in the data manipulation as well as data analysis results should be included in
the report or appendix.
3 Additional information and mark allocation
3.1 Data manipulation and analysis (17 marks)
For the data you have, you should manipulate the data to prepare for data analysis. This includes (but is
not limited to): data exploration, data cleaning (if necessary), combining all the datasets and aggregating
the data per provider (see the Resources section for documentation).
The analysis of the data should provide a good sense of the datasets, insights on beneficiary, claim and
provider characteristics as well as providing drive for further analysis. You may find interesting insights by
analysing both the combined and aggregate datasets.
This task does not consist of modeling but you should keep in mind that the question your team will
ultimately be looking at is which providers are likely to have fraudulent claims.
See the section on data for details.
Mark allocation for the assignment can be found in the rubrics (on the course Moodle webpage).
1
3.2 Presentation Format (3 marks)
Communication of quantitative results in a concise and easy-to-read manner is a skill that is vital in practice.
As such, marks will be given for the presentation of your results. In order to maximize your marks for
presentation you may wish to consider issues such as: table size/readability, figure axes/formatting, ease
of reading, grammar/spelling, and report structure. You may also wish to consider the use of executive
summaries and appendixes, where appropriate. Provide sufficient details to the reader so that they can
judge what you are doing, using appendices for non-essential but useful results for the report as necessary.
Note that sufficient detail must be provided (in either the report body and/or appendices) so that the
reviewer can follow all the steps and derivations required in your work.
Note that a maximum page limit of 2 pages (excluding tables and graphs) is applicable to the main body
of the report.1 You should also consider the rubric for the presentation component (on the course Moodle
webpage). There is no limit to the length of the appendix.
3.3 Software
You may choose which software package to use (e.g. R, Python or other), however, nearly every function you
will be required to use for this task is available in R. Note also that code enabling you to perform most of
the computing can be found in the learning activities of the course and the Resources section. Note that
any assumptions must be clearly identified and justified (if used).
4 Data
The data is related to US Medicare claims and beneficiary details of 4436 providers from 2008 to 2009 and
consists of 4 datasets:
1. Medicare_Provider.csv
2. Medicare_Inpatient.csv
3. Medicare_Outpatient.csv
4. Medicare_Beneficiary.csv
Similar (but not identical) datasets are provided here. You may wish to check that webpage for further
information about the context, data and problem.2
You may also wish to have a look at the following explanatory data analysis based on the Kaggle datasets
to give you an idea of why and how to start the data analysis: Healthcare Fraud Detection With Python:
The importance of exploratory data analysis (weblink here). This data analysis is just a brief example and
is not based on your datasets. Different and more variables may be of interest for your analysis.
4.1 Medicare_Provider.csv (Provider Data)
This dataset provides the provider ID and if yes or no they are fraudulent providers.
Variable Description
ProviderID: A unique ID assigned to each provider (character)
Fraud: Is fraudulent? (categorical: “no”,“yes”)
1Please kindly note that this is a maximum - you should feel free to use less pages if it is sufficient!
2Optional readings for extra information and context on Medicare Fraud in US can be found here: link 1 and link 2.
2
4.2 Medicare_Inpatient.csv (Inpatient Data)
This dataset provides insights about the claims filed for those patients who are admitted to hospital. It also
provides additional details about the admission, discharge dates and diagnosis code.
Variable Description
BeneID: A unique ID assigned to each beneficiary (chr)
ClaimID: A unique ID assigned to each claim (chr)
ClaimStartDt: Start date of the claim (date)
ClaimEndDt: End date of the claim (date)
InscClaimAmtReimbursed: Claim amount reimbursed (num)
AttendingPhysician: Attending physician (chr)
OperatingPhysician: Operating physician (chr)
OtherPhysician: Other physician (chr)
AdmissionDt: Admission date (date)
ClmAdmitDiagnosisCode: Claim admission diagnosis code (chr)
DeductibleAmtPaid: Deductible amount paid (num)
DischargeDt: Discharge date (date)
DiagnosisGroupCode: Diagnosis group code (chr)
ClmDiagnosisCode_1: Claim diagnosis code 1 (chr)
ClmProcedureCode_1: Claim procedure code 1 (num)
ProviderID: A unique ID assigned to each provider (chr)
Important remark: Variables ClmAdmitDiagnosisCode, DiagnosisGroupCode, ClmDiagnosisCode_1 and
ClmProcedureCode_1 correspond to specific international or national codifications.3 You don’t need to know
or understand the details of the meaning of the codification. You can treat those variables as categorical
and investigate only the most significant levels.
• ClmAdmitDiagnosisCode represents the diagnosis code on the institutional encounter indicating the
beneficiary’s initial diagnosis at admission. This diagnosis code may not be confirmed after the patient
is evaluated; it may be different than the eventual diagnoses.
• DiagnosisGroupCode represents the diagnostic group to which a hospital claim belongs. It is a unique
identifier of a hospital case type that is based on similar clinical problems.
• ClmDiagnosisCode_1 represents the diagnosis code in the 1st position identifying the condition(s) for
which the beneficiary is receiving care.
• ClmProcedureCode_1 indicates the principal procedure performed during the period covered by the
institutional claim.
4.3 Medicare_Outpatient.csv (Outpatient Data)
This dataset provides details about the claims filed for those patients who visited hospitals as outpatients.
Variable Description
BeneID: A unique ID assigned to each beneficiary (chr)
ClaimID: A unique ID assigned to each claim (chr)
ClaimStartDt: Start date of the claim (date)
ClaimEndDt: End date of the claim (date)
InscClaimAmtReimbursed: Claim amount reimbursed (num)
AttendingPhysician: Attending physician (chr)
3Reference: Research Data Assistance Center, weblink here.
3
Variable Description
OperatingPhysician: Operating physician (chr)
OtherPhysician: Other physician (chr)
ClmDiagnosisCode_1: Claim diagnosis code 1 (chr)
ClmProcedureCode_1: Claim procedure code 1 (num)
DeductibleAmtPaid: Deductible amount paid (num)
ClmAdmitDiagnosisCode: Claim admission diagnosis code (chr)
ProviderID: A unique ID assigned to each provider (chr)
4.4 Medicare_Beneficiary.csv (Beneficiary Details Data)
This dataset contains beneficiary individual details (e.g. date of birth, date of death, health conditions, state,
etc).
Variable Description
BeneID: A unique ID assigned to each beneficiary (chr)
DOB: Date of birth (date)
DOD: Date of death (date)
Gender: Gender 1 or 2 (categorical)
Race: Race 1 to 5 (categorical)
RenalDiseaseIndicator: Renal disease indicator “0” (No) or “Y” (Yes) (chr)
State: US state number (num)
County: County (num)
NoOfMonths_PartACov: Number of months Medicare Part A covered (num)
NoOfMonths_PartBCov: Number of months Medicare Part B covered (num)
ChronicCond_Alzheimer: Chronic condition Alzheimer 1 (Yes) or 2 (No) (num)
ChronicCond_Heartfailure: Chronic condition Heart failure 1 (Yes) or 2 (No) (num)
ChronicCond_KidneyDisease: Chronic condition Kidney Disease 1 (Yes) or 2 (No) (num)
ChronicCond_Cancer: Chronic condition Cancer 1 (Yes) or 2 (No) (num)
ChronicCond_ObstrPulmonary: Chronic condition Obstructive Pulmonary 1 (Yes) or 2 (No) (num)
ChronicCond_Depression: Chronic condition Depression 1 (Yes) or 2 (No) (num)
ChronicCond_Diabetes: Chronic condition Diabetes 1 (Yes) or 2 (No) (num)
ChronicCond_IschemicHeart: Chronic condition Ischemic Heart 1 (Yes) or 2 (No) (num)
ChronicCond_Osteoporasis: Chronic condition Osteoporasis 1 (Yes) or 2 (No) (num)
ChronicCond_rheumatoidarthritis: Chronic condition rheumatoidarthritis 1 (Yes) or 2 (No) (num)
ChronicCond_stroke: Chronic condition stroke 1 (Yes) or 2 (No) (num)
IPAnnualReimbursementAmt: Inpatient annual reimbursement amount (num)
IPAnnualDeductibleAmt: Inpatient annual deductible amount (num)
OPAnnualReimbursementAmt: Oupatient annual reimbursement amount (num)
OPAnnualDeductibleAmt: Outpatient annual deductible (num)
5 Resources
• Data manipulation with R: dplyr (weblink here)
• Merging with R (weblink here)
• Tidy data in R (weblink here)
• Explanatory Data Analysis with R (weblink here)
• Data visualistion in R with ggplot2 for fancy plots (weblink here)
• For any code related question google.com or stackoverflow.com are pretty helpful!
4
• As usual you can ask your questions on the course Ed forum.
6 Assignment submission procedure
6.1 Turnitin submission
Your assignment report must be uploaded as a unique document and all parts must be in portrait
format. As long as the due date is still future, you can resubmit your work; the previous version of your
assignment will be replaced by the new version.
Assignments must be submitted via the Turnitin submission box that is available on the course Moodle
website. Turnitin reports on any similarities between your cohort’s assignments, and also with regard to
other sources (such as the internet or all assignments submitted all around the world via Turnitin). More
information is available at: [click]. Please read this page, as we will assume that you are familiar with its
content. You can also find on the Moodle webpage the Turnitin Similarity Report Interpretation Guide
(2019).
Please also submit any programming code used in your analysis as a separate file in the dedicated
“Code only” Moodle assignment box on the course webpage. These will be referred to by the marker only if
needed, and in particular the report (with appendix) should be self-contained.
You need to check your document once it is submitted (check it on-screen). We will not mark assignments
that cannot be read on screen.
Students are reminded of the risk that technical issues may delay or even prevent their submission (such
as internet connection and/or computer breakdowns). Students should allow enough time (at least 24
hours is recommended) between their submission and the due time. The Turnitin module will not
let you submit a late report. No paper copy will be either accepted or graded.
6.2 Late submission
Please note that it is School policy that late submission of assignments will incur in a penalty.
A penalty of 25% of the mark the student would otherwise have obtained, for each full (or part) day of
lateness (e.g., 0 day 1 minute = 25% penalty, 2 days 21 hours = 75% penalty). Students who are late
must submit their assignment to the LIC via e-mail. The LIC will then upload documents to the relevant
submission boxes. The date and time of reception of the e-mail determines the submission time for the
purposes of calculating the penalty.
More information on Late submissions, extensions and special consideration is available in the Moodle course
webpage section Additional resources from UNSW (at the bottom).
6.3 Plagiarism awareness
Students are reminded that the work they submit must be their own. While we have no problem with
students working together on the assignment problems, the material students submit for assessment must
be their own.
Students should make sure they understand what plagiarism is—cases of plagiarism have a very high prob-
ability of being discovered. For issues of collective work, having different persons marking the assignment
does not decrease this probability.
More information on Academic integrity and plagiarism is available in the Moodle course webpage section
Additional resources from UNSW (at the bottom).
5
51作业君 51作业君

Email:51zuoyejun

@gmail.com

添加客服微信: IT_51zuoyejun