辅导案例-M226-1

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

Mid-term research project
COM SCI M226-1
(30% of Final Score)
Due date: January 19th, 2020
Pick one of the two:
1. Machine learning is known to include bias captured in the data. For example, most
natural language processing has suggested doctors as male and nurses as female.
This bias creates discrimination implicitly, even then algorithm excludes the
variable in modeling, due to the exposure distribution. In this task, you are asked to,
A) Create gradient boosting algorithm for Allstate claim prediction competition.
https://www.kaggle.com/c/ClaimPredictionChallenge/data
B) Introduce 3 bias assessment measures to rank the biasness of each variable. Please
check https://towardsdatascience.com/machine-learning-and-discrimination-
2ed1a8b01038 for reference
C) Identify the 3 most biased variables and illustrate the measure with support of
visualization
D) Propose 1 method to reduce or remove bias of 2 variables simultaneously. By
using the same measures, share the improvement.
E) Prepare a technical report to illustrate you modeling and findings from A) to
D)
2. Gradient boosting with both tree and linear base learners: xgboost and lightgbm are
the most popular boosting libraries for data scientists. However, most of the
applications presumes the use of tree as base learners. Other base learners, listed
below, are rarely utilized.

xgboost has included linear predictor as a base learner option. (by setting
booster=”gblinear” in parameter). However, the existing library does not allow both
tree and linear predictor to estimate parameters in the same model run. In this
assignment, you need to modify the source codes of lightgbm package
(https://github.com/microsoft/LightGBM):
A) Includes a booster similar to gblinear from xgboost. The module should allow
users to train linear booster and predict from admissible dataset. You can safely
assume the dataset to be fully numeric and treating missing values as zeros.
(Tips: create a new module residing in LightGBM/src/boosting/ folder with
same class nature with gbdt and modify other parts of source codes so that the
new module can be called exactly like gbdt class)
B) Enable the library to call different boosting at each iteration. For example, in a
500-iteration run, the model select tree (gbdt) in the first iteration and linear
in the 2nd and 3rd. The flow should be in each iteration, there is a base learner
assignment mechanism given by a probability parameter provided by users. In
each iteration, the algorithm should first generate a random number so that the
base learners will be assigned with the appropriate probability. You can safely
assume gbdt and linear are the only members. (Bonus: instead of using a
constant probability, the probability of assigning to base learners can be
adjusted by iteration by the latest loss improvement for the base learners. At
iteration 0, assign x_tree = x_linear =999. Probability of assigning to tree base
learner module = x_tree/(x_linear + x_tree) = 0.5. The resulting loss
improvement (say 10) is captured and used as x_tree in the next iteration. In
iteration 1, Probability of tree = 10/(10+999).
C) The resulting booster (called gbdt_and_linear) should have the class functions
and object as gbdt. i.e., training, predict, calculating metrics etc.
D) A python/R code to train the algorithm for Allstate claim prediction
competition and make one submission. Data can be found in
https://www.kaggle.com/c/ClaimPredictionChallenge/data