辅导案例-M226-1
Mid-term research project COM SCI M226-1 (30% of Final Score) Due date: January 19th, 2020 Pick one of the two: 1. Machine learning is known to include bias captured in the data. For example, most natural language processing has suggested doctors as male and nurses as female. This bias creates discrimination implicitly, even then algorithm excludes the variable in modeling, due to the exposure distribution. In this task, you are asked to, A) Create gradient boosting algorithm for Allstate claim prediction competition. https://www.kaggle.com/c/ClaimPredictionChallenge/data B) Introduce 3 bias assessment measures to rank the biasness of each variable. Please check https://towardsdatascience.com/machine-learning-and-discrimination- 2ed1a8b01038 for reference C) Identify the 3 most biased variables and illustrate the measure with support of visualization D) Propose 1 method to reduce or remove bias of 2 variables simultaneously. By using the same measures, share the improvement. E) Prepare a technical report to illustrate you modeling and findings from A) to D) 2. Gradient boosting with both tree and linear base learners: xgboost and lightgbm are the most popular boosting libraries for data scientists. However, most of the applications presumes the use of tree as base learners. Other base learners, listed below, are rarely utilized. xgboost has included linear predictor as a base learner option. (by setting booster=”gblinear” in parameter). However, the existing library does not allow both tree and linear predictor to estimate parameters in the same model run. In this assignment, you need to modify the source codes of lightgbm package (https://github.com/microsoft/LightGBM): A) Includes a booster similar to gblinear from xgboost. The module should allow users to train linear booster and predict from admissible dataset. You can safely assume the dataset to be fully numeric and treating missing values as zeros. (Tips: create a new module residing in LightGBM/src/boosting/ folder with same class nature with gbdt and modify other parts of source codes so that the new module can be called exactly like gbdt class) B) Enable the library to call different boosting at each iteration. For example, in a 500-iteration run, the model select tree (gbdt) in the first iteration and linear in the 2nd and 3rd. The flow should be in each iteration, there is a base learner assignment mechanism given by a probability parameter provided by users. In each iteration, the algorithm should first generate a random number so that the base learners will be assigned with the appropriate probability. You can safely assume gbdt and linear are the only members. (Bonus: instead of using a constant probability, the probability of assigning to base learners can be adjusted by iteration by the latest loss improvement for the base learners. At iteration 0, assign x_tree = x_linear =999. Probability of assigning to tree base learner module = x_tree/(x_linear + x_tree) = 0.5. The resulting loss improvement (say 10) is captured and used as x_tree in the next iteration. In iteration 1, Probability of tree = 10/(10+999). C) The resulting booster (called gbdt_and_linear) should have the class functions and object as gbdt. i.e., training, predict, calculating metrics etc. D) A python/R code to train the algorithm for Allstate claim prediction competition and make one submission. Data can be found in https://www.kaggle.com/c/ClaimPredictionChallenge/data