ctr+4/5注释
数据下载地址
1.先导入包:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import xgboost as xgb
import warnings
warnings.filterwarnings('ignore')

2.读入数据:

train_A=pd.read_csv('./input/A_train.csv')
train_B=pd.read_csv('./input/B_train.csv')
test_B=pd.read_csv('./input/B_test.csv')
train_B=train_B.drop('UserInfo_170',axis=1)#去除全零列

3.严重缺失的特征处理:

meaningful_col=[]
for col in train_B_info.columns:
    if train_B_info.ix[0,col]>train_B.shape[0]*0.01:
        meaningful_col.append(col)
train_B_1=train_B[meaningful_col].copy()
#print(train_B.shape)        
#print(train_B_1.shape)

info的第一列是count,如果特征的取值缺失,就不会count,比如某特征count为1,而train_B.shape[0]即数据点个数(行数)为4000,即此特征有3999个缺失值。
train_B_info.ix[0,col]代表describe这表的第0行col列的数.

4.缺失值填充
此处我们直接用-999进行填充.
train_B_1=train_B_1.fillna(-999)

5.高线性相关性数据处理

分析:如果两个特征是完全线性相关的,这个时候我们仅仅只需要保留其中一个即可.因为第二个特征包含的信息基本完全被第一个特征所包含.此时如果两个特征同时保留的话,模型的性能很大情况下会出现下降的情况.

  • 我们选择将高线性相关的特征进行删除
    relation=train_B_1.corr() 该变量的矩阵如图:
    前海征信“好信杯”大数据算法大赛——入门篇笔记-LMLPHP

Pandas中ix和iloc有什么区别?[1]

  • loc是基于column name选取特定行
  • iloc是基于行、列的位置选取特定行
  • ix是loc和iloc的混合体
 #结合相关性的那张图(第j行第i列)
length=relation.shape[0]
high_corr=list()
final_cols=[]
del_cols=[]
for i in range(length):
    if relation.columns[i] not in del_cols:
        final_cols.append(relation.columns[i])
        for j in range(i+1,length):
            if relation.ix[i,j]>0.98 and relation.columns[j] not in del_cols:
                del_cols.append(relation.columns[j])
train_B_1=train_B_1[final_cols]

6.模型训练与测试.

6.2.虽然上面的数据预处理和分析很简单,但是考虑到该赛题的数据是进行过预处理的,所以进行微处理后的数据已经具有较好的表示能力了,下面我们就上模型xgboost.

#模型训练
train_B_flag=train_B_1['flag']
train_B_1.drop('no',axis=1,inplace=True)
train_B_1.drop('flag',axis=1,inplace=True)

dtrain_B=xgb.DMatrix(data=train_B_1,label=train_B_flag)
Trate=0.25
params={'booster':'gbtree',
        'eta':0.1,
        'max_depth':4,
        'max_delta_step':0,
        'subsample':0.9,
        'colsample_bytree':0.9,
        'base_scorce':Trate,
        'objective':'binary:logistic',
        'lambda':5,
        'alpha':8,
        'random_seed':100
        }
params['eval_metric']='auc'
xgb_model=xgb.train(params,dtrain_B,num_boost_round=200,maximize=True,verbose_eval=True)

  • drop函数默认删除行,列需要加axis = 1.
  • 采用inplace=True之后,原数组名(如2和3情况所示)对应的内存值直接改变;而采用inplace=False之后,原数组名对应的内存值并不改变,需要将新的结果赋给一个新的数组或者覆盖原数组的内存.
1. DF= DF.drop('column_name', axis=1)2. DF.drop('column_name',axis=1, inplace=True)
3. DF.drop([DF.columns[[0,1, 3]]], axis=1, inplace=True)   # Note: zero indexed

6.3
选择采用test_B[train_B_1.columns]的形式 .输入测试特征,这样做的好处是可以防止很多情况下我们不小心没有将特征进行对齐,例如训练集我们的特征的顺序为fea2,fea1,fea3.但是我们的测试集的特征顺序为fea1,fea2,fea3.这时我们的预测结果就会十分的糟糕.

#模型测试
res=xgb_model.predict(xgb.DMatrix(test_B[train_B_1.columns].fillna(-999)))
test_B['pre']=res
test_B[['no','pre']].to_csv('submit.csv',index=None)#只需要这两列保存csv

7.全部源码:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import xgboost as xgb
import warnings
warnings.filterwarnings('ignore')

train_A=pd.read_csv('./input/A_train.csv')
train_B=pd.read_csv('./input/B_train.csv')
test_B=pd.read_csv('./input/B_test.csv')
train_B=train_B.drop('UserInfo_170',axis=1)
train_B_info=train_B.describe()


meaningful_col=[]
for col in train_B_info.columns:
    if train_B_info.ix[0,col]>train_B.shape[0]*0.01:
        meaningful_col.append(col)
train_B_1=train_B[meaningful_col].copy()
#print(train_B.shape)        
#print(train_B_1.shape)
train_B_1=train_B_1.fillna(-999)
relation=train_B_1.corr()
#结合相关性的那张图(第j行第i列)
length=relation.shape[0]
high_corr=list()
final_cols=[]
del_cols=[]
for i in range(length):
    if relation.columns[i] not in del_cols:
        final_cols.append(relation.columns[i])
        for j in range(i+1,length):
            if relation.ix[i,j]>0.98 and relation.columns[j] not in del_cols:
                del_cols.append(relation.columns[j])
train_B_1=train_B_1[final_cols]

#模型训练
train_B_flag=train_B_1['flag']
train_B_1.drop('no',axis=1,inplace=True)
train_B_1.drop('flag',axis=1,inplace=True)

dtrain_B=xgb.DMatrix(data=train_B_1,label=train_B_flag)
Trate=0.25
params={'booster':'gbtree',
        'eta':0.1,
        'max_depth':4,
        'max_delta_step':0,
        'subsample':0.9,
        'colsample_bytree':0.9,
        'base_scorce':Trate,
        'objective':'binary:logistic',
        'lambda':5,
        'alpha':8,
        'random_seed':100
        }
params['eval_metric']='auc'
xgb_model=xgb.train(params,dtrain_B,num_boost_round=200,maximize=True,verbose_eval=True)

#模型测试
res=xgb_model.predict(xgb.DMatrix(test_B[train_B_1.columns].fillna(-999)))
test_B['pre']=res
test_B[['no','pre']].to_csv('submit.csv',index=None)#只需要这两列保存csv

[1]https://www.zhihu.com/question/47362048
本笔记参见 https://www.kesci.com/home/project/59ca5ff521100106623f3db3

01-16 17:16