项目链接:https://datahack.analyticsvidhya.com/contest/practice-problem-loan-prediction-iii/
参考文献:《Python for Data Analysis》
参考链接:https://pandas.pydata.org/pandas-docs/stable/index.html
使用工具:Jupyter Notebook
今天进行数据清洗&初步数据整合,后期会逐渐把建立好的预测贴上来,时间周期:10天。
反省:A.数据结构认知过于仓促,实际上,预测若符合实际应用,还需进一步探索
B.填充缺失数据时,除了数值型数据用中位数填充,文本型数据用ffill方法填充且改变了原数据,之后要警惕小心使用,否则要备份原数据
(一)数据结构认知
Loan Prediction III--A practice-LMLPHP
初步整理:从平均放款数据、信用历史、自雇、收入角度
(二)初步整理

import pandas as pd
import numpy as np
data=pd.read_csv("C:\\Users\\lx\\Desktop\\prediction\\train_u6lujuX_CVtuZ9i.csv",
                 index_col="Loan_ID")
import matplotlib.pyplot as plt
%matplotlib inline
#(1)查看数据:eg.办贷款且尚未毕业的女性名单
data.loc[(data['Gender']=='Female')
         &(data['Education']=='Not Graduate')
         &(data['Loan_Status']=='Y'),
        ['Gender','Education','Loan_Status']]

返回值如下:
Loan Prediction III--A practice-LMLPHP

#(2)查询缺失值
def missing(x):
    return sum(x.isnull())
print('Missing values from every column:')
print(data.apply(missing,axis=0))
print('\nMissing values from every row:')
print (data.apply(missing,axis=1).head())

返回值如下:
Loan Prediction III--A practice-LMLPHP

#(3)补全缺失值:用平均数替换缺失值
data.fillna(data.mean(),inplace=True)
print(data)
#补全缺失值:向前填充方法
data.fillna(method='ffill',inplace=True)
print(data)
#(4)检查缺失值是否被补全
print (data.apply(missing, axis=0))

返回值如下:
Loan Prediction III--A practice-LMLPHP
由上,缺失值补齐

#(5)用'Gender','Married','Self_Employed'这几组的平均数剔掉缺失值,查看一下每组的平均‘LoanAmount’
#作出数据透视表pivot_table
Graphics=data.pivot_table(values=['LoanAmount'],index=['Gender','Married','Self_Employed'],aggfunc=np.mean)
print(Graphics)

返回值如下:
Loan Prediction III--A practice-LMLPHP

由上,自雇的女性最受青睐,贷款最多,而且十分突出

#(5)考虑信用历史的影响
#作出交叉表crosstab
pd.crosstab(data['Credit_History'],data['Loan_Status'],margins=True)

返回值如下:
Loan Prediction III--A practice-LMLPHP

#转化成百分比
def percConvert(ser):
  return ser/float(ser[-1])
pd.crosstab(data["Credit_History"],data["Loan_Status"],margins=True).apply(percConvert, axis=1)

返回值如下:
Loan Prediction III--A practice-LMLPHP
如上,有信用历史获得贷款的几率为:79.58%,然而无信用历史获得贷款的几率仅为:7.87%

#(6)考虑是否为自雇人士
pd.crosstab(data['Self_Employed'],data['Loan_Status'],margins=True)

返回值如下:
Loan Prediction III--A practice-LMLPHP

#转化成百分比
def percConvert(ser):
  return ser/float(ser[-1])
pd.crosstab(data["Self_Employed"],data["Loan_Status"],margins=True).apply(percConvert, axis=1)

返回值如下:
Loan Prediction III--A practice-LMLPHP
由上,自雇人士获得贷款的几率为:68.29%,然而非自雇人士获得贷款的几率为:68.60%,贷款状态不由自雇(Y/N)影响

#(7)考虑收入的影响
#重塑DataFrame
prop_rates = pd.DataFrame([1000, 5000, 12000], index=['Rural','Semiurban','Urban'],columns=['rates'])
prop_rates

返回值如下:
Loan Prediction III--A practice-LMLPHP

data_merged = data.merge(right=prop_rates, how='inner',left_on='Property_Area',right_index=True, sort=False)
data_merged.pivot_table(values='Credit_History',index=['Property_Area','rates'], aggfunc=len)

返回值如下:
Loan Prediction III--A practice-LMLPHP

#排序DataFrame
data_sorted = data.sort_values(['ApplicantIncome','CoapplicantIncome'], ascending=False)
data_sorted[['ApplicantIncome','CoapplicantIncome']].head(20)

返回值如下:
Loan Prediction III--A practice-LMLPHP

#箱型图&直方图
data.boxplot(column="ApplicantIncome",by="Loan_Status")
data.hist(column="ApplicantIncome",by="Loan_Status",bins=30)

返回值如下:
Loan Prediction III--A practice-LMLPHP
由上,贷款状态的分布大致相同,不由收入的高低影响

10-07 13:15