马上AI全球挑战者大赛-违约用户风险预测

2022-10-28,,,,

方案概述

近年来,互联网金融已经是当今社会上的一个金融发展趋势。在金融领域,无论是投资理财还是借贷放款,风险控制永远是业务的核心基础。对于消费金融来说,其主要服务对象的特点是:额度小、人群大、周期短,这个特性导致其被公认为是风险最高的细分领域。

以借贷为例,相比于传统的金融行业需要用户自己提供的资产资料的较单一途径,互联网金融更能将用户线下的资产情况,以及线上的网络消费行为进行资料整合,来进行综合分析,以便为用户提供更好的服务体验,为金融商家提供用户更全面的了解和评估。

随着人工智能和大数据等技术不断渗透,依靠金融科技主动收集、分析、整理各类金融数据,为细分人群提供更为精准的风控服务,成为解决消费金融风控问题的有效途径。简言之,如何区别违约风险用户,成为金融领域提供更为精准的风控服务的关键。

基于本赛题,大数据金融的违约用户风险预测,本文解决方案具体包括以下步骤:

1.对用户的历史行为数据预处理操作;
2.根据历史行为划分训练集数据、验证集数据;
3.对用户历史数据进行特征工程操作;
4.对构建特征完成的样本集进行特征选择;
5.建立多个机器学习模型,并进行模型融合;
6.通过建立的模型,根据用户历史行为数据对用户在未来一个月是否会逾期还款进行预测。

其中,图1展示了基于大数据金融的违约用户风险预测解决方案的流程图。

图1违约用户风险预测解决方案的流程图

二、数据洞察

2.1 数据预处理

1.异常值处理:针对数据中存在未知的异常值,采取直接过滤的方法进行处理会减少训练样本数量,从这里出发,将异常值用-1填充或者其他有区别于特征正常值的数值进行填充;

2.缺失值的多维度处理:在征信领域,用户信息的完善程度可能会影响该用户的信用评级。一个信息完善程度为100%的用户比起完善程度为50%的用户,会更加容易审核通过并得到借款。从这一点出发,对缺失值进行了多维度的分析和处理。按列(属性)统计缺失值个数,进一步得到各列的缺失比率,其中x_i为数据集中某属性列缺失值个数,Count为样本集总数,MissRate_i为数据集中该属性列缺失率:

3.其他处理:空格符处理,某些属性取值包含了空格字符,如“货到付款”和“货到付款 ”,它们明显是同一种取值,需要将空格符去除;城市名处理,包含有“重庆”、“重庆市”等取值,它们实际上是同一个城市,需要把字符中的“市”全部去掉。去掉类似于“市”的冗余之后,城市数目大大减少。

2.2 发现时序关系

根据用户历史数据,统计违约数量和未违约数量跟时间周期的关系,可视化实现如下图所示:

图2 违约数量和未违约数量跟时间周期的关系图

可以看出,时间对用户是否违约是成一定周期性的,且2017年明显比2016年的数量增加了很多,因此本文解决方案涉及很多时序特征。

2.3 划分训练集、验证集

对违约用户风险预测是一个长期且累积的过程,采取传统的按训练和测试集对应时间段滑窗法划分数据集并不是最佳方案,从这里出发,将历史用户数据全部用于训练集,更好的训练用户行为习惯,其中,验证集的构建采取交叉验证的方式,交叉验证如下图所示:

图3 交叉验证示意图

三、特征工程

3.1 0-1特征

主要基于auth、credit、user表提取,这三张表的id没有重复。

标记auth表的Id_card、auth_time、phone是否为空;
标记credit表的credit_score、overdraft、quota是否为空;
标记user表的sex、birthday、hobby、merriage、income、id_card、degree、industry、qq_bound、wechat_bound、account_grade是否为空。
标记auth表的Id_card、auth_time、phone是否正常(不为空);
标记credit表的credit_score、overdraft、quota是否正常(不为空);
标记user表的sex、birthday、hobby、merriage、income、id_card、degree、industry、qq_bound、wechat_bound、account_grade是否正常(不为空)。

3.2 信息完整度特征

主要基于auth、credit、user表提取,标记这三张表每条样本的信息完整度,定义为该条样本非空的属性数目/总属性数目。

3.3 one-hot特征

主要基于user表提取。

One-hot离散user表的sex、merriage、income、degree、qq_bound、wechat_bound、account_grade属性。

3.4 业务特征

基于业务逻辑提取的特征,最有效的特征,主要基于credit、auth、bankcard、order表提取。

(1)用户贷款提交时间(applsbm_time)和认证时间(auth_time)之差
(2)用户贷款提交时间(applsbm_time)和生日(birthday)之差
(3)信用评分(credit_score)反序
(4)信用额度未使用值(quota减overdraft)
(5)信用额度使用比率(overdraft除以quota)
(6)信用额度使用值是否超过信用额度(overdraft是否大于quota)
(7)银行卡(bankname)数目
(8)不同银行的银行卡(bankname)数目
(9)不同银行卡类型(card_type)数目
(10)不同银行卡预留电话(phone)数目
(11)提取order表的amt_order次数、type_pay_在线支付、type_pay——货到付款、sts_order_已完成次数,按id对order表去重,保留id重复的第一条样本

四、特征筛选

特征工程部分,构建了一系列基础特征、时序特征、业务特征、组合特征和离散特征等,所有特征加起来高达数百维,高维特征一方面可能会导致维数灾难,另一方面很容易导致模型过拟合。从这一点出发,通过特征选择来降低特征维度。比较高效的是基于学习模型的特征排序方法,可以达到目的:模型学习的过程和特征选择的过程是同时进行的,因此我们采用这种方法,基于 xgboost 来做特征选择,xgboost模型训练完成后可以输出特征的重要性(见图2),据此我们可以保留 top n 个特征,从而达到特征选择的目的。

五、模型训练

本文共计四个xgb模型,分别进行参数扰动、特征扰动,单模型效果均通过调参和特征选择,保证单模型最优,按四个模型不同比例融合,最终生成模型结果。

六、重要特征

通过XGBOOST模型输出特征重要性,降序排序,选取top20,可视化如下:

图4 特征重要性排序

列出模型所选的重要特征的前20个:表格样式如下:

七、创新点

7.1 特征

原始数据集很多属性比较乱,清洗了例如日期这样的属性方便特征提取;加入了信息完整度特征,很好地利用到了含有空值的样本;对于order这个id含有重复的样本,尝试了提取特征后按时间去重和按第一条和最后一条去重,发现按第一条去重效果是最好的,很好地使用到了order的信息;通过特征的重要性排序筛选了特征,也发现了提取的业务相关的特征是最重要的。

7.2 模型

模型的创新点主要体现在模型融合上。考察指标为AUC,侧重于答案的排序。在进行加权融合时,先对每个模型的结果进行了归一化,融合效果很好。

八、赛题思考

清洗数据非常重要,像时间这样的属性非常乱,处理起来也比较麻烦,我们只是简单地进行了处理,如果能够更细致的处理效果应该更好;某些属性,例如hobby,内容太复杂没有使用到,但这个属性肯定包含了许多有价值的信息,但遗憾没有发现一个好的处理方式。

04094.py

# -*- coding: utf-8 -*-
import pandas as pd
import datetime
import sys
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import xgboost as xgb
import re
from sklearn.metrics import roc_auc_score
from sklearn.metrics import auc def xgb_feature(train_set_x, train_set_y, test_set_x, test_set_y):
# 模型参数
params = {'booster': 'gbtree',
'objective': 'rank:pairwise',
'eval_metric': 'auc',
'eta': 0.02,
'max_depth': 5, # 4 3
'colsample_bytree': 0.7, # 0.8
'subsample': 0.7,
'min_child_weight': 1, # 2 3
'silent': 1
}
dtrain = xgb.DMatrix(train_set_x, label=train_set_y)
dvali = xgb.DMatrix(test_set_x)
model = xgb.train(params, dtrain, num_boost_round=800)
predict = model.predict(dvali)
return predict, model if __name__ == '__main__':
IS_OFFLine = False """读取auth数据"""
train_auth = pd.read_csv('../AI_risk_train_V3.0/train_auth_info.csv', parse_dates=['auth_time']) # 标记id_card,空为0,不空为1
auth_idcard = train_auth['id_card'].map(lambda x: 0 if str(x) == 'nan' else 1)
# 将标记数据加入到DataFrame中
auth_idcard_df = pd.DataFrame()
auth_idcard_df['id'] = train_auth['id']
auth_idcard_df['auth_idcard_df'] = auth_idcard # 标记phone,空为0,不空为1
auth_phone = train_auth['phone'].map(lambda x: 0 if str(x) == 'nan' else 1)
# 将标记数据加到DataFrame中
auth_phone_df = pd.DataFrame()
auth_phone_df['id'] = train_auth['id']
auth_idcard_df['auth_phone_df'] = auth_phone """读取bankcard数据"""
train_bankcard = pd.read_csv('../AI_risk_train_V3.0/train_bankcard_info.csv')
# 不同银行的银行卡(bankname)数目
train_bankcard_bank_count = train_bankcard.groupby(by=['id'], as_index=False)['bank_name'].agg(
{'bankcard_count': lambda x: len(x)})
# 不同银行卡类型(card_type)数目
train_bankcard_card_count = train_bankcard.groupby(by=['id'], as_index=False)['card_type'].agg(
{'card_type_count': lambda x: len(set(x))})
# 不同银行卡预留电话(phone)数目
train_bankcard_phone_count = train_bankcard.groupby(by=['id'], as_index=False)['phone'].agg(
{'phone_count': lambda x: len(set(x))}) """读取credit数据"""
train_credit = pd.read_csv('../AI_risk_train_V3.0/train_credit_info.csv')
# 评分的反序
train_credit['credit_score_inverse'] = train_credit['credit_score'].map(lambda x: 605 - x)
# 额度-使用值
train_credit['can_use'] = train_credit['quota'] - train_credit['overdraft'] """读取order数据"""
train_order = pd.read_csv('../AI_risk_train_V3.0/train_order_info.csv', parse_dates=['time_order'])
# 标记amt_order数据,NA或null为nan,否则为本身
train_order['amt_order'] = train_order['amt_order'].map(
lambda x: np.nan if ((x == 'NA') | (x == 'null')) else float(x))
# 标记time_order,0、NA或nan为NaT,否则格式化
train_order['time_order'] = train_order['time_order'].map(
lambda x: pd.lib.NaT if (str(x) == '0' or x == 'NA' or x == 'nan') else (
datetime.datetime.strptime(str(x), '%Y-%m-%d %H:%M:%S') if ':' in str(x) else (
datetime.datetime.utcfromtimestamp(int(x[0:10])) + datetime.timedelta(hours=8))))
train_order_time_max = train_order.groupby(by=['id'], as_index=False)['time_order'].agg(
{'train_order_time_max': lambda x: max(x)})
train_order_time_min = train_order.groupby(by=['id'], as_index=False)['time_order'].agg(
{'train_order_time_min': lambda x: min(x)})
train_order_type_zaixian = train_order.groupby(by=['id']).apply(
lambda x: x['type_pay'][(x['type_pay'] == ' 在线支付').values].count()).reset_index(name='type_pay_zaixian')
train_order_type_huodao = train_order.groupby(by=['id']).apply(
lambda x: x['type_pay'][(x['type_pay'] == '货到付款').values].count()).reset_index(name='type_pay_huodao') """读取地址信息数据"""
train_recieve = pd.read_csv('../AI_risk_train_V3.0/train_recieve_addr_info.csv')
# 截取region字符串前两位
train_recieve['region'] = train_recieve['region'].map(lambda x: str(x)[:2])
tmp_tmp_recieve = pd.crosstab(train_recieve.id, train_recieve.region)
tmp_tmp_recieve = tmp_tmp_recieve.reset_index()
# 根据id分组,fix_phone的数量
tmp_tmp_recieve_phone_count = train_recieve.groupby(by=['id']).apply(lambda x: x['fix_phone'].count())
tmp_tmp_recieve_phone_count = tmp_tmp_recieve_phone_count.reset_index()
# 根据id分组,fix_phone的去重数量
tmp_tmp_recieve_phone_count_unique = train_recieve.groupby(by=['id']).apply(lambda x: x['fix_phone'].nunique())
tmp_tmp_recieve_phone_count_unique = tmp_tmp_recieve_phone_count_unique.reset_index() """读取target数据"""
train_target = pd.read_csv('../AI_risk_train_V3.0/train_target.csv', parse_dates=['appl_sbm_tm']) """读取user数据"""
train_user = pd.read_csv('../AI_risk_train_V3.0/train_user_info.csv')
# 标记hobby,nan为0,否则为1
is_hobby = train_user['hobby'].map(lambda x: 0 if str(x) == 'nan' else 1)
# 将标记数据加入到DataFrame中
is_hobby_df = pd.DataFrame()
is_hobby_df['id'] = train_user['id']
is_hobby_df['is_hobby'] = is_hobby
# 标记id_card,nan为0,否则为1
is_idcard = train_user['id_card'].map(lambda x: 0 if str(x) == 'nan' else 1)
# 将标记数据加入到DataFrame中
is_idcard_df = pd.DataFrame()
is_idcard_df['id'] = train_user['id']
is_idcard_df['is_hobby'] = is_idcard # user_birthday
tmp_tmp = train_user[['id', 'birthday']]
tmp_tmp = tmp_tmp.set_index(['id'])
is_double_ = tmp_tmp['birthday'].map(lambda x: (str(x) == '--') * 1).reset_index(name='is_double_')
is_0_0_0 = tmp_tmp['birthday'].map(lambda x: (str(x) == '0-0-0') * 1).reset_index(name='is_0_0_0')
is_1_1_1 = tmp_tmp['birthday'].map(lambda x: (str(x) == '1-1-1') * 1).reset_index(name='is_1_1_1')
is_0000_00_00 = tmp_tmp['birthday'].map(lambda x: (str(x) == '0000-00-00') * 1).reset_index(name='is_0000_00_00')
is_0001_1_1 = tmp_tmp['birthday'].map(lambda x: (str(x) == '0001-1-1') * 1).reset_index(name='is_0001_1_1')
is_hou_in = tmp_tmp['birthday'].map(lambda x: ('后' in str(x)) * 1).reset_index(name='is_hou_in')
# is_nan = tmp_tmp['birthday'].map(lambda x:(str(x) == 'nan')*1).reset_index(name='is_nan') # 格式化birthday
train_user['birthday'] = train_user['birthday'].map(lambda x: datetime.datetime.strptime(str(x), '%Y-%m-%d') if (
re.match('19\d{2}-\d{1,2}-\d{1,2}', str(x)) and '-0' not in str(x)) else pd.lib.NaT) # 合并基础特征
train_data = pd.merge(train_target, train_auth, on=['id'], how='left')
train_data = pd.merge(train_data, train_user, on=['id'], how='left')
train_data = pd.merge(train_data, train_credit, on=['id'], how='left')
train_data['hour'] = train_data['appl_sbm_tm'].map(lambda x: x.hour)
train_data['month'] = train_data['appl_sbm_tm'].map(lambda x: x.month)
train_data['year'] = train_data['appl_sbm_tm'].map(lambda x: x.year)
train_data['quota_use_ratio'] = train_data['overdraft'] / (train_data['quota'] + 0.01)
train_data['nan_num'] = train_data.isnull().sum(axis=1)
train_data['diff_day'] = train_data.apply(lambda row: (row['appl_sbm_tm'] - row['auth_time']).days, axis=1)
train_data['how_old'] = train_data.apply(lambda row: (row['appl_sbm_tm'] - row['birthday']).days / 365, axis=1) # 两个id_card是否相同
auth_idcard = list(train_data['id_card_x'])
user_idcard = list(train_data['id_card_y'])
idcard_result = []
for indexx, uu in enumerate(auth_idcard):
if (str(auth_idcard[indexx]) == 'nan') and (str(user_idcard[indexx]) == 'nan'):
idcard_result.append(0)
elif (str(auth_idcard[indexx]) != 'nan') and (str(user_idcard[indexx]) == 'nan'):
idcard_result.append(1)
elif (str(auth_idcard[indexx]) == 'nan') and (str(user_idcard[indexx]) != 'nan'):
idcard_result.append(2)
else:
ttt1 = str(auth_idcard[indexx])[0] + str(auth_idcard[indexx])[-1]
ttt2 = str(user_idcard[indexx])[0] + str(user_idcard[indexx])[-1]
if ttt1 == ttt2:
idcard_result.append(3)
if ttt1 != ttt2:
idcard_result.append(4)
train_data['the_same_id'] = idcard_result train_bankcard_phone_list = train_bankcard.groupby(by=['id'])['phone'].apply(
lambda x: list(set(x.tolist()))).reset_index(name='bank_phone_list')
# 合并
train_data = pd.merge(train_data, train_bankcard_phone_list, on=['id'], how='left')
train_data['exist_phone'] = train_data.apply(lambda x: x['phone'] in x['bank_phone_list'], axis=1)
train_data['exist_phone'] = train_data['exist_phone'] * 1
train_data = train_data.drop(['bank_phone_list'], axis=1) """bank_info"""
bank_name = train_bankcard.groupby(by=['id'], as_index=False)['bank_name'].agg(
{'bank_name_len': lambda x: len(set(x))})
bank_num = train_bankcard.groupby(by=['id'], as_index=False)['tail_num'].agg(
{'tail_num_len': lambda x: len(set(x))})
bank_phone_num = train_bankcard.groupby(by=['id'], as_index=False)['phone'].agg(
{'bank_phone_num': lambda x: x.nunique()}) train_data = pd.merge(train_data, bank_name, on=['id'], how='left')
train_data = pd.merge(train_data, bank_num, on=['id'], how='left') train_data = pd.merge(train_data, train_order_time_max, on=['id'], how='left')
train_data = pd.merge(train_data, train_order_time_min, on=['id'], how='left')
train_data = pd.merge(train_data, train_order_type_zaixian, on=['id'], how='left')
train_data = pd.merge(train_data, train_order_type_huodao, on=['id'], how='left')
train_data = pd.merge(train_data, is_double_, on=['id'], how='left')
train_data = pd.merge(train_data, is_0_0_0, on=['id'], how='left')
train_data = pd.merge(train_data, is_1_1_1, on=['id'], how='left')
train_data = pd.merge(train_data, is_0000_00_00, on=['id'], how='left')
train_data = pd.merge(train_data, is_0001_1_1, on=['id'], how='left')
train_data = pd.merge(train_data, is_hou_in, on=['id'], how='left')
train_data = pd.merge(train_data, tmp_tmp_recieve, on=['id'], how='left')
train_data = pd.merge(train_data, tmp_tmp_recieve_phone_count, on=['id'], how='left')
train_data = pd.merge(train_data, tmp_tmp_recieve_phone_count_unique, on=['id'], how='left')
train_data = pd.merge(train_data, bank_phone_num, on=['id'], how='left')
train_data = pd.merge(train_data, is_hobby_df, on=['id'], how='left')
train_data = pd.merge(train_data, is_idcard_df, on=['id'], how='left')
train_data = pd.merge(train_data, auth_idcard_df, on=['id'], how='left')
train_data = pd.merge(train_data, auth_phone_df, on=['id'], how='left') train_data['day_order_max'] = train_data.apply(lambda row: (row['appl_sbm_tm'] - row['train_order_time_max']).days,
axis=1)
train_data = train_data.drop(['train_order_time_max'], axis=1)
train_data['day_order_min'] = train_data.apply(lambda row: (row['appl_sbm_tm'] - row['train_order_time_min']).days,
axis=1)
train_data = train_data.drop(['train_order_time_min'], axis=1) """order_info"""
order_time = train_order.groupby(by=['id'], as_index=False)['amt_order'].agg({'order_time': len})
order_mean = train_order.groupby(by=['id'], as_index=False)['amt_order'].agg({'order_mean': np.mean})
unit_price_mean = train_order.groupby(by=['id'], as_index=False)['unit_price'].agg({'unit_price_mean': np.mean})
order_time_set = train_order.groupby(by=['id'], as_index=False)['time_order'].agg(
{'order_time_set': lambda x: len(set(x))}) train_data = pd.merge(train_data, order_time, on=['id'], how='left')
train_data = pd.merge(train_data, order_mean, on=['id'], how='left')
train_data = pd.merge(train_data, order_time_set, on=['id'], how='left')
train_data = pd.merge(train_data, unit_price_mean, on=['id'], how='left') if IS_OFFLine == False:
# 在测试集上验证
train_data = train_data.drop(
['appl_sbm_tm', 'id', 'id_card_x', 'auth_time', 'phone', 'birthday', 'hobby', 'id_card_y'], axis=1) if IS_OFFLine == True:
# 在验证集上验证
dummy_fea = ['sex', 'merriage', 'income', 'qq_bound', 'degree', 'wechat_bound', 'account_grade', 'industry']
dummy_df = pd.get_dummies(train_data.loc[:, dummy_fea])
# 加入One-Hot后的特征
train_data_copy = pd.concat([train_data, dummy_df], axis=1)
train_data_copy = train_data_copy.fillna(0)
# 删除One-Hot前的特征
vaild_train_data = train_data_copy.drop(dummy_fea, axis=1)
# 划分训练集和验证集
valid_train_train = vaild_train_data[vaild_train_data.appl_sbm_tm < datetime.datetime(2017, 4, 1)]
valid_train_test = vaild_train_data[vaild_train_data.appl_sbm_tm >= datetime.datetime(2017, 4, 1)] valid_train_train = valid_train_train.drop(
['appl_sbm_tm', 'id', 'id_card_x', 'auth_time', 'phone', 'birthday', 'hobby', 'id_card_y'], axis=1)
valid_train_test = valid_train_test.drop(
['appl_sbm_tm', 'id', 'id_card_x', 'auth_time', 'phone', 'birthday', 'hobby', 'id_card_y'], axis=1)
# 训练集特征
vaild_train_x = valid_train_train.drop(['target'], axis=1)
# 验证集特征
vaild_test_x = valid_train_test.drop(['target'], axis=1) redict_result, modelee = xgb_feature(vaild_train_x, valid_train_train['target'].values, vaild_test_x, None)
print('valid auc', roc_auc_score(valid_train_test['target'].values, redict_result))
sys.exit(23) """************************测试集数据处理***********************************"""
"""auth_info"""
test_auth = pd.read_csv('../AI_Risk_data_Btest_V2.0/test_auth_info.csv', parse_dates=['auth_time']) # 标记id_card,nan为0,否则为1
auth_idcard = test_auth['id_card'].map(lambda x: 0 if str(x) == 'nan' else 1)
# 将标记数据加入到DataFrame中
auth_idcard_df = pd.DataFrame()
auth_idcard_df['id'] = test_auth['id']
auth_idcard_df['auth_idcard_df'] = auth_idcard
# 标记phone,nan为0,否则为1
auth_phone = test_auth['phone'].map(lambda x: 0 if str(x) == 'nan' else 1)
# 将标记数据加入到DataFrame中
auth_phone_df = pd.DataFrame()
auth_phone_df['id'] = test_auth['id']
auth_idcard_df['auth_phone_df'] = auth_phone test_auth['auth_time'].replace('0000-00-00', 'nan', inplace=True)
test_auth['auth_time'] = pd.to_datetime(test_auth['auth_time']) """bankcard_info"""
test_bankcard = pd.read_csv('../AI_Risk_data_Btest_V2.0/test_bankcard_info.csv')
# 不同银行的银行卡(bankname)数目
test_bankcard_bank_count = test_bankcard.groupby(by=['id'], as_index=False)['bank_name'].agg(
{'bankcard_count': lambda x: len(x)})
# 不同银行卡类型(card_type)数目
test_bankcard_card_count = test_bankcard.groupby(by=['id'], as_index=False)['card_type'].agg(
{'card_type_count': lambda x: len(set(x))})
# 不同银行卡预留电话(phone)数目
test_bankcard_phone_count = test_bankcard.groupby(by=['id'], as_index=False)['phone'].agg(
{'phone_count': lambda x: len(set(x))}) """credit_info"""
test_credit = pd.read_csv('../AI_Risk_data_Btest_V2.0/test_credit_info.csv')
# 信用评分反序
test_credit['credit_score_inverse'] = test_credit['credit_score'].map(lambda x: 605 - x)
# 额度-使用值
test_credit['can_use'] = test_credit['quota'] - test_credit['overdraft'] """order_info"""
test_order = pd.read_csv('../AI_Risk_data_Btest_V2.0/test_order_info.csv', parse_dates=['time_order'])
# 标记amt_order数据,NA或null为nan,否则为本身
test_order['amt_order'] = test_order['amt_order'].map(
lambda x: np.nan if ((x == 'NA') | (x == 'null')) else float(x))
# 标记time_order,0、NA或nan为NaT,否则格式化
test_order['time_order'] = test_order['time_order'].map(
lambda x: pd.lib.NaT if (str(x) == '0' or x == 'NA' or x == 'nan')
else (datetime.datetime.strptime(str(x), '%Y-%m-%d %H:%M:%S') if ':' in str(x)
else (datetime.datetime.utcfromtimestamp(int(x[0:10])) + datetime.timedelta(hours=8))))
# 根据id分组后的最大交易时间
test_order_time_max = test_order.groupby(by=['id'], as_index=False)['time_order'].agg(
{'test_order_time_max': lambda x: max(x)})
# 根据id分组后的最小交易时间
test_order_time_min = test_order.groupby(by=['id'], as_index=False)['time_order'].agg(
{'test_order_time_min': lambda x: min(x)})
# 根据id分组后的在线支付的数量
test_order_type_zaixian = test_order.groupby(by=['id']).apply(
lambda x: x['type_pay'][(x['type_pay'] == '在线支付').values].count()).reset_index(name='type_pay_zaixian')
# 根据id分组后的货到付款的数量
test_order_type_huodao = test_order.groupby(by=['id']).apply(
lambda x: x['type_pay'][(x['type_pay'] == '货到付款').values].count()).reset_index(name='type_pay_huodao') """recieve_addr_info"""
test_recieve = pd.read_csv('../AI_Risk_data_Btest_V2.0/test_recieve_addr_info.csv')
# 截取region字符串前两位
test_recieve['region'] = test_recieve['region'].map(lambda x: str(x)[:2])
tmp_tmp_recieve = pd.crosstab(test_recieve.id, test_recieve.region)
tmp_tmp_recieve = tmp_tmp_recieve.reset_index()
# 根据id分组,fix_phone的数量
tmp_tmp_recieve_phone_count = test_recieve.groupby(by=['id']).apply(lambda x: x['fix_phone'].count())
tmp_tmp_recieve_phone_count = tmp_tmp_recieve_phone_count.reset_index()
# 根据id分组,fix_phone的去重数量
tmp_tmp_recieve_phone_count_unique = test_recieve.groupby(by=['id']).apply(lambda x: x['fix_phone'].nunique())
tmp_tmp_recieve_phone_count_unique = tmp_tmp_recieve_phone_count_unique.reset_index() """test_list"""
test_target = pd.read_csv('../AI_Risk_data_Btest_V2.0/test_list.csv', parse_dates=['appl_sbm_tm']) """test_user_info"""
test_user = pd.read_csv('../AI_Risk_data_Btest_V2.0/test_user_info.csv', parse_dates=['birthday'])
# 标记hobby,nan为0,否则为1
is_hobby = test_user['hobby'].map(lambda x: 0 if str(x) == 'nan' else 1)
# 将标记数据加入到DataFrame中
is_hobby_df = pd.DataFrame()
is_hobby_df['id'] = test_user['id']
is_hobby_df['is_hobby'] = is_hobby
# 标记id_card,nan为0,否则为1
is_idcard = test_user['id_card'].map(lambda x: 0 if str(x) == 'nan' else 1)
# 将标记数据加入到DataFrame中
is_idcard_df = pd.DataFrame()
is_idcard_df['id'] = test_user['id']
is_idcard_df['is_hobby'] = is_idcard # user_birthday
# 解决关联报错问题
train_user['id'] = pd.to_numeric(train_user['id'], errors='coerce') tmp_tmp = train_user[['id', 'birthday']]
tmp_tmp = tmp_tmp.set_index(['id'])
is_double_ = tmp_tmp['birthday'].map(lambda x: (str(x) == '--') * 1).reset_index(name='is_double_')
is_0_0_0 = tmp_tmp['birthday'].map(lambda x: (str(x) == '0-0-0') * 1).reset_index(name='is_0_0_0')
is_1_1_1 = tmp_tmp['birthday'].map(lambda x: (str(x) == '1-1-1') * 1).reset_index(name='is_1_1_1')
is_0000_00_00 = tmp_tmp['birthday'].map(lambda x: (str(x) == '0000-00-00') * 1).reset_index(name='is_0000_00_00')
is_0001_1_1 = tmp_tmp['birthday'].map(lambda x: (str(x) == '0001-1-1') * 1).reset_index(name='is_0001_1_1')
is_hou_in = tmp_tmp['birthday'].map(lambda x: ('后' in str(x)) * 1).reset_index(name='is_hou_in')
# is_nan = tmp_tmp['birthday'].map(lambda x:(str(x) == 'nan')*1).reset_index(name='is_nan') # 格式化birthday
test_user['birthday'] = test_user['birthday'].map(lambda x: datetime.datetime.strptime(str(x), '%Y-%m-%d') if (
re.match('19\d{2}-\d{1,2}-\d{1,2}', str(x)) and '-0' not in str(x)) else pd.lib.NaT) test_data = pd.merge(test_target, test_auth, on=['id'], how='left')
test_data = pd.merge(test_data, test_user, on=['id'], how='left')
test_data = pd.merge(test_data, test_credit, on=['id'], how='left') test_data['hour'] = test_data['appl_sbm_tm'].map(lambda x: x.hour)
test_data['month'] = test_data['appl_sbm_tm'].map(lambda x: x.month)
test_data['year'] = test_data['appl_sbm_tm'].map(lambda x: x.year)
test_data['quota_use_ratio'] = test_data['overdraft'] / (test_data['quota'] + 0.01)
test_data['nan_num'] = test_data.isnull().sum(axis=1)
test_data['diff_day'] = test_data.apply(lambda row: (row['appl_sbm_tm'] - row['auth_time']).days, axis=1)
test_data['how_old'] = test_data.apply(lambda row: (row['appl_sbm_tm'] - row['birthday']).days / 365, axis=1) # 两个id_card是否相同
auth_idcard = list(test_data['id_card_x'])
user_idcard = list(test_data['id_card_y'])
idcard_result = []
for indexx, uu in enumerate(auth_idcard):
if (str(auth_idcard[indexx]) == 'nan') and (str(user_idcard[indexx]) == 'nan'):
idcard_result.append(0)
elif (str(auth_idcard[indexx]) != 'nan') and (str(user_idcard[indexx]) == 'nan'):
idcard_result.append(1)
elif (str(auth_idcard[indexx]) == 'nan') and (str(user_idcard[indexx]) != 'nan'):
idcard_result.append(2)
else:
ttt1 = str(auth_idcard[indexx])[0] + str(auth_idcard[indexx])[-1]
ttt2 = str(user_idcard[indexx])[0] + str(user_idcard[indexx])[-1]
if ttt1 == ttt2:
idcard_result.append(3)
if ttt1 != ttt2:
idcard_result.append(4)
test_data['the_same_id'] = idcard_result test_bankcard_phone_list = test_bankcard.groupby(by=['id'])['phone'].apply(
lambda x: list(set(x.tolist()))).reset_index(name='bank_phone_list')
test_data = pd.merge(test_data, test_bankcard_phone_list, on=['id'], how='left')
test_data['exist_phone'] = test_data.apply(lambda x: x['phone'] in x['bank_phone_list'], axis=1)
test_data['exist_phone'] = test_data['exist_phone'] * 1
test_data = test_data.drop(['bank_phone_list'], axis=1) """bankcard_info"""
bank_name = test_bankcard.groupby(by=['id'], as_index=False)['bank_name'].agg(
{'bank_name_len': lambda x: len(set(x))})
bank_num = test_bankcard.groupby(by=['id'], as_index=False)['tail_num'].agg({'tail_num_len': lambda x: len(set(x))})
bank_phone_num = test_bankcard.groupby(by=['id'], as_index=False)['phone'].agg(
{'bank_phone_num': lambda x: x.nunique()}) test_data = pd.merge(test_data, bank_name, on=['id'], how='left')
test_data = pd.merge(test_data, bank_num, on=['id'], how='left') test_data = pd.merge(test_data, test_order_time_max, on=['id'], how='left')
test_data = pd.merge(test_data, test_order_time_min, on=['id'], how='left')
test_data = pd.merge(test_data, test_order_type_zaixian, on=['id'], how='left')
test_data = pd.merge(test_data, test_order_type_huodao, on=['id'], how='left') test_data = pd.merge(test_data, is_double_, on=['id'], how='left')
test_data = pd.merge(test_data, is_0_0_0, on=['id'], how='left')
test_data = pd.merge(test_data, is_1_1_1, on=['id'], how='left')
test_data = pd.merge(test_data, is_0000_00_00, on=['id'], how='left')
test_data = pd.merge(test_data, is_0001_1_1, on=['id'], how='left')
test_data = pd.merge(test_data, is_hou_in, on=['id'], how='left')
test_data = pd.merge(test_data, tmp_tmp_recieve, on=['id'], how='left')
test_data = pd.merge(test_data, tmp_tmp_recieve_phone_count, on=['id'], how='left')
test_data = pd.merge(test_data, tmp_tmp_recieve_phone_count_unique, on=['id'], how='left')
test_data = pd.merge(test_data, bank_phone_num, on=['id'], how='left')
test_data = pd.merge(test_data, is_hobby_df, on=['id'], how='left')
test_data = pd.merge(test_data, is_idcard_df, on=['id'], how='left')
test_data = pd.merge(test_data, auth_idcard_df, on=['id'], how='left')
test_data = pd.merge(test_data, auth_phone_df, on=['id'], how='left') test_data['day_order_max'] = test_data.apply(lambda row: (row['appl_sbm_tm'] - row['test_order_time_max']).days,
axis=1)
test_data = test_data.drop(['test_order_time_max'], axis=1)
test_data['day_order_min'] = test_data.apply(lambda row: (row['appl_sbm_tm'] - row['test_order_time_min']).days,
axis=1)
test_data = test_data.drop(['test_order_time_min'], axis=1) """order_info"""
order_time = test_order.groupby(by=['id'], as_index=False)['amt_order'].agg({'order_time': len})
order_mean = test_order.groupby(by=['id'], as_index=False)['amt_order'].agg({'order_mean': np.mean})
unit_price_mean = test_order.groupby(by=['id'], as_index=False)['unit_price'].agg({'unit_price_mean': np.mean})
order_time_set = test_order.groupby(by=['id'], as_index=False)['time_order'].agg(
{'order_time_set': lambda x: len(set(x))}) test_data = pd.merge(test_data, order_time, on=['id'], how='left')
test_data = pd.merge(test_data, order_mean, on=['id'], how='left')
test_data = pd.merge(test_data, order_time_set, on=['id'], how='left')
test_data = pd.merge(test_data, unit_price_mean, on=['id'], how='left') test_data = test_data.drop(
['appl_sbm_tm', 'id', 'id_card_x', 'auth_time', 'phone', 'birthday', 'hobby', 'id_card_y'], axis=1)
test_data['target'] = -1 test_data.to_csv('8288test.csv', index=None)
train_data.to_csv('8288train.csv', index=None) dummy_fea = ['sex', 'merriage', 'income', 'qq_bound', 'degree', 'wechat_bound', 'account_grade', 'industry']
train_test_data = pd.concat([train_data, test_data], axis=0, ignore_index=True)
train_test_data = train_test_data.fillna(0)
dummy_df = pd.get_dummies(train_test_data.loc[:, dummy_fea])
train_test_data = pd.concat([train_test_data, dummy_df], axis=1)
train_test_data = train_test_data.drop(dummy_fea, axis=1) train_train = train_test_data.iloc[:train_data.shape[0], :]
test_test = train_test_data.iloc[train_data.shape[0]:, :] train_train_x = train_train.drop(['target'], axis=1)
test_test_x = test_test.drop(['target'], axis=1) predict_result, modelee = xgb_feature(train_train_x, train_train['target'].values, test_test_x, None)
ans = pd.read_csv('../AI_Risk_data_Btest_V2.0/test_list.csv', parse_dates=['appl_sbm_tm'])
ans['PROB'] = predict_result
ans = ans.drop(['appl_sbm_tm'], axis=1)
minmin, maxmax = min(ans['PROB']), max(ans['PROB'])
ans['PROB'] = ans['PROB'].map(lambda x: (x - minmin) / (maxmax - minmin))
ans['PROB'] = ans['PROB'].map(lambda x: '%.4f' % x)
ans.to_csv('04094test.csv', index=None)

stacking.py

# -*- coding: utf-8 -*-
from heamy.dataset import Dataset
from heamy.estimator import Regressor, Classifier
from heamy.pipeline import ModelsPipeline
import pandas as pd
import xgboost as xgb
import datetime
from sklearn.metrics import roc_auc_score
import lightgbm as lgb
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
import numpy as np def xgb_feature(X_train, y_train, X_test, y_test=None):
# 模型参数
params = {'booster': 'gbtree',
'objective': 'rank:pairwise',
'eval_metric': 'auc',
'eta': 0.02,
'max_depth': 5, # 4 3
'colsample_bytree': 0.7, # 0.8
'subsample': 0.7,
'min_child_weight': 1, # 2 3
'seed': 1111,
'silent': 1
}
dtrain = xgb.DMatrix(X_train, label=y_train)
dvali = xgb.DMatrix(X_test)
model = xgb.train(params, dtrain, num_boost_round=800)
predict = model.predict(dvali)
minmin = min(predict)
maxmax = max(predict)
vfunc = np.vectorize(lambda x: (x - minmin) / (maxmax - minmin))
return vfunc(predict) def xgb_feature2(X_train, y_train, X_test, y_test=None):
# 模型参数
params = {'booster': 'gbtree',
'objective': 'rank:pairwise',
'eval_metric': 'auc',
'eta': 0.015,
'max_depth': 5, # 4 3
'colsample_bytree': 0.7, # 0.8
'subsample': 0.7,
'min_child_weight': 1, # 2 3
'seed': 11,
'silent': 1
}
dtrain = xgb.DMatrix(X_train, label=y_train)
dvali = xgb.DMatrix(X_test)
model = xgb.train(params, dtrain, num_boost_round=1200)
predict = model.predict(dvali)
minmin = min(predict)
maxmax = max(predict)
vfunc = np.vectorize(lambda x: (x - minmin) / (maxmax - minmin))
return vfunc(predict) def xgb_feature3(X_train, y_train, X_test, y_test=None):
# 模型参数
params = {'booster': 'gbtree',
'objective': 'rank:pairwise',
'eval_metric': 'auc',
'eta': 0.01,
'max_depth': 5, # 4 3
'colsample_bytree': 0.7, # 0.8
'subsample': 0.7,
'min_child_weight': 1, # 2 3
'seed': 1,
'silent': 1
}
dtrain = xgb.DMatrix(X_train, label=y_train)
dvali = xgb.DMatrix(X_test)
model = xgb.train(params, dtrain, num_boost_round=2000)
predict = model.predict(dvali)
minmin = min(predict)
maxmax = max(predict)
vfunc = np.vectorize(lambda x: (x - minmin) / (maxmax - minmin))
return vfunc(predict) def et_model(X_train, y_train, X_test, y_test=None):
model = ExtraTreesClassifier(max_features='log2', n_estimators=1000, n_jobs=-1).fit(X_train, y_train)
return model.predict_proba(X_test)[:, 1] def gbdt_model(X_train, y_train, X_test, y_test=None):
model = GradientBoostingClassifier(learning_rate=0.02, max_features=0.7, n_estimators=700, max_depth=5).fit(X_train,
y_train)
predict = model.predict_proba(X_test)[:, 1]
minmin = min(predict)
maxmax = max(predict)
vfunc = np.vectorize(lambda x: (x - minmin) / (maxmax - minmin))
return vfunc(predict) def logistic_model(X_train, y_train, X_test, y_test=None):
model = LogisticRegression(penalty='l2').fit(X_train, y_train)
return model.predict_proba(X_test)[:, 1] def lgb_feature(X_train, y_train, X_test, y_test=None):
lgb_train = lgb.Dataset(X_train, y_train,
categorical_feature={'sex', 'merriage', 'income', 'qq_bound', 'degree', 'wechat_bound',
'account_grade', 'industry'})
lgb_test = lgb.Dataset(X_test,
categorical_feature={'sex', 'merriage', 'income', 'qq_bound', 'degree', 'wechat_bound',
'account_grade', 'industry'})
params = {
'task': 'train',
'boosting_type': 'gbdt',
'objective': 'binary',
'metric': 'auc',
'num_leaves': 25,
'learning_rate': 0.01,
'feature_fraction': 0.7,
'bagging_fraction': 0.7,
'bagging_freq': 5,
'min_data_in_leaf': 5,
'max_bin': 200,
'verbose': 0,
}
gbm = lgb.train(params,
lgb_train,
num_boost_round=2000)
predict = gbm.predict(X_test)
minmin = min(predict)
maxmax = max(predict)
vfunc = np.vectorize(lambda x: (x - minmin) / (maxmax - minmin))
return vfunc(predict) if __name__ == '__main__':
VAILD = False if VAILD == True:
# 训练数据
train_data = pd.read_csv('8288train.csv', engine='python')
# train_data = train_data.drop(
# ['appl_sbm_tm', 'id', 'id_card_x', 'auth_time', 'phone', 'birthday', 'hobby', 'id_card_y'], axis=1)
# 填充
train_data = train_data.fillna(0)
# # 测试数据
# test_data = pd.read_csv('8288test.csv', engine='python')
# # 填充0
# test_data = test_data.fillna(0)
# One-Hot
dummy_fea = ['sex', 'merriage', 'income', 'qq_bound', 'degree', 'wechat_bound', 'account_grade', 'industry']
dummy_df = pd.get_dummies(train_data.loc[:, dummy_fea])
# 合并One-Hot数据
train_data_copy = pd.concat([train_data, dummy_df], axis=1)
# 填充0
train_data_copy = train_data_copy.fillna(0)
# 删除One-Hot之前的数据
vaild_train_data = train_data_copy.drop(dummy_fea, axis=1)
# 训练集
valid_train_train = vaild_train_data[(vaild_train_data.year <= 2017) & (vaild_train_data.month < 4)]
# 测试集
valid_train_test = vaild_train_data[(vaild_train_data.year >= 2017) & (vaild_train_data.month >= 4)]
# 训练集特征
vaild_train_x = valid_train_train.drop(['target'], axis=1)
# 测试集特征
vaild_test_x = valid_train_test.drop(['target'], axis=1)
# 逻辑回归模型
redict_result = logistic_model(vaild_train_x, valid_train_train['target'].values, vaild_test_x, None)
print('valid auc', roc_auc_score(valid_train_test['target'].values, redict_result)) # dummy_fea = ['sex', 'merriage', 'income', 'qq_bound', 'degree', 'wechat_bound','account_grade','industry']
# for _fea in dummy_fea:
# print(_fea)
# le = LabelEncoder()
# le.fit(train_data[_fea].tolist())
# train_data[_fea] = le.transform(train_data[_fea].tolist())
# train_data_copy = train_data.copy()
# vaild_train_data = train_data_copy
# valid_train_train = vaild_train_data[(vaild_train_data.year <= 2017) & (vaild_train_data.month < 4)]
# valid_train_test = vaild_train_data[(vaild_train_data.year >= 2017) & (vaild_train_data.month >= 4)]
# vaild_train_x = valid_train_train.drop(['target'],axis=1)
# vaild_test_x = valid_train_test.drop(['target'],axis=1) # redict_result = lgb_feature(vaild_train_x,valid_train_train['target'].values,vaild_test_x,None)
# print('valid auc',roc_auc_score(valid_train_test['target'].values,redict_result)) if VAILD == False:
# 训练数据
train_data = pd.read_csv('8288train.csv', engine='python')
train_data = train_data.fillna(0) # 测试数据
test_data = pd.read_csv('8288test.csv', engine='python')
test_data = test_data.fillna(0) # 合并训练数据和测试数据
train_test_data = pd.concat([train_data, test_data], axis=0, ignore_index=True)
train_test_data = train_test_data.fillna(0) # 划分
train_data = train_test_data.iloc[:train_data.shape[0], :]
test_data = train_test_data.iloc[train_data.shape[0]:, :] dummy_fea = ['sex', 'merriage', 'income', 'qq_bound', 'degree', 'wechat_bound', 'account_grade', 'industry']
for _fea in dummy_fea:
print(_fea)
le = LabelEncoder()
le.fit(train_data[_fea].tolist() + test_data[_fea].tolist())
tmp = le.transform(train_data[_fea].tolist() + test_data[_fea].tolist())
train_data[_fea] = tmp[:train_data.shape[0]]
test_data[_fea] = tmp[train_data.shape[0]:] train_x = train_data.drop(['target'], axis=1)
test_x = test_data.drop(['target'], axis=1)
lgb_dataset = Dataset(train_x, train_data['target'], test_x, use_cache=False) """**********************************************************"""
# 训练数据
train_data = pd.read_csv('8288train.csv', engine='python')
train_data = train_data.fillna(0) # 测试数据
test_data = pd.read_csv('8288test.csv', engine='python')
test_data = test_data.fillna(0) # 合并训练数据和测试数据
train_test_data = pd.concat([train_data, test_data], axis=0, ignore_index=True)
train_test_data = train_test_data.fillna(0) # One-Hot
dummy_fea = ['sex', 'merriage', 'income', 'qq_bound', 'degree', 'wechat_bound', 'account_grade', 'industry']
dummy_df = pd.get_dummies(train_test_data.loc[:, dummy_fea])
train_test_data = pd.concat([train_test_data, dummy_df], axis=1)
train_test_data = train_test_data.drop(dummy_fea, axis=1) train_train = train_test_data.iloc[:train_data.shape[0], :]
test_test = train_test_data.iloc[train_data.shape[0]:, :]
train_train_x = train_train.drop(['target'], axis=1)
test_test_x = test_test.drop(['target'], axis=1)
xgb_dataset = Dataset(X_train=train_train_x, y_train=train_train['target'], X_test=test_test_x, y_test=None,
use_cache=False)
# heamy
model_xgb = Regressor(dataset=xgb_dataset, estimator=xgb_feature, name='xgb', use_cache=False)
model_xgb2 = Regressor(dataset=xgb_dataset, estimator=xgb_feature2, name='xgb2', use_cache=False)
model_xgb3 = Regressor(dataset=xgb_dataset, estimator=xgb_feature3, name='xgb3', use_cache=False)
model_lgb = Regressor(dataset=lgb_dataset, estimator=lgb_feature, name='lgb', use_cache=False)
model_gbdt = Regressor(dataset=xgb_dataset, estimator=gbdt_model, name='gbdt', use_cache=False)
pipeline = ModelsPipeline(model_xgb, model_xgb2, model_xgb3, model_lgb, model_gbdt)
stack_ds = pipeline.stack(k=5, seed=111, add_diff=False, full_test=True)
stacker = Regressor(dataset=stack_ds, estimator=LinearRegression, parameters={'fit_intercept': False})
predict_result = stacker.predict() ans = pd.read_csv('../AI_Risk_data_Btest_V2.0/test_list.csv', parse_dates=['appl_sbm_tm'])
ans['PROB'] = predict_result
ans = ans.drop(['appl_sbm_tm'], axis=1)
minmin, maxmax = min(ans['PROB']), max(ans['PROB'])
ans['PROB'] = ans['PROB'].map(lambda x: (x - minmin) / (maxmax - minmin))
ans['PROB'] = ans['PROB'].map(lambda x: '%.4f' % x)
ans.to_csv('ans_stacking.csv', index=None)

Reference:https://github.com/chenkkkk/User-loan-risk-prediction

马上AI全球挑战者大赛-违约用户风险预测的相关教程结束。

《马上AI全球挑战者大赛-违约用户风险预测.doc》

下载本文的Word格式文档,以方便收藏与打印。