3. LightAutoML 使用 LightAutoML 是2020年新开源的自动机器学习框架,适用于图表数据集的多种任务,如二分类、多分类以及回归任务。包括不同的特征:数字,类别,日期,文本等。安装非常简单一条命令pip install -U lightautoml。
本文是 LightAutoML vs Titanic: 80% accuracy in several lines of code 简单翻译。代码完整链接 。
拿Titanic数据集看看具体怎么使用,正题在第4部分,前面可以忽略。
Step 1: 导入包 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 import  osimport  timeimport  reimport  numpy as  npimport  pandas as  pdfrom  sklearn.metrics import  accuracy_score, f1_scorefrom  sklearn.model_selection import  train_test_splitfrom  lightautoml.automl.presets.tabular_presets import  TabularAutoML, TabularUtilizedAutoML from  lightautoml.tasks import  Task
Step 2:导入数据 1 2 3 4 5 6 7 8 from  pathlib import  Pathdata_dir = Path("/data" ) train_data = pd.read_csv(data_dir/'train.csv' ) test_data = pd.read_csv(data_dir/'test.csv' ) sample = pd.read_csv(data_dir/'gender_submission.csv' ) train_data.head() 
具体字段表示为:
Age 年龄 
Cabin 船舱号 
Embarked 登船港口 
Fare 票价 
Name 乘客姓名 
Parch 不同代直系亲属人数 
SibSp 同代直系亲属人数 
PassengerId 乘客ID 
Pclass 客舱等级 
Sex 性别 
Ticket 船票编号 
Survived 存活情况 
 
 
 
Step 3: 清理数据得到新特征 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 def  get_title (name ):    title_search = re.search(' ([A-Za-z]+)\.' , name)          if  title_search:         return  title_search.group(1 )     return  ""  def  create_extra_features (data ):    data['Ticket_type' ] = data['Ticket' ].map (lambda  x: x[0 :3 ])      data['Name_Words_Count' ] = data['Name' ].map (lambda  x: len (x.split()))           data['Has_Cabin' ] = data["Cabin" ].map (lambda  x: 1  - int (type (x) == float ))      data['FamilySize' ] = data['SibSp' ] + data['Parch' ] + 1                data['CategoricalFare' ] = pd.qcut(data['Fare' ], 5 ).astype(str )          data['CategoricalAge' ] = pd.cut(data['Age' ], 5 ).astype(str )          data['Title' ] = data['Name' ].apply(get_title).replace(['Lady' , 'Countess' ,'Capt' , 'Col' ,'Don' , 'Dr' , 'Major' , 'Rev' , 'Sir' , 'Jonkheer' , 'Dona' ], 'Rare' )     data['Title' ] = data['Title' ].replace('Mlle' , 'Miss' )     data['Title' ] = data['Title' ].replace('Ms' , 'Miss' )     data['Title' ] = data['Title' ].replace('Mme' , 'Mrs' )     data['Title' ] = data['Title' ].map ({"Mr" : 1 , "Miss" : 2 , "Mrs" : 3 , "Master" : 4 , "Rare" : 5 }).fillna(0 )     return  data train_data = create_extra_features(train_data) test_data = create_extra_features(test_data) tr_data, valid_data = train_test_split(train_data,                      test_size=0.2 ,                      stratify=train_data["Survived" ],                      random_state=42 ) 
Step 4: LightAutoML preset usage 1. 创建Task对象 这要你根据任务来选择,对应任务类型、metric两个参数。具体使用如:task = Task('reg', ),参数解释如下:
'reg':表示回归问题'binary': 表示二分类问题'multiclass':表示多分类问题 
而metric参数是用来监测模型预测效果的,比如你要用F1 metric,你就定义一个:
1 2 3 4 def  f1_metric (y_true, y_pred ):    return  f1_score(y_true, (y_pred > 0.5 ).astype(int )) task = Task('binary' , metric = f1_metric) 
2. 设置columns roles 类似于设置哪些特征列是target,哪些不要进入训练。
1 2 roles = {'target' : 'Survived' ,          'drop' : ['PassengerId' , 'Name' , 'Ticket' ]} 
3. 从preset创建AutoML 模型 
我们可以用 TabularAutoML创建一个自动调优的模型,然后根据LightAutoML输出调整参数。
1 2 3 4 automl = TabularAutoML(task = task,              timeout = 600 ,              cpu_limit = 4 ,              general_params = {'use_algos' : [['lgb' , 'lgb_tuned' , 'cb' ]]}) 
基本的算法现在是在general_params中的 'use_algos'设置,可以设置的参数有:
线性模型: 'linear_l2' 
基于数据集带实验参数的LightGBM模型: 'lgb' 
用Optuna自动调参微调参数的LightGBM模型: 'lgb_tuned' 
带实验参数的CatBoost模型: 'cb' 
Optuna调整参数的CatBoost模型: 'cb_tuned' 
 
因为想Stacking算法它们是由两个Level训练得到结果的,所以 'use_algos': [['linear_l2', 'lgb', 'lgb_tuned'], ['lgb_tuned', 'cb']]代表着第一Level用3个对应的算法调,第二个Level用两个算法,如上图。(LightAutoML叫layer)在第二个Level全部训练完后,从这两个算法的权重平均预测值构建最终预测结果。实际上,我们可以根据YAML config  来定制整个参数集。
再举一个例子:
1 2 3 4 5 6 7 8 9 automl = TabularAutoML(task=task,             timeout=TIMEOUT,             cpu_limit = N_THREADS,             reader_params={'n_jobs' : N_THREADS, 'cv' : 10 , 'random_state' : RANDOM_SEED},             general_params={'use_algos' : [['lgb' , 'cb' ]],                              'return_all_predictions' : True ,                              'weighted_blender_max_nonzero_coef' : 0.0 },              verbose=2                ) 
4. 预测 用Hold-out data 和valid_data预测:
1 2 oof_pred = automl.fit_predict(tr_data, roles = roles) valid_pred = automl.predict(valid_data) 
Automl preset training completed in 67.70 seconds. ,一分钟左右时间,看看分数就oof acc:0.85 , val acc: 0.83.
1 2 3 4 5 6 7 def  acc_score (y_true, y_pred ):    return  accuracy_score(y_true, (y_pred > 0.5 ).astype(int )) print ('OOF acc: {}' .format (acc_score(tr_data['Survived' ].values,      oof_pred.data[:, 0 ])))print ('VAL acc: {}' .format (acc_score(valid_data['Survived' ].values, valid_pred.data[:, 0 ])))======================================================================================== OOF acc: 0.851123595505618  VAL acc: 0.8324022346368715  
5. 用TabularUtilizedAutoML创建AutoML TabularUtilizedAutoML跟 TabularAutoML最大区别在于timeout 利用它的time utilization是尽可能最大化的,感觉上它的config也细节多一些。#TODO
1 2 3 4 5 automl = TabularUtilizedAutoML(task = task,              timeout = 600 ,              cpu_limit = 4 ,              general_params = {'use_algos' : [['lgb' , 'lgb_tuned' , 'cb' ]],}) oof_pred = automl.fit_predict(tr_data, roles = roles) 
看看得分:
1 2 3 4 5 6 valid_pred = automl.predict(valid_data) print ('OOF acc: {}' .format (acc_score(tr_data['Survived' ].values, oof_pred.data[:, 0 ])))print ('VAL acc: {}' .format (acc_score(valid_data['Survived' ].values, valid_pred.data[:, 0 ])))=========================================================================================== OOF acc: 0.8693820224719101  VAL acc: 0.8212290502793296  
6.用完整数据集训练 用得到的参数训练全部数据,并预测提交,最后结果0.79665。
1 2 3 4 5 6 7 8 9 10 automl = TabularUtilizedAutoML(task = task,                  timeout = 600 ,                  cpu_limit = 4 ,                  general_params = {'use_algos' : [['lgb' , 'lgb_tuned' , 'cb' ]],}) oof_pred = automl.fit_predict(train_data, roles = roles) test_pred = automl.predict(test_data) sample['Survived' ] = (test_pred.data[:, 0 ] > 0.5 ).astype(int ) sample.to_csv('automl_utilized_600_f1_score.csv' , index = False ) 
还可以看看这个 ,学习进一步使用。