一、概述
本篇我们首先通过回归算法实现一个葡萄酒品质预测的程序,然后通过AutoML的方法再重新实现,通过对比两种实现方式来学习AutoML的应用。
首先数据集来自于竞赛网站kaggle.com的UCI Wine Quality Dataset数据集,访问地址:https://www.kaggle.com/c/uci-wine-quality-dataset/data
该数据集,输入为一些葡萄酒的化学检测数据,比如酒精度等,输出为品酒师的打分,具体字段描述如下:
Data fields Input variables (based on physicochemical tests): 1 - fixed acidity 2 - volatile acidity 3 - citric acid 4 - residual sugar 5 - chlorides 6 - free sulfur dioxide 7 - total sulfur dioxide 8 - density 9 - pH 10 - sulphates 11 - alcohol Output variable (based on sensory data): 12 - quality (score between 0 and 10) Other: 13 - id (unique ID for each sample, needed for submission)
二、代码
namespace Regression_WineQuality { public class WineData { [LoadColumn(0)] public float FixedAcidity; [LoadColumn(1)] public float VolatileAcidity; [LoadColumn(2)] public float CitricACID; [LoadColumn(3)] public float ResidualSugar; [LoadColumn(4)] public float Chlorides; [LoadColumn(5)] public float FreeSulfurDioxide; [LoadColumn(6)] public float TotalSulfurDioxide; [LoadColumn(7)] public float Density; [LoadColumn(8)] public float PH; [LoadColumn(9)] public float Sulphates; [LoadColumn(10)] public float Alcohol; [LoadColumn(11)] [ColumnName("Label")] public float Quality; [LoadColumn(12)] public float Id; } public class WinePrediction { [ColumnName("Score")] public float PredictionQuality; } class Program { static readonly string ModelFilePath = Path.Combine(Environment.CurrentDirectory, "MLModel", "model.zip"); static void Main(string[] args) { Train(); Prediction(); Console.WriteLine("Hit any key to finish the app"); Console.ReadKey(); } public static void Train() { MLContext mlContext = new MLContext(seed: 1); // 准备数据 string TrainDataPath = Path.Combine(Environment.CurrentDirectory, "Data", "winequality-data-full.csv"); var fulldata = mlContext.Data.LoadFromTextFile(path: TrainDataPath, separatorChar: ',', hasHeader: true); var trainTestData = mlContext.Data.TrainTestSplit(fulldata, testFraction: 0.2); var trainData = trainTestData.TrainSet; var testData = trainTestData.TestSet; // 创建学习管道并通过训练数据调整模型 var dataProcessPipeline = mlContext.Transforms.DropColumns("Id") .Append(mlContext.Transforms.NormalizeMeanVariance(nameof(WineData.FreeSulfurDioxide))) .Append(mlContext.Transforms.NormalizeMeanVariance(nameof(WineData.TotalSulfurDioxide))) .Append(mlContext.Transforms.Concatenate("Features", new string[] { nameof(WineData.FixedAcidity), nameof(WineData.VolatileAcidity), nameof(WineData.CitricACID), nameof(WineData.ResidualSugar), nameof(WineData.Chlorides), nameof(WineData.FreeSulfurDioxide), nameof(WineData.TotalSulfurDioxide), nameof(WineData.Density), nameof(WineData.PH), nameof(WineData.Sulphates), nameof(WineData.Alcohol)})); var trainer = mlContext.Regression.Trainers.LbfgsPoissonRegression(labelColumnName: "Label", featureColumnName: "Features"); var trainingPipeline = dataProcessPipeline.Append(trainer); var trainedModel = trainingPipeline.Fit(trainData); // 评估 var predictions = trainedModel.Transform(testData); var metrics = mlContext.Regression.Evaluate(predictions, labelColumnName: "Label", scoreColumnName: "Score"); PrintRegressionMetrics(trainer.ToString(), metrics); // 保存模型 Console.WriteLine("====== Save model to local file ========="); mlContext.Model.Save(trainedModel, trainData.Schema, ModelFilePath); } static void Prediction() { MLContext mlContext = new MLContext(seed: 1); ITransformer loadedModel = mlContext.Model.Load(ModelFilePath, out var modelInputSchema); var predictor = mlContext.Model.CreatePredictionEngine (loadedModel); WineData wineData = new WineData { FixedAcidity = 7.6f, VolatileAcidity = 0.33f, CitricACID = 0.36f, ResidualSugar = 2.1f, Chlorides = 0.034f, FreeSulfurDioxide = 26f, TotalSulfurDioxide = 172f, Density = 0.9944f, PH = 3.42f, Sulphates = 0.48f, Alcohol = 10.5f }; var wineQuality = predictor.Predict(wineData); Console.WriteLine($"Wine Data Quality is:{wineQuality.PredictionQuality} "); } } }
关于泊松回归的算法,我们在进行人脸颜值判断的那篇文章已经介绍过了,这个程序没有涉及任何新的知识点,就不重复解释了,主要目的是和下面的AutoML代码对比用的。
三、自动学习
我们发现机器学习的大致流程基本都差不多,如:准备数据-明确特征-选择算法-训练等,有时我们存在这样一个问题:该选择什么算法?算法的参数该如何配置?等等。而自动学习就解决了这个问题,框架会多次重复数据选择、算法选择、参数调优、评估结果这一过程,通过这个过程找出评估效果最好的模型。
全部代码如下:
namespace Regression_WineQuality { public class WineData { [LoadColumn(0)] public float FixedAcidity; [LoadColumn(1)] public float VolatileAcidity; [LoadColumn(2)] public float CitricACID; [LoadColumn(3)] public float ResidualSugar; [LoadColumn(4)] public float Chlorides; [LoadColumn(5)] public float FreeSulfurDioxide; [LoadColumn(6)] public float TotalSulfurDioxide; [LoadColumn(7)] public float Density; [LoadColumn(8)] public float PH; [LoadColumn(9)] public float Sulphates; [LoadColumn(10)] public float Alcohol; [LoadColumn(11)] [ColumnName("Label")] public float Quality; [LoadColumn(12)] public float ID; } public class WinePrediction { [ColumnName("Score")] public float PredictionQuality; } class Program { static readonly string ModelFilePath = Path.Combine(Environment.CurrentDirectory, "MLModel", "model.zip"); static readonly string TrainDataPath = Path.Combine(Environment.CurrentDirectory, "Data", "winequality-data-train.csv"); static readonly string TestDataPath = Path.Combine(Environment.CurrentDirectory, "Data", "winequality-data-test.csv"); static void Main(string[] args) { TrainAndSave(); LoadAndPrediction(); Console.WriteLine("Hit any key to finish the app"); Console.ReadKey(); } public static void TrainAndSave() { MLContext mlContext = new MLContext(seed: 1); // 准备数据 var trainData = mlContext.Data.LoadFromTextFile(path: TrainDataPath, separatorChar: ',', hasHeader: true); var testData = mlContext.Data.LoadFromTextFile (path: TestDataPath, separatorChar: ',', hasHeader: true); var progressHandler = new RegressionExperimentProgressHandler(); uint ExperimentTime = 200; ExperimentResult experimentResult = mlContext.Auto() .CreateRegressionExperiment(ExperimentTime) .Execute(trainData, "Label", progressHandler: progressHandler); Debugger.PrintTopModels(experimentResult); RunDetail best = experimentResult.BestRun; ITransformer trainedModel = best.Model; // 评估 BestRun var predictions = trainedModel.Transform(testData); var metrics = mlContext.Regression.Evaluate(predictions, labelColumnName: "Label", scoreColumnName: "Score"); Debugger.PrintRegressionMetrics(best.TrainerName, metrics); // 保存模型 Console.WriteLine("====== Save model to local file ========="); mlContext.Model.Save(trainedModel, trainData.Schema, ModelFilePath); } static void LoadAndPrediction() { MLContext mlContext = new MLContext(seed: 1); ITransformer loadedModel = mlContext.Model.Load(ModelFilePath, out var modelInputSchema); var predictor = mlContext.Model.CreatePredictionEngine (loadedModel); WineData wineData = new WineData { FixedAcidity = 7.6f, VolatileAcidity = 0.33f, CitricACID = 0.36f, ResidualSugar = 2.1f, Chlorides = 0.034f, FreeSulfurDioxide = 26f, TotalSulfurDioxide = 172f, Density = 0.9944f, PH = 3.42f, Sulphates = 0.48f, Alcohol = 10.5f }; var wineQuality = predictor.Predict(wineData); Console.WriteLine($"Wine Data Quality is:{wineQuality.PredictionQuality} "); } } }
四、代码分析
1、自动学习过程
var progressHandler = new RegressionExperimentProgressHandler(); uint ExperimentTime = 200; ExperimentResultexperimentResult = mlContext.Auto() .CreateRegressionExperiment(ExperimentTime) .Execute(trainData, "Label", progressHandler: progressHandler); Debugger.PrintTopModels(experimentResult); //打印所有模型数据
ExperimentTime 是允许的试验时间,progressHandler是一个报告程序,当每完成一种学习,系统就会调用一次报告事件。
public class RegressionExperimentProgressHandler : IProgress> { private int _iterationIndex; public void Report(RunDetail iterationResult) { _iterationIndex++; Console.WriteLine($"Report index:{_iterationIndex},TrainerName:{iterationResult.TrainerName},RuntimeInSeconds:{iterationResult.RuntimeInSeconds}"); } }
调试结果如下:
Report index:1,TrainerName:SdcaRegression,RuntimeInSeconds:12.5244426 Report index:2,TrainerName:LightGbmRegression,RuntimeInSeconds:11.2034988 Report index:3,TrainerName:FastTreeRegression,RuntimeInSeconds:14.810409 Report index:4,TrainerName:FastTreeTweedieRegression,RuntimeInSeconds:14.7338553 Report index:5,TrainerName:FastForestRegression,RuntimeInSeconds:15.6224459 Report index:6,TrainerName:LbfgsPoissonRegression,RuntimeInSeconds:11.1668197 Report index:7,TrainerName:OnlineGradientDescentRegression,RuntimeInSeconds:10.5353 Report index:8,TrainerName:OlsRegression,RuntimeInSeconds:10.8905459 Report index:9,TrainerName:LightGbmRegression,RuntimeInSeconds:10.5703296 Report index:10,TrainerName:FastTreeRegression,RuntimeInSeconds:19.4470509 Report index:11,TrainerName:FastTreeTweedieRegression,RuntimeInSeconds:63.638882 Report index:12,TrainerName:LightGbmRegression,RuntimeInSeconds:10.7710518
学习结束后我们通过Debugger.PrintTopModels打印出所有模型数据:
public class Debugger { private const int Width = 114; public static void PrintTopModels(ExperimentResultexperimentResult) { var topRuns = experimentResult.RunDetails .Where(r => r.ValidationMetrics != null && !double.IsNaN(r.ValidationMetrics.RSquared)) .OrderByDescending(r => r.ValidationMetrics.RSquared); Console.WriteLine("Top models ranked by R-Squared --"); PrintRegressionMetricsHeader(); for (var i = 0; i < topRuns.Count(); i++) { var run = topRuns.ElementAt(i); PrintIterationMetrics(i + 1, run.TrainerName, run.ValidationMetrics, run.RuntimeInSeconds); } } public static void PrintRegressionMetricsHeader() { CreateRow($"{"",-4} {"Trainer",-35} {"RSquared",8} {"Absolute-loss",13} {"Squared-loss",12} {"RMS-loss",8} {"Duration",9}", Width); } public static void PrintIterationMetrics(int iteration, string trainerName, RegressionMetrics metrics, double? runtimeInSeconds) { CreateRow($"{iteration,-4} {trainerName,-35} {metrics?.RSquared ?? double.NaN,8:F4} {metrics?.MeanAbsoluteError ?? double.NaN,13:F2} {metrics?.MeanSquaredError ?? double.NaN,12:F2} {metrics?.RootMeanSquaredError ?? double.NaN,8:F2} {runtimeInSeconds.Value,9:F1}", Width); } public static void CreateRow(string message, int width) { Console.WriteLine("|" + message.PadRight(width - 2) + "|"); } }
其中CreateRow代码功能用于排版。调试结果如下:
Top models ranked by R-Squared -- | Trainer RSquared Absolute-loss Squared-loss RMS-loss Duration | |1 FastTreeTweedieRegression 0.4731 0.46 0.41 0.64 63.6 | |2 FastTreeTweedieRegression 0.4431 0.49 0.43 0.65 14.7 | |3 FastTreeRegression 0.4386 0.54 0.49 0.70 19.4 | |4 LightGbmRegression 0.4177 0.52 0.45 0.67 10.8 | |5 FastTreeRegression 0.4102 0.51 0.45 0.67 14.8 | |6 LightGbmRegression 0.3944 0.52 0.46 0.68 11.2 | |7 LightGbmRegression 0.3501 0.60 0.57 0.75 10.6 | |8 FastForestRegression 0.3381 0.60 0.58 0.76 15.6 | |9 OlsRegression 0.2829 0.56 0.53 0.73 10.9 | |10 LbfgsPoissonRegression 0.2760 0.62 0.63 0.80 11.2 | |11 SdcaRegression 0.2746 0.58 0.56 0.75 12.5 | |12 OnlineGradientDescentRegression 0.0593 0.69 0.81 0.90 10.5 |
根据结果可以看到,一些算法被重复试验,但在使用同一个算法时其配置参数并不一样,如阙值、深度等。
2、获取最优模型
RunDetailbest = experimentResult.BestRun; ITransformer trainedModel = best.Model;
获取最佳模型后,其评估和保存的过程和之前代码一致。用测试数据评估结果:
************************************************* * Metrics for FastTreeTweedieRegression regression model *------------------------------------------------ * LossFn: 0.67 * R2 Score: 0.34 * Absolute loss: .63 * Squared loss: .67 * RMS loss: .82 *************************************************
看结果识别率约70%左右,这种结果是没有办法用于生产的,问题应该是我们没有找到决定葡萄酒品质的关键特征。
五、小结
到这篇文章为止,《ML.NET学习笔记系列》就结束了。学习过程中涉及的原始代码主要来源于:https://github.com/dotnet/machinelearning-samples 。
该工程中还有一些其他算法应用的例子,包括:聚类、矩阵分解、异常检测,其大体流程基本都差不多,有了我们这个系列的学习基础有兴趣的朋友可以自己研究一下。
六、资源获取
源码下载地址:https://github.com/seabluescn/Study_ML.NET
回归工程名称:Regression_WineQuality
AutoML工程名称:Regression_WineQuality_AutoML
点击查看机器学习框架ML.NET学习笔记系列文章目录