李沐实用机器学习笔记(class1,class2)

Class 1: industrial ML Application

Manufacturing: Predictive maintenance, quality control

Retail: Recommendation, chatbot, chatbot, demand forecasting

Healthcare: Alerts from real-time patient data, disease identification

Finance: Fraud detection, application processing

Automobile: Breakdown prediction, seif-driving

  1. ML Workflow

Problem formulation – collect & process data – train &tune models – deploy models

                                          Monitor

  1. Challenges

Formulate problem: focus on the most impactful industrial problem.

Data: high-quality data is scarce(高质量的数据是非常稀缺的), privacy issues

Train models: models are more and more complex, data-hungry, expensive

Deploy models: heavy computation is not suitable for real-time inference

Monitor: data distributions shifts, fairness issues

  1. Roles

Domain experts: have business insights, know what data is important and where to find it, identify the real impact of a ML model

Data scientists: full stack on data mining, model training and deployment

ML experts: customize SOTA ML models

SDE: develop/maintain data pipelines, model training and serving pipelines

Class2: Data Acquisition

Flow chart for data acquisition

李沐实用机器学习笔记(class1,class2)_第1张图片

 

Discover what data is available

Identify existing datasets

Find benchmark datasets to evaluate a new idea

Sources of Popular ML datasets

MNIST: digits written by employees of the US Census Bureau

ImageNet: millions of images from image search engines

AudioSet: You Tube sound clips for sound classification

Kinetics: You Tube videos clips for human actions classification

Where to find datasets?

Paperswithcodes Datasets: academic datasets with leaderboard

Kaggle Datasets: ML dataset upload by data scientists

Google Dataset search: search dataset in the web

Datasets comparison

Academic dataset(学术数据集): clean, proper difficulty, limited choices, too simplified, usually small scale

Competition datasets(竞赛数据集): closer to real ML applications, still simplified, and only available for hot topics

Raw(原始数据集): great flexibility, needs a lot of effort to process

You often need to deal with raw data in industrial settings

Data integration

Combine data from multiple sources into a coherent dataset

Product data is often stored in multiple tables

Join tables by keys, which are often entity IDs

Key issues: identify IDs, missing rows, redundant columns, value conflicts

Generate synthetic data

Use GANs(GANs 生成类似的图片)

Data augmentations(图像增强)

Summary

Finding the right data is challenging

Raw data in industrial settings VS academic datasets

Data integration combines data from multiple sources

Data augmentation a common practice

Synthesizing data is getting popular

你可能感兴趣的:(机器学习)