Manufacturing: Predictive maintenance, quality control
Retail: Recommendation, chatbot, chatbot, demand forecasting
Healthcare: Alerts from real-time patient data, disease identification
Finance: Fraud detection, application processing
Automobile: Breakdown prediction, seif-driving
Problem formulation – collect & process data – train &tune models – deploy models
Monitor
Formulate problem: focus on the most impactful industrial problem.
Data: high-quality data is scarce(高质量的数据是非常稀缺的), privacy issues
Train models: models are more and more complex, data-hungry, expensive
Deploy models: heavy computation is not suitable for real-time inference
Monitor: data distributions shifts, fairness issues
Domain experts: have business insights, know what data is important and where to find it, identify the real impact of a ML model
Data scientists: full stack on data mining, model training and deployment
ML experts: customize SOTA ML models
SDE: develop/maintain data pipelines, model training and serving pipelines
Flow chart for data acquisition
Discover what data is available
Identify existing datasets
Find benchmark datasets to evaluate a new idea
Sources of Popular ML datasets
MNIST: digits written by employees of the US Census Bureau
ImageNet: millions of images from image search engines
AudioSet: You Tube sound clips for sound classification
Kinetics: You Tube videos clips for human actions classification
Where to find datasets?
Paperswithcodes Datasets: academic datasets with leaderboard
Kaggle Datasets: ML dataset upload by data scientists
Google Dataset search: search dataset in the web
Datasets comparison
Academic dataset(学术数据集): clean, proper difficulty, limited choices, too simplified, usually small scale
Competition datasets(竞赛数据集): closer to real ML applications, still simplified, and only available for hot topics
Raw(原始数据集): great flexibility, needs a lot of effort to process
You often need to deal with raw data in industrial settings
Combine data from multiple sources into a coherent dataset
Product data is often stored in multiple tables
Join tables by keys, which are often entity IDs
Key issues: identify IDs, missing rows, redundant columns, value conflicts
Use GANs(GANs 生成类似的图片)
Data augmentations(图像增强)
Finding the right data is challenging
Raw data in industrial settings VS academic datasets
Data integration combines data from multiple sources
Data augmentation a common practice
Synthesizing data is getting popular