[Deep Learning] 特征工程

特征工程(Feature Engineering)是从原始数据中创造新的特征以提升算法学习效果的过程。
特征工程与特征选择不同,通常先通过特征工程生成新的特征,之后通过特征选择去掉无关的、冗余的、强相关的特性。

  • feature engineering: This process attempts to create additional relevant features from the existing raw features in the data, and to increase the predictive power of the learning algorithm.
  • feature selection: This process selects the key subset of original data features in an attempt to reduce the dimensionality of the training problem.
    Normally feature engineering is applied first to generate additional features, and then the feature selection step is performed to eliminate irrelevant, redundant, or highly correlated features.

下面以Tensorflow为例简述特征工程在实战项目的应用:
Google可视化数据集分析工具 Facets

  • 对于数值型特征(numeric values)
    i.e. age与income为非线性关系


    [Deep Learning] 特征工程_第1张图片

可使用bucketing方法对每一bucket使用不同权重


[Deep Learning] 特征工程_第2张图片
using bucket for different weights

在tensorflow中可以直接使用

age_buckets = tf.feature_column.bucketized_column{
  tf.feature_column.numeric_column('age'),
  boundaries=[31, 46, 60, 75, 90]
}
  • 对于类别型特征(categorical values)
    For small vocabulary: use the raw value
    对于线性分类器,特征交叉往往是一个有用的创建新特征的方法。i.e.


    [Deep Learning] 特征工程_第3张图片
    feature crossing

For larger vocabulary: use hash or embedding
hash适用于无法提供完整的词汇列表或构建全连接神经网络的情况使用(节约内存但会增加噪声数据)。i.e.

occupation = tf.feature_column_categorical_column_with_hash_bucket('occupation', 1080)

Embeddings

Dense vectors vs One-hot(Sparse)
tensorflow projector可视化网站

  • Word Embeddings
    • word2vec
      • skipgram
      • CBOW
      • GloVe
    • Word Regularities
    • Doc2vec
  • Image Embeddings
    • Single layer embeddings
      • DeCAF
      • CNN Features off-the-shell
    • Studies of transferability
      • Transferability of features
      • Factors of transferability
    • Multiple layer embeddings
      • Full-Network embedding
  • Multimodal Embeddings
    • Introduction
    • Image and Text Multimodal Embeddings
      • Two separate embeddings
      • Pairwise Ranking Loss
      • Available datasets for Image Captioning
      • Applications today
    • Other multimodal combinations

你可能感兴趣的:([Deep Learning] 特征工程)