COMP9313 - Data Curation

Data Curation is a must in any Big Data Project.

Data Curation is the process of identifying which data sources are needed, putting that data in the context of the business so that business users can interact with it, and use it to create their analysis.

Data Quality Dimensions

COMP9313 - Data Curation_第1张图片
Data Quality Dimensions(1)

COMP9313 - Data Curation_第2张图片
Data Quality Dimensions(2)

COMP9313 - Data Curation_第3张图片
Data Quality Dimensions(3)

COMP9313 - Data Curation_第4张图片
Data Quality Dimensions(4)

Data Curation

  • Ingestion
  • Validation
  • Transformation
  • Correction
  • Consolidation
  • Visulization

Ingestion

Obtaining/importing data from potentially large number of sources.

Streaming/storing data

Main V challenges: Volume, Velocity, Visibility

Validation

Is this data valid and does it represent true facts?

Is this data valuable for the goals of my big data project?

Main V challenges: Veracity, Value

Transformation

  • Schema mapping
    Global schema creation
    Mapping of global-to-local schema
  • Record linkage
    Same logical entities, different data sources.
    Traditional Record Linkage -> static/structured records, same schema.
    Recrod Linkage in Big Data -> heterogenous sources, dynamic and continuously evolving.
  • Data fusion
    Resolving conflicts.
    Finding truth about real world -> veracity of data.
  • Main V challenges: Variety, Visibility

Correction

  • Any good big data project needs to satisfy certain quality criteria(garbage in -> garbage out)

  • Main quality dimensions: Free-of-error, believability, objectivity

  • Main V challenges: Veracity

Consolidation

  • Schema integration
  • Consolidating data source
    Use of synonyms, templates and authoritative tables.
    Incremental improvement of consolidation techniques.
  • Consolidating entities

Visualization

  • Data/results visualization

3rd Generation of Data Curation Approach

Focus on scalability and automation
Scalability: 1000s to 10000s data sources.
Automation:
Use of ML and statistics for "low hanging fruits"
Parallelization is a must(big data)

Data Curation at Scale
Data curation is an ongoing task.
Use of expert sourcing is a must.
Fitting into the organization's ecosystem is a must.
A schema for finding data sources is a must.

Data curation is an ongoing task
Organizations add new data on a regular basis.
Streams of data arriving all the time.
Data properties and characteristis keep changing.
Mergers - Integration, Transformation, data fusion may totally change.
Recommendation: Global schema and curation algorithms must be incremental.

Use of expert sourcing is a must
Cannot always rely on purely automated curation.
Domain expertise required.
Expertise is hierarchical.
Use expertise is in a smart way.

Fitting into the organization's ecosystem is a must
Ingest from all kinds of data sources(Variety)
Export to a variety of data sinks(data sharing)
Keep original data sources in stiu(data governance)
Access control is a must(privacy and security)
Support for data partners(data sharing)

A schema for finding data sources is a must
CIOs typically have no idea of how many data sources they have.

CIOs typically have no idea of how many duplicates they have in their data sources.

Use of templates for common integration problems can be useful here:

  • Procurement optimization
  • Customer data integration
  • Etc

Data Curation Tools

Data Tamer

  • Automated data integration
  • Automatic schema mapping
  • Entity de-duplication
  • Leverages human experts and crowd for integration verification.

ZenCrowd
Name entity to Knowledge based linking
Main goal:

  • Bridge gab between automated/manual linking.
  • with automated linking with humans.

CrowdDB

  • Answers queries that cannot be answered with traditional DB systems or search engines.
  • Uses fuzzy operations with help of humans.
  • Ranks items by relevancy.

Talend

  • Data integration&cleaning
  • Provides Master Data Management(MDM) functionality.
  • Data governance(data catalog, data quality, data stewardship, etc)

Pentaho Data Integration(Kettle)

  • Data Integration
  • Extraction-Transform-Load(ETL)
  • Uses dataflow programming
  • Integrates with various storages with web services

你可能感兴趣的:(COMP9313 - Data Curation)