Data Curation is a must in any Big Data Project.

Data Curation is the process of identifying which data sources are needed, putting that data in the context of the business so that business users can interact with it, and use it to create their analysis.

Data Quality Dimensions

Data Quality Dimensions(1)

Data Quality Dimensions(2)

Data Quality Dimensions(3)

Data Quality Dimensions(4)

Data Curation

Ingestion
Validation
Transformation
Correction
Consolidation
Visulization

Ingestion

Obtaining/importing data from potentially large number of sources.

Streaming/storing data

Main V challenges: Volume, Velocity, Visibility

Validation

Is this data valid and does it represent true facts?

Is this data valuable for the goals of my big data project?

Main V challenges: Veracity, Value

Transformation

Schema mapping
Global schema creation
Mapping of global-to-local schema
Record linkage
Same logical entities, different data sources.
Traditional Record Linkage -> static/structured records, same schema.
Recrod Linkage in Big Data -> heterogenous sources, dynamic and continuously evolving.
Data fusion
Resolving conflicts.
Finding truth about real world -> veracity of data.
Main V challenges: Variety, Visibility

Correction

Any good big data project needs to satisfy certain quality criteria(garbage in -> garbage out)
Main quality dimensions: Free-of-error, believability, objectivity
Main V challenges: Veracity

Consolidation

Schema integration
Consolidating data source
Use of synonyms, templates and authoritative tables.
Incremental improvement of consolidation techniques.
Consolidating entities

Visualization

Data/results visualization

3rd Generation of Data Curation Approach

Focus on scalability and automation
Scalability: 1000s to 10000s data sources.
Automation:
Use of ML and statistics for "low hanging fruits"
Parallelization is a must(big data)

Data Curation at Scale
Data curation is an ongoing task.
Use of expert sourcing is a must.
Fitting into the organization's ecosystem is a must.
A schema for finding data sources is a must.

Data curation is an ongoing task
Organizations add new data on a regular basis.
Streams of data arriving all the time.
Data properties and characteristis keep changing.
Mergers - Integration, Transformation, data fusion may totally change.
Recommendation: Global schema and curation algorithms must be incremental.

Use of expert sourcing is a must
Cannot always rely on purely automated curation.
Domain expertise required.
Expertise is hierarchical.
Use expertise is in a smart way.

Fitting into the organization's ecosystem is a must
Ingest from all kinds of data sources(Variety)
Export to a variety of data sinks(data sharing)
Keep original data sources in stiu(data governance)
Access control is a must(privacy and security)
Support for data partners(data sharing)

A schema for finding data sources is a must
CIOs typically have no idea of how many data sources they have.

CIOs typically have no idea of how many duplicates they have in their data sources.

Use of templates for common integration problems can be useful here:

Procurement optimization
Customer data integration
Etc

Data Curation Tools

Data Tamer

Automated data integration
Automatic schema mapping
Entity de-duplication
Leverages human experts and crowd for integration verification.

ZenCrowd
Name entity to Knowledge based linking
Main goal:

Bridge gab between automated/manual linking.
with automated linking with humans.

CrowdDB

Answers queries that cannot be answered with traditional DB systems or search engines.
Uses fuzzy operations with help of humans.
Ranks items by relevancy.

Talend

Data integration&cleaning
Provides Master Data Management(MDM) functionality.
Data governance(data catalog, data quality, data stewardship, etc)

Pentaho Data Integration(Kettle)

Data Integration
Extraction-Transform-Load(ETL)
Uses dataflow programming
Integrates with various storages with web services

COMP9313 - Data Curation