Core Idea: Modern systems often combine multiple specialized tools (databases, caches, search indices) rather than relying on monolithic databases.
Key Concepts:
Derived Data:
Batch vs. Stream Processing:
Core Idea: Databases are being “unbundled” into composable components (storage, processing, indexing).
Key Concepts:
Disaggregated Components:
Challenges:
Core Idea: Build systems around explicit dataflow models to ensure reliability and evolvability.
Key Concepts:
Immutable Logs:
Stream Processing:
Core Idea: Prioritize correctness in distributed systems despite inherent challenges.
Key Concepts:
End-to-End Argument:
Enforcing Constraints:
Auditing and Lineage:
Core Idea: Data systems have societal impacts; engineers must address privacy, bias, and transparency.
Key Concepts:
Privacy:
Bias in ML:
Transparency:
Challenge | Solution |
---|---|
Cross-component consistency | Use immutable logs and idempotent processing. |
Handling large-scale data | Leverage distributed frameworks (Spark, Flink) with horizontal scaling. |
Ensuring data ethics | Implement privacy-by-design and fairness audits. |
Schema evolution | Store raw data; reprocess with new schemas. |
Chapter 12 emphasizes:
Question 1: Data Integration Approaches
Which of the following statements are TRUE about modern data integration strategies?
A. Batch processing is obsolete in favor of real-time stream processing.
B. Deriving datasets across specialized tools reduces vendor lock-in.
C. Unbundling databases requires sacrificing consistency guarantees.
D. Materialized views can enforce cross-system invariants.
E. Event sourcing simplifies reprocessing historical data.
Question 2: Correctness in Distributed Systems
Which techniques are critical for achieving end-to-end correctness in data systems?
A. Relying solely on database transactions for atomicity.
B. Implementing idempotent operations to handle retries.
C. Using checksums for data integrity across pipelines.
D. Assuming timeliness ensures eventual consistency.
E. Performing post-hoc reconciliation of derived data.
Question 3: Stream-Batch Unification
Select valid advantages of unifying batch and stream processing:
A. Simplified codebase using the same logic for both paradigms.
B. Elimination of exactly-once processing requirements.
C. Native support for reprocessing historical data via streams.
D. Reduced latency for real-time analytics.
E. Automatic handling of out-of-order events.
Question 4: Unbundling Databases
Which challenges arise when replacing monolithic databases with specialized components?
A. Increased operational complexity for coordination.
B. Loss of ACID transactions across components.
C. Improved performance for all use cases.
D. Stronger consistency guarantees by default.
E. Higher risk of data duplication and inconsistency.
Question 5: Dataflow-Centric Design
What are key benefits of designing systems around dataflow principles?
A. Tight coupling between storage and computation.
B. Decoupling producers and consumers via immutable logs.
C. Enabling incremental processing of state changes.
D. Simplified debugging due to deterministic workflows.
E. Reduced need for schema evolution.
Question 6: Correctness and Integrity
Which approaches help enforce correctness in derived data pipelines?
A. Using distributed transactions for all writes.
B. Embedding cryptographic hashes in event streams.
C. Periodically validating invariants with offline jobs.
D. Relying on idempotent writes to avoid duplicates.
E. Assuming monotonic processing of events.
Question 7: Timeliness vs. Integrity
Which statements accurately describe trade-offs between timeliness and integrity?
A. Real-time systems prioritize integrity over latency.
B. Batch processing ensures integrity but sacrifices timeliness.
C. Stream processors can enforce integrity with synchronous checks.
D. Approximate algorithms (e.g., HyperLogLog) balance both.
E. Exactly-once semantics eliminate the need for reconciliation.
Question 8: Privacy and Ethics
Which practices align with ethical data system design?
A. Storing raw user data indefinitely for flexibility.
B. Implementing differential privacy in analytics.
C. Using dark patterns to maximize data collection.
D. Providing user-accessible data deletion pipelines.
E. Anonymizing data by removing obvious PII fields.
Question 9: System Models and CAP
Which scenarios demonstrate practical CAP trade-offs?
A. A CP system rejecting writes during network partitions.
B. An AP system serving stale data to ensure availability.
C. A CA system using multi-region synchronous replication.
D. A CP system using leaderless replication with quorums.
E. An AP system employing CRDTs for conflict resolution.
Question 10: Observability and Auditing
Which techniques improve auditability in data-intensive systems?
A. Logging all data transformations with lineage metadata.
B. Using procedural code without declarative configurations.
C. Immutable event logs as the source of truth.
D. Ephemeral storage for intermediate processing results.
E. Periodic sampling instead of full trace collection.
Question 1
Correct Answers: B, D, E
Question 2
Correct Answers: B, C, E
Question 3
Correct Answers: A, C
Question 4
Correct Answers: A, B, E
Question 5
Correct Answers: B, C
Question 6
Correct Answers: B, C, D
Question 7
Correct Answers: B, D
Question 8
Correct Answers: B, D
Question 9
Correct Answers: A, B, E
Question 10
Correct Answers: A, C
These questions test deep understanding of Chapter 12’s themes: balancing correctness, scalability, and ethics in evolving data systems. The explanations tie concepts to real-world tools and design patterns, reinforcing the chapter’s key arguments.