Start free trial
Searching...
English
EnglishEnglish
EspañolSpanish
简体中文Chinese
FrançaisFrench
DeutschGerman
日本語Japanese
PortuguêsPortuguese
ItalianoItalian
한국어Korean
РусскийRussian
NederlandsDutch
العربيةArabic
PolskiPolish
हिन्दीHindi
Tiếng ViệtVietnamese
SvenskaSwedish
ΕλληνικάGreek
TürkçeTurkish
ไทยThai
ČeštinaCzech
RomânăRomanian
MagyarHungarian
УкраїнськаUkrainian
Bahasa IndonesiaIndonesian
DanskDanish
SuomiFinnish
БългарскиBulgarian
עבריתHebrew
NorskNorwegian
HrvatskiCroatian
CatalàCatalan
SlovenčinaSlovak
LietuviųLithuanian
SlovenščinaSlovenian
СрпскиSerbian
EestiEstonian
LatviešuLatvian
فارسیPersian
മലയാളംMalayalam
தமிழ்Tamil
اردوUrdu
Fundamentals of Data Engineering

Fundamentals of Data Engineering

Plan and Build Robust Data Systems
by Joe Reis 2022 426 pages
4.18
500+ ratings
Listen
2 minutes
Try Full Access for 3 Days
Unlock listening & more!
Continue

Key Takeaways

1. Data Engineering: The Essential Foundation for Data-Driven Success

Data engineering is the development, implementation, and maintenance of systems and processes that take in raw data and produce high-quality, consistent information that supports downstream use cases, such as analysis and machine learning.

Defining the role. Data engineering has emerged as a critical field, building the robust foundation necessary for data science and analytics to thrive in production environments. It's the intersection of various disciplines, including:

  • Security
  • Data management
  • DataOps
  • Data architecture
  • Orchestration
  • Software engineering

Beyond data science. While data science often captures the spotlight, data engineering is its upstream enabler. Data scientists typically spend 70-80% of their time on data collection, cleaning, and preparation—tasks that data engineers are uniquely trained to solve. By providing a solid data foundation, data engineers free data scientists to focus on advanced analytics and machine learning, maximizing their impact.

Evolution of the field. From its roots in data warehousing (1980s) to the "big data" era (2000s-2010s) with Hadoop and early cloud services, data engineering has constantly evolved. Today, the focus is shifting from managing monolithic frameworks to orchestrating decentralized, modular, and managed cloud services, making the role more about lifecycle management and less about low-level infrastructure hacking.

2. The Data Engineering Lifecycle: A Holistic Framework for Data Management

The data engineering lifecycle shifts the conversation away from technology and toward the data itself and the end goals that it must serve.

Cradle-to-grave framework. The data engineering lifecycle is a comprehensive framework describing the "cradle to grave" journey of data, from its origin to its ultimate consumption. It comprises five core stages:

  • Generation: Data originates from source systems.
  • Storage: Data is persisted throughout the lifecycle.
  • Ingestion: Data is moved from sources to storage.
  • Transformation: Raw data is refined into useful information.
  • Serving: Data is delivered for analysis, ML, or other uses.

Undercurrents of success. Supporting these stages are six critical "undercurrents" that permeate the entire lifecycle. These foundational elements ensure that data engineering efforts are robust, secure, and aligned with business objectives:

  • Security: Protecting data at all stages.
  • Data Management: Ensuring data quality, governance, and usability.
  • DataOps: Applying DevOps principles to data for agility and reliability.
  • Data Architecture: Designing scalable and flexible data systems.
  • Orchestration: Coordinating complex data workflows.
  • Software Engineering: Building and maintaining data systems with code.

Beyond linear flow. While presented sequentially, the lifecycle stages are not always a neat, continuous flow. They can repeat, overlap, or occur out of order, reflecting the dynamic nature of real-world data projects. The undercurrents act as a bedrock, ensuring stability and quality across all these dynamic interactions.

3. Designing Good Data Architecture: Balancing Flexibility, Scalability, and Cost

Data architecture is the design of systems to support the evolving data needs of an enterprise, achieved by flexible and reversible decisions reached through a careful evaluation of trade-offs.

Strategic blueprint. Data architecture is the strategic blueprint for an organization's data systems, aligning technology with business strategy. Good architecture is agile, flexible, and easily maintainable, evolving in response to changing business needs and new technologies. Key principles include:

  • Flexible and reversible decisions: Prioritizing "two-way doors" that are easy to reverse, reducing risk and enabling rapid iteration.
  • Planning for failure: Designing systems with high availability, reliability, and clear recovery objectives (RTO, RPO).
  • Architecting for scalability: Building systems that can dynamically scale up and down to handle varying data loads and save costs.

Leadership and collaboration. Architecture is not a solitary, "ivory tower" activity. It requires strong technical leadership to disseminate choices, mentor engineers, and foster collaboration. Architects must constantly engage with stakeholders to understand business problems and translate them into technical solutions, ensuring that technology serves business goals, not the other way around.

Key architectural concepts. Understanding concepts like domains, services, distributed systems, and coupling (monoliths vs. microservices) is crucial. Modern architectures increasingly favor:

  • Loosely coupled systems: Enabling independent development and deployment of components.
  • Event-driven architectures: Facilitating asynchronous communication and state distribution.
  • FinOps: Integrating financial accountability into cloud spending decisions to optimize cost and value.

4. Choosing Technologies: Prioritizing Value, Interoperability, and Future-Proofing

Architecture is strategic; tools are tactical.

Architecture first, tools second. A common pitfall is selecting technologies before defining the architecture. This often leads to a "Dr. Seuss fantasy machine" rather than a coherent data system. Technology choices must always be driven by the architectural blueprint and the value they add to data products and the broader business.

Key considerations for selection:

  • Team size and capabilities: Matching technology complexity to team bandwidth and skill sets, avoiding "cargo-cult engineering."
  • Speed to market: Prioritizing tools that enable rapid delivery of features and data while maintaining quality and security.
  • Interoperability: Ensuring seamless connection and data exchange between different technologies and systems.

Cost and longevity. Cost optimization is paramount, especially in the cloud era. Data engineers must consider:

  • Total Cost of Ownership (TCO): Direct and indirect costs, favoring operational expenses (opex) over capital expenses (capex).
  • Total Opportunity Cost of Ownership (TOCO): The cost of lost opportunities from choosing one technology over others, avoiding "bear traps."
  • Immutable vs. transitory technologies: Building on stable, long-lasting technologies (e.g., object storage, SQL) while being prepared to swap out rapidly evolving ones.

5. Data Generation and Storage: Understanding the Raw Ingredients and Systems

Knowing the use case of the data and the way you will retrieve it in the future is the first step to choosing the proper storage solutions for your data architecture.

Data's origin story. Data generation is the first stage, where data originates from diverse source systems. Data engineers must understand:

  • How data is created: Analog-to-digital conversion or native digital production.
  • Source system types: Files (CSV, JSON), APIs (REST, GraphQL), application databases (OLTP, NoSQL), logs, message queues, and event streams.
  • Data characteristics: Bounded vs. unbounded, frequency, schema evolution, and consistency (ACID vs. BASE).

Storage fundamentals. Storage is the bedrock of the entire lifecycle. Understanding the "raw ingredients" is crucial for making informed decisions:

  • Physical media: Magnetic disks (HDDs), Solid-State Drives (SSDs), and Random Access Memory (RAM), each with distinct cost, latency, and bandwidth profiles.
  • Software elements: Serialization (encoding data for storage/transmission) and compression (reducing data size for efficiency).
  • Caching: Storing frequently accessed data in faster, more expensive layers (cache hierarchy).

Storage systems and abstractions. Data engineers work with various storage systems and higher-level abstractions:

  • Distributed storage: Spreading data across multiple servers for scalability, durability, and availability.
  • Object storage: Highly scalable, cost-effective, immutable storage (e.g., S3) forming the backbone of data lakes and cloud data warehouses.
  • Data warehouses/lakes/lakehouses: Evolving abstractions that combine structured query capabilities with flexible, large-scale storage.

6. Efficient Data Ingestion: The Lifeline of Your Data Pipelines

Data ingestion is the process of moving data from one place to another.

The critical transfer. Data ingestion is the process of moving data from source systems into storage, acting as a crucial intermediate step in the data engineering lifecycle. It's often a bottleneck, making efficient design paramount. Key considerations include:

  • Bounded vs. unbounded data: Deciding between batch processing (discrete chunks) or streaming (continuous flow).
  • Frequency: From annual batches to real-time, near-instantaneous event processing.
  • Reliability and durability: Ensuring data is not lost or corrupted during transit, often requiring redundancy and self-healing systems.

Ingestion patterns. Data engineers employ various patterns to move data:

  • Batch ingestion:
    • Snapshot or differential extraction: Capturing full data states or only changes.
    • File-based export: Moving data via files (CSV, Parquet) for security and control.
    • ETL vs. ELT: Extract-Transform-Load vs. Extract-Load-Transform, with ELT gaining popularity in cloud data warehouses.
  • Streaming ingestion:
    • Change Data Capture (CDC): Extracting every change from a database for real-time replication or event streams.
    • Message queues/event-streaming platforms: Handling high-velocity, high-volume event data (e.g., Kafka, Kinesis).

Practical considerations. Beyond patterns, engineers must manage payload characteristics (kind, shape, size, schema), schema evolution (using schema registries), late-arriving data, and error handling (dead-letter queues). Leveraging managed data connectors and object storage for data movement can significantly reduce undifferentiated heavy lifting.

7. Queries, Modeling, and Transformation: Unlocking Data's Business Value

Transformations manipulate, enhance, and save data for downstream use, increasing its value in a scalable, reliable, and cost-effective manner.

Making data useful. This stage is where raw data is refined and structured to create business value. It involves:

  • Queries: Retrieving and acting on data using languages like SQL (DDL, DML, DCL, TCL). Understanding query optimizers and performance tuning (joins, pruning, commits, caching) is crucial.
  • Data Modeling: Deliberately structuring data to reflect organizational processes and business logic. This ensures consistency and usability.

Data modeling techniques. Various approaches exist for modeling batch analytical data:

  • Inmon: Highly normalized (3NF) enterprise data warehouse, serving department-specific data marts.
  • Kimball: Dimensional modeling (fact and dimension tables) with star schemas, often denormalized for query performance.
  • Data Vault: Separates business keys (hubs), relationships (links), and attributes (satellites) for agility and auditability.
  • Wide denormalized tables: Common in columnar databases, combining many fields for flexible schemas and faster analytics.

Transformations in action. Transformations persist the results of queries and models, often in complex pipelines spanning multiple systems. Key aspects include:

  • Distributed joins: Optimizing joins across clusters (broadcast, shuffle hash).
  • SQL vs. general-purpose code: Choosing the right tool for the job, leveraging SQL for declarative tasks and Spark/Python for complex logic.
  • Update patterns: Managing inserts, deletes, upserts, and schema changes efficiently in columnar systems.
  • Data wrangling: Cleaning and preparing messy data, often with visual tools.
  • Streaming transformations: Continuously processing and enriching data streams, often using dynamic windows and triggers.

8. Serving Data: Delivering Actionable Insights and Driving Automation

Above all else, trust is the root consideration in serving data; end users need to trust the data they’re receiving.

The ultimate goal. Serving data is the final stage, where processed data is delivered for consumption, driving action and value. Trust is paramount; data must be high-quality, reliable, and consistently available according to agreed-upon SLAs and SLOs. Key considerations include:

  • Use case and user: Understanding who will use the data (analysts, data scientists, executives) and for what purpose (strategic decisions, real-time alerts, ML models).
  • Data products: Designing data outputs that solve specific user "jobs to be done" and provide clear ROI.
  • Self-service vs. guided access: Balancing user empowerment with data governance and quality control.

Serving analytics. Data is served for various types of analytics:

  • Business analytics: Dashboards, reports, and ad hoc analysis for strategic decisions (often batch-driven).
  • Operational analytics: Real-time monitoring and alerts for immediate action (e.g., application performance, factory defects).
  • Embedded analytics: User-facing dashboards within applications, requiring low latency, high performance, and concurrency.

Serving machine learning. Data engineers provide the data foundation for ML applications, often collaborating with data scientists and ML engineers on:

  • Feature engineering: Preparing data for model training.
  • Model training: Supplying data for batch or online learning.
  • Model deployment: Ensuring data is available for real-time predictions.

Reverse ETL. A growing practice involves sending processed data back to source systems (e.g., CRM, ad platforms) to close feedback loops and automate actions, directly impacting business operations.

9. Security and Privacy: Non-Negotiable Pillars Across the Data Lifecycle

Security is vital to the practice of data engineering. This should be blindingly obvious, but we’re constantly amazed at how often data engineers view security as an afterthought.

Pervasive importance. Security and privacy are not optional add-ons but fundamental requirements at every stage of the data engineering lifecycle. A single breach can be catastrophic, leading to financial penalties, reputational damage, and loss of trust. Data engineers must adopt a "security-first" mindset.

People are the weakest link. Human factors are the most common cause of security compromises. Data engineers must:

  • Practice "negative thinking": Actively anticipate and plan for potential security threats and data leaks.
  • Be paranoid: Exercise extreme caution with credentials and sensitive information, verifying all requests.
  • Foster a security culture: Make security a habit, not just a compliance checklist, through regular training and communication.

Robust processes. Effective security relies on clear, actionable processes:

  • Principle of least privilege: Granting users and systems only the minimum access required for their tasks, and revoking it when no longer needed.
  • Shared responsibility model: Understanding that in the cloud, while providers secure the infrastructure, users are responsible for securing their data and applications.
  • Regular backups: Essential for disaster recovery and protection against ransomware attacks.

Leveraging technology. Technology reinforces security processes:

  • Patching and updates: Keeping all systems and software current to mitigate known vulnerabilities.
  • Encryption: Implementing encryption at rest (storage) and in transit (network) as a baseline defense.
  • Logging, monitoring, and alerting: Proactively detecting unusual access patterns, resource spikes, or billing anomalies that may indicate a breach.
  • Network access control: Strictly limiting inbound and outbound connections to only what is necessary, avoiding broad public access.

10. The Future of Data Engineering: Simplified, Interoperable, and Live

The data engineering lifecycle isn’t going away anytime soon.

Evolving, not disappearing. Despite the rise of increasingly simple tools, the data engineering role is not diminishing; it's evolving. As data becomes more central to business, data engineers will move up the value chain, focusing on higher-level architectural design, data governance, and strategic problem-solving, rather than low-level infrastructure management.

The cloud-scale data OS. The future will see data engineering coalesce around a "cloud-scale data operating system." This involves:

  • Simplified tools: Continued reduction in complexity and increased functionality of data tools, making them accessible to more companies.
  • Improved interoperability: Standardization of data APIs, file formats (Parquet, Avro), and metadata catalogs (replacing Hive Metastore) to enable seamless data exchange across services and clouds.
  • Enhanced orchestration: Next-generation platforms (e.g., Airflow, Dagster, Prefect) integrating data cataloging, lineage, Infrastructure as Code (IaC), and CI/CD for automated pipeline deployment and monitoring.

"Enterprisey" data engineering. The field will embrace "enterprisey" practices, focusing on robust data management, DataOps, and governance. This means:

  • Data quality and reliability: Proactive monitoring and automation to ensure data integrity.
  • Financial accountability (FinOps): Optimizing cloud spending and demonstrating clear ROI for data initiatives.
  • Ethical and privacy compliance: Integrating legal and ethical considerations into data handling by design.

The "Live Data Stack." The ultimate evolution moves beyond the "Modern Data Stack" (MDS) to a "Live Data Stack." This paradigm shift will:

  • Prioritize real-time data: Replacing batch-centric analytics with streaming-first approaches for immediate action and automation.
  • Fuse operational and analytical data: Blending traditional software applications with real-time analytics and machine learning to power entire businesses and applications with minimal latency.
  • Democratize sophistication: Making the advanced real-time data capabilities currently exclusive to large tech companies accessible to all.

Last updated:

Report Issue

Review Summary

4.18 out of 5
Average of 500+ ratings from Goodreads and Amazon.

Fundamentals of Data Engineering receives mostly positive reviews for providing a comprehensive overview of data engineering concepts, though some find it repetitive and lacking depth. Readers appreciate its focus on high-level principles and business context rather than specific tools. The book is praised for its structure and coverage of the data engineering lifecycle. It's recommended for those new to the field or seeking to broaden their perspective, but may be less useful for experienced practitioners looking for detailed technical information.

Your rating:
4.82
2 ratings
Want to read the full book?

About the Author

Joe Reis is a data engineering expert and author. He co-wrote Fundamentals of Data Engineering to provide a comprehensive overview of the field, drawing on his extensive experience in the industry. Reis focuses on teaching fundamental concepts and best practices rather than specific tools, aiming to create a resource that remains relevant despite rapid technological changes. His approach emphasizes the importance of understanding the broader context of data engineering within business operations. Reis's work is appreciated for its pragmatic tone and focus on real-world applications of data engineering principles.

Listen2 mins
Now playing
Fundamentals of Data Engineering
0:00
-0:00
Now playing
Fundamentals of Data Engineering
0:00
-0:00
1x
Queue
Home
Swipe
Library
Get App
Create a free account to unlock:
Recommendations: Personalized for you
Requests: Request new book summaries
Bookmarks: Save your favorite books
History: Revisit books later
Ratings: Rate books & see your ratings
600,000+ readers
Try Full Access for 3 Days
Listen, bookmark, and more
Compare Features Free Pro
📖 Read Summaries
Read unlimited summaries. Free users get 3 per month
🎧 Listen to Summaries
Listen to unlimited summaries in 40 languages
❤️ Unlimited Bookmarks
Free users are limited to 4
📜 Unlimited History
Free users are limited to 4
📥 Unlimited Downloads
Free users are limited to 1
Risk-Free Timeline
Today: Get Instant Access
Listen to full summaries of 26,000+ books. That's 12,000+ hours of audio!
Day 2: Trial Reminder
We'll send you a notification that your trial is ending soon.
Day 3: Your subscription begins
You'll be charged on Feb 24,
cancel anytime before.
Consume 2.8× More Books
2.8× more books Listening Reading
Our users love us
600,000+ readers
Trustpilot Rating
TrustPilot
4.6 Excellent
This site is a total game-changer. I've been flying through book summaries like never before. Highly, highly recommend.
— Dave G
Worth my money and time, and really well made. I've never seen this quality of summaries on other websites. Very helpful!
— Em
Highly recommended!! Fantastic service. Perfect for those that want a little more than a teaser but not all the intricate details of a full audio book.
— Greg M
Save 62%
Yearly
$119.88 $44.99/year/yr
$3.75/mo
Monthly
$9.99/mo
Start a 3-Day Free Trial
3 days free, then $44.99/year. Cancel anytime.
Scanner
Find a barcode to scan

We have a special gift for you
Open
38% OFF
DISCOUNT FOR YOU
$79.99
$49.99/year
only $4.16 per month
Continue
2 taps to start, super easy to cancel
Settings
General
Widget
Loading...
We have a special gift for you
Open
38% OFF
DISCOUNT FOR YOU
$79.99
$49.99/year
only $4.16 per month
Continue
2 taps to start, super easy to cancel