Introduction
Query engines are the invisible workhorses powering modern data infrastructure. Every time you run a SQL query against a database, execute a Spark job, or query a data lake, a query engine is transforming your high-level request into an efficient execution plan. Understanding how query engines work gives you insight into one of the most important abstractions in computing.
This book takes a hands-on approach to demystifying query engines. Rather than surveying existing systems, we will build a fully functional query engine from scratch, covering each component in enough depth that you could implement your own.
Who This Book Is For
This book is for software engineers who want to understand the internals of query engines. You might be:
- A data engineer who wants to understand why queries perform the way they do
- A database developer looking to learn foundational concepts
- A software engineer curious about compiler-like systems
- Someone building tooling that needs to parse or analyze SQL
Basic programming knowledge is assumed. The examples use Kotlin, chosen for its conciseness, but the concepts apply to any language.
What You Will Learn
By the end of this book, you will understand how to:
- Design a columnar type system using Apache Arrow
- Build data source connectors for CSV and Parquet files
- Represent queries as logical and physical plans
- Create a DataFrame API for building queries programmatically
- Translate logical plans into executable physical plans
- Implement query optimizations like projection and predicate push-down
- Parse SQL and convert it to query plans
- Execute queries in parallel across multiple CPU cores
- Design distributed query execution across a cluster
How This Book Is Organized
The book follows the natural architecture of a query engine, building each layer on top of the previous.
Chapters 1 through 4 cover the foundations. We start with what a query engine is, then establish our foundation with Apache Arrow for the memory model, a type system for representing data, and data source abstractions for reading files.
Chapters 5 through 7 cover query representation. We define logical plans and expressions to represent queries abstractly, build a DataFrame API for constructing plans programmatically, and add SQL support so queries can be written in the familiar query language.
Chapters 8 through 10 cover execution. We translate logical plans into physical plans containing executable code, then cover joins and subqueries, two of the most complex operations in query processing.
Chapters 11 through 13 cover planning and optimization. We implement a query planner to automate the translation from logical to physical plans, build optimizer rules to transform plans into more efficient forms, and execute queries to compare performance.
Chapters 14 and 15 cover scaling. We extend the engine to execute queries in parallel across CPU cores, then across distributed clusters.
Chapters 16 and 17 cover quality. We cover testing strategies including fuzzing, and benchmarking approaches for measuring performance.
This book is also available for purchase in ePub, MOBI, and PDF format from https://leanpub.com/how-query-engines-work
Copyright © 2020-2025 Andy Grove. All rights reserved.