Introduction

Query engines are the invisible workhorses powering modern data infrastructure. Every time you run a SQL query against a database, execute a Spark job, or query a data lake, a query engine is transforming your high-level request into an efficient execution plan. Understanding how query engines work gives you insight into one of the most important abstractions in computing.

This book takes a hands-on approach to demystifying query engines. Rather than surveying existing systems, we will build a fully functional query engine from scratch, covering each component in enough depth that you could implement your own.

Who This Book Is For

This book is for software engineers who want to understand the internals of query engines. You might be:

A data engineer who wants to understand why queries perform the way they do
A database developer looking to learn foundational concepts
A software engineer curious about compiler-like systems
Someone building tooling that needs to parse or analyze SQL

Basic programming knowledge is assumed. The examples use Kotlin, chosen for its conciseness, but the concepts apply to any language.

What You Will Learn

By the end of this book, you will understand how to:

Design a columnar type system using Apache Arrow
Build data source connectors for CSV and Parquet files
Represent queries as logical and physical plans
Create a DataFrame API for building queries programmatically
Translate logical plans into executable physical plans
Implement query optimizations like projection and predicate push-down
Parse SQL and convert it to query plans
Execute queries in parallel across multiple CPU cores
Design distributed query execution across a cluster

How This Book Is Organized

The book follows the natural architecture of a query engine, building each layer on top of the previous.

The first four chapters cover the foundations: What is a Query Engine?, Apache Arrow, Type System, and Data Sources. We start with what a query engine is, then establish our foundation with Apache Arrow for the memory model, a type system for representing data, and data source abstractions for reading files.

The next three chapters cover query representation: Logical Plans, DataFrames, and SQL Support. We define logical plans and expressions to represent queries abstractly, build a DataFrame API for constructing plans programmatically, and add SQL support so queries can be written in the familiar query language.

The Physical Plans, Query Planning, and Joins chapters cover execution. We translate logical plans into physical plans containing executable code, implement a query planner to automate that translation, then cover joins, one of the most complex operations in query processing.

The Subqueries, Query Optimizers, and Query Execution chapters continue with more advanced topics. We handle subqueries, build optimizer rules to transform plans into more efficient forms, and execute queries to compare performance.

The Parallel Query Execution and Distributed Query Execution chapters cover scaling. We extend the engine to execute queries in parallel across CPU cores, then across distributed clusters.

The Testing and Benchmarks chapters cover quality. We cover testing strategies including fuzzing, and benchmarking approaches for measuring performance.

This book is also available for purchase in ePub, MOBI, and PDF format from https://leanpub.com/how-query-engines-work

Keyboard shortcuts

How Query Engines Work

Introduction

Who This Book Is For

What You Will Learn

How This Book Is Organized