Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Logical Plans and Expressions

The source code discussed in this chapter can be found in the logical-plan module of the KQuery project.

When you write a SQL query, you describe what data you want. The query engine must figure out how to get it. The first step is building a logical plan: a tree structure that represents the computation without specifying exactly how to execute it.

Consider this query:

SELECT name, salary * 1.1 AS new_salary
FROM employees
WHERE department = 'Engineering'

Before we can execute this, we need a structured representation. The logical plan captures:

  • Read from the employees table
  • Keep only rows where department = 'Engineering'
  • Compute name and salary * 1.1 for each remaining row

This chapter covers how to represent these operations as a tree of logical plans and expressions.

Why Separate Logical from Physical?

We could jump straight from SQL to execution, but separating logical and physical planning has advantages:

  1. Validation: We can check that columns exist and types are compatible before doing any work
  2. Optimization: We can transform the logical plan to make it more efficient
  3. Flexibility: The same logical plan might execute differently depending on data size or available resources

A logical plan says “filter rows where X” without specifying whether to use an index, a hash table, or a sequential scan. Those are physical execution details decided later.

The LogicalPlan Interface

A logical plan represents a relation: a set of rows with a known schema. Each plan can have child plans as inputs, forming a tree.

interface LogicalPlan {

  / Returns the schema of the data that will be produced by this logical plan. */
  fun schema(): Schema

  /
   * Returns the children (inputs) of this logical plan. This method is used to
   * enable use of the visitor pattern to walk a query tree.
   */
  fun children(): List<LogicalPlan>
}

schema() returns the output schema, the columns and their types that this plan produces. This is essential for validation. If a later plan references a column, we can check that it exists in the input schema.

children() returns the input plans. A scan has no children (it reads from a data source). A filter has one child (its input). A join has two children (left and right inputs). This method enables walking the plan tree.

Printing Logical Plans

Debugging query engines requires seeing what the plan looks like. We print plans as indented trees where children are nested under parents:

fun format(plan: LogicalPlan, indent: Int = 0): String {
  val b = StringBuilder()
  0.until(indent).forEach { b.append("\t") }
  b.append(plan.toString()).append("\n")
  plan.children().forEach { b.append(format(it, indent + 1)) }
  return b.toString()
}

Our example query might print as:

Projection: #name, #salary * 1.1 AS new_salary
  Filter: #department = 'Engineering'
    Scan: employees; projection=None

Read this bottom-up: scan the employees table, filter to Engineering, project the columns we want.

Logical Expressions

Plans describe data flow. Expressions describe computations within a plan. A filter plan contains an expression that evaluates to true or false for each row. A projection plan contains expressions that compute output columns.

Expressions can be simple (a column reference, a literal value) or complex (nested arithmetic, function calls). Here are common expression types:

Expression TypeExamples
Literal Value"hello", 12.34, true
Column Referenceuser_id, first_name, salary
Math Expressionsalary * 0.1, price + tax
Comparison Expressionage >= 21, status != 'inactive'
Boolean Expressionage >= 21 AND country = 'US'
Aggregate ExpressionMIN(salary), MAX(salary), SUM(amount), COUNT(*)
Scalar FunctionUPPER(name), CONCAT(first_name, ' ', last_name)
Aliased Expressionsalary * 1.1 AS new_salary

Expressions form trees. The expression (a + b) * c has a multiply at the root with two children: an add expression (with children a and b) and a column reference c.

The LogicalExpr Interface

During planning, we need to know what type of value an expression produces. If you write a + b, that is only valid if both columns are numeric. The interface captures this:

interface LogicalExpr {

  /
   * Return meta-data about the value that will be produced by this expression
   * when evaluated against a particular input.
   */
  fun toField(input: LogicalPlan): Field
}

The toField method returns the name and data type of the expression’s output. It takes the input plan because some expressions depend on the input schema. A column reference has the type of whatever column it references. A comparison expression always returns boolean regardless of its inputs.

Column Expressions

The simplest expression references a column by name:

class Column(val name: String) : LogicalExpr {

  override fun toField(input: LogicalPlan): Field {
    return input.schema().fields.find { it.name == name }
        ?: throw SQLException("No column named '$name'")
  }

  override fun toString(): String {
    return "#$name"
  }
}

The toField implementation looks up the column in the input schema. If it does not exist, that is an error, which we catch during planning rather than execution.

The # prefix in toString is a convention to distinguish column references from literal strings when printing plans.

Literal Expressions

Expressions like salary * 1.1 need to represent the literal value 1.1. We need literal expressions for each data type:

class LiteralString(val str: String) : LogicalExpr {

  override fun toField(input: LogicalPlan): Field {
    return Field(str, ArrowTypes.StringType)
  }

  override fun toString(): String {
    return "'$str'"
  }
}

class LiteralLong(val n: Long) : LogicalExpr {

  override fun toField(input: LogicalPlan): Field {
    return Field(n.toString(), ArrowTypes.Int64Type)
  }

  override fun toString(): String {
    return n.toString()
  }
}

class LiteralDouble(val n: Double) : LogicalExpr {

  override fun toField(input: LogicalPlan): Field {
    return Field(n.toString(), ArrowTypes.DoubleType)
  }

  override fun toString(): String {
    return n.toString()
  }
}

Literal expressions do not depend on the input plan since their type is fixed.

Binary Expressions

Most operators take two inputs: comparison (=, <, >), boolean logic (AND, OR), and arithmetic (+, -, *, /). We can share structure across these:

abstract class BinaryExpr(
    val name: String,
    val op: String,
    val l: LogicalExpr,
    val r: LogicalExpr
) : LogicalExpr {

  override fun toString(): String {
    return "$l $op $r"
  }
}

The name identifies the expression type. The op is the operator symbol for printing. The l and r are the left and right operands.

Comparison and Boolean Expressions

Comparisons and boolean operators always produce boolean results:

abstract class BooleanBinaryExpr(
    name: String,
    op: String,
    l: LogicalExpr,
    r: LogicalExpr
) : BinaryExpr(name, op, l, r) {

  override fun toField(input: LogicalPlan): Field {
    return Field(name, ArrowTypes.BooleanType)
  }
}

// Comparisons
class Eq(l: LogicalExpr, r: LogicalExpr) : BooleanBinaryExpr("eq", "=", l, r)
class Neq(l: LogicalExpr, r: LogicalExpr) : BooleanBinaryExpr("neq", "!=", l, r)
class Gt(l: LogicalExpr, r: LogicalExpr) : BooleanBinaryExpr("gt", ">", l, r)
class GtEq(l: LogicalExpr, r: LogicalExpr) : BooleanBinaryExpr("gteq", ">=", l, r)
class Lt(l: LogicalExpr, r: LogicalExpr) : BooleanBinaryExpr("lt", "<", l, r)
class LtEq(l: LogicalExpr, r: LogicalExpr) : BooleanBinaryExpr("lteq", "<=", l, r)

// Boolean logic
class And(l: LogicalExpr, r: LogicalExpr) : BooleanBinaryExpr("and", "AND", l, r)
class Or(l: LogicalExpr, r: LogicalExpr) : BooleanBinaryExpr("or", "OR", l, r)

Math Expressions

In KQuery, arithmetic expressions preserve the type of the left operand. This is a simplified approach. A production system would handle type promotion):

abstract class MathExpr(
    name: String,
    op: String,
    l: LogicalExpr,
    r: LogicalExpr
) : BinaryExpr(name, op, l, r) {

  override fun toField(input: LogicalPlan): Field {
    return Field(name, l.toField(input).dataType)
  }
}

class Add(l: LogicalExpr, r: LogicalExpr) : MathExpr("add", "+", l, r)
class Subtract(l: LogicalExpr, r: LogicalExpr) : MathExpr("subtract", "-", l, r)
class Multiply(l: LogicalExpr, r: LogicalExpr) : MathExpr("mult", "*", l, r)
class Divide(l: LogicalExpr, r: LogicalExpr) : MathExpr("div", "/", l, r)
class Modulus(l: LogicalExpr, r: LogicalExpr) : MathExpr("mod", "%", l, r)

Aggregate Expressions

Aggregates reduce multiple rows to a single value: SUM, MIN, MAX, AVG, COUNT. They appear in aggregate plans (covered later) and have special semantics.

abstract class AggregateExpr(
    val name: String,
    val expr: LogicalExpr
) : LogicalExpr {

  override fun toField(input: LogicalPlan): Field {
    return Field(name, expr.toField(input).dataType)
  }

  override fun toString(): String {
    return "$name($expr)"
  }
}

class Sum(input: LogicalExpr) : AggregateExpr("SUM", input)
class Min(input: LogicalExpr) : AggregateExpr("MIN", input)
class Max(input: LogicalExpr) : AggregateExpr("MAX", input)
class Avg(input: LogicalExpr) : AggregateExpr("AVG", input)

Most aggregates return the same type as their input. COUNT is different since it always returns an integer:

class Count(input: LogicalExpr) : AggregateExpr("COUNT", input) {

  override fun toField(input: LogicalPlan): Field {
    return Field("COUNT", ArrowTypes.Int32Type)
  }
}

Aliased Expressions

SQL’s AS keyword renames an expression’s output:

class Alias(val expr: LogicalExpr, val alias: String) : LogicalExpr {

  override fun toField(input: LogicalPlan): Field {
    return Field(alias, expr.toField(input).dataType)
  }

  override fun toString(): String {
    return "$expr as $alias"
  }
}

The alias changes the name but preserves the type.

Logical Plans

With expressions defined, we can build the plans that use them.

Scan

Scan reads from a data source. It is the leaf node in every query tree, the place where data enters the plan.

class Scan(
    val path: String,
    val dataSource: DataSource,
    val projection: List<String>
) : LogicalPlan {

  val schema = deriveSchema()

  override fun schema(): Schema {
    return schema
  }

  private fun deriveSchema(): Schema {
    val schema = dataSource.schema()
    if (projection.isEmpty()) {
      return schema
    } else {
      return schema.select(projection)
    }
  }

  override fun children(): List<LogicalPlan> {
    return listOf()
  }

  override fun toString(): String {
    return if (projection.isEmpty()) {
      "Scan: $path; projection=None"
    } else {
      "Scan: $path; projection=$projection"
    }
  }
}

The projection parameter lists which columns to read. If empty, read all columns. This optimization matters because reading fewer columns means less I/O and less memory.

Selection (Filter)

Selection keeps only rows where an expression evaluates to true. This corresponds to SQL’s WHERE clause.

class Selection(
    val input: LogicalPlan,
    val expr: LogicalExpr
) : LogicalPlan {

  override fun schema(): Schema {
    return input.schema()  // filtering doesn't change the schema
  }

  override fun children(): List<LogicalPlan> {
    return listOf(input)
  }

  override fun toString(): String {
    return "Filter: $expr"
  }
}

The schema passes through unchanged since filtering only removes rows, not columns.

Projection

Projection computes new columns from expressions. This corresponds to SQL’s SELECT list.

class Projection(
    val input: LogicalPlan,
    val expr: List<LogicalExpr>
) : LogicalPlan {

  override fun schema(): Schema {
    return Schema(expr.map { it.toField(input) })
  }

  override fun children(): List<LogicalPlan> {
    return listOf(input)
  }

  override fun toString(): String {
    return "Projection: ${expr.map { it.toString() }.joinToString(", ")}"
  }
}

The output schema comes from the expressions. If you project name and salary * 1.1 AS bonus, the output schema has two columns with those names and appropriate types.

Aggregate

Aggregate groups rows and computes aggregate functions. This corresponds to SQL’s GROUP BY with aggregate functions.

class Aggregate(
    val input: LogicalPlan,
    val groupExpr: List<LogicalExpr>,
    val aggregateExpr: List<AggregateExpr>
) : LogicalPlan {

  override fun schema(): Schema {
    return Schema(
      groupExpr.map { it.toField(input) } +
      aggregateExpr.map { it.toField(input) }
    )
  }

  override fun children(): List<LogicalPlan> {
    return listOf(input)
  }

  override fun toString(): String {
    return "Aggregate: groupExpr=$groupExpr, aggregateExpr=$aggregateExpr"
  }
}

The output schema has the grouping columns first, followed by the aggregate results. For SELECT department, AVG(salary) FROM employees GROUP BY department, the output has two columns: department and AVG(salary).

Join

Join combines rows from two inputs based on a condition. Unlike the plans we have seen so far, joins have two children: a left input and a right input.

class Join(
    val left: LogicalPlan,
    val right: LogicalPlan,
    val joinType: JoinType,
    val condition: LogicalExpr
) : LogicalPlan {

  override fun schema(): Schema {
    return Schema(left.schema().fields + right.schema().fields)
  }

  override fun children(): List<LogicalPlan> {
    return listOf(left, right)
  }

  override fun toString(): String {
    return "Join: type=$joinType, condition=$condition"
  }
}

enum class JoinType {
  INNER, LEFT, RIGHT, FULL, SEMI, ANTI
}

The output schema combines columns from both inputs. The join type determines which rows appear in the output: inner joins return only matching rows, outer joins include unmatched rows with nulls, and so on.

Joins are fundamental to relational queries but come with significant complexity: multiple join types, various join algorithms with different performance characteristics, and optimization challenges. The Joins chapter covers these topics in depth.

Putting It Together

Here is how our example query becomes a logical plan:

SELECT name, salary * 1.1 AS new_salary
FROM employees
WHERE department = 'Engineering'

Building bottom-up:

val scan = Scan("employees", employeeDataSource, listOf())

val filter = Selection(
    scan,
    Eq(Column("department"), LiteralString("Engineering"))
)

val project = Projection(
    filter,
    listOf(
        Column("name"),
        Alias(Multiply(Column("salary"), LiteralDouble(1.1)), "new_salary")
    )
)

Printed:

Projection: #name, #salary * 1.1 as new_salary
  Filter: #department = 'Engineering'
    Scan: employees; projection=None

This logical plan can now be validated (do the columns exist?), optimized (can we push the projection into the scan?), and eventually converted to a physical plan for execution.

Serialization

Query plans sometimes need to be serialized: sent across a network, stored for later, or passed between systems in different languages.

Options include language-specific serialization (JSON with Jackson in Java, kotlinx.serialization in Kotlin) or language-agnostic formats like Protocol Buffers or Avro.

A newer standard called Substrait aims to provide cross-language serialization for relational algebra. This is exciting because it enables mixing components: use Apache Calcite for query planning in Java, serialize to Substrait, execute in a Rust or C++ engine. If you are building a query engine today, Substrait is worth investigating.

This book is also available for purchase in ePub, MOBI, and PDF format from https://leanpub.com/how-query-engines-work

Copyright © 2020-2025 Andy Grove. All rights reserved.