Type System

The source code discussed in this chapter can be found in the datatypes module of the KQuery project.

Every query engine needs a type system—a way to represent and reason about data types. When you write SELECT price * quantity, the engine must know that price and quantity are numbers that can be multiplied, and what type the result will be. This chapter builds KQuery’s type system on top of Apache Arrow.

Why Types Matter

Consider this query:

SELECT name, salary * 1.1 AS new_salary
FROM employees
WHERE department = 'Engineering'

Before executing, the query engine must answer:

Is salary a numeric type that supports multiplication?
What’s the result type of salary * 1.1? (If salary is an integer, should the result be integer or floating point?)
Is department a string type that supports equality comparison?
What columns and types does the result have?

These questions arise during query planning, before we touch any data. A well-designed type system catches errors early (“you can’t multiply a string by a number”) and enables optimizations (“this column is never null, so skip null checks”).

Building on Arrow

Rather than invent our own type system, we’ll use Apache Arrow’s types directly. This gives us:

A rich set of standard types (integers, floats, strings, dates, etc.)
Efficient in-memory representation (as covered in the previous chapter)
Compatibility with Arrow-based tools and file formats

Arrow’s type system includes:

Category	Types
Boolean	Bool
Integers	Int8, Int16, Int32, Int64 (signed and unsigned)
Floating point	Float32, Float64
Text	Utf8 (variable-length strings)
Binary	Binary (variable-length bytes)
Temporal	Date32, Date64, Timestamp, Time32, Time64, Duration
Nested	List, Struct, Map

For KQuery, we’ll focus on the common types: booleans, integers, floating point numbers, and strings. We define convenient constants for these:

object ArrowTypes {
    val BooleanType = ArrowType.Bool()
    val Int8Type = ArrowType.Int(8, true)
    val Int16Type = ArrowType.Int(16, true)
    val Int32Type = ArrowType.Int(32, true)
    val Int64Type = ArrowType.Int(64, true)
    val UInt8Type = ArrowType.Int(8, false)
    val UInt16Type = ArrowType.Int(16, false)
    val UInt32Type = ArrowType.Int(32, false)
    val UInt64Type = ArrowType.Int(64, false)
    val FloatType = ArrowType.FloatingPoint(FloatingPointPrecision.SINGLE)
    val DoubleType = ArrowType.FloatingPoint(FloatingPointPrecision.DOUBLE)
    val StringType = ArrowType.Utf8()
}

Schemas and Fields

A schema describes the structure of a dataset: what columns exist and what type each has. Schemas are essential metadata that flows through the entire query engine.

Arrow represents schemas as a list of fields, where each field has:

A name (string)
A data type (ArrowType)
A nullable flag (can this column contain nulls?)

For example, an employee table might have this schema:

Schema:
  - id: Int32, not nullable
  - name: Utf8, nullable
  - department: Utf8, nullable
  - salary: Float64, not nullable

Schemas serve multiple purposes:

Validation: Reject queries that reference non-existent columns or misuse types
Planning: Determine output types for expressions
Optimization: Skip null checks for non-nullable columns
Execution: Allocate correctly-typed storage for results

Column Vectors

Arrow stores column data in vectors (also called arrays). Each vector type has its own class: IntVector, Float8Vector, VarCharVector, etc. This is efficient but inconvenient—code that processes columns would need type-specific branches everywhere.

KQuery introduces a ColumnVector interface to abstract over the underlying Arrow vectors:

/** Abstraction over different implementations of a column vector. */
interface ColumnVector : AutoCloseable {
  fun getType(): ArrowType

  fun getValue(i: Int): Any?

  fun size(): Int
}

This interface lets us write generic code that works with any column type. The getValue method returns Any? (Kotlin’s equivalent of Java’s Object), which isn’t ideal for performance but keeps our code simple.

We can wrap an Arrow FieldVector in this interface:

/** Wrapper around Arrow FieldVector */
class ArrowFieldVector(val field: FieldVector) : ColumnVector {

  override fun getType(): ArrowType {
    return when (field) {
      is BitVector -> ArrowTypes.BooleanType
      is TinyIntVector -> ArrowTypes.Int8Type
      is SmallIntVector -> ArrowTypes.Int16Type
      ...
      else -> throw IllegalStateException("Unsupported vector type: ${field.javaClass.name}")
    }
  }

  override fun getValue(i: Int): Any? {

    if (field.isNull(i)) {
      return null
    }

    return when (field) {
      is BitVector -> if (field.get(i) == 1) true else false
      is TinyIntVector -> field.get(i)
      is SmallIntVector -> field.get(i)
      ...
      else -> throw IllegalStateException("Unsupported vector type: ${field.javaClass.name}")
    }
  }

  override fun size(): Int {
    return field.valueCount
  }

  override fun close() {
    field.close()
  }
}

Literal Values

Sometimes we need a “column” that contains the same value repeated. For example, when evaluating salary * 1.1, the literal 1.1 needs to act like a column of 1.1 values (one per row in the batch).

Rather than allocate memory for thousands of identical values, we create a virtual column:

class LiteralValueVector(
    private val arrowType: ArrowType,
    private val value: Any?,
    private val size: Int
) : ColumnVector {

    override fun getType(): ArrowType = arrowType

    override fun getValue(i: Int): Any? {
        if (i < 0 || i >= size) {
            throw IndexOutOfBoundsException()
        }
        return value
    }

    override fun size(): Int = size
}

This returns the same value for any valid index, using constant memory regardless of how many rows we’re processing.

Record Batches

A record batch groups multiple columns together with a schema. This is the fundamental unit of data that flows through our query engine.

class RecordBatch(
    val schema: Schema,
    val fields: List<ColumnVector>
) {
    fun rowCount(): Int = fields.first().size()

    fun columnCount(): Int = fields.size

    fun field(i: Int): ColumnVector = fields[i]
}

Record batches enable batch processing: rather than processing one row at a time (slow due to function call overhead) or loading entire datasets into memory (impractical for large data), we process chunks of typically 1,000 to 100,000 rows.

Type Coercion

Real query engines need type coercion: automatically converting types when necessary. For example:

5 + 3.14 should promote the integer 5 to a float
Comparing Int32 to Int64 should work without explicit casts
String-to-date conversion for predicates like date > '2024-01-01'

KQuery keeps things simple and requires explicit types in most cases. A production query engine would implement coercion rules, typically promoting to the “wider” type (Int32 → Int64 → Float64).

Putting It Together

Here’s how these pieces fit together. When processing a query:

Data sources provide schemas describing available columns
Query planning uses schemas to validate expressions and determine result types
Execution passes record batches between operators
Each operator produces output batches with a known schema

The type system is the foundation that makes this possible. Without it, we couldn’t validate queries, plan efficiently, or allocate storage correctly.

The next chapter builds on this foundation to create abstractions for reading data from files.

This book is also available for purchase in ePub, MOBI, and PDF format from https://leanpub.com/how-query-engines-work

Keyboard shortcuts

How Query Engines Work