Type System
The source code discussed in this chapter can be found in the datatypes module of the KQuery project.
Every query engine needs a type system—a way to represent and reason about data types. When you write SELECT price * quantity, the engine must know that price and quantity are numbers that can be multiplied, and what type the result will be. This chapter builds KQuery’s type system on top of Apache Arrow.
Why Types Matter
Consider this query:
SELECT name, salary * 1.1 AS new_salary
FROM employees
WHERE department = 'Engineering'
Before executing, the query engine must answer:
- Is
salarya numeric type that supports multiplication? - What’s the result type of
salary * 1.1? (If salary is an integer, should the result be integer or floating point?) - Is
departmenta string type that supports equality comparison? - What columns and types does the result have?
These questions arise during query planning, before we touch any data. A well-designed type system catches errors early (“you can’t multiply a string by a number”) and enables optimizations (“this column is never null, so skip null checks”).
Building on Arrow
Rather than invent our own type system, we’ll use Apache Arrow’s types directly. This gives us:
- A rich set of standard types (integers, floats, strings, dates, etc.)
- Efficient in-memory representation (as covered in the previous chapter)
- Compatibility with Arrow-based tools and file formats
Arrow’s type system includes:
| Category | Types |
|---|---|
| Boolean | Bool |
| Integers | Int8, Int16, Int32, Int64 (signed and unsigned) |
| Floating point | Float32, Float64 |
| Text | Utf8 (variable-length strings) |
| Binary | Binary (variable-length bytes) |
| Temporal | Date32, Date64, Timestamp, Time32, Time64, Duration |
| Nested | List, Struct, Map |
For KQuery, we’ll focus on the common types: booleans, integers, floating point numbers, and strings. We define convenient constants for these:
object ArrowTypes {
val BooleanType = ArrowType.Bool()
val Int8Type = ArrowType.Int(8, true)
val Int16Type = ArrowType.Int(16, true)
val Int32Type = ArrowType.Int(32, true)
val Int64Type = ArrowType.Int(64, true)
val UInt8Type = ArrowType.Int(8, false)
val UInt16Type = ArrowType.Int(16, false)
val UInt32Type = ArrowType.Int(32, false)
val UInt64Type = ArrowType.Int(64, false)
val FloatType = ArrowType.FloatingPoint(FloatingPointPrecision.SINGLE)
val DoubleType = ArrowType.FloatingPoint(FloatingPointPrecision.DOUBLE)
val StringType = ArrowType.Utf8()
}
Schemas and Fields
A schema describes the structure of a dataset: what columns exist and what type each has. Schemas are essential metadata that flows through the entire query engine.
Arrow represents schemas as a list of fields, where each field has:
- A name (string)
- A data type (ArrowType)
- A nullable flag (can this column contain nulls?)
For example, an employee table might have this schema:
Schema:
- id: Int32, not nullable
- name: Utf8, nullable
- department: Utf8, nullable
- salary: Float64, not nullable
Schemas serve multiple purposes:
- Validation: Reject queries that reference non-existent columns or misuse types
- Planning: Determine output types for expressions
- Optimization: Skip null checks for non-nullable columns
- Execution: Allocate correctly-typed storage for results
Column Vectors
Arrow stores column data in vectors (also called arrays). Each vector type has its own class: IntVector, Float8Vector, VarCharVector, etc. This is efficient but inconvenient—code that processes columns would need type-specific branches everywhere.
KQuery introduces a ColumnVector interface to abstract over the underlying Arrow vectors:
/** Abstraction over different implementations of a column vector. */
interface ColumnVector : AutoCloseable {
fun getType(): ArrowType
fun getValue(i: Int): Any?
fun size(): Int
}
This interface lets us write generic code that works with any column type. The getValue method returns Any? (Kotlin’s equivalent of Java’s Object), which isn’t ideal for performance but keeps our code simple.
We can wrap an Arrow FieldVector in this interface:
/** Wrapper around Arrow FieldVector */
class ArrowFieldVector(val field: FieldVector) : ColumnVector {
override fun getType(): ArrowType {
return when (field) {
is BitVector -> ArrowTypes.BooleanType
is TinyIntVector -> ArrowTypes.Int8Type
is SmallIntVector -> ArrowTypes.Int16Type
...
else -> throw IllegalStateException("Unsupported vector type: ${field.javaClass.name}")
}
}
override fun getValue(i: Int): Any? {
if (field.isNull(i)) {
return null
}
return when (field) {
is BitVector -> if (field.get(i) == 1) true else false
is TinyIntVector -> field.get(i)
is SmallIntVector -> field.get(i)
...
else -> throw IllegalStateException("Unsupported vector type: ${field.javaClass.name}")
}
}
override fun size(): Int {
return field.valueCount
}
override fun close() {
field.close()
}
}
Literal Values
Sometimes we need a “column” that contains the same value repeated. For example, when evaluating salary * 1.1, the literal 1.1 needs to act like a column of 1.1 values (one per row in the batch).
Rather than allocate memory for thousands of identical values, we create a virtual column:
class LiteralValueVector(
private val arrowType: ArrowType,
private val value: Any?,
private val size: Int
) : ColumnVector {
override fun getType(): ArrowType = arrowType
override fun getValue(i: Int): Any? {
if (i < 0 || i >= size) {
throw IndexOutOfBoundsException()
}
return value
}
override fun size(): Int = size
}
This returns the same value for any valid index, using constant memory regardless of how many rows we’re processing.
Record Batches
A record batch groups multiple columns together with a schema. This is the fundamental unit of data that flows through our query engine.
class RecordBatch(
val schema: Schema,
val fields: List<ColumnVector>
) {
fun rowCount(): Int = fields.first().size()
fun columnCount(): Int = fields.size
fun field(i: Int): ColumnVector = fields[i]
}
Record batches enable batch processing: rather than processing one row at a time (slow due to function call overhead) or loading entire datasets into memory (impractical for large data), we process chunks of typically 1,000 to 100,000 rows.
Type Coercion
Real query engines need type coercion: automatically converting types when necessary. For example:
5 + 3.14should promote the integer5to a float- Comparing
Int32toInt64should work without explicit casts - String-to-date conversion for predicates like
date > '2024-01-01'
KQuery keeps things simple and requires explicit types in most cases. A production query engine would implement coercion rules, typically promoting to the “wider” type (Int32 → Int64 → Float64).
Putting It Together
Here’s how these pieces fit together. When processing a query:
- Data sources provide schemas describing available columns
- Query planning uses schemas to validate expressions and determine result types
- Execution passes record batches between operators
- Each operator produces output batches with a known schema
The type system is the foundation that makes this possible. Without it, we couldn’t validate queries, plan efficiently, or allocate storage correctly.
The next chapter builds on this foundation to create abstractions for reading data from files.
This book is also available for purchase in ePub, MOBI, and PDF format from https://leanpub.com/how-query-engines-work
Copyright © 2020-2025 Andy Grove. All rights reserved.