Apache Arrow

Before we start building our query engine, we need to choose how to represent data in memory. This choice affects everything: how we read files, how we pass data between operators, and how fast our computations run. We will use Apache Arrow, which has become the standard for in-memory columnar data.

Why Columnar?

Traditional databases and programming languages typically store data row by row. If you have a table of employees, each employee record sits together in memory:

Row 0: [1, "Alice", "Engineering", 95000]
Row 1: [2, "Bob", "Sales", 87000]
Row 2: [3, "Carol", "Engineering", 102000]

This layout is intuitive and works well when you need to access entire records. But query engines often process just a few columns at a time. Consider:

SELECT AVG(salary) FROM employees WHERE department = 'Engineering'

This query only needs the department and salary columns. With row-based storage, we would load entire rows into memory just to access two fields.

Columnar storage flips this around. Each column is stored contiguously:

ids:         [1, 2, 3]
names:       ["Alice", "Bob", "Carol"]
departments: ["Engineering", "Sales", "Engineering"]
salaries:    [95000, 87000, 102000]

Now reading salary means reading one contiguous block of memory, with no jumping around to skip unwanted fields. This matters enormously for performance:

Modern CPUs load memory in cache lines (typically 64 bytes). Columnar data packs more useful values per cache line.
Similar values grouped together compress much better. A column of departments might compress to just a few distinct values.
CPUs can apply the same operation to multiple values simultaneously (Single Instruction, Multiple Data). Columnar layout enables this.

What Is Apache Arrow?

Apache Arrow is a specification for how to represent columnar data in memory, plus libraries implementing that specification in many languages. Think of it as a universal format that different systems can share.

The key insight behind Arrow is that data analytics involves many tools: Python for exploration, Spark for distributed processing, databases for storage, visualization tools for presentation. Traditionally, each tool has its own internal format, so moving data between tools means serializing and deserializing, converting formats repeatedly.

Arrow eliminates this overhead. If your Python code and your Java code both use Arrow format, they can share memory directly. A pointer to Arrow data is meaningful to any Arrow-aware system.

Arrow Memory Layout

An Arrow column (called a "vector" or "array") consists of:

A data buffer containing the actual values, packed contiguously
A validity buffer, which is a bitmap indicating which values are null
An optional offset buffer for variable-length types like strings

For a column of 32-bit integers [1, null, 3, 4]:

Validity bitmap: [1, 0, 1, 1]  (bit per value: 1=valid, 0=null)
Data buffer:     [1, ?, 3, 4]  (? = undefined, since null)

For strings, we need offsets because strings vary in length:

Values: ["hello", "world", "!"]

Offsets: [0, 5, 10, 11]  (start position of each string, plus end)
Data:    "helloworld!"   (all strings concatenated)

To get string at index 1: read from offset[1]=5 to offset[2]=10, giving "world".

This layout is simple but powerful. Fixed-width types like integers require just one memory access per value. The validity bitmap uses just one bit per value, so checking for nulls is cheap.

Record Batches

A single column is not very useful on its own. We need multiple columns together. Query engines typically group columns into record batches: a collection of equal-length columns with a schema describing their names and types. KQuery defines its own RecordBatch class for this purpose.

RecordBatch:
  Schema: {id: Int32, name: Utf8, salary: Float64}
  Columns:
    id:     [1, 2, 3]
    name:   ["Alice", "Bob", "Carol"]
    salary: [95000.0, 87000.0, 102000.0]
  Length: 3

Query engines process data in batches rather than row by row or all at once. Batch processing hits the sweet spot:

Small enough to fit in CPU cache
Large enough to amortize per-batch overhead
Enables streaming (process data as it arrives)

Typical batch sizes range from 1,000 to 100,000 rows depending on the workload.

Schemas and Types

Arrow defines a rich type system covering:

Integers: Int8, Int16, Int32, Int64 (signed and unsigned)
Floating point: Float32 (single), Float64 (double)
Binary: Variable-length byte arrays
Strings: UTF-8 encoded text
Temporal: Date, Time, Timestamp, Duration, Interval
Nested: List, Struct, Map, Union

A schema names and types the columns:

Schema:
  - id: Int32, not nullable
  - name: Utf8, nullable
  - hire_date: Date32, nullable
  - salary: Float64, not nullable

Having explicit nullability in the schema lets the query engine optimize. If a column cannot be null, we can skip null checks entirely.

Language Implementations

Arrow has official implementations in C++, Java, Python, Rust, Go, JavaScript, C#, Ruby, and more. For this book, we will use the Java implementation.

The Java API provides:

FieldVector: Base class for column vectors (IntVector, VarCharVector, etc.)
VectorSchemaRoot: Container for a record batch
ArrowType: Represents data types
Schema/Field: Describes the structure of data

Here is a simple example creating an Arrow vector in Java:

try (IntVector vector = new IntVector("id", allocator)) {
    vector.allocateNew(3);
    vector.set(0, 1);
    vector.set(1, 2);
    vector.set(2, 3);
    vector.setValueCount(3);
    // vector now contains [1, 2, 3]
}

Why Arrow for Our Query Engine?

We will use Arrow as the foundation of our query engine for several reasons:

It is a standard format, so we can easily read/write Parquet files and integrate with other tools
The Java library is mature and well-tested
The columnar layout enables efficient query processing
Understanding Arrow helps you understand other data systems

The next chapter builds our type system on top of Arrow, adding the abstractions our query engine needs.

How Query Engines Work