Pipeline Architecture

geolith processes data through a four-phase pipeline. Each phase is designed for streaming, so memory usage stays bounded regardless of input size.

┌──────────────────┐    ┌──────────────────┐    ┌──────────────────┐    ┌──────────────────┐
│  Phase 1         │    │  Phase 2         │    │  Phase 3         │    │  Phase 4         │
│  Read + Process  │───▶│  External Sort   │───▶│  Tile Encode     │───▶│  Output Write    │
│                  │    │                  │    │                  │    │                  │
│  6 sources ──▶   │    │  Sort key pack   │    │  MLT columnar    │    │  PMTiles v3 /    │
│  GERS conflate   │    │  LZ4 chunks ──▶  │    │  encoding ──▶    │    │  MBTiles ──▶     │
│  Project ──▶     │    │  K-way merge     │    │  Deduplication   │    │  Tile dedup      │
│  Clip + Simplify │    │                  │    │                  │    │                  │
└──────────────────┘    └──────────────────┘    └──────────────────┘    └──────────────────┘

Phase 1: Read + Process

Input: 6 data sources (Overture GeoParquet, OSM PBF, Natural Earth SQLite, land/water shapefiles, India boundary GeoJSON) Output: Stream of SortEntry (sort key + TileFeature) pairs

Source Reading

Reader	Format	Key Details
Overture Maps	GeoParquet	Column projection, bbox predicate pushdown, hive-partitioned
OpenStreetMap	PBF	2-pass: node coordinate cache (mmap) + feature extraction. Auto osmium extract for `--bbox`
Natural Earth	SQLite	Spatialite blob decoding for land/ocean/admin boundaries
Land Polygons	Shapefile	EPSG:3857→WGS84 reprojection
Water Polygons	Shapefile	EPSG:3857→WGS84 reprojection
India Boundary	GeoJSON	India border override for correct rendering

When GERS bridge files are present (auto-discovered or via --bridge-dir), geolith loads a BridgeIndex mapping OSM element IDs → GERS UUIDs. During OSM reading, features with matching GERS IDs are skipped — the Overture version is preferred. This eliminates duplicate buildings, roads, and divisions.

For themes without bridge files (places, POIs, landcover, water), an R-tree/H3 spatial deduplication fallback is used.

Per-Feature Processing (rayon parallel)

Layer dispatch — route InputFeature to one of 11 layer processors
Web Mercator projection — WGS84 → world coordinates 0, 1
Tile assignment — world_to_tile() at each zoom level (0 to --max-zoom)
Stripe clipping — two-pass Sutherland-Hodgman (X→Y) partitions geometry to tile boundaries
Simplification — Douglas-Peucker with zoom-dependent tolerance
Tile geometry encoding — coordinates snapped to 4096×4096 grid, delta + zigzag encoded

InputFeature.geometry uses Arc<geo::Geometry<f64>> to eliminate deep clones across zoom levels (critical for 400M+ buildings).

Phase 2: External Sort

Input: Stream of SortEntry pairs Output: Sorted stream of entries ordered by (tile_id, layer, sort_key)

Planet-scale datasets produce billions of tile entries that don't fit in memory. geolith uses an external merge sort:

Accumulate — thread-local SortBuffer collects entries until threshold (auto-sized or --chunk-size-mb)
Flush — sort by packed u64 key, serialize with postcard, compress with LZ4 to chunk files
Merge — KWayMerge uses a BinaryHeap min-heap to stream-merge all chunks

Sort key layout — single u64 enables comparison without deserializing:

MSB                                                              LSB
[5 bits zoom][29 bits tile_id][4 bits layer][26 bits feature_sort]

This groups all features for a tile together, then by layer within a tile, then by feature sort key within a layer.

Phase 3: Tile Encode

Input: Sorted stream of entries grouped by tile Output: Encoded tile blobs (MLT with built-in compression, or gzip-compressed MVT)

Group — consecutive entries with the same tile ID are grouped
Polygon merge — adjacent features with identical properties merged
Feature deduplication — ahash fingerprinting detects duplicates
Height quantization — per-zoom level bucketing for building heights
Tile encoding — MLT columnar encoding via mlt-core (default), or MVT protobuf via prost (--tile-format mvt)
Compression — MLT uses built-in lightweight compression (SIMD-FastPFOR, FSST dictionary); MVT uses gzip via flate2
Batched parallelism — 512 tiles per rayon batch

Coordinate space: TILE_EXTENT = 4096. MLT produces 50–80% smaller tiles than MVT with 2–3x faster browser decoding.

Phase 4: Output Write

Input: Deduplicated tile blobs Output: A single .pmtiles or .mbtiles file

PMTiles v3

Hilbert curve ordering — adjacent map tiles are adjacent in the file for efficient HTTP range requests
Content deduplication — ocean/empty tiles share storage (30–50% size reduction)
Two-level directory — root + leaf entries, any tile found in ≤2 range requests
PMTiles tile ID: sum(4^z for z in 0..tile.z) + hilbert(tile.x, tile.y, tile.z)

MBTiles

SQLite-based tile storage with WAL mode for concurrent reads during write. Same content deduplication as PMTiles.

The output conforms to the PMTiles v3 specification and is compatible with any PMTiles reader.