geolith processes data through a four-phase pipeline. Each phase is designed for streaming, so memory usage stays bounded regardless of input size.
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ Phase 1 │ │ Phase 2 │ │ Phase 3 │ │ Phase 4 │
│ Read + Process │───▶│ External Sort │───▶│ Tile Encode │───▶│ Output Write │
│ │ │ │ │ │ │ │
│ 6 sources ──▶ │ │ Sort key pack │ │ MVT protobuf │ │ PMTiles v3 / │
│ GERS conflate │ │ LZ4 chunks ──▶ │ │ encoding ──▶ │ │ MBTiles ──▶ │
│ Project ──▶ │ │ K-way merge │ │ Deduplication │ │ Tile dedup │
│ Clip + Simplify │ │ │ │ │ │ │
└──────────────────┘ └──────────────────┘ └──────────────────┘ └──────────────────┘
Phase 1: Read + Process
Input: 6 data sources (Overture GeoParquet, OSM PBF, Natural Earth SQLite, land/water shapefiles, India boundary GeoJSON)
Output: Stream of SortEntry (sort key + TileFeature) pairs
Source Reading
| Reader | Format | Key Details |
|---|---|---|
| Overture Maps | GeoParquet | Column projection, bbox predicate pushdown, hive-partitioned |
| OpenStreetMap | PBF | 2-pass: node coordinate cache (mmap) + feature extraction. Auto osmium extract for --bbox |
| Natural Earth | SQLite | Spatialite blob decoding for land/ocean/admin boundaries |
| Land Polygons | Shapefile | EPSG:3857→WGS84 reprojection |
| Water Polygons | Shapefile | EPSG:3857→WGS84 reprojection |
| India Boundary | GeoJSON | India border override for correct rendering |
GERS Conflation
When GERS bridge files are present (auto-discovered or via --bridge-dir), geolith loads a BridgeIndex mapping OSM element IDs → GERS UUIDs. During OSM reading, features with matching GERS IDs are skipped — the Overture version is preferred. This eliminates duplicate buildings, roads, and divisions.
For themes without bridge files (places, POIs, landcover, water), an R-tree/H3 spatial deduplication fallback is used.
Per-Feature Processing (rayon parallel)
- Layer dispatch — route
InputFeatureto one of 11 layer processors - Web Mercator projection — WGS84 → world coordinates 0, 1
- Tile assignment —
world_to_tile()at each zoom level (0 to--max-zoom) - Stripe clipping — two-pass Sutherland-Hodgman (X→Y) partitions geometry to tile boundaries
- Simplification — Douglas-Peucker with zoom-dependent tolerance
- MVT geometry encoding — coordinates snapped to 4096×4096 grid, delta + zigzag encoded
InputFeature.geometry uses Arc<geo::Geometry<f64>> to eliminate deep clones across zoom levels (critical for 400M+ buildings).
Phase 2: External Sort
Input: Stream of SortEntry pairs
Output: Sorted stream of entries ordered by (tile_id, layer, sort_key)
Planet-scale datasets produce billions of tile entries that don't fit in memory. geolith uses an external merge sort:
- Accumulate — thread-local
SortBuffercollects entries until threshold (auto-sized or--chunk-size-mb) - Flush — sort by packed
u64key, serialize with postcard, compress with LZ4 to chunk files - Merge —
KWayMergeuses aBinaryHeapmin-heap to stream-merge all chunks
Sort key layout — single u64 enables comparison without deserializing:
MSB LSB
[5 bits zoom][29 bits tile_id][4 bits layer][26 bits feature_sort]
This groups all features for a tile together, then by layer within a tile, then by feature sort key within a layer.
Phase 3: Tile Encode
Input: Sorted stream of entries grouped by tile Output: Encoded MVT blobs, gzip compressed
- Group — consecutive entries with the same tile ID are grouped
- Polygon merge — adjacent features with identical properties merged
- Feature deduplication — ahash fingerprinting detects duplicates
- Height quantization — per-zoom level bucketing for building heights
- MVT encoding — group by layer, deduplicated key/value string tables, protobuf via prost
- Gzip compression —
flate2compresses each tile - Batched parallelism — 512 tiles per rayon batch
Coordinate space: TILE_EXTENT = 4096 (standard MVT resolution).
Phase 4: Output Write
Input: Deduplicated tile blobs
Output: A single .pmtiles or .mbtiles file
PMTiles v3
- Hilbert curve ordering — adjacent map tiles are adjacent in the file for efficient HTTP range requests
- Content deduplication — ocean/empty tiles share storage (30–50% size reduction)
- Two-level directory — root + leaf entries, any tile found in ≤2 range requests
- PMTiles tile ID:
sum(4^z for z in 0..tile.z) + hilbert(tile.x, tile.y, tile.z)
MBTiles
SQLite-based tile storage with WAL mode for concurrent reads during write. Same content deduplication as PMTiles.
The output conforms to the PMTiles v3 specification and is compatible with any PMTiles reader.