Documentation

geolith processes data through a four-phase pipeline. Each phase is designed for streaming, so memory usage stays bounded regardless of input size.

┌──────────────────┐    ┌──────────────────┐    ┌──────────────────┐    ┌──────────────────┐
│  Phase 1         │    │  Phase 2         │    │  Phase 3         │    │  Phase 4         │
│  Read + Process  │───▶│  External Sort   │───▶│  Tile Encode     │───▶│  Output Write    │
│                  │    │                  │    │                  │    │                  │
│  6 sources ──▶   │    │  Sort key pack   │    │  MVT protobuf    │    │  PMTiles v3 /    │
│  GERS conflate   │    │  LZ4 chunks ──▶  │    │  encoding ──▶    │    │  MBTiles ──▶     │
│  Project ──▶     │    │  K-way merge     │    │  Deduplication   │    │  Tile dedup      │
│  Clip + Simplify │    │                  │    │                  │    │                  │
└──────────────────┘    └──────────────────┘    └──────────────────┘    └──────────────────┘

Phase 1: Read + Process

Input: 6 data sources (Overture GeoParquet, OSM PBF, Natural Earth SQLite, land/water shapefiles, India boundary GeoJSON) Output: Stream of SortEntry (sort key + TileFeature) pairs

Source Reading

ReaderFormatKey Details
Overture MapsGeoParquetColumn projection, bbox predicate pushdown, hive-partitioned
OpenStreetMapPBF2-pass: node coordinate cache (mmap) + feature extraction. Auto osmium extract for --bbox
Natural EarthSQLiteSpatialite blob decoding for land/ocean/admin boundaries
Land PolygonsShapefileEPSG:3857→WGS84 reprojection
Water PolygonsShapefileEPSG:3857→WGS84 reprojection
India BoundaryGeoJSONIndia border override for correct rendering

GERS Conflation

When GERS bridge files are present (auto-discovered or via --bridge-dir), geolith loads a BridgeIndex mapping OSM element IDs → GERS UUIDs. During OSM reading, features with matching GERS IDs are skipped — the Overture version is preferred. This eliminates duplicate buildings, roads, and divisions.

For themes without bridge files (places, POIs, landcover, water), an R-tree/H3 spatial deduplication fallback is used.

Per-Feature Processing (rayon parallel)

  1. Layer dispatch — route InputFeature to one of 11 layer processors
  2. Web Mercator projection — WGS84 → world coordinates 0, 1
  3. Tile assignmentworld_to_tile() at each zoom level (0 to --max-zoom)
  4. Stripe clipping — two-pass Sutherland-Hodgman (X→Y) partitions geometry to tile boundaries
  5. Simplification — Douglas-Peucker with zoom-dependent tolerance
  6. MVT geometry encoding — coordinates snapped to 4096×4096 grid, delta + zigzag encoded

InputFeature.geometry uses Arc<geo::Geometry<f64>> to eliminate deep clones across zoom levels (critical for 400M+ buildings).

Phase 2: External Sort

Input: Stream of SortEntry pairs Output: Sorted stream of entries ordered by (tile_id, layer, sort_key)

Planet-scale datasets produce billions of tile entries that don't fit in memory. geolith uses an external merge sort:

  1. Accumulate — thread-local SortBuffer collects entries until threshold (auto-sized or --chunk-size-mb)
  2. Flush — sort by packed u64 key, serialize with postcard, compress with LZ4 to chunk files
  3. MergeKWayMerge uses a BinaryHeap min-heap to stream-merge all chunks

Sort key layout — single u64 enables comparison without deserializing:

MSB                                                              LSB
[5 bits zoom][29 bits tile_id][4 bits layer][26 bits feature_sort]

This groups all features for a tile together, then by layer within a tile, then by feature sort key within a layer.

Phase 3: Tile Encode

Input: Sorted stream of entries grouped by tile Output: Encoded MVT blobs, gzip compressed

  1. Group — consecutive entries with the same tile ID are grouped
  2. Polygon merge — adjacent features with identical properties merged
  3. Feature deduplication — ahash fingerprinting detects duplicates
  4. Height quantization — per-zoom level bucketing for building heights
  5. MVT encoding — group by layer, deduplicated key/value string tables, protobuf via prost
  6. Gzip compressionflate2 compresses each tile
  7. Batched parallelism — 512 tiles per rayon batch

Coordinate space: TILE_EXTENT = 4096 (standard MVT resolution).

Phase 4: Output Write

Input: Deduplicated tile blobs Output: A single .pmtiles or .mbtiles file

PMTiles v3

  • Hilbert curve ordering — adjacent map tiles are adjacent in the file for efficient HTTP range requests
  • Content deduplication — ocean/empty tiles share storage (30–50% size reduction)
  • Two-level directory — root + leaf entries, any tile found in ≤2 range requests
  • PMTiles tile ID: sum(4^z for z in 0..tile.z) + hilbert(tile.x, tile.y, tile.z)

MBTiles

SQLite-based tile storage with WAL mode for concurrent reads during write. Same content deduplication as PMTiles.

The output conforms to the PMTiles v3 specification and is compatible with any PMTiles reader.