Skip to main content

Fast Hashing – hash-nch

Scope: Fonts & Images (WASM / Embedded)

feature idstatusdescriptionPRs
hash-nchdraftFast non‑cryptographic hashing for engines that run in browsers (WASM) and embedded systems.#415

This document proposes a lightweight, stable, and portable hashing strategy for resource identity (images, fonts, and other binary blobs) in Grida engines that target wasm32-unknown-emscripten and embedded environments.


TL;DR / Decision

  • Winner: SeaHash ([crate: seahash])
  • Class: Non‑cryptographic hash (NCH)
  • Seed: fixed (implicit, no per‑run randomness)
  • Canonical output:
    • Internal: u64
    • External (text): big‑endian lowercase hex (16 chars)
    • Optional: base64url (no padding) for URL/file‑safe compact form
  • Why: Pure Rust, tiny footprint, very fast, fully portable, deterministic across platforms and builds, and ideal for WASM (no special intrinsics).
  • Interop: Ecosystem/db‑native functions are non‑goals for now; if needed later, we can add an opt‑in xxhash backend behind a feature flag without changing the public API.

Goals

  • Consistency: identical results across builds, architectures, and engine versions for the same bytes.
  • Performance: near memory‑bandwidth throughput for large files (e.g., 16 MB PNGs in a few ms).
  • Portability: runs the same in wasm32 (emscripten) and native; no reliance on CPU intrinsics.
  • Simplicity: small dependency surface; straightforward API (one‑shot and streaming).
  • Determinism: fixed algorithm/seed; no randomized hashers.

Non‑Goals

  • Ecosystem / DB‑native function compatibility (e.g., xxh64() inside SQL engines).
  • Cryptographic integrity (tamper resistance). Use cryptographic hashes when required.
  • Serving as GUID / global identifiers (not suitable as service-wide database IDs).

Terminology

  • NCH (Non‑Cryptographic Hash): fast hash optimized for speed & distribution, not for security (e.g., SeaHash, xxHash, Murmur).
  • Checksum: simple error‑detection codes (e.g., CRC32) — useful for transport/storage integrity, not general identity.
  • Cryptographic hash: collision‑resistant (e.g., SHA‑256, BLAKE3) — slower, used for security/integrity.

Rationale: Why SeaHash?

  1. Pure Rust & Portable: No platform intrinsics; compiles cleanly to wasm32-unknown-emscripten.
  2. Speed: Among the fastest general NCH functions. On large buffers, often on par with or slightly ahead of xxHash64. In practice, hashing time is near memory‑bandwidth‑limited.
  3. Deterministic & Stable: Outputs do not vary by endianness or build options. Perfect for persistent IDs.
  4. Footprint: Small crate, minimal code size growth in WASM.
  5. Fit to our needs: We control both producer and consumer of the hash. Cross‑tool recomputation (e.g., computing the same hash inside SQL) is not required at this stage.

Note: xxHash (especially XXH3) is also excellent and widely adopted. Our current requirements favor portability, size, and control over cross‑ecosystem parity. We can still support xxHash in the future behind a feature flag without breaking the API (see Extensibility).


API Proposal

One‑shot

/// Compute a SeaHash of the entire byte slice.
pub fn hash_bytes(bytes: &[u8]) -> u64;

Incremental (streaming)

pub struct Hasher { /* seahash state */ }
impl Hasher {
pub fn new() -> Self;
pub fn update(&mut self, chunk: &[u8]);
pub fn finish(&self) -> u64; // non‑consuming
pub fn finalize(self) -> u64; // consuming
}

Encoding helpers (canonical outputs)

/// Big‑endian lowercase hex (16 chars for u64)
pub fn to_hex_be(h: u64) -> String;

/// URL‑safe Base64 without padding (shortest ASCII form)
pub fn to_b64url_nopad(h: u64) -> String;

Conventions

  • Store/compare u64 internally.
  • Render to big‑endian bytes before converting to text (hex/base64url).
  • Keep the algorithm name with serialized values when persisted externally: e.g., seahash:3f1a…e6ab.

FormatExample (64‑bit)LengthProsConsUse
Raw bytes [u8; 8]\x8a\x1f…\x3c8 BSmallest; fastest compareOpaque; awkward in JSONIn‑memory keys, binary caches
Hex (BE, lower)3f1a9b0c7d42e6ab16 charsUniversal, readable2× size of bytesLogs, JSON, CLI, filenames
Base64URL (no pad)PxqbDH1C5qs11 charsShort ASCII, URL‑safeLess ubiquitous than hexShort IDs, URLs, slugs

WASM / Embedded Notes

  • wasm32 means 32‑bit addressing; 64‑bit integer ops are available and compile to native WASM i64 ops. Both SeaHash and xxHash64 run efficiently.
  • For large buffers (e.g., 16 MB PNG), hashing time is typically ~1–3 ms on desktop‑class hardware; transfer/copy often dominates.
  • No CPU‑specific intrinsics required; works consistently in browsers and embedded runtimes.

Candidate Comparison (what we considered)

AlgorithmCrateClassPerf (native)wasm32 noteAdoptionWhy pick / not pick
SeaHashseahashNCH★★★★☆ (very fast)✅ Portable; greatLow‑mid (Rust‑centric)Chosen: pure Rust, tiny, fast, deterministic
xxHash (XXH64/XXH3)xxhash-rustNCH★★★★★ (XXH3 top‑tier)✅ Portable; greatHigh (cross‑ecosystem)Not chosen now: interop not needed; larger footprint; optional future backend
rustc‑hash (FxHash)rustc-hashNCH★★★★☆ (small keys)Mid (Rust compilers/tools)Great for hashmaps; not ideal for large blobs
FoldHashfoldhashNCH★★★★☆ (maps)Mid (hashbrown)Map workloads; not stable for persisted IDs
Murmur3murmur3NCH★★★☆☆✅ (pure Rust impls)Mid‑high (legacy)Fine, but older; alignment gotchas in C/asm.js histories
CityHash/FarmHashfasthashNCH★★★★☆✅ (baseline)Mid‑high (Google/Chromium)Good, but not Rust‑first; platform‑tuned variants
CRC32crc32fastChecksum★★☆☆☆ (SW)⚠️ slower in wasmVery highInterop/integrity; 32‑bit only; not for dedup
Cryptographic (SHA‑256/BLAKE3)sha2 / blake3Crypto★★–★★★☆Very highUse when tamper resistance is needed; heavier

Stars are relative to NCH peers for our workloads (large binary blobs). Exact numbers are environment‑dependent.


What about “ecosystem‑friendly” xxHash?

  • Some DBs/engines expose xxh64() in SQL (e.g., Spark/Databricks; Postgres via extension). That matters only if we want to recompute the same hash inside those systems.
  • Our current pipeline computes hashes inside the engine and stores values; external recomputation is not a requirement.
  • If that changes, we can expose an adapter:
    • HashAlgo::SeaHash (default)
    • HashAlgo::XxHash64 / HashAlgo::Xxh3_64 (feature‑gated)
    • Same API, same output encodings.

Usage Examples

// One‑shot
let h: u64 = hash_bytes(&bytes);
let hex = to_hex_be(h);

// Streaming
let mut st = Hasher::new();
st.update(&bytes[0..8192]);
st.update(&bytes[8192..]);
let h2 = st.finalize();

Expected Collisions & Safety

  • NCHs are not collision‑proof. For our dedup/resource identity use, u64 is sufficient (astronomically low collision odds at our scale).
  • Do not use NCHs for untrusted security contexts or as password digests. Use cryptographic hashes instead.
  • For hash‑table DoS resistance with untrusted keys, prefer randomized map hashers (e.g., hashbrown default) rather than a fixed NCH.

Intended Scope

The hash-nch output is intended as a scoped identifier for resources (such as images, fonts, or documents) inside the engine. It provides fast, deterministic identity useful for deduplication, cache keys, and change detection within a project.

It is not meant to serve as a global, service-wide database primary key. For service-wide uniqueness or external identifiers, use UUIDs, ULIDs, or database-native sequences. For adversarial or tamper-resistant contexts, prefer cryptographic hashes (e.g., BLAKE3, SHA-256).


Non‑fast / Non‑NCH Options we may adopt later

ClassOptionWhy we’d use itStatus
Checksumcrc32fastInterop with PNG/gzip/protocols; quick error detectionNot planned
Cryptoblake3High‑speed cryptographic integrity; parallelizableNot planned
Cryptosha2 (SHA‑256)Industry compatibility, legal/compliance contextsNot planned

These aren’t needed for the rendering engine’s internal identity/dedup today, but the API leaves room to add them as separate, explicit functions.


Testing & Validation

  • Determinism tests across targets: native vs wasm32-unknown-emscripten (same bytes → same u64).
  • Incremental vs one‑shot equivalence.
  • Encoding round‑trips: u64hex/base64url.
  • Throughput sanity checks on representative files (PNG, TTF/OTF).

FAQ (from the WG discussion)

  • Does wasm32 make a difference? Not materially. Both SeaHash and xxHash compile to efficient i64 ops; performance is largely memory‑bound.
  • Seed vs Salt? NCHs use a seed (we keep it fixed). No per‑run randomness; outputs are stable across versions.
  • Why not xxHash if it’s popular? Ecosystem parity matters only if we compute the same hash in external systems. We don’t. SeaHash gives us speed + portability with a smaller footprint.

Extensibility

  • Keep a stable HashAlgo enum and adapter layer so we can add xxhash or cryptographic options without breaking callers.
  • Persisted values should be self‑describing when crossing system boundaries (e.g., seahash:<hex>).

References