webgpu-dawn: Haskell bindings to WebGPU Dawn for GPU computing and graphics

[ gpu, graphics, library, mit, program ] [ Propose Tags ] [ Report a vulnerability ]

This package provides Haskell bindings to Google's Dawn WebGPU implementation, enabling GPU computing and graphics programming from Haskell. It wraps the gpu.cpp library which provides a high-level C++ interface to Dawn.


[Skip to Readme]

Flags

Manual Flags

NameDescriptionDefault
glfw

Enable GLFW support for windowed graphics applications

Enabled

Use -f <flag> to enable a flag, or -f -<flag> to disable that flag. More info

Downloads

Maintainer's Corner

Package maintainers

For package maintainers and hackage trustees

Candidates

  • No Candidates
Versions [RSS] 0.1.0.0, 0.1.1.0
Dependencies aeson (>=2.0 && <2.3), base (>=4.14 && <5), base64-bytestring (>=1.0 && <1.3), binary (>=0.8 && <0.9), bytestring (>=0.10 && <0.13), clock (>=0.8 && <0.9), containers (>=0.6 && <0.8), filepath (>=1.4 && <1.6), mtl (>=2.2 && <2.4), stm (>=2.5 && <2.6), text (>=1.2 && <2.1), transformers (>=0.5 && <0.7), unordered-containers (>=0.2.14 && <0.3), vector (>=0.12 && <0.14), webgpu-dawn [details]
License MIT
Author Junji Hashimoto
Maintainer junji.hashimoto@gmail.com
Category Graphics, GPU
Home page https://github.com/junjihashimoto/webgpu-dawn
Source repo head: git clone https://github.com/junjihashimoto/webgpu-dawn
Uploaded by junjihashimoto at 2025-12-30T08:54:13Z
Distributions
Executables bench-async-matmul, bench-optimized-matmul, bench-subgroup-matmul, bench-linear, async-pipeline-demo, chrome-tracing-demo, high-level-api, struct-field-offset, particle-system, kernel-fusion, layout-demo, struct-generics-dsl, vector-add-dsl, matmul-subgroup-dsl, shared-memory-reduction
Downloads 3 total (3 in the last 30 days)
Rating (no votes yet) [estimated by Bayesian average]
Your Rating
  • λ
  • λ
  • λ
Status Docs uploaded by user
Build status unknown [no reports yet]

Readme for webgpu-dawn-0.1.1.0

[back to package description]

webgpu-dawn

High-level, type-safe Haskell bindings to Google's Dawn WebGPU implementation.

This library enables portable GPU computing with a Production-Ready DSL designed for high-throughput inference (e.g., LLMs), targeting 300 TPS (Tokens Per Second) performance.


⚡ Core Design Principles

To achieve high performance and type safety, this library adheres to the following strict patterns:

  1. Type-Safe Monadic DSL: No raw strings. We use ShaderM for composability and type safety.
  2. Natural Math & HOAS: Standard operators (+, *) and Higher-Order Abstract Syntax (HOAS) for loops (loop ... $ \i -> ...).
  3. Profile-Driven: Performance tuning is based on Roofline Analysis.
  4. Async Execution: Prefer AsyncPipeline to hide CPU latency and maximize GPU occupancy.
  5. Hardware Acceleration: Mandatory use of Subgroup Operations and F16 precision for heavy compute (MatMul/Reduction).

🏎️ Performance & Profiling

We utilize a Profile-Driven Development (PDD) workflow to maximize throughput.

1. Standard Benchmarks & Roofline Analysis

Run the optimized benchmark to determine TFLOPS and check the Roofline classification (Compute vs Memory Bound).

# Run 2D Block-Tiling MatMul Benchmark (FP32)
cabal run bench-optimized-matmul -- --size 4096 --iters 50

Output Example:

[Compute]  137.4 GFLOPs
[Memory]   201.3 MB
[Status]   COMPUTE BOUND (limited by GPU FLOPs)
[Hint]     Use F16 and Subgroup Operations to break the roofline.

2. Visual Profiling (Chrome Tracing)

Generate a trace file to visualize CPU/GPU overlap and kernel duration.

cabal run bench-optimized-matmul -- --size 4096 --trace

  • Load: Open chrome://tracing or ui.perfetto.dev
  • Analyze: Import trace.json to identify gaps between kernel executions (CPU overhead).

3. Debugging

Use the GPU printf-style debug buffer to inspect values inside kernels.

-- In DSL:
debugPrintF "intermediate_val" val


🚀 Quick Start

1. High-Level API (Data Parallelism)

Zero boilerplate. Ideal for simple map/reduce tasks.

import WGSL.API
import qualified Data.Vector.Storable as V

main :: IO ()
main = withContext $ \ctx -> do
  input  <- toGPU ctx (V.fromList [1..100] :: V.Vector Float)
  result <- gpuMap (\x -> x * 2.0 + 1.0) input
  out    <- fromGPU' result
  print out

2. Core DSL (Explicit Control)

Required for tuning Shared Memory, Subgroups, and F16.

import WGSL.DSL

shader :: ShaderM ()
shader = do
  input  <- declareInputBuffer "in" (TArray 1024 TF16)
  output <- declareOutputBuffer "out" (TArray 1024 TF16)
   
  -- HOAS Loop: Use lambda argument 'i', NOT string "i"
  loop 0 1024 1 $ \i -> do
    val <- readBuffer input i
    -- f16 literals for 2x throughput
    let res = val * litF16 2.0 + litF16 1.0
    writeBuffer output i res


📚 DSL Syntax Cheatsheet

Types & Literals

Haskell Type WGSL Type Literal Constructor Note
Exp F32 f32 litF32 1.0 or 1.0 Standard float
Exp F16 f16 litF16 1.0 Half precision (Fast!)
Exp I32 i32 litI32 1 or 1 Signed int
Exp U32 u32 litU32 1 Unsigned int
Exp Bool_ bool litBool True Boolean

Casting Helpers: i32(e), u32(e), f32(e), f16(e)

Control Flow (HOAS)

-- For Loop
loop start end step $ \i -> do ...

-- If Statement
if_ (val > 10.0) 
    (do ... {- then block -} ...) 
    (do ... {- else block -} ...)

-- Barrier
barrier  -- workgroupBarrier()


🧩 Kernel Fusion

For maximum performance, fuse multiple operations (Load -> Calc -> Store) into a single kernel to reduce global memory traffic.

import WGSL.Kernel

-- Fuse: Load -> Process -> Store
let pipeline = loadK inBuf >>> mapK (* 2.0) >>> mapK relu >>> storeK outBuf

-- Execute inside shader
unKernel pipeline i


📚 Architecture & Modules

Execution Model (Latency Hiding)

To maximize GPU occupancy, encoding is separated from submission.

  • WGSL.Async.Pipeline: Use for main loops. Allows CPU to encode Token N+1 while GPU processes Token N.
  • WGSL.Execute: Low-level synchronous execution (primarily for debugging).

Module Guide

Feature Module Description
Subgroup Ops WGSL.DSL subgroupMatrixLoad, mma, subgroupMatrixStore
F16 Math WGSL.DSL litF16, vec4<f16> for 2x throughput
Structs WGSL.Struct Generic derivation for std430 layout compliance
Analysis WGSL.Analyze Roofline analysis logic

📦 Installation

Pre-built Dawn binaries are downloaded automatically during installation.

cabal install webgpu-dawn


License

MIT License - see LICENSE file for details.

Acknowledgments

  • Dawn (Google): Core WebGPU runtime.
  • gpu.cpp (Answer.AI): High-level C++ API wrapper inspiration.
  • GLFW: Window management.

Contact

Maintainer: Junji Hashimoto junji.hashimoto@gmail.com