Building a GPU from Scratch

An open-source journey from specification to silicon streamed live.

Live Emulator Output
Input A[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0]
Input B[0.0, 10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0]
Output[0.0, 11.0, 22.0, 33.0, 44.0, 55.0, 66.0, 77.0]
✓ Vector addition computed on PumpGPU emulator

Development Roadmap

From software emulation to hardware implementation

25 Days Sprint
Phase 0 Vision + Definitions
3 days Complete

Without a clear spec, we'll waste months on rework. The spec is the contract between all components: assembler, emulator, and eventually RTL must agree on every bit.

Deliverables
  • SPEC.md v0 with ISA encoding table
  • Memory model documentation (global, shared, registers)
  • Kernel launch ABI defined
  • Repository CI green
  • Emulator boots and executes NOP
Exit Criteria
  • SPEC.md v0 committed and reviewed
  • Emulator compiles and runs empty kernel
  • CI pipeline passes
  • At least one stream episode completed
Risks & Mitigations
Over-engineering ISAStart with minimal ops, add later
SIMD vs SIMT paralysisCommit to SIMD v0, document SIMT path
Scope creepTimebox to 2 weeks max
Phase 1 Software Golden Model (Emulator)
5 days In Progress

The emulator is the golden reference. Every future component (RTL, optimized emulator, debugger) must produce identical results. Getting this right is non-negotiable.

Deliverables
  • All arithmetic ops implemented (int32, float32)
  • Load/store to global memory
  • Parameter passing via launch ABI
  • Lane ID / Global ID intrinsics
  • Predicated execution
  • Basic barrier (workgroup sync)
  • Atomic add (int32)
  • vecadd kernel passes
  • reduce kernel passes
Exit Criteria
  • vecadd kernel: emulator matches Python reference
  • reduce kernel: emulator matches Python reference
  • All ISA ops have unit tests
  • Memory model documented with examples
  • Performance counter skeleton exists
Risks & Mitigations
Memory model ambiguityDocument every edge case as we find it
Float precision issuesUse IEEE 754 strictly, document rounding
Barrier semanticsCopy CUDA barrier semantics initially
Happy path onlyProperty-based tests for edge cases
Phase 2 Tooling
4 days Upcoming

Good tooling accelerates everything. A flaky assembler means hours of "is this a bug or a tooling issue?" debugging. Disassembler is essential for verifying RTL later.

Deliverables
  • Assembler handles all ISA ops
  • Assembler has good error messages with line numbers
  • Disassembler round-trips cleanly
  • Emulator tracks instruction count, memory ops
  • matmul (tiled) kernel runs correctly
  • At least 3 example kernels documented
Exit Criteria
  • matmul kernel passes tests
  • Assembler fuzz-tested (no crashes on garbage input)
  • Disassembler output reassembles to identical binary
  • Performance counters report instruction mix
Risks & Mitigations
Grammar ambiguitiesUse proper parser (pest, nom) not regex
Binary format changesVersion header in binary format
DSL scope creepKeep optional, assembler is primary
Phase 3 Microarchitecture Plan
3 days Upcoming

RTL without a microarchitecture plan is like coding without design docs-possible but painful. This phase prevents "architecture astronaut" problems in Phase 4.

Deliverables
  • Execution unit design (how many ALUs, what ops)
  • Lane width decision (4, 8, 16, 32?)
  • Register file design (ports, banking)
  • Scheduler design (in-order v0, scoreboard v1)
  • Memory coalescer rules documented
  • Command processor / queue design
  • DMA engine interface
  • Interface timing diagrams
Exit Criteria
  • ARCHITECTURE.md has complete block diagram
  • All module interfaces documented
  • Resource estimate for target FPGA
  • Scheduling algorithm documented
  • Memory coalescing rules with examples
Risks & Mitigations
Over-designingKeep v0 simple, document "v1 ideas" separately
Ignoring FPGA limitsCheck target FPGA resources early
Mismatched interfacesDefine contracts before modules
Phase 4 RTL on FPGA
8 days Upcoming High Risk

This is where PumpGPU becomes a real (soft) GPU. All previous phases were building toward this moment.

Deliverables
  • Fetch/decode unit
  • Scalar ALU
  • Vector ALU (SIMD lanes)
  • Register file
  • Load/store unit
  • Scratchpad (shared memory)
  • Command processor
  • DMA engine
  • UART interface for debugging
  • Ethernet or PCIe interface (stretch)
  • "Hello kernel" executes on FPGA
Target FPGA Classes
EntryArty A7, DE10-Lite - Limited resources, good for core
MidNexys Video, DE10-Nano - Enough for full v0
HighKCU105, VCU118 - PCIe, HBM possible
Exit Criteria
  • vecadd runs on FPGA, matches emulator
  • reduce runs on FPGA, matches emulator
  • RTL passes all emulator test vectors
  • Resource utilization documented
  • Clock frequency achieved documented
Risks & Mitigations
Timing closureStart with low clock, optimize later
RTL-emulator mismatchCo-simulation from day 1
Resource exhaustionCheck utilization every module
Debugging hellExtensive ILA triggers, UART logging
Phase 5 Optimization & Advanced Features
2 days Upcoming

v0 will be slow. Phase 5 is about understanding why and fixing the bottlenecks systematically.

Deliverables
  • Bank conflict counter in emulator
  • Coalescing efficiency metric
  • Scheduler improvements (reduce stalls)
  • Profile-guided optimization docs
  • Stable kernel suite (10+ kernels)
  • Performance comparison: emulator vs FPGA vs CPU
Exit Criteria
  • Measurable perf improvement on matmul
  • Profiling tools documented
  • At least 10 kernels in test suite
  • Performance numbers published
Phase 6 Optional: Graphics Pipeline
Stretch Post-Hackathon

Graphics is what makes GPUs "GPUs" in public perception. Even a simple triangle demo is hugely impactful for engagement.

Deliverables
  • Triangle rasterization in emulator
  • Framebuffer memory region
  • Vertex shader subset (transform, project)
  • Fragment shader subset (color, texture sample)
  • HDMI output from FPGA
  • Spinning cube demo
Exit Criteria
  • Draws colored triangles
  • Runs on FPGA with display output
  • Mini-shader executes
Phase 7 Optional: ASIC Path
Stretch Post-Hackathon High Risk

The ultimate goal of "building a GPU" is custom silicon. Even if we don't tape out, understanding the path is valuable.

Deliverables
  • OpenROAD flow documented
  • Synthesis results for core modules
  • Area/power estimates
  • MPW shuttle options researched (Google/eFabless, TinyTapeout)
  • Tapeout readiness checklist
  • Cost/timeline estimate
Exit Criteria
  • Synthesis completes for core
  • Feasibility document published
  • Go/no-go decision documented
Risks & Mitigations
Tool complexityStart with OpenROAD tutorials
Cost prohibitiveDocument shuttle options, crowdfunding
Design rule violationsUse proven PDKs (SKY130, GF180)
Pump.fun Build in Public

Why PumpGPU Will Win

Building a GPU from scratch, live, in 25 days.

25 Days
00 Hours
00 Minutes
00 Seconds

Why PumpGPU Deserves to Win

PumpGPU isn't just another software project - it's an audacious attempt to build actual silicon from scratch, completely in public.

Unprecedented Technical Ambition

No one has ever live-streamed building a GPU from specification to silicon. This isn't a tutorial or a clone - it's original hardware design, documented from day zero. We're creating ISA specifications, writing assemblers, building emulators, and designing RTL that will run on real FPGAs.

📺

Ultimate Build in Public

Every commit, every design decision, every bug fix happens live. Viewers see the raw process of hardware engineering - the debugging sessions, the "aha" moments, the architectural pivots. This is transparency at its most extreme. No polished demos, just real engineering.

🎓

Massive Educational Value

GPU architecture is gatekept knowledge - mostly locked inside NVIDIA and AMD. PumpGPU breaks that barrier. Every episode teaches concepts that cost $50K+ in university courses: ISA design, memory coalescing, SIMD execution, RTL synthesis. We're creating the GPU course that doesn't exist.

🌍

Open Source Everything

MIT licensed from day one. Every line of code, every documentation file, every design document is public. The community can fork, learn, contribute, and build upon PumpGPU. We're not just building a GPU - we're building a foundation for open hardware.

🔥

Narrative That Captures Attention

"Solo dev builds GPU from scratch on stream" is a headline that writes itself. It's the kind of underdog story that resonates - combining extreme technical depth with accessible streaming content. Perfect for viral growth and community engagement.

🎯

Clear Milestones, Real Progress

We have a detailed roadmap with concrete deliverables: working emulator (done), assembler, disassembler, RTL modules, FPGA demo. Each stream advances toward a visible goal. Progress is measurable, verifiable, and undeniably real.

Perfect Fit for Pump.fun Hackathon

Build in Public

Every line of code written live on stream

Community Engagement

Real-time feedback shapes design decisions

Live Streaming

5+ streams per week minimum

Transparent Progress

Public GitHub, public roadmap, public demos

Broad User Interest

Hardware, AI, gaming, education communities

Scalable Potential

Educational platform, hardware products, consulting

Founder Discipline

Detailed specs, structured roadmap, consistent delivery

Organic Demand

Technical content that generates genuine interest