Compiler Onboarding

Skunk, Explained As A Pipeline

This booklet is meant to be printed, annotated, and read slowly. It starts from the top-level story of the compiler, then follows one tiny Skunk program through parsing, checking, layouts, IR generation, and native build.

Who It Is For

Someone who is new to compiler architecture, new to LLVM, and wants to learn this codebase without being thrown into the deepest file first.

Best Way To Use It

Read one chapter, then open the linked file and inspect the exact function names mentioned there. This guide is a map, not a replacement for reading code.

How To Read This

The most important idea in this whole booklet is that Skunk is a pipeline. Each stage receives a program in one form and hands a more useful form to the next stage. If you always know which stage you are in, the compiler stops feeling like a pile of unrelated files.

The second important idea is that not every program exercises every pass equally. A tiny non-generic program will mostly glide through monomorphization. A generic one will make that pass much more interesting. That is normal.

Read next in code
  • main in src/main.rs
  • load_program in src/source.rs
  • prepare_program in src/monomorphize.rs
  • check in src/type_checker.rs
  • compile_to_executable in src/compiler.rs

The Whole Pipeline

The high-level path through the compiler is short enough to memorize. That is helpful, because it lets you classify almost every file by its role in the bigger machine.

Source Files
One Loaded Program
Parsed AST
Prepared / Monomorphized Program
Type-Checked Program
LLVM IR
Runtime + Clang
Native Executable
Front End

Loading, grammar, AST construction, and semantic checks live mostly here.

Middle Preparation

Monomorphization makes generic programs more concrete before later passes.

Back End

Layouts, lowering, runtime linkage, and native build happen here.

Read next in code
  • main in src/main.rs
  • compile_to_llvm_ir and compile_to_executable in src/compiler.rs

Parsing And AST

Skunk parsing is split into two layers. src/grammar.pest describes which source forms are valid. src/ast.rs turns those grammar matches into the compiler's internal tree of Node values.

This matters because the rest of the compiler does not want to reason about raw strings. It wants to reason about named constructs like StructDeclaration, FunctionDeclaration, StructInitialization, and Access.

source text
  "Point { x: 20, y: 22 }"

grammar match
  recognized as a struct initialization

AST
  Node::StructInitialization {
      _type: Point,
      fields: [("x", 20), ("y", 22)]
  }
Read next in code
  • PestImpl::parse in src/ast.rs
  • create_ast in src/ast.rs
  • create_primary, create_access, and create_struct_init in src/ast.rs

Modules And Normalization

src/source.rs is where the compiler stops thinking in terms of "one file the user opened" and starts thinking in terms of "one program the compiler can analyze."

The source loader resolves imports, validates module declarations, detects cycles, and uses the module normalizer to rename private symbols when needed. That makes later global passes much simpler.

Mental model

This file turns many source files into one merged, safer program tree.

Read next in code
  • load_program in src/source.rs
  • ProgramLoader::load_file and ProgramLoader::module_path in src/source.rs
  • ModuleNormalizer::normalize in src/source.rs

Monomorphization

Generics are comfortable for programmers and inconvenient for backends. Skunk's answer is a preparation pass in src/monomorphize.rs that turns generic templates into concrete specialized program pieces when needed.

The pass is easiest to understand if you think in terms of recipes and finished dishes. A generic function is a recipe. A monomorphized concrete function is one finished dish for one concrete set of type arguments.

Collect

Gather generic templates and concrete declarations.

Decide

Figure out which concrete instances are actually needed.

Emit

Produce a prepared program with concrete declarations ready for later passes.

Read next in code
  • prepare_program in src/monomorphize.rs
  • Monomorphizer::new and Monomorphizer::prepare in src/monomorphize.rs
  • apply_substitutions, specialized_struct_name, and specialized_function_name in src/monomorphize.rs

Type Checking

The type checker is where the compiler shifts from "this parses" to "this is a legal Skunk program."

The public entry point is check. The most important recursive engine under it is resolve_type. It walks expressions, determines the type they produce, and validates whether the operations used are allowed.

One especially valuable helper in this file is resolve_access, because many language rules come together in access chains like self.x, ptr.*, slice[0], or window.draw_rect(...).

What type checking proves
  • The names used by the program exist.
  • The operations on those names make sense.
  • Assignments are legal.
  • Returns match declared function types.
  • Bounds and trait relationships are satisfied.
Read next in code
  • check in src/type_checker.rs
  • resolve_type, resolve_access, and is_assignable in src/type_checker.rs
  • GlobalScope::add and SymbolTables in src/type_checker.rs

LLVM, Layouts, And Runtime

The backend in src/compiler.rs is where language concepts become storage and instructions. Its own internal vocabulary is LlvmType.

This file also contains the layout structures that describe how values live in memory: StructLayout, EnumLayout, TraitLayout, and TraitMethodLayout.

compile_to_llvm_ir emits textual LLVM IR. Then compile_to_executable writes the IR to disk and invokes clang along with the runtime support files.

Layouts

Describe memory shape so the backend knows where fields and payloads live.

Lowering

Translate statements and expressions into LLVM instructions.

Runtime Linkage

Pull in support code from runtime/ when the compiled program needs it.

Read next in code
  • LlvmType and llvm_type in src/compiler.rs
  • collect_struct_layouts, collect_enum_layouts, and collect_trait_layouts in src/compiler.rs
  • compile_statement, compile_expr_with_expected, compile_struct_literal, and coerce_expr in src/compiler.rs
  • compile_to_llvm_ir and compile_to_executable in src/compiler.rs

Worked Example: One Tiny Program Through The Compiler

The best way to make the pipeline feel real is to trace one small program through it. Here is the example used in Part 2 of the notebook:

struct Point {
    x: int;
    y: int;
}

attach Point {
    function sum(self): int {
        return self.x + self.y;
    }
}

function main(): int {
    p: Point = Point { x: 20, y: 22 };
    return p.sum();
}

Step 1: Parse It

The parser recognizes a struct declaration, an attach declaration, and a main function. The method body becomes a nested expression tree rather than a flat string.

Step 2: Load It

Because this example has no imports, load_program has little visible work to do. But it still wraps the result as one coherent program node.

Step 3: Prepare It

Because this example is non-generic, monomorphization mostly passes it through. That is a useful lesson in itself: not every pass dramatically changes every program.

Step 4: Type-Check It

The checker proves that Point exists, the fields are legal, the struct literal initializes valid fields with assignable types, and p.sum() returns an int.

Step 5: Build Layouts

StructLayout("Point")
  field 0 -> x : i32
  field 1 -> y : i32

Step 6: Emit LLVM IR

The backend lowers the struct literal, method body, and return path into LLVM IR. You do not need to master LLVM syntax to understand the shape: build a value, access its fields, add them, and return the result.

Step 7: Link The Binary

Finally the compiler writes a .ll file and asks clang to produce a native executable, linking runtime support as needed.

Read next in code
  • create_struct_init and create_access in src/ast.rs
  • resolve_access and resolve_type in src/type_checker.rs
  • collect_struct_layouts, compile_struct_literal, and compile_expr_with_expected in src/compiler.rs

Extending Skunk

If Parts 1 and 2 teach you how to read the compiler, Part 3 teaches you how to change it. The key idea is to stop thinking of a feature as one edit and start thinking of it as a path through the pipeline.

Syntax Path

Grammar, AST construction, and maybe tests are often enough for small syntax sugar features.

Semantic Path

Type checking becomes central when the feature changes meaning, validity rules, or inferred types.

Runtime Path

Backend lowering and native runtime support matter when the feature requires execution-time behavior.

Beginner feature checklist
  • Start with one tiny example program.
  • Decide whether the feature is syntax sugar or a new semantic kind of thing.
  • Touch only the stages that actually need to know about it.
  • Add parser, type-checker, and compiler/runtime tests as needed.
  • Update docs and examples so the feature is teachable, not just implemented.
Read next in code
  • src/grammar.pest and src/ast.rs for syntax
  • src/type_checker.rs for meaning and rules
  • src/compiler.rs and runtime/ for execution behavior
  • Open Part 3 for the full extending guide

Recommended Reading Order

Read the compiler in this order if you want the architecture before the details:

  1. src/main.rs
  2. src/source.rs
  3. src/ast.rs
  4. src/type_checker.rs
  5. src/compiler.rs

Then go deeper with:

  1. src/grammar.pest
  2. src/monomorphize.rs
  3. src/interpreter.rs
  4. runtime/skunk_runtime.c
  5. runtime/skunk_window_runtime.m

How To Contribute Without Getting Lost

Do not try to understand every file before changing anything. Pick one feature, identify which stage first sees it, and trace only the stages that need to know about it.

A good beginner rhythm is:

  • Start with one tiny example program.
  • Find its syntax in the grammar and AST.
  • See how the type checker validates it.
  • See how the backend lowers it.
  • Add or update a focused test.

The markdown versions of this guide are here: Part 1, Part 2, and Part 3.