Performance Optimizations for Minga

Copy Markdown View Source

This document catalogues concrete performance improvements that leverage BEAM VM internals, JIT compiler behaviour (OTP 25+), and Elixir/Erlang idioms. Each section identifies the current bottleneck, explains why it's slow on the BEAM, and proposes a fix.


Already Implemented

The following optimizations have been completed (see commit 8beec9d):

  • Gap Buffer: count_newlines uses :binary.matches/2 (Boyer-Moore C) instead of String.graphemes/1 |> Enum.count/2.
  • Gap Buffer: multi-clause move/2 split from single case into 4 function clauses for JIT jump table optimization.
  • Gap Buffer: clear_line/2 + content_and_cursor/1 compound operations that eliminate multiple GenServer round-trips. change_line went from N+5 GenServer calls to 1.
  • Motion: tuple indexing all grapheme lists converted to tuples; elem/2 (O(1)) replaces Enum.at/2 (O(n)), turning O(n²) motions to O(n).
  • Motion: binary pattern matching for char classification classify_char/1 with guards replaces word_char? regex. Hot helpers inlined with @compile {:inline, ...}.
  • Motion: multi-clause functions replacing cond advance_word_forward/4, advance_word_end/4, bracket scan helpers extracted as multi-clause functions.
  • Editor: content_and_cursor/1 12 separate content() + cursor() GenServer call pairs replaced with single round-trip.
  • Git: in-memory diffing Git.Buffer caches HEAD content and diffs against current buffer using List.myers_difference/2 in pure Elixir. No git diff subprocess spawned on edits. Git commands only run at buffer open and explicit stage operations.
  • BEAM VM tuning (see below)

BEAM VM Tuning

The BEAM's defaults are tuned for web servers: many concurrent connections, high throughput, long-running processes. An editor is the opposite: single user, latency-sensitive, few processes, bursty allocations from the render loop, and long idle periods between keystrokes. Minga ships custom VM flags that address three areas:

Scheduler and thread configuration (rel/vm.args.eex, bin/minga)

Scheduler count (+S) and dirty IO threads (+SDio) are left at BEAM defaults. Reducing schedulers provides negligible benefit once busy-waiting is disabled (see below), and risks starving background work (agent streaming, LSP, git) under load.

FlagDefaultMingaWhy
+A 414Async threads handle Port I/O (Zig renderer, tree-sitter parser). 4 gives headroom for both Ports plus file operations.

Scheduler wake/sleep behavior

FlagDefaultMingaWhy
+sbwt noneshortnoneDisable busy-waiting. The editor is idle between keystrokes; busy-waiting burns CPU and battery for nothing.
+sbwtdcpu noneshortnoneSame for dirty CPU schedulers.
+sbwtdio noneshortnoneSame for dirty IO schedulers.
+swt very_lowlowvery_lowWake schedulers faster when a keystroke arrives. Low latency matters more than throughput for an editor.
+swtdcpu very_lowlowvery_lowSame for dirty CPU schedulers.
+swtdio very_lowlowvery_lowSame for dirty IO schedulers.

Memory allocators

FlagDefaultMingaWhy
+MBas aobfbfaobfAddress-order best fit reduces fragmentation for bursty allocation patterns (render loop builds then discards IO lists every frame).
+Mea min(default)minReturn unused memory carriers to the OS sooner. Editors should have a small idle footprint.
+MBacul 0(default)0Abandon carriers more aggressively. Don't hold memory the editor isn't using.
+hmbs 3276826214432768Lower minimum binary virtual heap. Triggers binary GC sooner on processes that churn binaries (Editor, Buffer.Server during render and streaming).

Per-process GC tuning (Elixir code)

Applied in init/1 of hot GenServer processes:

Process.flag(:fullsweep_after, 20)   # Default is 65535; frequent full GC reclaims dead binary refs
Process.flag(:min_heap_size, 4096)   # Pre-allocate larger heap; avoids repeated grow-and-GC cycles

Processes with this tuning:

  • Minga.Editor (render loop, state management)
  • Minga.Buffer.Server (gap buffer, binary churn on edits)
  • Minga.Port.Manager (binary render commands every frame; fullsweep only, no min_heap bump)

Development vs release

  • mix minga: VM flags aren't available after the BEAM starts. Use bin/minga wrapper script which sets ERL_FLAGS before launching Mix, or set ERL_FLAGS manually.
  • Release / Burrito: rel/vm.args.eex is compiled into the release and read by the Burrito launcher automatically. No user action needed.
  • Per-process GC tuning: Applied in Elixir code at process startup. Works in both dev and release.

Overriding flags

Any flag can be overridden at runtime. Flags in ERL_FLAGS are appended after the vm.args defaults, so later flags win:

# Try more aggressive scheduler reduction
ERL_FLAGS="+S 2:2" bin/minga

# Disable all custom flags (pass empty to override)
ERL_FLAGS=" " mix minga

Future work

  • Measure memory footprint with :erlang.memory/0 and OS RSS before/after.
  • Profile render latency via [render:content] log timings.

Agent Edit Performance

Agent tools (EditFile, WriteFile, ReadFile) are a distinct performance domain from human keystroke editing. An agent making 20 edits to a single file in a refactoring pass creates a different bottleneck profile than a human typing one character at a time.

Current path: filesystem round-trips

Each edit_file call does: File.read (4 syscalls) → String.split (find match) → String.replaceFile.write (3 syscalls). For 20 edits to one file, that's 20 full read-modify-write cycles, 140 syscalls, and 20 FileWatcher notifications that trigger "file changed on disk" checks.

The raw I/O isn't the bottleneck (OS page cache makes reads/writes effectively memory operations). The problems are:

  1. Redundant work per edit. Each File.read re-reads content the buffer already holds. Each File.write triggers a watcher event that may cause a buffer reload, competing with the next edit.
  2. No batching. 20 tool calls = 20 filesystem round-trips. Each one allocates a fresh binary for the full file content, does the string replace, then discards it.
  3. No incremental tree-sitter sync. The parser doesn't learn about individual edits. It gets a full reparse trigger after each file watcher notification.

Planned path: buffer-routed edits

Route through Buffer.Server.apply_text_edits/2: one GenServer call with a list of edits, one undo entry, one version bump, one EditDelta batch for tree-sitter. Zero filesystem events.

MetricFilesystem (20 edits)Buffer (20 edits, 1 batch)
Syscalls~140 (7 per edit × 20)0
Binary allocations40 (read + write per edit)1 (edit list)
FileWatcher events200
Tree-sitter reparses20 (one per watcher notification)1 (incremental)
Undo entries0 (not on stack)1

Buffer fork vs git worktree

For multi-agent concurrent editing, the performance gap is dramatic:

OperationGit worktreeBuffer fork
CreateSeconds to minutes (clone directory, cold caches)Microseconds (copy a struct into a new process)
MemoryFull checkout + separate _build + depsTwo binaries + a few integers
First buildCold (recompile everything)Shared _build and deps
Mergegit merge (disk-based, may need conflict resolution)Three-way merge in memory, instant for non-overlapping changes
Cleanupgit worktree remove + git worktree pruneProcess exits, garbage collected automatically

See BUFFER-AWARE-AGENTS.md for the full design.


Table of Contents

  1. Gap Buffer: Eliminate Repeated Full-Text Materialization
  2. Gap Buffer: Replace Grapheme Lists with Binary Walking
  3. Gap Buffer: Use IOdata Instead of Binary Concatenation
  4. Motion Module: Avoid Temporary Document + Content Copies
  5. Editor: Batch Remaining Buffer.Server Mutations into Single Calls
  6. Editor: Pre-build Render Binaries with IOlists
  7. Undo/Redo: Structural Sharing via Zipper or Diff-Based Stack
  8. TextObject: Avoid Flattening Entire Buffer for Paren Matching
  9. Picker: Cache Downcased Labels and Use ETS for Large Lists
  10. Keymap Trie: Compile to a Persistent Map for JIT-Friendly Lookups
  11. Port Protocol: Leverage Sub-Binary References
  12. GC Pressure: Reduce Short-Lived Allocations in the Render Loop
  13. Process Architecture: Consider ETS for Shared Read-Only State
  14. JIT-Specific: Help the BEAM JIT Generate Better Native Code
  15. Benchmarking Strategy

1. Gap Buffer: Eliminate Repeated Full-Text Materialization

Problem

Nearly every query function in Document calls content/1 which does before <> after_, an O(n) binary concatenation that allocates a fresh copy of the entire buffer content. Functions like line_at/2, lines/3, and content_range/3 then immediately String.split/2 that copy into a list of lines.

For a 10,000-line file, every single cursor motion triggers:

  1. content/1 → allocate ~500 KB binary
  2. String.split("\n") → allocate a list of 10,000 binaries
  3. String.graphemes() → allocate a list of ~300,000 graphemes

This is the single largest performance bottleneck.

Fix

Cache line metadata on the struct. Maintain a list (or tuple) of {byte_offset, grapheme_count} per line, updated incrementally on insert/delete. Then line_at/2 can extract a sub-binary directly from before or after_ without materializing the full content:

defstruct [:before, :after, :cursor_line, :cursor_col, :line_count,
           :line_index]  # [{byte_offset, byte_length}]

For line extraction, use binary_part/3 on the appropriate half:

@spec line_at(t(), non_neg_integer()) :: String.t() | nil
def line_at(%__MODULE__{} = buf, line_num) do
  case lookup_line(buf.line_index, line_num) do
    nil -> nil
    {offset, length} -> extract_line(buf.before, buf.after, offset, length)
  end
end

This turns line_at/2 from O(n) to O(1) and eliminates the temporary allocation entirely. The JIT can optimize binary_part/3 calls into direct pointer arithmetic on the heap binary.

Impact

Critical. Every keystroke in the editor calls line_at or lines at least once (for rendering) and often 3–4 times (motion + render). With a 10K-line file, this alone would cut per-keystroke allocations by ~90%.


2. Gap Buffer: Replace Grapheme Lists with Binary Walking

Problem

Functions like pop_last_grapheme/1 and motion helpers still convert binaries into grapheme data structures. While motions now use tuples (O(1) indexing), the initial String.graphemes/1 call is still O(n) in both time and memory.

Fix

Use String.next_grapheme/1 or String.next_grapheme_size/1 to walk the binary in-place without materializing the full list. The BEAM JIT generates excellent native code for binary matching. It can process UTF-8 bytes with near-C performance using the bs_match JIT primitive.

For random access by grapheme index, the current tuple approach is good. The next step is eliminating the need for full-buffer grapheme conversion in motions by working on individual lines (enabled by the line index cache from optimization #1).

Impact

High. Eliminates the remaining O(n) allocation in every motion call. Most impactful after the line index cache is implemented.


3. Gap Buffer: Use IOdata Instead of Binary Concatenation

Problem

insert_char/2 does before <> char which copies the entire before binary to append a single character. For rapid typing (e.g., pasting text), this is O(n²) in the size of before.

Fix

Store before as an iolist instead of a flat binary. Appending becomes O(1): [before | char]. Flatten to binary only when needed (content extraction, file save). The Port protocol's IO.iodata_to_binary/1 already handles iolists natively.

# Insert becomes O(1)
def insert_char(%__MODULE__{before: before} = buf, char) do
  %{buf | before: [before | char], ...}
end

# Content materialization (only when needed)
def content(%__MODULE__{before: before, after: after_}) do
  IO.iodata_to_binary([before | after_])
end

Caveat: This changes the internal representation. binary_part/3, byte_size/1, and pattern matching on before would need adjustment. Consider this a Phase 2 optimization after the line index cache.

Impact

Medium-High. Primarily affects insert-heavy workloads (typing, pasting). Transforms O(n) per character to O(1).


4. Motion Module: Avoid Temporary Document + Content Copies

Problem

Every motion function receives a Document.t() and calls Document.content/1 (binary concat), then String.graphemes/1 (tuple allocation), then String.split/2 (another list). The Editor's apply_motion/2 creates a temporary Document.new(content), so the content is still materialized and copied into a throwaway struct.

While content_and_cursor/1 reduced GenServer round-trips from 3 to 2, the temporary Document allocation and content copy remain.

Fix

Move motion execution into the Buffer.Server process via an apply_motion/2 GenServer call. The motion function runs inside the server where the gap buffer already lives (zero copies):

# In Buffer.Server
def handle_call({:apply_motion, motion_fn}, _from, state) do
  new_pos = motion_fn.(state.document, Document.cursor(state.document))
  new_buf = Document.move_to(state.document, new_pos)
  {:reply, :ok, %{state | document: new_buf}}
end

This eliminates the content copy, the temporary Document, and reduces the remaining 2 GenServer calls to 1.

Impact

High. Eliminates all temporary allocations per motion. For a 50,000-line file, that's ~3 MB saved per keystroke.


5. Editor: Batch Remaining Buffer.Server Mutations into Single Calls

Problem

Several editor commands still issue multiple sequential GenServer calls to BufferServer. The change_line command was fixed (now uses clear_line/1), but others remain:

  • join_lines: 4+ GenServer calls (cursor, get_lines, move_to, delete_at, N × delete_at for whitespace, insert_char)
  • indent_line: 3 calls (cursor, move_to, insert_char, move_to)
  • dedent_line: 3+ calls
  • toggle_case: 4 calls (cursor, get_lines, delete_at, insert_char)

Fix

Add compound operations to BufferServer / Document:

# In Buffer.Server
def handle_call({:join_lines, line}, _from, state) do
  {new_buf, _} = Document.join_lines(state.document, line)
  {:reply, :ok, push_undo(state, new_buf) |> mark_dirty()}
end

Similarly for toggle_case_at, indent_line, dedent_line.

Impact

Medium-High. Eliminates O(n) GenServer overhead for multi-step operations. Each GenServer call involves message copying, scheduling, and reply matching.


6. Editor: Pre-build Render Binaries with IOlists

Problem

do_render/1 builds render commands by concatenating strings with <> for padding (String.pad_trailing, String.duplicate) and joining. Each Protocol.encode_draw/4 call allocates a fresh binary.

The full render pipeline allocates hundreds of small binaries per frame that are immediately sent to the Port and become garbage.

Fix

Use iolists throughout the render pipeline. The Port's Port.command/2 accepts iolists natively, so there's no need to flatten to binary.

More impactful: in PortManager.handle_cast({:send_commands, ...}), skip the intermediate IO.iodata_to_binary(commands) flattening since Port.command/2 accepts iolists directly.

Note: Validate {:packet, 4} framing compatibility with iolists.

Impact

Medium. Saves one full-buffer-sized allocation per render frame. At 60 fps with a full-screen terminal, that's ~60 allocations/second of 10–50 KB each.


7. Undo/Redo: Structural Sharing via Zipper or Diff-Based Stack

Problem

Every mutation pushes a complete copy of the Document.t() struct onto the undo stack. The struct contains before and after_ binaries which together hold the full file content. For a 1 MB file, each keystroke adds ~1 MB to the undo stack. With @max_undo_stack of 1000, that's potentially ~1 GB of undo history.

The BEAM's garbage collector runs per-process, so this all lives in the Buffer.Server process heap and causes increasingly expensive GC pauses.

Fix

Store diffs instead of full snapshots:

@type undo_entry :: {
  cursor_before :: position(),
  operations :: [{:insert, position(), String.t()} | {:delete, position(), String.t()}]
}

Each undo entry is typically a few bytes (the inserted/deleted text + cursor position). Undo replays the inverse operations; redo replays forward.

Impact

High for large files. Reduces undo stack memory from O(n × stack_size) to O(delta × stack_size). For a 1 MB file with 1000 undo entries, this could reduce memory from ~1 GB to ~1 MB.


8. TextObject: Avoid Flattening Entire Buffer for Paren Matching

Problem

find_delimited_pair/4 calls flatten_with_positions/1, which creates a list of {grapheme, {line, col}} tuples for the entire buffer. For a 10,000-line file with 50 chars/line, that's 500,000 tuples (~40 MB of list cells + tuple headers).

Then it does linear scans with Enum.at/2 on this flat list.

Fix

Scan backward/forward from the cursor position using binary walking on the raw content, tracking line/col as you go. This requires no allocation beyond the final result positions.

Alternatively, use the gap buffer's before and after_ directly: scan backward through before for the opening delimiter, forward through after_ for the closing one.

Impact

High for large files. Eliminates a 40 MB allocation for paren matching on a 500K-character file.


9. Picker: Cache Downcased Labels and Use ETS for Large Lists

Problem

refilter/1 runs on every keystroke in the picker. For each item, it calls String.downcase/1 on the label and description. With 10,000 files in a project, that's 20,000 String.downcase/1 calls per keystroke.

The length/1 function is also called repeatedly on the same lists. length/1 is O(n) on linked lists.

Fix

  1. Pre-compute downcased labels at picker construction time.
  2. Cache the filtered count as a field instead of calling length/1.
  3. For very large candidate sets (>5000 items), consider ETS with match specs or a sorted index for prefix matching.

Impact

Medium. Eliminates 20K+ String.downcase/1 allocations per keystroke in the picker.


10. Keymap Trie: Compile to a Persistent Map for JIT-Friendly Lookups

Problem

The keymap trie is already efficient. Map.fetch/2 on small maps is fast. However, the trie is rebuilt from Defaults every time the editor starts. For the leader trie, Defaults.leader_trie() is called on every SPC keypress.

Fix

  1. Module-attribute compile-time construction: Build the trie at compile time using module attributes so it lives in the literal pool.

  2. Use :persistent_term for runtime-registered keybindings.

Impact

Low. Polish optimization.


11. Port Protocol: Leverage Sub-Binary References

Problem

For render commands going out, ensure large text payloads in encode_draw/4 use sub-binaries of the original line text rather than copies.

Fix

Use binary_part/3 to create sub-binary references (zero-copy) when the source is a reference-counted binary.

Impact

Low. Micro-optimization for large text payloads in draw commands.


12. GC Pressure: Reduce Short-Lived Allocations in the Render Loop

Problem

The render loop allocates many short-lived binaries per frame that become garbage after Port.command/2 sends them.

Fix

  1. Pre-compute tilde row commands once and reuse them.
  2. Cache modeline template; only rebuild changed segments.
  3. Use @compile {:inline, [...]} for small render helpers.
  4. Consider :erlang.garbage_collect(self(), type: :minor) after render.
  5. Use Process.flag(:min_heap_size, n) on the Editor process.

Impact

Medium. Smooths out latency spikes from GC pauses during rapid typing.


13. Process Architecture: ETS for Shared Read-Only State

Status: Partially Complete

The three highest-contention GenServer stores have been migrated to ETS with read_concurrency: true:

  • Config.Options (#156): every render frame and keystroke read options via Agent.get. Now direct ETS lookup. ~4x faster per read.
  • Diagnostics (#155): gutter signs and minibuffer hints read on every frame, serialized behind LSP publish writes. Reads now bypass the GenServer entirely; writes still go through GenServer for subscriber notifications.
  • Keymap.Active (#157): binding lookups on every keystroke went through Agent.get. Now direct ETS lookup.

Remaining

The Editor process still creates a temporary Document.new(content) for motions, copying the full content across processes. Best approach: move motions into the Buffer.Server process (see #4).

Impact

Medium-High. The ETS migrations eliminate all GenServer.call round-trips from the per-keystroke and per-frame hot paths (#159). The remaining buffer content copy is addressed by #4.


14. JIT-Specific: Help the BEAM JIT Generate Better Native Code

Several patterns remain that can help the JIT:

a. Avoid Enum for Small, Known-Size Collections

Enum functions add overhead from protocol dispatch and anonymous function calls. For small collections, use direct recursion or comprehensions:

# Faster for small lists: list comprehension (JIT-optimized)
for {text, fg, bg, opts} <- segments, do: ...

b. Mark Hot Functions for Inlining

Additional hot functions in the render path and buffer operations could benefit from @compile {:inline, ...}.

Impact

Low-Medium. Individual micro-optimizations that compound in hot loops.


15. Benchmarking Strategy

Telemetry

The keystroke-to-render critical path is instrumented with :telemetry spans. Set :log_level_render to :debug to see per-stage timing in *Messages*. See CONTRIBUTING.md for the full event list, example output, and how to attach custom handlers for histograms or percentile tracking.

Tools

  • Benchee micro-benchmarks for individual functions
  • :timer.tc/1 quick timing in IEx
  • :erlang.statistics(:reductions) BEAM work units (proxy for CPU)
  • :erlang.process_info(pid, :memory) per-process heap size
  • :erlang.process_info(pid, :garbage_collection) GC stats
  • recon production-ready process inspection
  • eflame / eflambe flame graphs for BEAM processes

What to Measure

  1. Per-keystroke latency: Time from handle_info({:minga_input, ...}) to Port.command/2 completion. Target: < 1 ms for normal mode, < 5 ms for complex operations.

  2. Memory per buffer: :erlang.process_info(buf_pid, :memory) with files of 1K, 10K, 100K lines. Track growth over 1000 edits.

  3. GC pause frequency: Monitor {:garbage_collection, info} trace events on the Editor process during sustained typing.

  4. Allocation rate: Use :erlang.system_info(:alloc_util_allocators) before/after a burst of operations.

Existing Performance Test

The project already has test/perf/document_perf_test.exs. Extend this with benchmarks for:

  • Motion on 10K-line buffer (word_forward, paragraph, bracket match)
  • Render cycle with full viewport
  • Picker filtering with 5000 candidates
  • 1000 sequential inserts (typing simulation)

Priority Order

Implement optimizations in this order for maximum impact:

  1. #1 Line index cache (eliminates the #1 allocation source)
  2. #4 Move motions into Buffer.Server (eliminates cross-process copies)
  3. #5 Batch remaining mutations (eliminates O(n) GenServer calls)
  4. #7 Diff-based undo (memory)
  5. #2 Binary walking (allocation reduction)
  6. #8 TextObject scanning (large file paren matching)
  7. #14 JIT-specific patterns (polish)
  8. Everything else