This document catalogues concrete performance improvements that leverage BEAM VM internals, JIT compiler behaviour (OTP 25+), and Elixir/Erlang idioms. Each section identifies the current bottleneck, explains why it's slow on the BEAM, and proposes a fix.
Already Implemented
The following optimizations have been completed (see commit 8beec9d):
- Gap Buffer:
count_newlinesuses:binary.matches/2(Boyer-Moore C) instead ofString.graphemes/1 |> Enum.count/2. - Gap Buffer: multi-clause
move/2split from singlecaseinto 4 function clauses for JIT jump table optimization. - Gap Buffer:
clear_line/2+content_and_cursor/1compound operations that eliminate multiple GenServer round-trips.change_linewent from N+5 GenServer calls to 1. - Motion: tuple indexing all grapheme lists converted to tuples;
elem/2(O(1)) replacesEnum.at/2(O(n)), turning O(n²) motions to O(n). - Motion: binary pattern matching for char classification
classify_char/1with guards replacesword_char?regex. Hot helpers inlined with@compile {:inline, ...}. - Motion: multi-clause functions replacing
condadvance_word_forward/4,advance_word_end/4, bracket scan helpers extracted as multi-clause functions. - Editor:
content_and_cursor/112 separatecontent()+cursor()GenServer call pairs replaced with single round-trip. - Git: in-memory diffing
Git.Buffercaches HEAD content and diffs against current buffer usingList.myers_difference/2in pure Elixir. Nogit diffsubprocess spawned on edits. Git commands only run at buffer open and explicit stage operations. - BEAM VM tuning (see below)
BEAM VM Tuning
The BEAM's defaults are tuned for web servers: many concurrent connections, high throughput, long-running processes. An editor is the opposite: single user, latency-sensitive, few processes, bursty allocations from the render loop, and long idle periods between keystrokes. Minga ships custom VM flags that address three areas:
Scheduler and thread configuration (rel/vm.args.eex, bin/minga)
Scheduler count (+S) and dirty IO threads (+SDio) are left at BEAM defaults. Reducing schedulers provides negligible benefit once busy-waiting is disabled (see below), and risks starving background work (agent streaming, LSP, git) under load.
| Flag | Default | Minga | Why |
|---|---|---|---|
+A 4 | 1 | 4 | Async threads handle Port I/O (Zig renderer, tree-sitter parser). 4 gives headroom for both Ports plus file operations. |
Scheduler wake/sleep behavior
| Flag | Default | Minga | Why |
|---|---|---|---|
+sbwt none | short | none | Disable busy-waiting. The editor is idle between keystrokes; busy-waiting burns CPU and battery for nothing. |
+sbwtdcpu none | short | none | Same for dirty CPU schedulers. |
+sbwtdio none | short | none | Same for dirty IO schedulers. |
+swt very_low | low | very_low | Wake schedulers faster when a keystroke arrives. Low latency matters more than throughput for an editor. |
+swtdcpu very_low | low | very_low | Same for dirty CPU schedulers. |
+swtdio very_low | low | very_low | Same for dirty IO schedulers. |
Memory allocators
| Flag | Default | Minga | Why |
|---|---|---|---|
+MBas aobf | bf | aobf | Address-order best fit reduces fragmentation for bursty allocation patterns (render loop builds then discards IO lists every frame). |
+Mea min | (default) | min | Return unused memory carriers to the OS sooner. Editors should have a small idle footprint. |
+MBacul 0 | (default) | 0 | Abandon carriers more aggressively. Don't hold memory the editor isn't using. |
+hmbs 32768 | 262144 | 32768 | Lower minimum binary virtual heap. Triggers binary GC sooner on processes that churn binaries (Editor, Buffer.Server during render and streaming). |
Per-process GC tuning (Elixir code)
Applied in init/1 of hot GenServer processes:
Process.flag(:fullsweep_after, 20) # Default is 65535; frequent full GC reclaims dead binary refs
Process.flag(:min_heap_size, 4096) # Pre-allocate larger heap; avoids repeated grow-and-GC cyclesProcesses with this tuning:
Minga.Editor(render loop, state management)Minga.Buffer.Server(gap buffer, binary churn on edits)Minga.Port.Manager(binary render commands every frame; fullsweep only, no min_heap bump)
Development vs release
mix minga: VM flags aren't available after the BEAM starts. Usebin/mingawrapper script which setsERL_FLAGSbefore launching Mix, or setERL_FLAGSmanually.- Release / Burrito:
rel/vm.args.eexis compiled into the release and read by the Burrito launcher automatically. No user action needed. - Per-process GC tuning: Applied in Elixir code at process startup. Works in both dev and release.
Overriding flags
Any flag can be overridden at runtime. Flags in ERL_FLAGS are appended after the vm.args defaults, so later flags win:
# Try more aggressive scheduler reduction
ERL_FLAGS="+S 2:2" bin/minga
# Disable all custom flags (pass empty to override)
ERL_FLAGS=" " mix minga
Future work
- Measure memory footprint with
:erlang.memory/0and OS RSS before/after. - Profile render latency via
[render:content]log timings.
Agent Edit Performance
Agent tools (EditFile, WriteFile, ReadFile) are a distinct performance domain from human keystroke editing. An agent making 20 edits to a single file in a refactoring pass creates a different bottleneck profile than a human typing one character at a time.
Current path: filesystem round-trips
Each edit_file call does: File.read (4 syscalls) → String.split (find match) → String.replace → File.write (3 syscalls). For 20 edits to one file, that's 20 full read-modify-write cycles, 140 syscalls, and 20 FileWatcher notifications that trigger "file changed on disk" checks.
The raw I/O isn't the bottleneck (OS page cache makes reads/writes effectively memory operations). The problems are:
- Redundant work per edit. Each
File.readre-reads content the buffer already holds. EachFile.writetriggers a watcher event that may cause a buffer reload, competing with the next edit. - No batching. 20 tool calls = 20 filesystem round-trips. Each one allocates a fresh binary for the full file content, does the string replace, then discards it.
- No incremental tree-sitter sync. The parser doesn't learn about individual edits. It gets a full reparse trigger after each file watcher notification.
Planned path: buffer-routed edits
Route through Buffer.Server.apply_text_edits/2: one GenServer call with a list of edits, one undo entry, one version bump, one EditDelta batch for tree-sitter. Zero filesystem events.
| Metric | Filesystem (20 edits) | Buffer (20 edits, 1 batch) |
|---|---|---|
| Syscalls | ~140 (7 per edit × 20) | 0 |
| Binary allocations | 40 (read + write per edit) | 1 (edit list) |
| FileWatcher events | 20 | 0 |
| Tree-sitter reparses | 20 (one per watcher notification) | 1 (incremental) |
| Undo entries | 0 (not on stack) | 1 |
Buffer fork vs git worktree
For multi-agent concurrent editing, the performance gap is dramatic:
| Operation | Git worktree | Buffer fork |
|---|---|---|
| Create | Seconds to minutes (clone directory, cold caches) | Microseconds (copy a struct into a new process) |
| Memory | Full checkout + separate _build + deps | Two binaries + a few integers |
| First build | Cold (recompile everything) | Shared _build and deps |
| Merge | git merge (disk-based, may need conflict resolution) | Three-way merge in memory, instant for non-overlapping changes |
| Cleanup | git worktree remove + git worktree prune | Process exits, garbage collected automatically |
See BUFFER-AWARE-AGENTS.md for the full design.
Table of Contents
- Gap Buffer: Eliminate Repeated Full-Text Materialization
- Gap Buffer: Replace Grapheme Lists with Binary Walking
- Gap Buffer: Use IOdata Instead of Binary Concatenation
- Motion Module: Avoid Temporary Document + Content Copies
- Editor: Batch Remaining Buffer.Server Mutations into Single Calls
- Editor: Pre-build Render Binaries with IOlists
- Undo/Redo: Structural Sharing via Zipper or Diff-Based Stack
- TextObject: Avoid Flattening Entire Buffer for Paren Matching
- Picker: Cache Downcased Labels and Use ETS for Large Lists
- Keymap Trie: Compile to a Persistent Map for JIT-Friendly Lookups
- Port Protocol: Leverage Sub-Binary References
- GC Pressure: Reduce Short-Lived Allocations in the Render Loop
- Process Architecture: Consider ETS for Shared Read-Only State
- JIT-Specific: Help the BEAM JIT Generate Better Native Code
- Benchmarking Strategy
1. Gap Buffer: Eliminate Repeated Full-Text Materialization
Problem
Nearly every query function in Document calls content/1 which does before <> after_, an O(n) binary concatenation that allocates a fresh copy of the entire buffer content. Functions like line_at/2, lines/3, and content_range/3 then immediately String.split/2 that copy into a list of lines.
For a 10,000-line file, every single cursor motion triggers:
content/1→ allocate ~500 KB binaryString.split("\n")→ allocate a list of 10,000 binariesString.graphemes()→ allocate a list of ~300,000 graphemes
This is the single largest performance bottleneck.
Fix
Cache line metadata on the struct. Maintain a list (or tuple) of {byte_offset, grapheme_count} per line, updated incrementally on insert/delete. Then line_at/2 can extract a sub-binary directly from before or after_ without materializing the full content:
defstruct [:before, :after, :cursor_line, :cursor_col, :line_count,
:line_index] # [{byte_offset, byte_length}]For line extraction, use binary_part/3 on the appropriate half:
@spec line_at(t(), non_neg_integer()) :: String.t() | nil
def line_at(%__MODULE__{} = buf, line_num) do
case lookup_line(buf.line_index, line_num) do
nil -> nil
{offset, length} -> extract_line(buf.before, buf.after, offset, length)
end
endThis turns line_at/2 from O(n) to O(1) and eliminates the temporary allocation entirely. The JIT can optimize binary_part/3 calls into direct pointer arithmetic on the heap binary.
Impact
Critical. Every keystroke in the editor calls line_at or lines at least once (for rendering) and often 3–4 times (motion + render). With a 10K-line file, this alone would cut per-keystroke allocations by ~90%.
2. Gap Buffer: Replace Grapheme Lists with Binary Walking
Problem
Functions like pop_last_grapheme/1 and motion helpers still convert binaries into grapheme data structures. While motions now use tuples (O(1) indexing), the initial String.graphemes/1 call is still O(n) in both time and memory.
Fix
Use String.next_grapheme/1 or String.next_grapheme_size/1 to walk the binary in-place without materializing the full list. The BEAM JIT generates excellent native code for binary matching. It can process UTF-8 bytes with near-C performance using the bs_match JIT primitive.
For random access by grapheme index, the current tuple approach is good. The next step is eliminating the need for full-buffer grapheme conversion in motions by working on individual lines (enabled by the line index cache from optimization #1).
Impact
High. Eliminates the remaining O(n) allocation in every motion call. Most impactful after the line index cache is implemented.
3. Gap Buffer: Use IOdata Instead of Binary Concatenation
Problem
insert_char/2 does before <> char which copies the entire before binary to append a single character. For rapid typing (e.g., pasting text), this is O(n²) in the size of before.
Fix
Store before as an iolist instead of a flat binary. Appending becomes O(1): [before | char]. Flatten to binary only when needed (content extraction, file save). The Port protocol's IO.iodata_to_binary/1 already handles iolists natively.
# Insert becomes O(1)
def insert_char(%__MODULE__{before: before} = buf, char) do
%{buf | before: [before | char], ...}
end
# Content materialization (only when needed)
def content(%__MODULE__{before: before, after: after_}) do
IO.iodata_to_binary([before | after_])
endCaveat: This changes the internal representation. binary_part/3, byte_size/1, and pattern matching on before would need adjustment. Consider this a Phase 2 optimization after the line index cache.
Impact
Medium-High. Primarily affects insert-heavy workloads (typing, pasting). Transforms O(n) per character to O(1).
4. Motion Module: Avoid Temporary Document + Content Copies
Problem
Every motion function receives a Document.t() and calls Document.content/1 (binary concat), then String.graphemes/1 (tuple allocation), then String.split/2 (another list). The Editor's apply_motion/2 creates a temporary Document.new(content), so the content is still materialized and copied into a throwaway struct.
While content_and_cursor/1 reduced GenServer round-trips from 3 to 2, the temporary Document allocation and content copy remain.
Fix
Move motion execution into the Buffer.Server process via an apply_motion/2 GenServer call. The motion function runs inside the server where the gap buffer already lives (zero copies):
# In Buffer.Server
def handle_call({:apply_motion, motion_fn}, _from, state) do
new_pos = motion_fn.(state.document, Document.cursor(state.document))
new_buf = Document.move_to(state.document, new_pos)
{:reply, :ok, %{state | document: new_buf}}
endThis eliminates the content copy, the temporary Document, and reduces the remaining 2 GenServer calls to 1.
Impact
High. Eliminates all temporary allocations per motion. For a 50,000-line file, that's ~3 MB saved per keystroke.
5. Editor: Batch Remaining Buffer.Server Mutations into Single Calls
Problem
Several editor commands still issue multiple sequential GenServer calls to BufferServer. The change_line command was fixed (now uses clear_line/1), but others remain:
join_lines: 4+ GenServer calls (cursor, get_lines, move_to, delete_at, N × delete_at for whitespace, insert_char)indent_line: 3 calls (cursor, move_to, insert_char, move_to)dedent_line: 3+ callstoggle_case: 4 calls (cursor, get_lines, delete_at, insert_char)
Fix
Add compound operations to BufferServer / Document:
# In Buffer.Server
def handle_call({:join_lines, line}, _from, state) do
{new_buf, _} = Document.join_lines(state.document, line)
{:reply, :ok, push_undo(state, new_buf) |> mark_dirty()}
endSimilarly for toggle_case_at, indent_line, dedent_line.
Impact
Medium-High. Eliminates O(n) GenServer overhead for multi-step operations. Each GenServer call involves message copying, scheduling, and reply matching.
6. Editor: Pre-build Render Binaries with IOlists
Problem
do_render/1 builds render commands by concatenating strings with <> for padding (String.pad_trailing, String.duplicate) and joining. Each Protocol.encode_draw/4 call allocates a fresh binary.
The full render pipeline allocates hundreds of small binaries per frame that are immediately sent to the Port and become garbage.
Fix
Use iolists throughout the render pipeline. The Port's Port.command/2 accepts iolists natively, so there's no need to flatten to binary.
More impactful: in PortManager.handle_cast({:send_commands, ...}), skip the intermediate IO.iodata_to_binary(commands) flattening since Port.command/2 accepts iolists directly.
Note: Validate {:packet, 4} framing compatibility with iolists.
Impact
Medium. Saves one full-buffer-sized allocation per render frame. At 60 fps with a full-screen terminal, that's ~60 allocations/second of 10–50 KB each.
7. Undo/Redo: Structural Sharing via Zipper or Diff-Based Stack
Problem
Every mutation pushes a complete copy of the Document.t() struct onto the undo stack. The struct contains before and after_ binaries which together hold the full file content. For a 1 MB file, each keystroke adds ~1 MB to the undo stack. With @max_undo_stack of 1000, that's potentially ~1 GB of undo history.
The BEAM's garbage collector runs per-process, so this all lives in the Buffer.Server process heap and causes increasingly expensive GC pauses.
Fix
Store diffs instead of full snapshots:
@type undo_entry :: {
cursor_before :: position(),
operations :: [{:insert, position(), String.t()} | {:delete, position(), String.t()}]
}Each undo entry is typically a few bytes (the inserted/deleted text + cursor position). Undo replays the inverse operations; redo replays forward.
Impact
High for large files. Reduces undo stack memory from O(n × stack_size) to O(delta × stack_size). For a 1 MB file with 1000 undo entries, this could reduce memory from ~1 GB to ~1 MB.
8. TextObject: Avoid Flattening Entire Buffer for Paren Matching
Problem
find_delimited_pair/4 calls flatten_with_positions/1, which creates a list of {grapheme, {line, col}} tuples for the entire buffer. For a 10,000-line file with 50 chars/line, that's 500,000 tuples (~40 MB of list cells + tuple headers).
Then it does linear scans with Enum.at/2 on this flat list.
Fix
Scan backward/forward from the cursor position using binary walking on the raw content, tracking line/col as you go. This requires no allocation beyond the final result positions.
Alternatively, use the gap buffer's before and after_ directly: scan backward through before for the opening delimiter, forward through after_ for the closing one.
Impact
High for large files. Eliminates a 40 MB allocation for paren matching on a 500K-character file.
9. Picker: Cache Downcased Labels and Use ETS for Large Lists
Problem
refilter/1 runs on every keystroke in the picker. For each item, it calls String.downcase/1 on the label and description. With 10,000 files in a project, that's 20,000 String.downcase/1 calls per keystroke.
The length/1 function is also called repeatedly on the same lists. length/1 is O(n) on linked lists.
Fix
- Pre-compute downcased labels at picker construction time.
- Cache the filtered count as a field instead of calling
length/1. - For very large candidate sets (>5000 items), consider ETS with match specs or a sorted index for prefix matching.
Impact
Medium. Eliminates 20K+ String.downcase/1 allocations per keystroke in the picker.
10. Keymap Trie: Compile to a Persistent Map for JIT-Friendly Lookups
Problem
The keymap trie is already efficient. Map.fetch/2 on small maps is fast. However, the trie is rebuilt from Defaults every time the editor starts. For the leader trie, Defaults.leader_trie() is called on every SPC keypress.
Fix
Module-attribute compile-time construction: Build the trie at compile time using module attributes so it lives in the literal pool.
Use
:persistent_termfor runtime-registered keybindings.
Impact
Low. Polish optimization.
11. Port Protocol: Leverage Sub-Binary References
Problem
For render commands going out, ensure large text payloads in encode_draw/4 use sub-binaries of the original line text rather than copies.
Fix
Use binary_part/3 to create sub-binary references (zero-copy) when the source is a reference-counted binary.
Impact
Low. Micro-optimization for large text payloads in draw commands.
12. GC Pressure: Reduce Short-Lived Allocations in the Render Loop
Problem
The render loop allocates many short-lived binaries per frame that become garbage after Port.command/2 sends them.
Fix
- Pre-compute tilde row commands once and reuse them.
- Cache modeline template; only rebuild changed segments.
- Use
@compile {:inline, [...]}for small render helpers. - Consider
:erlang.garbage_collect(self(), type: :minor)after render. - Use
Process.flag(:min_heap_size, n)on the Editor process.
Impact
Medium. Smooths out latency spikes from GC pauses during rapid typing.
13. Process Architecture: ETS for Shared Read-Only State
Status: Partially Complete
The three highest-contention GenServer stores have been migrated to ETS with read_concurrency: true:
- ✅ Config.Options (#156): every render frame and keystroke read options via Agent.get. Now direct ETS lookup. ~4x faster per read.
- ✅ Diagnostics (#155): gutter signs and minibuffer hints read on every frame, serialized behind LSP publish writes. Reads now bypass the GenServer entirely; writes still go through GenServer for subscriber notifications.
- ✅ Keymap.Active (#157): binding lookups on every keystroke went through Agent.get. Now direct ETS lookup.
Remaining
The Editor process still creates a temporary Document.new(content) for motions, copying the full content across processes. Best approach: move motions into the Buffer.Server process (see #4).
Impact
Medium-High. The ETS migrations eliminate all GenServer.call round-trips from the per-keystroke and per-frame hot paths (#159). The remaining buffer content copy is addressed by #4.
14. JIT-Specific: Help the BEAM JIT Generate Better Native Code
Several patterns remain that can help the JIT:
a. Avoid Enum for Small, Known-Size Collections
Enum functions add overhead from protocol dispatch and anonymous function calls. For small collections, use direct recursion or comprehensions:
# Faster for small lists: list comprehension (JIT-optimized)
for {text, fg, bg, opts} <- segments, do: ...b. Mark Hot Functions for Inlining
Additional hot functions in the render path and buffer operations could benefit from @compile {:inline, ...}.
Impact
Low-Medium. Individual micro-optimizations that compound in hot loops.
15. Benchmarking Strategy
Telemetry
The keystroke-to-render critical path is instrumented with :telemetry spans. Set :log_level_render to :debug to see per-stage timing in *Messages*. See CONTRIBUTING.md for the full event list, example output, and how to attach custom handlers for histograms or percentile tracking.
Tools
Bencheemicro-benchmarks for individual functions:timer.tc/1quick timing in IEx:erlang.statistics(:reductions)BEAM work units (proxy for CPU):erlang.process_info(pid, :memory)per-process heap size:erlang.process_info(pid, :garbage_collection)GC statsreconproduction-ready process inspectioneflame/eflambeflame graphs for BEAM processes
What to Measure
Per-keystroke latency: Time from
handle_info({:minga_input, ...})toPort.command/2completion. Target: < 1 ms for normal mode, < 5 ms for complex operations.Memory per buffer:
:erlang.process_info(buf_pid, :memory)with files of 1K, 10K, 100K lines. Track growth over 1000 edits.GC pause frequency: Monitor
{:garbage_collection, info}trace events on the Editor process during sustained typing.Allocation rate: Use
:erlang.system_info(:alloc_util_allocators)before/after a burst of operations.
Existing Performance Test
The project already has test/perf/document_perf_test.exs. Extend this with benchmarks for:
- Motion on 10K-line buffer (word_forward, paragraph, bracket match)
- Render cycle with full viewport
- Picker filtering with 5000 candidates
- 1000 sequential inserts (typing simulation)
Priority Order
Implement optimizations in this order for maximum impact:
- #1 Line index cache (eliminates the #1 allocation source)
- #4 Move motions into Buffer.Server (eliminates cross-process copies)
- #5 Batch remaining mutations (eliminates O(n) GenServer calls)
- #7 Diff-based undo (memory)
- #2 Binary walking (allocation reduction)
- #8 TextObject scanning (large file paren matching)
- #14 JIT-specific patterns (polish)
- Everything else