Matching Brackets in Arm64-assembly: Complete Solution & Deep Dive Guide

Mastering Bracket Validation in Arm64 Assembly: A Zero-to-Hero Guide

Solving the matching brackets problem in Arm64 assembly requires implementing a stack data structure manually. The core algorithm iterates through the input string, pushing opening brackets onto the stack and popping them when a corresponding closing bracket is found. A perfectly balanced string results in an empty stack.

You’ve just been handed a critical task. You're the new programmer for the legendary Bracketeer™, an ancient but immensely powerful mainframe computer. Its architecture is rigid, its software unforgiving. The slightest syntax error, specifically an unbalanced bracket, brace, or parenthesis in its proprietary source code, causes a catastrophic system crash requiring a full, time-consuming reboot. The pressure is on. Your mission is to write a low-level validation utility in Arm64 assembly to prevent these crashes, ensuring that every piece of code fed to the Bracketeer™ is perfectly balanced. This isn't just about writing code; it's about taming a beast and ensuring the stability of a critical system, using nothing but the raw power of assembly language.

What is the Bracket Matching Problem?

The "Matching Brackets" problem, also known as the "Balanced Parentheses" or "Balanced Symbols" problem, is a classic challenge in computer science. It's a fundamental concept that forms the bedrock of parsing, syntax analysis, and compiler design. The goal is to determine if a sequence of various types of brackets is "well-formed" or "balanced."

The Core Rules of a Balanced Sequence

A string of brackets is considered balanced if it adheres to two simple yet strict rules:

Every opening bracket—(, [, or {—must have a corresponding closing bracket—), ], or }—of the same type.
The pairs of brackets must be correctly nested. An opening bracket that appears later in the string must be closed before an opening bracket that appeared earlier.

For example, {[()]} is balanced. The innermost () is a valid pair. The next level, [()], is also valid. Finally, the outermost {[()]} is a correct pair. Conversely, {[(])} is not balanced because the ( is closed by a ] before its matching ), violating the nesting rule.

In our scenario, the validator must also be capable of ignoring any characters that are not brackets, focusing solely on the structural integrity of the code's bracket syntax. For instance, in the string "int main() { return 0; }", the validator should only care about the () and {} pairs, ignoring all the other text.

Why Solve This in Arm64 Assembly?

Tackling this problem in a high-level language like Python or Java is relatively straightforward, thanks to built-in data structures like stacks and dynamic arrays. However, choosing to implement it in Arm64 assembly presents a unique set of challenges and learning opportunities that are invaluable for any serious programmer.

The Low-Level Perspective

Ultimate Performance: Assembly language provides direct, unabstracted control over the processor. For a utility that might scan millions of lines of code on the Bracketeer™, every clock cycle counts. An assembly implementation, when written correctly, will be significantly faster and more memory-efficient than any high-level equivalent.
No Safety Nets: In assembly, there are no built-in data structures. You can't just import a Stack class. You are responsible for managing memory directly, allocating space on the program stack, and manipulating pointers to simulate the behavior of a stack. This forces a deep understanding of how data structures actually work under the hood.
Understanding the ABI: Writing this function in Arm64 requires adherence to the Application Binary Interface (ABI), specifically the AAPCS64. You must understand how arguments are passed (in registers like x0, x1), how return values are handled (in x0), and which registers must be preserved across function calls.
Direct Hardware Interaction: You are writing instructions that the CPU executes directly. This level of control is essential for systems programming, embedded systems, and performance-critical applications where you need to squeeze every last drop of power from the hardware.

By solving this problem in Arm64, you're not just finding a solution; you're mastering the fundamental principles of computing, from memory management to CPU instruction sets. This knowledge is transferable and provides a powerful foundation for understanding how all software, regardless of the language it's written in, ultimately interacts with the machine. Explore our complete Arm64 Assembly language guide for more foundational concepts.

How to Approach the Solution: The Stack is Key

The most elegant and efficient way to solve the bracket matching problem is by using a Stack data structure. A stack operates on a Last-In, First-Out (LIFO) principle. The last item you add (push) to the stack is the first item you can remove (pop). This LIFO behavior perfectly mirrors the nesting requirement of brackets.

The Core Algorithm

The logic can be broken down into a simple set of steps:

Initialize an Empty Stack: In assembly, this means allocating a region of memory on the program's stack and setting up a pointer to track the "top" of our data stack.
Iterate Through the Input String: We process the string one character at a time, from beginning to end.
Decision Logic: For each character, we do one of the following:
- If it's an opening bracket ((, [, {), we push it onto our stack.
- If it's a closing bracket (), ], }), we check the top of the stack.
  - If the stack is empty, it means we have a closing bracket with no opener. The string is unbalanced.
  - If the stack is not empty, we peek at the top element. If it's the corresponding opening bracket, we pop it from the stack and continue.
  - If the top element does not match, the string is unbalanced.
- If the character is not a bracket, we simply ignore it and move to the next one.
Final Check: After iterating through the entire string, we check our stack one last time.
- If the stack is empty, it means every opening bracket found a matching closing bracket. The string is balanced.
- If the stack is not empty, it means there are leftover opening brackets that were never closed. The string is unbalanced.

ASCII Art Diagram: The LIFO Stack Logic

This diagram illustrates how a stack processes a balanced string like "{()}".

Input String: "{()}"

  ● Start Processing
  │
  ▼
┌───────────┐
│ Read '{'  │
└─────┬─────┘
      │
      ▼ Push '{'
  ┌─────────┐
  │ Stack:  │
  │   {     │
  └─────────┘
      │
      ▼
┌───────────┐
│ Read '('  │
└─────┬─────┘
      │
      ▼ Push '('
  ┌─────────┐
  │ Stack:  │
  │   (     │
  │   {     │
  └─────────┘
      │
      ▼
┌───────────┐
│ Read ')'  │
└─────┬─────┘
      │
      ▼ Peek: Top is '('. Match! Pop.
  ┌─────────┐
  │ Stack:  │
  │   {     │
  └─────────┘
      │
      ▼
┌───────────┐
│ Read '}'  │
└─────┬─────┘
      │
      ▼ Peek: Top is '{'. Match! Pop.
  ┌─────────┐
  │ Stack:  │
  │ (empty) │
  └─────────┘
      │
      ▼
  ● End of String. Stack is empty.
  │
  └─→ Result: BALANCED

ASCII Art Diagram: Full Algorithm Flow

This flow chart visualizes the complete decision-making process within the loop.

    ● Start
    │
    ▼
┌──────────────────┐
│ Initialize Stack │
│ & Pointers       │
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│ Loop: Read Char  │
└────────┬─────────┘
         │
         ▼
    ◆ Is Char EOF? ────────Yes───→ Is Stack Empty? ──Yes──→ ● Return 1 (Balanced)
         │ No                            │ No
         │                               └───────────────→ ● Return 0 (Unbalanced)
         ▼
    ◆ Is Char an Opening Bracket?
   ╱           ╲
  Yes           No
  │              │
  ▼              ▼
┌───────────┐   ◆ Is Char a Closing Bracket?
│ Push Char │  ╱           ╲
│ to Stack  │ Yes           No
└─────┬─────┘  │              │
      │        ▼              ▼
      │   ┌──────────────┐  (Ignore Char)
      │   │ Is Stack     │    │
      │   │ Empty?       ├──Yes──→ ● Return 0 (Unbalanced)
      │   └───────┬──────┘    │
      │           │ No        │
      │           ▼           │
      │   ┌──────────────┐    │
      │   │ Peek Stack.  │    │
      │   │ Match Found? ├──No───→ ● Return 0 (Unbalanced)
      │   └───────┬──────┘    │
      │           │ Yes       │
      │           ▼           │
      │   ┌──────────────┐    │
      │   │ Pop from Stack │    │
      │   └──────────────┘    │
      │           │           │
      └───────────┴─────┬─────┘
                        │
                        ▼
                  (Continue Loop)

The Arm64 Implementation: A Detailed Code Walkthrough

Now, let's dissect the Arm64 assembly code provided in the kodikra learning path. This solution cleverly uses the program stack not just for function calls but also as the data stack for storing brackets.


.text
.globl is_paired

/* extern int is_paired(const char *value);
 *
 * Checks if a string has balanced brackets.
 * Returns 1 for balanced, 0 for unbalanced.
 *
 * Registers:
 * x0: (Argument) Pointer to the input string. Return value (0 or 1).
 * x1: Working pointer for iterating through the string.
 * w2: Holds the current character (byte) being processed.
 * x3: Stack pointer for our bracket stack.
 * w4: Holds a character popped from the stack for comparison.
 */
is_paired:
    // Backup callee-saved registers if we were to use them
    // For this simple function, we only use caller-saved registers.

    // --- Phase 1: Calculate string length and allocate stack space ---
    mov     x1, x0              // x1 = start of string
.scan_len:
    ldrb    w2, [x1], #1        // Load byte from [x1] into w2, then increment x1
    cbnz    w2, .scan_len       // If w2 is not zero (null terminator), loop again
    sub     x1, x1, x0          // x1 = length of string (including null terminator)
    
    // Align stack allocation to 16 bytes for performance and ABI compliance
    add     x1, x1, #15
    and     x1, x1, #-16        // x1 = length rounded up to nearest multiple of 16
    sub     sp, sp, x1          // Allocate space on the main program stack
    mov     x3, sp              // x3 will be our dedicated stack pointer

    // --- Phase 2: Main processing loop ---
.main_loop:
    ldrb    w2, [x0], #1        // Load char from input string and advance pointer
    cbz     w2, .end_check      // If char is null terminator, jump to final check

    // Check for opening brackets
    cmp     w2, #'('
    b.eq    .push
    cmp     w2, #'['
    b.eq    .push
    cmp     w2, #'{'
    b.eq    .push

    // Check for closing brackets
    cmp     w2, #')'
    b.eq    .pop
    cmp     w2, #']'
    b.eq    .pop
    cmp     w2, #'}'
    b.eq    .pop

    b       .main_loop          // Not a bracket, ignore and continue loop

.push:
    strb    w2, [x3], #1        // Store the opening bracket onto our stack, increment stack pointer
    b       .main_loop

.pop:
    cmp     x3, sp              // Is our stack pointer at the base?
    b.eq    .unbalanced         // If so, stack is empty. Unbalanced.
    sub     x3, x3, #1          // Decrement stack pointer (move to top element)
    ldrb    w4, [x3]            // Load the char from the top of our stack into w4

    // Compare the popped char with the current closing bracket
    cmp     w2, #')'
    ccmp    w4, #'(', #0, eq    // If w2==')', then check if w4=='('. #0 flag means NZCV=0000
    b.ne    .unbalanced

    cmp     w2, #']'
    ccmp    w4, #'[', #0, eq    // If w2==']', then check if w4=='['.
    b.ne    .unbalanced

    cmp     w2, #'}'
    ccmp    w4, #'{', #0, eq    // If w2=='{', then check if w4=='}'.
    b.ne    .unbalanced

    b       .main_loop          // Match found, continue processing

.end_check:
    cmp     x3, sp              // After string ends, is our stack empty?
    b.eq    .balanced           // If yes, everything was balanced.
    // If not, fall through to unbalanced

.unbalanced:
    mov     w0, #0              // Set return value to 0 (false)
    add     sp, sp, x1          // Deallocate the stack space we reserved
    ret

.balanced:
    mov     w0, #1              // Set return value to 1 (true)
    add     sp, sp, x1          // Deallocate the stack space we reserved
    ret

Step-by-Step Explanation

Phase 1: Initialization and Stack Allocation

mov x1, x0: The address of the input string is passed in register x0. We copy this address to x1, which will be our working pointer to scan the string.
.scan_len loop: This loop determines the length of the string. ldrb w2, [x1], #1 loads a byte from the memory address in x1 into w2, and then increments x1 by 1 (post-index addressing). cbnz w2, .scan_len checks if the loaded byte is non-zero. If it is, it loops. This continues until the null terminator (\0) is found.
sub x1, x1, x0: After the loop, x1 points just past the end of the string. By subtracting the original start address (x0), we get the total length.
add x1, x1, #15 and and x1, x1, #-16: This is a common and efficient trick to round a number up to the nearest multiple of 16. The Arm64 ABI requires the stack pointer (sp) to be 16-byte aligned. Allocating a buffer on the stack with a size that is a multiple of 16 maintains this alignment and can improve performance.
sub sp, sp, x1: We move the main stack pointer down by the calculated size, effectively reserving a block of memory for our bracket stack.
mov x3, sp: We use register x3 as our dedicated stack pointer for the bracket stack. It starts at the base of the allocated region.

Phase 2: The Main Loop

.main_loop: This is the heart of the algorithm. ldrb w2, [x0], #1 reads the next character from the input string (using the original pointer x0) and advances the pointer.
cbz w2, .end_check: If the character is the null terminator, we've reached the end of the string and jump to the final validation step.
cmp and b.eq chains: The code checks if the character w2 is one of the six bracket types. If it's an opening bracket, it branches to .push. If it's a closing bracket, it branches to .pop. If it's none of these, it branches back to the top of .main_loop to process the next character.

The Push and Pop Logic

.push: strb w2, [x3], #1 stores the opening bracket (from w2) at the location pointed to by our stack pointer x3, and then increments x3. This is the "push" operation.
.pop: This section is more complex.
- cmp x3, sp: It first checks if our stack is empty by comparing our stack pointer x3 with the base address sp. If they are equal, it means we tried to pop from an empty stack (e.g., encountering a ) with no preceding (), which is an unbalanced condition.
- sub x3, x3, #1: If the stack is not empty, we decrement our stack pointer. Because we push with post-increment, the pointer is always one byte *past* the top element. So, we must decrement it first to get to the top element.
- ldrb w4, [x3]: We load the character at the top of the stack into register w4.
- cmp/ccmp blocks: This is a clever use of conditional comparison. For cmp w2, #')' followed by ccmp w4, #'(', #0, eq, the second instruction (ccmp) only executes its comparison if the first one was true (the condition code eq is met). If the current character is ), it then checks if the character from the stack is (. If they don't match, the flags are set, and b.ne .unbalanced will trigger. This pattern repeats for [] and {}.

Phase 3: Final Checks and Return

.end_check: When the end of the string is reached, we check if our stack is empty one last time with cmp x3, sp. If it is, every bracket was matched, and we branch to .balanced.
.unbalanced: Sets the return value register w0 to 0.
.balanced: Sets the return value register w0 to 1.
add sp, sp, x1: In both success and failure cases, we must clean up by deallocating the memory we reserved on the stack. We do this by adding back the size we originally subtracted.
ret: Returns control to the calling function.

An Alternative, More Optimized Approach

The provided solution is correct and demonstrates the concepts well. However, it performs two passes over the string: one to calculate the length for stack allocation, and a second to do the actual processing. For very large strings, this can be suboptimal. A more optimized approach would use a single pass.

This alternative strategy involves allocating a fixed-size buffer on the stack and checking for stack overflow, or dynamically growing the stack if needed (though that adds complexity). For most practical purposes, a reasonably large fixed-size buffer is sufficient and much faster.

Here's the logic for a single-pass version:

Subtract a fixed size from the stack pointer (e.g., 4KB) at the beginning of the function. This will be our bracket stack.
Check if the stack pointer x3 exceeds the allocated boundary on every push. If it does, the nesting is too deep for our buffer, and we can consider it an error.
Process the string in a single loop as before.
Deallocate the fixed-size buffer before returning.

This approach eliminates the initial .scan_len loop, making the function's execution time directly proportional to the string length in a single pass, which is theoretically more efficient.

Pros and Cons of the Assembly Approach

While powerful, writing this logic in assembly is a trade-off. It's crucial to understand when this approach is warranted.

Pros (Advantages of Arm64 Assembly)	Cons (Disadvantages of Arm64 Assembly)
Maximum Performance: Direct CPU control means no overhead from interpreters, virtual machines, or high-level abstractions. The code runs as fast as the hardware allows.	High Complexity & Verbosity: Simple operations like a loop or a comparison require multiple lines of code. The logic is much harder to read and understand compared to a high-level language.
Minimal Memory Footprint: You allocate exactly the memory you need. There is no garbage collector or other memory management overhead.	Error-Prone: Manual memory management, pointer arithmetic, and register allocation are difficult to get right. A small mistake can lead to segmentation faults or subtle, hard-to-find bugs.
Deep System Understanding: Forces you to learn about CPU architecture, memory layout, the ABI, and how programs actually execute.	Lack of Portability: This code is specific to the Arm64 architecture. It will not run on x86 or other processor architectures without a complete rewrite.
Ideal for Constrained Environments: Perfect for bootloaders, operating system kernels, device drivers, and embedded systems where resources are scarce.	Slow Development Time: Writing, debugging, and maintaining assembly code is significantly more time-consuming than using a language like Python or Go.

Frequently Asked Questions (FAQ)

What exactly is a stack and why is it perfect for this problem?

A stack is a fundamental data structure that follows the Last-In, First-Out (LIFO) principle. Think of it like a stack of plates: you add a new plate to the top, and you can only remove the topmost plate. This LIFO behavior naturally models the nesting of brackets. When you see an opening bracket, you "remember" it by pushing it onto the stack. When you see a closing bracket, you check if it matches the most recently remembered opening bracket, which is conveniently at the top of the stack.

Why is 16-byte stack alignment important in Arm64?

The Arm 64-bit Architecture Procedure Call Standard (AAPCS64) mandates that the stack pointer (sp) must be 16-byte aligned at any public interface (like the start and end of a function). Adhering to this standard ensures compatibility with other compiled code and system libraries. Furthermore, some advanced SIMD (Single Instruction, Multiple Data) instructions require this alignment for memory access and can perform much more efficiently when data is properly aligned.

Could this logic be extended to handle other character pairs like XML tags (e.g., <div>...</div>)?

Conceptually, yes, but the implementation would become much more complex. Instead of pushing single characters, you would need to push entire strings (the tag names) onto the stack. The matching logic would then involve string comparison instead of character comparison. While the underlying stack-based algorithm remains the same, the assembly code would need to handle dynamic string allocation, storage, and comparison, which is a significantly more involved task.

What happens if the input string is empty?

An empty string is considered balanced. Let's trace the code: 1. The .scan_len loop will immediately find the null terminator. The calculated length will be 1. 2. It will allocate 16 bytes on the stack. 3. The .main_loop starts, and ldrb w2, [x0], #1 immediately loads the null terminator. 4. cbz w2, .end_check will be true, so it jumps to .end_check. 5. In .end_check, cmp x3, sp will be true because nothing was ever pushed. 6. It branches to .balanced, sets w0 to 1, deallocates the stack, and returns. The result is correctly 1 (balanced).

Is recursion a viable alternative for solving this problem?

Yes, recursion can be used, but it's generally less efficient for this specific problem. A recursive solution would implicitly use the program's call stack to store state, which mirrors what we are doing explicitly with our data stack. However, function calls have overhead (saving registers, updating frame pointers), which makes an iterative, explicit stack-based approach faster and less prone to stack overflow errors with deeply nested input.

What tools are needed to compile and run this Arm64 assembly code?

To work with this code, you'll need a cross-compiler toolchain if you're on a non-ARM machine (like x86). The GNU toolchain is a common choice:

Assembler: aarch64-linux-gnu-as to assemble the .s file into an object file (.o).
Linker: aarch64-linux-gnu-ld or aarch64-linux-gnu-gcc to link the object file into an executable.
Emulator: QEMU (qemu-aarch64) to run the AArch64 executable on an x86 host.

If you are on an ARM-based machine like a Raspberry Pi 4, a Mac with Apple Silicon (M1/M2/M3), or an ARM-based cloud server, you can use the native tools like as and gcc directly.

Conclusion: From Theory to Machine Code

We've journeyed from a high-level problem statement—validating bracket syntax for a fictional mainframe—down to the bare-metal implementation in Arm64 assembly. This exercise, part of the exclusive kodikra.com curriculum, demonstrates that behind every elegant algorithm is a foundation of precise, low-level machine instructions. By manually managing a stack and manipulating registers, we gained not only a solution but also a profound appreciation for the efficiency and control that assembly language offers.

Mastering concepts like this at the assembly level builds an incredibly strong foundation, making you a more effective programmer in any language. You now understand the "why" behind the "how" of data structures and memory management. This is the kind of deep knowledge that separates a good programmer from a great one.

To continue your journey into low-level programming, explore the next module in our Arm64 Assembly learning path and deepen your understanding of this powerful architecture.

Disclaimer: The code and explanations are based on the Armv8-A architecture (AArch64) and the AAPCS64 calling convention. Behavior may vary on different architectures or with different ABIs.

Published by Kodikra — Your trusted Arm64-assembly learning resource.

kodikra

Search this blog