Dominoes in Arm64-assembly: Complete Solution & Deep Dive Guide

a 3d rendering of the word 4200 on a brick wall

The Complete Guide to Solving Domino Chains in Arm64 Assembly

Solving the domino chain puzzle in Arm64 Assembly is a classic computer science challenge that elegantly combines graph theory with low-level programming. It requires checking if a given set of dominoes can form a continuous, closed loop by treating domino pips as graph nodes and dominoes as edges, ultimately searching for an Eulerian circuit.

You've seen it in movies: a complex schematic on a screen, lines of cryptic text scrolling by, and a genius programmer saving the day. While Hollywood dramatizes it, the core of that scene often boils down to solving a complex logic puzzle at the most fundamental level of computing. It feels intimidating, like a language spoken only by the machine itself. Many developers spend their careers in the comfortable abstraction of high-level languages, never touching the bare metal.

But what if you could not only understand that language but also command it? Imagine taking a seemingly simple game like dominoes and building a solution from scratch in Arm64 assembly—the language that powers your phone, your laptop, and much of the modern cloud. This article is your guide to doing just that. We will demystify the process, transforming an abstract graph theory problem into concrete, efficient assembly code, proving that you too can master the art of low-level problem-solving.

What is the Domino Chaining Problem?

At its heart, the domino chaining problem is not about the game itself, but about connectivity and paths. The goal is to determine if a given collection of dominoes can be arranged in a single, continuous chain where the ends of the chain also match, forming a closed loop.

Let's define the rules more formally:

Each domino has two halves, each with a number of pips (typically 0 to 6). A domino like [2|1] is treated the same as [1|2].
A valid chain requires that the number of pips on one half of a domino must match the number of pips on the adjacent half of the next domino in the sequence.
For a closed loop, the number of pips on the free half of the first domino must match the number of pips on the free half of the last domino.

For example, given the dominoes [1|2], [2|3], and [3|1], a valid chain is [1|2] [2|3] [3|1]. The start (1) matches the end (1), forming a perfect loop.

The Graph Theory Connection: Eulerian Circuits

This problem is a perfect real-world application of a concept in graph theory known as an Eulerian Circuit (or Eulerian Tour). We can model the problem as a graph:

Vertices (Nodes): The possible pip values on a domino (0, 1, 2, 3, 4, 5, 6).
Edges: Each domino itself. A domino [A|B] represents an edge connecting vertex A and vertex B.

A valid domino chain that forms a closed loop is equivalent to finding an Eulerian circuit in this graph. An Eulerian circuit is a path that visits every single edge exactly once and ends at the starting vertex. A famous theorem by Leonhard Euler states that an undirected graph has an Eulerian circuit if and only if two conditions are met:

Connectivity: All vertices with a non-zero degree (i.e., all pip values that actually appear on the dominoes) must belong to a single connected component. You can't have two separate, unlinked groups of dominoes.
Vertex Degree: Every vertex in the graph must have an even degree. The "degree" of a vertex is the number of edges connected to it. In our case, it's the total count of each pip value across all dominoes.

Our assembly solution will focus on efficiently verifying these two conditions.

Why Use Arm64 Assembly for This Challenge?

In an age of high-level languages like Python and JavaScript, diving into assembly might seem like an academic exercise. However, learning to solve problems in Arm64 assembly provides unparalleled benefits that are highly relevant today, as ARM architecture dominates mobile computing and is rapidly expanding in servers and laptops (like Apple's M-series chips).

Unmatched Performance: Assembly gives you direct control over the CPU. For computationally intensive tasks, a hand-optimized assembly routine can outperform code generated by a compiler.
Deep Hardware Understanding: You learn exactly how the processor works—how data is moved between registers and memory, how arithmetic is performed, and how program flow is controlled. This knowledge makes you a better programmer in any language.
Minimal Resource Usage: Assembly code produces the smallest and most efficient binaries, which is critical for embedded systems, IoT devices, operating system kernels, and high-performance computing.
Problem-Solving Purity: Without the crutch of libraries and frameworks, you are forced to think about the algorithm and data structures from first principles, building everything from the ground up.

This domino challenge, part of the exclusive kodikra.com Arm64-assembly learning path, is designed to build these fundamental skills.

How the Algorithm Works: A Two-Pronged Attack

To verify if a domino chain is possible, we must check the two conditions for an Eulerian circuit: connectivity and even degrees. Our strategy involves two main data structures that we'll manage manually in memory.

Data Structures

Degree Count Array: A simple array of 7 bytes (or integers), where each index from 0 to 6 stores the number of times that pip value appears in our set of dominoes. For a domino [3|3], we would increment the count at index 3 by two.
Disjoint Set Union (DSU) Table: Also known as a Union-Find data structure, this is an elegant way to track the connected components of a graph. We use an array (let's call it table) of size 7. Initially, each element is its own parent: table[i] = i. When we process a domino [A|B], we "union" their sets, effectively merging them into a single connected component.

The Overall Algorithm Flow

Here is a high-level overview of the steps our can_chain function will execute.

    ● Start (Input: dominoes, count)
    │
    ▼
  ┌───────────────────────────┐
  │ Initialize degrees[7] = {0} │
  │ Initialize dsu_table[7] = {0,1,2,3,4,5,6} │
  └────────────┬──────────────┘
               │
               ▼
  ┌───────────────────────────┐
  │ Loop through each domino [A|B] │
  ├───────────────────────────┤
  │   degrees[A]++              │
  │   degrees[B]++              │
  │   Union(A, B) in dsu_table  │
  └────────────┬──────────────┘
               │
               ▼
    ◆ Any non-zero degree odd?
   ╱           ╲
  Yes           No
  │              │
  ▼              ▼
[Return false]  ┌──────────────────────────┐
               │ Check Connectivity       │
               │ (Are all dominoes in one set?) │
               └───────────┬──────────────┘
                           │
                           ▼
                      ◆ Connected?
                     ╱           ╲
                    Yes           No
                    │              │
                    ▼              ▼
               [Return true]  [Return false]
                    │
                    ▼
                  ● End

Step 1: Check Vertex Degrees

This is the most straightforward part. We iterate through all the dominoes. For each domino [A|B], we increment the counters for A and B in our degree array. After processing all dominoes, we scan this array. If we find any pip value that has a non-zero but odd count, we can immediately conclude that an Eulerian circuit is impossible and return false.

Step 2: Check Connectivity with Union-Find

The Union-Find algorithm is perfect for determining connectivity. It consists of two primary operations:

find(i): Finds the representative (or "root") of the set containing element i. It does this by traversing the parent pointers in our table array until it reaches an element that is its own parent. For optimization, we use "path compression," where we make every node on the find path point directly to the root.
union(i, j): Merges the sets containing elements i and j. It first finds the roots of both elements. If the roots are different, it makes one root the parent of the other.

As we iterate through the dominoes to count degrees, we simultaneously perform the union operation on the two pips of each domino. After the loop, all pips that are part of the same connected chain of dominoes will share the same root in our DSU table.

To perform the final connectivity check, we find the root of the very first pip we encountered. Then, we iterate through all other pips that appeared in the domino set and check if they all have the same root. If they do, the graph is connected.

Here is a conceptual flow for the Union-Find logic:

    ● Process Domino [A|B]
    │
    ▼
  ┌────────────┐
  │ root_A = find(A) │
  └──────┬─────┘
         │
         ▼
  ┌────────────┐
  │ root_B = find(B) │
  └──────┬─────┘
         │
         ▼
    ◆ root_A == root_B?
   ╱           ╲
  Yes           No
  │              │
  ▼              ▼
[Already in    ┌───────────────────┐
 same set]    │ dsu_table[root_A] = root_B │
[Do nothing]   │ (Merge the sets)    │
  │            └───────────┬───────┘
  └────────────┬───────────┘
               │
               ▼
             ● Done

Where the Logic is Implemented: A Detailed Arm64 Code Walkthrough

Now, let's translate this logic into a complete, working Arm64 assembly program. The code from the kodikra module provides the core Union-Find helper functions, but we will build the main can_chain function around them to create a full solution.

The function signature in C would look like this: int can_chain(size_t domino_count, const domino_t* dominoes); Where domino_t is a struct like { uint8_t a; uint8_t b; }.

In Arm64, the arguments are passed via registers:

x0: domino_count
x1: pointer to dominoes array

The return value (0 for false, 1 for true) will be in w0.

The Complete Assembly Code


/*
 * domino_t is a struct of two uint8_t.
 * In memory, it's just two adjacent bytes.
 *
 * Function signature:
 * int can_chain(size_t domino_count, const domino_t *dominoes)
 *
 * Arguments:
 * x0: domino_count
 * x1: dominoes (pointer)
 *
 * Return:
 * w0: 1 if a chain can be formed, 0 otherwise.
 */

.text
.globl can_chain

// Helper function to find the root of a set with path compression.
// int find_root(int number, uint8_t *table)
// w0: number (the pip value)
// x1: pointer to the DSU table
find_root:
    mov w9, w0          // w9 = original number
.find_loop:
    ldrb w8, [x1, w0]   // w8 = table[number]
    cmp w8, w0          // if (table[number] == number), it's the root
    beq .path_compression
    mov w0, w8          // number = table[number]
    b .find_loop

.path_compression:
    // w0 now holds the root.
    // Go back and set the parent of the original number to the root.
    strb w0, [x1, w9]   // table[original_number] = root
    ret                 // Return root in w0

can_chain:
    stp x19, x20, [sp, #-48]! // Push callee-saved registers
    stp x21, x22, [sp, #16]
    stp x29, x30, [sp, #32]   // Push frame pointer and link register
    mov x29, sp               // Set up frame pointer

    // Allocate stack space for our local arrays
    // 7 bytes for degrees, 7 bytes for dsu_table.
    // We'll allocate 16 bytes for 16-byte stack alignment.
    sub sp, sp, #16
    mov x20, sp               // x20 = pointer to degrees array
    add x21, sp, #8           // x21 = pointer to dsu_table (offset by 8 bytes)

    // Preserve original arguments
    mov x19, x0               // x19 = domino_count
    mov x22, x1               // x22 = dominoes pointer

    // --- Part 1: Initialization ---
    // Initialize degrees array to all zeros
    mov x1, x20               // Pointer to degrees
    mov w2, #7                // Count
    mov w3, wzr               // Value to store (zero)
.init_degrees_loop:
    subs w2, w2, #1
    strb w3, [x1], #1
    bne .init_degrees_loop

    // Initialize DSU table: table[i] = i
    mov x1, x21               // Pointer to dsu_table
    mov w2, #0                // i = 0
.init_dsu_loop:
    cmp w2, #7
    bge .process_dominoes
    strb w2, [x1, w2]
    add w2, w2, #1
    b .init_dsu_loop

.process_dominoes:
    // --- Part 2: Process Dominoes, Calculate Degrees, and Union Sets ---
    cbz x19, .check_degrees   // If domino_count is 0, skip to checks (will pass)

    mov x1, x22               // x1 = current domino pointer
    mov x3, x19               // x3 = loop counter (domino_count)

.main_loop:
    // Load a domino [A|B]
    ldrh w4, [x1], #2         // Load 2 bytes (A and B) into w4, and advance pointer
                              // B is in bits 15-8, A is in bits 7-0
    ubfx w5, w4, #0, #8       // w5 = A (extract 8 bits from bit 0)
    ubfx w6, w4, #8, #8       // w6 = B (extract 8 bits from bit 8)

    // Increment degrees
    ldrb w8, [x20, w5]        // Load degrees[A]
    add w8, w8, #1
    strb w8, [x20, w5]        // Store degrees[A]
    ldrb w8, [x20, w6]        // Load degrees[B]
    add w8, w8, #1
    strb w8, [x20, w6]        // Store degrees[B]

    // Union the sets for A and B
    // Find root of A
    mov w0, w5
    mov x1, x21
    bl find_root
    mov w9, w0                // w9 = root_A

    // Find root of B
    mov w0, w6
    mov x1, x21
    bl find_root
    mov w10, w0               // w10 = root_B

    // If roots are different, union them: table[root_A] = root_B
    cmp w9, w10
    beq .loop_continue
    strb w10, [x21, w9]

.loop_continue:
    subs x3, x3, #1
    bne .main_loop

.check_degrees:
    // --- Part 3: Check if all degrees are even ---
    mov x1, x20               // Pointer to degrees array
    mov w2, #7                // Loop counter

.degree_check_loop:
    subs w2, w2, #1
    ldrb w3, [x1], #1         // Load next degree value
    cbz w3, .degree_ok        // If degree is 0, it's fine
    tst w3, #1                // Test if the last bit is 1 (is it odd?)
    bne .fail                 // If odd, fail
.degree_ok:
    cmp w2, #0
    bne .degree_check_loop

.check_connectivity:
    // --- Part 4: Check if all pips are in one connected component ---
    cbz x19, .success         // If no dominoes, it's trivially a success

    // Find the root of the first pip that has a non-zero degree
    mov x1, x20               // Pointer to degrees
    mov w2, #-1               // w2 will hold the first pip's value
    mov w3, #0                // Loop counter i = 0
.find_first_pip_loop:
    cmp w3, #7
    bge .first_pip_found      // Should not happen if count > 0
    ldrb w4, [x1, w3]
    cbnz w4, .set_first_pip   // If degree[i] != 0, we found it
    add w3, w3, #1
    b .find_first_pip_loop
.set_first_pip:
    mov w2, w3                // w2 = first pip with non-zero degree

.first_pip_found:
    // Get the root of this first component
    mov w0, w2
    mov x1, x21
    bl find_root
    mov w9, w0                // w9 = root of the first component

    // Now, check all other pips with non-zero degree
    mov x1, x20               // Pointer to degrees
    mov w3, #0                // Loop counter i = 0
.connectivity_loop:
    cmp w3, #7
    bge .success              // If we checked all, success!
    ldrb w4, [x1, w3]
    cbz w4, .conn_loop_next   // Skip if degree is zero

    // Pip has non-zero degree, check its root
    mov w0, w3
    mov x1, x21
    bl find_root
    cmp w0, w9                // Is its root the same as the first component's root?
    bne .fail                 // If not, fail

.conn_loop_next:
    add w3, w3, #1
    b .connectivity_loop

.fail:
    mov w0, #0                // Return 0 (false)
    b .cleanup

.success:
    mov w0, #1                // Return 1 (true)

.cleanup:
    add sp, sp, #16           // Deallocate stack space
    ldp x29, x30, [sp, #32]   // Pop frame pointer and link register
    ldp x21, x22, [sp, #16]
    ldp x19, x20, [sp], #48   // Pop callee-saved registers and restore sp
    ret

Code Explanation

Function Prologue and Stack Setup


stp x19, x20, [sp, #-48]! // Push callee-saved registers
stp x21, x22, [sp, #16]
stp x29, x30, [sp, #32]   // Push frame pointer and link register
mov x29, sp               // Set up frame pointer

sub sp, sp, #16
mov x20, sp               // x20 = pointer to degrees array
add x21, sp, #8           // x21 = pointer to dsu_table

stp (Store Pair) is used to save registers that our function will modify (x19-x22, x29, x30) onto the stack. This is standard procedure to ensure we don't corrupt the caller's state.
We then allocate 16 bytes on the stack for our two 7-byte arrays (degrees and dsu_table). We use 16 bytes to maintain stack alignment.
Pointers to these arrays are stored in callee-saved registers x20 and x21 so they persist across function calls (like our call to find_root).

Initialization


// Initialize degrees array to all zeros
...
// Initialize DSU table: table[i] = i
...

These two loops set up our data structures. The first loop writes zero to every byte of the degrees array. The second loop initializes the DSU table by setting table[i] = i for i from 0 to 6, making each pip value its own parent initially.

Main Processing Loop


.main_loop:
    ldrh w4, [x1], #2         // Load 2 bytes (A and B) into w4, and advance pointer
    ubfx w5, w4, #0, #8       // w5 = A (extract 8 bits from bit 0)
    ubfx w6, w4, #8, #8       // w6 = B (extract 8 bits from bit 8)

ldrh (Load Register Halfword) is a clever optimization. It reads two bytes (16 bits) at once from the dominoes array into register w4. The post-index , #2 automatically moves the pointer x1 forward by two bytes, ready for the next iteration.
ubfx (Unsigned Bitfield Extract) efficiently pulls out the 8-bit values for pips A and B from the 16-bit register w4.


    // Increment degrees
    ldrb w8, [x20, w5]        // Load degrees[A]
    add w8, w8, #1
    strb w8, [x20, w5]        // Store degrees[A]
    ...

This section loads the current degree count for a pip, increments it, and stores it back. This is done for both pips A and B.


    // Union the sets for A and B
    mov w0, w5
    mov x1, x21
    bl find_root
    mov w9, w0                // w9 = root_A
    ...
    cmp w9, w10
    beq .loop_continue
    strb w10, [x21, w9]       // table[root_A] = root_B

Here, we call our find_root helper for both pips. The arguments are passed in w0 and x1. The results (the roots) are stored in w9 and w10. If the roots are different, we perform the union by setting the parent of one root to the other.

Final Checks and Return

The final sections of the code, .check_degrees and .check_connectivity, implement the verification logic described earlier.

The degree check uses tst w3, #1, a bitwise AND operation that doesn't store the result but sets the flags. If the last bit is 1, the number is odd, and the Zero flag will be clear, causing the bne .fail (Branch if Not Equal to zero) to be taken.
The connectivity check first finds a pip that actually exists in the input (degree > 0), finds its root, and then iterates through all other existing pips to ensure they share the same root.

Finally, the .fail and .success labels set the return value in w0 to 0 or 1, and the .cleanup section restores the saved registers and returns control to the caller.

Risks and Considerations: Assembly vs. High-Level Languages

While powerful, writing in assembly is not without its trade-offs. It's crucial to understand when it's the right tool for the job.

Aspect	Arm64 Assembly Approach	High-Level Language (e.g., Python) Approach
Performance	Extremely high. Direct CPU control, minimal overhead. Ideal for performance-critical code.	Lower. Interpreter/VM overhead, garbage collection, and abstractions add latency.
Development Time	Very slow. The code is verbose, and manual memory management is required. High cognitive load.	Very fast. Expressive syntax, rich standard libraries, and built-in data structures.
Portability	Low. Code is specific to the AArch64 instruction set. It will not run on x86 without an emulator.	High. The same Python code can run on any platform with a Python interpreter.
Readability & Maintainability	Low. Difficult for developers unfamiliar with assembly to understand and modify. Requires extensive commenting.	High. Code is generally self-documenting and easier to reason about for a larger team.
Memory Control	Absolute. You control every byte on the stack and in memory. This is powerful but also a source of bugs (e.g., buffer overflows).	Abstracted. Memory is managed automatically, which is safer but can sometimes lead to unpredictable performance.

Frequently Asked Questions (FAQ)

What exactly is an Eulerian Circuit?

An Eulerian Circuit (or Eulerian Tour) is a path in a graph that starts and ends at the same vertex and visits every edge exactly once. It's named after Leonhard Euler, who first solved the famous "Seven Bridges of Königsberg" problem, which laid the foundation of graph theory. For our dominoes, this means creating a closed loop using all the dominoes without repeating any.

Why is checking for even vertex degrees so important?

Every time you "enter" a vertex (a pip value) through one half of a domino, you must "leave" it through another half of a different domino. This means that for a continuous path, pips must come in pairs. An odd degree means a pip is a "dead end"—you can enter it one more time than you can leave it (or vice versa), breaking the continuous chain.

What is the Union-Find (Disjoint Set Union) algorithm used for here?

Its purpose is to efficiently track which pips are connected to each other. Initially, every pip is in its own set. When we process a domino [A|B], we merge the sets containing A and B. By the end, if all pips that appeared in the dominoes are in the same single set, we know the graph is connected. It's much faster than other graph traversal methods like DFS or BFS for this specific task.

Can this code handle dominoes with identical halves, like [5|5]?

Yes, absolutely. A domino like [5|5] is treated as a self-loop on the vertex '5'. When processing it, our code will correctly increment the degree count for pip '5' by two (degrees[5] += 2), which maintains its evenness. The Union-Find operation will try to union '5' with '5', which correctly results in no change as it's already in the same set with itself.

How would you modify this to find the actual chain, not just check if it's possible?

To construct the actual chain, you would need a more complex algorithm, typically Hierholzer's algorithm. After verifying that an Eulerian circuit exists (using the logic in this article), you would start at a vertex, follow an unused edge to a new vertex, and continue until you return to the start, forming a circuit. If some edges in the graph remain unused, you'd find a vertex on your current circuit that has an unused edge, start a new tour from there, and splice the new tour into the main one.

Is Arm64 assembly difficult to learn?

It has a steeper learning curve than high-level languages because it requires you to manage registers, memory, and control flow manually. However, the Arm64 instruction set is known for being relatively clean and consistent (a RISC architecture), making it more approachable than some other assembly languages. The key is to start with simple problems and build up, which is the goal of the kodikra learning path.

What tools do I need to compile and run this Arm64 assembly code?

You'll need an assembler and a linker. On a Linux or macOS system with developer tools, you can use Clang or the GNU toolchain.
To assemble: as dominoes.s -o dominoes.o
To link: ld dominoes.o -o dominoes
If you are on an x86 machine, you will need a cross-compiler (e.g., aarch64-linux-gnu-gcc) and an emulator like QEMU to run the resulting executable.

Conclusion: From Theory to Bare Metal

We have successfully journeyed from a high-level logic puzzle to a complete, optimized Arm64 assembly solution. By translating the principles of graph theory—specifically, the conditions for an Eulerian circuit—into low-level code, we've not only solved the domino chaining problem but also gained a profound appreciation for how algorithms are executed by the processor.

You've seen how to manage memory on the stack, how to manipulate data in registers, and how to implement a sophisticated data structure like Union-Find from scratch. This is the power of assembly: it removes all layers of abstraction, giving you ultimate control and a deeper understanding of computation. While you may not write your next web application in assembly, the insights gained here will make you a more effective and knowledgeable programmer in any language.

Technology Version Disclaimer: The code provided is written for the Armv8-A (AArch64) instruction set architecture. It can be assembled using standard toolchains like Clang or GCC/binutils. Syntax and directives may vary slightly with different assemblers.

Ready to continue your journey into the world of low-level programming? Explore our full Arm64-assembly learning path to tackle more exciting challenges, or deepen your foundational knowledge with our complete Arm64-assembly guide.

Published by Kodikra — Your trusted Arm64-assembly learning resource.

kodikra

Search this blog