Dominoes in Arm64-assembly: Complete Solution & Deep Dive Guide
The Complete Guide to Solving Domino Chains in Arm64 Assembly
Solving the domino chain puzzle in Arm64 Assembly is a classic computer science challenge that elegantly combines graph theory with low-level programming. It requires checking if a given set of dominoes can form a continuous, closed loop by treating domino pips as graph nodes and dominoes as edges, ultimately searching for an Eulerian circuit.
You've seen it in movies: a complex schematic on a screen, lines of cryptic text scrolling by, and a genius programmer saving the day. While Hollywood dramatizes it, the core of that scene often boils down to solving a complex logic puzzle at the most fundamental level of computing. It feels intimidating, like a language spoken only by the machine itself. Many developers spend their careers in the comfortable abstraction of high-level languages, never touching the bare metal.
But what if you could not only understand that language but also command it? Imagine taking a seemingly simple game like dominoes and building a solution from scratch in Arm64 assembly—the language that powers your phone, your laptop, and much of the modern cloud. This article is your guide to doing just that. We will demystify the process, transforming an abstract graph theory problem into concrete, efficient assembly code, proving that you too can master the art of low-level problem-solving.
What is the Domino Chaining Problem?
At its heart, the domino chaining problem is not about the game itself, but about connectivity and paths. The goal is to determine if a given collection of dominoes can be arranged in a single, continuous chain where the ends of the chain also match, forming a closed loop.
Let's define the rules more formally:
- Each domino has two halves, each with a number of pips (typically 0 to 6). A domino like
[2|1]is treated the same as[1|2]. - A valid chain requires that the number of pips on one half of a domino must match the number of pips on the adjacent half of the next domino in the sequence.
- For a closed loop, the number of pips on the free half of the first domino must match the number of pips on the free half of the last domino.
For example, given the dominoes [1|2], [2|3], and [3|1], a valid chain is [1|2] [2|3] [3|1]. The start (1) matches the end (1), forming a perfect loop.
The Graph Theory Connection: Eulerian Circuits
This problem is a perfect real-world application of a concept in graph theory known as an Eulerian Circuit (or Eulerian Tour). We can model the problem as a graph:
- Vertices (Nodes): The possible pip values on a domino (0, 1, 2, 3, 4, 5, 6).
- Edges: Each domino itself. A domino
[A|B]represents an edge connecting vertexAand vertexB.
A valid domino chain that forms a closed loop is equivalent to finding an Eulerian circuit in this graph. An Eulerian circuit is a path that visits every single edge exactly once and ends at the starting vertex. A famous theorem by Leonhard Euler states that an undirected graph has an Eulerian circuit if and only if two conditions are met:
- Connectivity: All vertices with a non-zero degree (i.e., all pip values that actually appear on the dominoes) must belong to a single connected component. You can't have two separate, unlinked groups of dominoes.
- Vertex Degree: Every vertex in the graph must have an even degree. The "degree" of a vertex is the number of edges connected to it. In our case, it's the total count of each pip value across all dominoes.
Our assembly solution will focus on efficiently verifying these two conditions.
Why Use Arm64 Assembly for This Challenge?
In an age of high-level languages like Python and JavaScript, diving into assembly might seem like an academic exercise. However, learning to solve problems in Arm64 assembly provides unparalleled benefits that are highly relevant today, as ARM architecture dominates mobile computing and is rapidly expanding in servers and laptops (like Apple's M-series chips).
- Unmatched Performance: Assembly gives you direct control over the CPU. For computationally intensive tasks, a hand-optimized assembly routine can outperform code generated by a compiler.
- Deep Hardware Understanding: You learn exactly how the processor works—how data is moved between registers and memory, how arithmetic is performed, and how program flow is controlled. This knowledge makes you a better programmer in any language.
- Minimal Resource Usage: Assembly code produces the smallest and most efficient binaries, which is critical for embedded systems, IoT devices, operating system kernels, and high-performance computing.
- Problem-Solving Purity: Without the crutch of libraries and frameworks, you are forced to think about the algorithm and data structures from first principles, building everything from the ground up.
This domino challenge, part of the exclusive kodikra.com Arm64-assembly learning path, is designed to build these fundamental skills.
How the Algorithm Works: A Two-Pronged Attack
To verify if a domino chain is possible, we must check the two conditions for an Eulerian circuit: connectivity and even degrees. Our strategy involves two main data structures that we'll manage manually in memory.
Data Structures
- Degree Count Array: A simple array of 7 bytes (or integers), where each index from 0 to 6 stores the number of times that pip value appears in our set of dominoes. For a domino
[3|3], we would increment the count at index 3 by two. - Disjoint Set Union (DSU) Table: Also known as a Union-Find data structure, this is an elegant way to track the connected components of a graph. We use an array (let's call it
table) of size 7. Initially, each element is its own parent:table[i] = i. When we process a domino[A|B], we "union" their sets, effectively merging them into a single connected component.
The Overall Algorithm Flow
Here is a high-level overview of the steps our can_chain function will execute.
● Start (Input: dominoes, count)
│
▼
┌───────────────────────────┐
│ Initialize degrees[7] = {0} │
│ Initialize dsu_table[7] = {0,1,2,3,4,5,6} │
└────────────┬──────────────┘
│
▼
┌───────────────────────────┐
│ Loop through each domino [A|B] │
├───────────────────────────┤
│ degrees[A]++ │
│ degrees[B]++ │
│ Union(A, B) in dsu_table │
└────────────┬──────────────┘
│
▼
◆ Any non-zero degree odd?
╱ ╲
Yes No
│ │
▼ ▼
[Return false] ┌──────────────────────────┐
│ Check Connectivity │
│ (Are all dominoes in one set?) │
└───────────┬──────────────┘
│
▼
◆ Connected?
╱ ╲
Yes No
│ │
▼ ▼
[Return true] [Return false]
│
▼
● End
Step 1: Check Vertex Degrees
This is the most straightforward part. We iterate through all the dominoes. For each domino [A|B], we increment the counters for A and B in our degree array. After processing all dominoes, we scan this array. If we find any pip value that has a non-zero but odd count, we can immediately conclude that an Eulerian circuit is impossible and return false.
Step 2: Check Connectivity with Union-Find
The Union-Find algorithm is perfect for determining connectivity. It consists of two primary operations:
find(i): Finds the representative (or "root") of the set containing elementi. It does this by traversing the parent pointers in ourtablearray until it reaches an element that is its own parent. For optimization, we use "path compression," where we make every node on the find path point directly to the root.union(i, j): Merges the sets containing elementsiandj. It first finds the roots of both elements. If the roots are different, it makes one root the parent of the other.
As we iterate through the dominoes to count degrees, we simultaneously perform the union operation on the two pips of each domino. After the loop, all pips that are part of the same connected chain of dominoes will share the same root in our DSU table.
To perform the final connectivity check, we find the root of the very first pip we encountered. Then, we iterate through all other pips that appeared in the domino set and check if they all have the same root. If they do, the graph is connected.
Here is a conceptual flow for the Union-Find logic:
● Process Domino [A|B]
│
▼
┌────────────┐
│ root_A = find(A) │
└──────┬─────┘
│
▼
┌────────────┐
│ root_B = find(B) │
└──────┬─────┘
│
▼
◆ root_A == root_B?
╱ ╲
Yes No
│ │
▼ ▼
[Already in ┌───────────────────┐
same set] │ dsu_table[root_A] = root_B │
[Do nothing] │ (Merge the sets) │
│ └───────────┬───────┘
└────────────┬───────────┘
│
▼
● Done
Where the Logic is Implemented: A Detailed Arm64 Code Walkthrough
Now, let's translate this logic into a complete, working Arm64 assembly program. The code from the kodikra module provides the core Union-Find helper functions, but we will build the main can_chain function around them to create a full solution.
The function signature in C would look like this:
int can_chain(size_t domino_count, const domino_t* dominoes);
Where domino_t is a struct like { uint8_t a; uint8_t b; }.
In Arm64, the arguments are passed via registers:
x0:domino_countx1: pointer todominoesarray
w0.
The Complete Assembly Code
/*
* domino_t is a struct of two uint8_t.
* In memory, it's just two adjacent bytes.
*
* Function signature:
* int can_chain(size_t domino_count, const domino_t *dominoes)
*
* Arguments:
* x0: domino_count
* x1: dominoes (pointer)
*
* Return:
* w0: 1 if a chain can be formed, 0 otherwise.
*/
.text
.globl can_chain
// Helper function to find the root of a set with path compression.
// int find_root(int number, uint8_t *table)
// w0: number (the pip value)
// x1: pointer to the DSU table
find_root:
mov w9, w0 // w9 = original number
.find_loop:
ldrb w8, [x1, w0] // w8 = table[number]
cmp w8, w0 // if (table[number] == number), it's the root
beq .path_compression
mov w0, w8 // number = table[number]
b .find_loop
.path_compression:
// w0 now holds the root.
// Go back and set the parent of the original number to the root.
strb w0, [x1, w9] // table[original_number] = root
ret // Return root in w0
can_chain:
stp x19, x20, [sp, #-48]! // Push callee-saved registers
stp x21, x22, [sp, #16]
stp x29, x30, [sp, #32] // Push frame pointer and link register
mov x29, sp // Set up frame pointer
// Allocate stack space for our local arrays
// 7 bytes for degrees, 7 bytes for dsu_table.
// We'll allocate 16 bytes for 16-byte stack alignment.
sub sp, sp, #16
mov x20, sp // x20 = pointer to degrees array
add x21, sp, #8 // x21 = pointer to dsu_table (offset by 8 bytes)
// Preserve original arguments
mov x19, x0 // x19 = domino_count
mov x22, x1 // x22 = dominoes pointer
// --- Part 1: Initialization ---
// Initialize degrees array to all zeros
mov x1, x20 // Pointer to degrees
mov w2, #7 // Count
mov w3, wzr // Value to store (zero)
.init_degrees_loop:
subs w2, w2, #1
strb w3, [x1], #1
bne .init_degrees_loop
// Initialize DSU table: table[i] = i
mov x1, x21 // Pointer to dsu_table
mov w2, #0 // i = 0
.init_dsu_loop:
cmp w2, #7
bge .process_dominoes
strb w2, [x1, w2]
add w2, w2, #1
b .init_dsu_loop
.process_dominoes:
// --- Part 2: Process Dominoes, Calculate Degrees, and Union Sets ---
cbz x19, .check_degrees // If domino_count is 0, skip to checks (will pass)
mov x1, x22 // x1 = current domino pointer
mov x3, x19 // x3 = loop counter (domino_count)
.main_loop:
// Load a domino [A|B]
ldrh w4, [x1], #2 // Load 2 bytes (A and B) into w4, and advance pointer
// B is in bits 15-8, A is in bits 7-0
ubfx w5, w4, #0, #8 // w5 = A (extract 8 bits from bit 0)
ubfx w6, w4, #8, #8 // w6 = B (extract 8 bits from bit 8)
// Increment degrees
ldrb w8, [x20, w5] // Load degrees[A]
add w8, w8, #1
strb w8, [x20, w5] // Store degrees[A]
ldrb w8, [x20, w6] // Load degrees[B]
add w8, w8, #1
strb w8, [x20, w6] // Store degrees[B]
// Union the sets for A and B
// Find root of A
mov w0, w5
mov x1, x21
bl find_root
mov w9, w0 // w9 = root_A
// Find root of B
mov w0, w6
mov x1, x21
bl find_root
mov w10, w0 // w10 = root_B
// If roots are different, union them: table[root_A] = root_B
cmp w9, w10
beq .loop_continue
strb w10, [x21, w9]
.loop_continue:
subs x3, x3, #1
bne .main_loop
.check_degrees:
// --- Part 3: Check if all degrees are even ---
mov x1, x20 // Pointer to degrees array
mov w2, #7 // Loop counter
.degree_check_loop:
subs w2, w2, #1
ldrb w3, [x1], #1 // Load next degree value
cbz w3, .degree_ok // If degree is 0, it's fine
tst w3, #1 // Test if the last bit is 1 (is it odd?)
bne .fail // If odd, fail
.degree_ok:
cmp w2, #0
bne .degree_check_loop
.check_connectivity:
// --- Part 4: Check if all pips are in one connected component ---
cbz x19, .success // If no dominoes, it's trivially a success
// Find the root of the first pip that has a non-zero degree
mov x1, x20 // Pointer to degrees
mov w2, #-1 // w2 will hold the first pip's value
mov w3, #0 // Loop counter i = 0
.find_first_pip_loop:
cmp w3, #7
bge .first_pip_found // Should not happen if count > 0
ldrb w4, [x1, w3]
cbnz w4, .set_first_pip // If degree[i] != 0, we found it
add w3, w3, #1
b .find_first_pip_loop
.set_first_pip:
mov w2, w3 // w2 = first pip with non-zero degree
.first_pip_found:
// Get the root of this first component
mov w0, w2
mov x1, x21
bl find_root
mov w9, w0 // w9 = root of the first component
// Now, check all other pips with non-zero degree
mov x1, x20 // Pointer to degrees
mov w3, #0 // Loop counter i = 0
.connectivity_loop:
cmp w3, #7
bge .success // If we checked all, success!
ldrb w4, [x1, w3]
cbz w4, .conn_loop_next // Skip if degree is zero
// Pip has non-zero degree, check its root
mov w0, w3
mov x1, x21
bl find_root
cmp w0, w9 // Is its root the same as the first component's root?
bne .fail // If not, fail
.conn_loop_next:
add w3, w3, #1
b .connectivity_loop
.fail:
mov w0, #0 // Return 0 (false)
b .cleanup
.success:
mov w0, #1 // Return 1 (true)
.cleanup:
add sp, sp, #16 // Deallocate stack space
ldp x29, x30, [sp, #32] // Pop frame pointer and link register
ldp x21, x22, [sp, #16]
ldp x19, x20, [sp], #48 // Pop callee-saved registers and restore sp
ret
Code Explanation
Function Prologue and Stack Setup
stp x19, x20, [sp, #-48]! // Push callee-saved registers
stp x21, x22, [sp, #16]
stp x29, x30, [sp, #32] // Push frame pointer and link register
mov x29, sp // Set up frame pointer
sub sp, sp, #16
mov x20, sp // x20 = pointer to degrees array
add x21, sp, #8 // x21 = pointer to dsu_table
stp(Store Pair) is used to save registers that our function will modify (x19-x22,x29,x30) onto the stack. This is standard procedure to ensure we don't corrupt the caller's state.- We then allocate 16 bytes on the stack for our two 7-byte arrays (
degreesanddsu_table). We use 16 bytes to maintain stack alignment. - Pointers to these arrays are stored in callee-saved registers
x20andx21so they persist across function calls (like our call tofind_root).
Initialization
// Initialize degrees array to all zeros
...
// Initialize DSU table: table[i] = i
...
These two loops set up our data structures. The first loop writes zero to every byte of the degrees array. The second loop initializes the DSU table by setting table[i] = i for i from 0 to 6, making each pip value its own parent initially.
Main Processing Loop
.main_loop:
ldrh w4, [x1], #2 // Load 2 bytes (A and B) into w4, and advance pointer
ubfx w5, w4, #0, #8 // w5 = A (extract 8 bits from bit 0)
ubfx w6, w4, #8, #8 // w6 = B (extract 8 bits from bit 8)
ldrh(Load Register Halfword) is a clever optimization. It reads two bytes (16 bits) at once from the dominoes array into registerw4. The post-index, #2automatically moves the pointerx1forward by two bytes, ready for the next iteration.ubfx(Unsigned Bitfield Extract) efficiently pulls out the 8-bit values for pips A and B from the 16-bit registerw4.
// Increment degrees
ldrb w8, [x20, w5] // Load degrees[A]
add w8, w8, #1
strb w8, [x20, w5] // Store degrees[A]
...
This section loads the current degree count for a pip, increments it, and stores it back. This is done for both pips A and B.
// Union the sets for A and B
mov w0, w5
mov x1, x21
bl find_root
mov w9, w0 // w9 = root_A
...
cmp w9, w10
beq .loop_continue
strb w10, [x21, w9] // table[root_A] = root_B
Here, we call our find_root helper for both pips. The arguments are passed in w0 and x1. The results (the roots) are stored in w9 and w10. If the roots are different, we perform the union by setting the parent of one root to the other.
Final Checks and Return
The final sections of the code, .check_degrees and .check_connectivity, implement the verification logic described earlier.
- The degree check uses
tst w3, #1, a bitwise AND operation that doesn't store the result but sets the flags. If the last bit is 1, the number is odd, and the Zero flag will be clear, causing thebne .fail(Branch if Not Equal to zero) to be taken. - The connectivity check first finds a pip that actually exists in the input (degree > 0), finds its root, and then iterates through all other existing pips to ensure they share the same root.
.fail and .success labels set the return value in w0 to 0 or 1, and the .cleanup section restores the saved registers and returns control to the caller.
Risks and Considerations: Assembly vs. High-Level Languages
While powerful, writing in assembly is not without its trade-offs. It's crucial to understand when it's the right tool for the job.
| Aspect | Arm64 Assembly Approach | High-Level Language (e.g., Python) Approach |
|---|---|---|
| Performance | Extremely high. Direct CPU control, minimal overhead. Ideal for performance-critical code. | Lower. Interpreter/VM overhead, garbage collection, and abstractions add latency. |
| Development Time | Very slow. The code is verbose, and manual memory management is required. High cognitive load. | Very fast. Expressive syntax, rich standard libraries, and built-in data structures. |
| Portability | Low. Code is specific to the AArch64 instruction set. It will not run on x86 without an emulator. | High. The same Python code can run on any platform with a Python interpreter. |
| Readability & Maintainability | Low. Difficult for developers unfamiliar with assembly to understand and modify. Requires extensive commenting. | High. Code is generally self-documenting and easier to reason about for a larger team. |
| Memory Control | Absolute. You control every byte on the stack and in memory. This is powerful but also a source of bugs (e.g., buffer overflows). | Abstracted. Memory is managed automatically, which is safer but can sometimes lead to unpredictable performance. |
Frequently Asked Questions (FAQ)
What exactly is an Eulerian Circuit?
An Eulerian Circuit (or Eulerian Tour) is a path in a graph that starts and ends at the same vertex and visits every edge exactly once. It's named after Leonhard Euler, who first solved the famous "Seven Bridges of Königsberg" problem, which laid the foundation of graph theory. For our dominoes, this means creating a closed loop using all the dominoes without repeating any.
Why is checking for even vertex degrees so important?
Every time you "enter" a vertex (a pip value) through one half of a domino, you must "leave" it through another half of a different domino. This means that for a continuous path, pips must come in pairs. An odd degree means a pip is a "dead end"—you can enter it one more time than you can leave it (or vice versa), breaking the continuous chain.
What is the Union-Find (Disjoint Set Union) algorithm used for here?
Its purpose is to efficiently track which pips are connected to each other. Initially, every pip is in its own set. When we process a domino [A|B], we merge the sets containing A and B. By the end, if all pips that appeared in the dominoes are in the same single set, we know the graph is connected. It's much faster than other graph traversal methods like DFS or BFS for this specific task.
Can this code handle dominoes with identical halves, like [5|5]?
Yes, absolutely. A domino like [5|5] is treated as a self-loop on the vertex '5'. When processing it, our code will correctly increment the degree count for pip '5' by two (degrees[5] += 2), which maintains its evenness. The Union-Find operation will try to union '5' with '5', which correctly results in no change as it's already in the same set with itself.
How would you modify this to find the actual chain, not just check if it's possible?
To construct the actual chain, you would need a more complex algorithm, typically Hierholzer's algorithm. After verifying that an Eulerian circuit exists (using the logic in this article), you would start at a vertex, follow an unused edge to a new vertex, and continue until you return to the start, forming a circuit. If some edges in the graph remain unused, you'd find a vertex on your current circuit that has an unused edge, start a new tour from there, and splice the new tour into the main one.
Is Arm64 assembly difficult to learn?
It has a steeper learning curve than high-level languages because it requires you to manage registers, memory, and control flow manually. However, the Arm64 instruction set is known for being relatively clean and consistent (a RISC architecture), making it more approachable than some other assembly languages. The key is to start with simple problems and build up, which is the goal of the kodikra learning path.
What tools do I need to compile and run this Arm64 assembly code?
You'll need an assembler and a linker. On a Linux or macOS system with developer tools, you can use Clang or the GNU toolchain.
To assemble: as dominoes.s -o dominoes.o
To link: ld dominoes.o -o dominoes
If you are on an x86 machine, you will need a cross-compiler (e.g., aarch64-linux-gnu-gcc) and an emulator like QEMU to run the resulting executable.
Conclusion: From Theory to Bare Metal
We have successfully journeyed from a high-level logic puzzle to a complete, optimized Arm64 assembly solution. By translating the principles of graph theory—specifically, the conditions for an Eulerian circuit—into low-level code, we've not only solved the domino chaining problem but also gained a profound appreciation for how algorithms are executed by the processor.
You've seen how to manage memory on the stack, how to manipulate data in registers, and how to implement a sophisticated data structure like Union-Find from scratch. This is the power of assembly: it removes all layers of abstraction, giving you ultimate control and a deeper understanding of computation. While you may not write your next web application in assembly, the insights gained here will make you a more effective and knowledgeable programmer in any language.
Technology Version Disclaimer: The code provided is written for the Armv8-A (AArch64) instruction set architecture. It can be assembled using standard toolchains like Clang or GCC/binutils. Syntax and directives may vary slightly with different assemblers.
Ready to continue your journey into the world of low-level programming? Explore our full Arm64-assembly learning path to tackle more exciting challenges, or deepen your foundational knowledge with our complete Arm64-assembly guide.
Published by Kodikra — Your trusted Arm64-assembly learning resource.
Post a Comment