Sieve in Arm64-assembly: Complete Solution & Deep Dive Guide
Mastering Prime Numbers: A Deep Dive into the Sieve of Eratosthenes with Arm64 Assembly
The Sieve of Eratosthenes is a highly efficient, ancient algorithm for finding all prime numbers up to a specified limit. Implementing it in Arm64 assembly unlocks unparalleled performance by directly manipulating memory and leveraging the raw power of modern CPU architectures, making it a perfect benchmark.
Imagine stumbling upon a treasure trove of computer parts at a garage sale. You meticulously assemble different combinations, creating unique custom machines. But how do you measure their true power? You need a benchmark—a task that pushes the CPU and memory to their absolute limits. This is where the Sieve of Eratosthenes, implemented in the bare-metal language of Arm64 assembly, becomes your ultimate tool. It's not just an algorithm; it's a rite of passage into the world of high-performance computing. This guide will walk you through every step, from theory to a line-by-line code breakdown, transforming you from a novice to an expert in low-level optimization.
What Is the Sieve of Eratosthenes Algorithm?
The Sieve of Eratosthenes is an elegant and surprisingly simple algorithm for finding prime numbers. Named after the Greek mathematician Eratosthenes of Cyrene, its method is analogous to sifting sand through a sieve to leave only the pebbles behind. In our case, we "sift" out composite (non-prime) numbers, leaving only the primes.
The core idea is to perform a series of eliminations. You start with a list of consecutive integers from 2 up to your desired limit. You then iteratively mark the multiples of each prime number, starting with the first prime, 2.
The Step-by-Step Process:
- Create a List: Generate a list of consecutive integers from 2 to your limit,
n. Initially, assume all of them are prime. A boolean array (or a byte array in our assembly implementation) is perfect for this, where an indexicorresponds to the numberi, and its value (true/false) indicates primality. - Start with the First Prime: The first number in your list, 2, is a prime.
- Mark the Multiples: Go through the list and "cross out" or "mark" all multiples of 2 (4, 6, 8, etc.) as not prime.
- Move to the Next Unmarked Number: The next number in the list that hasn't been marked is 3. This must be a prime.
- Repeat the Marking: Now, mark all multiples of 3 (6, 9, 12, etc.) as not prime. Note that some, like 6, will already be marked.
- Continue the Process: The next unmarked number is 5 (since 4 was marked). Repeat the process, marking all multiples of 5. Continue this until you reach a prime whose square is greater than your limit,
n. Any number remaining unmarked in the list is a prime number.
This visual, iterative process is what makes the algorithm so efficient. Instead of testing each number for divisibility (a slow process), it eliminates entire sets of composite numbers in single passes.
Conceptual Flow Diagram
Here is a simplified visual representation of the Sieve algorithm's logic for finding primes up to a small limit, like 20.
● Start (Limit = 20)
│
▼
┌───────────────────────────────┐
│ Create boolean array `isPrime` │
│ of size 21, all set to `true` │
└──────────────┬────────────────┘
│
▼
◆ p = 2 (First prime)
│
▼
┌───────────────────────────────┐
│ Mark multiples of 2 as `false` │
│ (4, 6, 8, 10, 12, 14, 16, 18, 20) │
└──────────────┬────────────────┘
│
▼
◆ p = 3 (Next unmarked number)
│
▼
┌───────────────────────────────┐
│ Mark multiples of 3 as `false` │
│ (6, 9, 12, 15, 18) │
└──────────────┬────────────────┘
│
▼
◆ p = 5 (Next unmarked, as 4 is marked)
│ p*p > 20? No (5*5=25)
│ (Process would continue, but for brevity...)
│
▼
┌───────────────────────────────┐
│ Collect all numbers `i` where │
│ `isPrime[i]` is still `true` │
└──────────────┬────────────────┘
│
▼
● End (Primes: 2, 3, 5, 7, 11, 13, 17, 19)
Why Use Arm64 Assembly for This Task?
In an age of high-level languages like Python and JavaScript, choosing to write in assembly might seem like an archaic decision. However, for tasks that demand the absolute peak of performance and efficiency, assembly language is not just relevant; it's essential. The Arm64 (or AArch64) architecture, which powers everything from Apple's M-series chips to the world's most powerful supercomputers and countless mobile devices, is a prime candidate for such low-level optimization.
The Core Advantages
- Unmatched Performance: Assembly gives you direct, one-to-one control over the processor's instructions. There is no abstraction layer, no interpreter, and no garbage collector. Every instruction you write is exactly what the CPU executes. For an algorithm like the Sieve, which involves tight loops and massive memory manipulation, this translates to blistering speed.
-
Direct Memory Manipulation: The Sieve algorithm's performance is heavily tied to how quickly it can access and modify its boolean "sieve" array. In Arm64 assembly, you can use instructions like
stp(Store Pair) to write 16 bytes at a time, orstrb(Store Byte) for precise marking. This fine-grained control over memory access patterns is impossible to achieve in most other languages. -
Full Register Control: High-level language compilers do a good job of register allocation, but they aren't perfect. By writing assembly, you decide exactly which values live in which of the 31 general-purpose registers (
x0-x30). This allows you to keep critical loop counters, pointers, and limits in the fastest possible storage, avoiding slower memory access. - Minimal Code Size: An assembly program, when written well, results in a tiny executable file. There are no large runtimes or libraries to bundle. This is crucial for embedded systems, bootloaders, and performance-critical kernels where every byte counts.
- A Deeper Understanding: Writing assembly forces you to understand how a computer truly works at a fundamental level. You learn about the CPU pipeline, memory hierarchy, calling conventions, and system calls. This knowledge makes you a better programmer, even when you return to high-level languages.
For our benchmarking scenario, these advantages are paramount. We want to measure the raw capability of the hardware, and using Arm64 assembly ensures that our measurement is not clouded by the overhead of a language runtime or an inefficient compiler optimization. We are speaking the CPU's native tongue.
How the Arm64 Assembly Implementation Works: A Detailed Code Walkthrough
Now, let's dissect the Arm64 assembly code from the kodikra learning path module. This implementation is a masterclass in efficiency, leveraging specific features of the AArch64 instruction set. We will break it down section by section, explaining the purpose of each instruction and the role of every register.
The code's primary goal is to take a pointer to an output array (primes) and a numeric limit, and fill the array with all prime numbers up to that limit.
The Complete Source Code
.text
.globl sieve
/*
* extern size_t sieve(uint64_t* primes, uint64_t limit);
*
* Register usage according to AArch64 calling convention:
* x0: primes (pointer to output array), later used as return value (count)
* x1: limit
*
* Local register usage:
* x2: limit_rounded_up (size of sieve array on stack)
* x3: loop counter / pointer for initialization
* x4: fill value (-1, representing all bits set to 1)
* x5: stack pointer base / loop counter for sieving
* x6: inner loop counter / address offset
* x7: temporary register
* x8: prime count
* wzr: zero register (used for storing 0)
*/
sieve:
// --- 1. Prologue & Stack Allocation ---
add x2, x1, #16
and x2, x2, #-16 // limit, rounded up to the next multiple of 16 for alignment
mov x3, sp
sub sp, sp, x2 // Allocate space on the stack for the sieve array
// --- 2. Initialize Sieve Array to all 'true' (0xff) ---
mov x4, #-1 // -1 is 0xffffffffffffffff, a repeating pattern of 1s
mov x5, sp // x5 holds the base address of our sieve array
.fill:
stp x4, x4, [x3, #-16]! // Store 16 bytes (two 64-bit values) of 1s, pre-decrement
cmp x3, x5
bne .fill // Loop until the entire allocated space is filled
// --- 3. Mark 0 and 1 as not prime ---
mov x3, #0
strb wzr, [x5] // Mark 0 as not prime
strb wzr, [x5, #1] // Mark 1 as not prime
// --- 4. Main Sieving Loop ---
mov x5, #2 // Start sieving from p = 2
.outer_loop_test:
cmp x5, x1 // while (p <= limit)
bgt .collect // If p > limit, we are done sieving
ldrb w7, [sp, x5] // Load the byte for number 'p'
cbz w7, .outer_loop_increment // If is_prime[p] is 0 (false), skip to next p
// p is prime, now mark its multiples
add x6, x5, x5 // Start marking from 2*p
.inner_loop:
cmp x6, x1 // while (multiple <= limit)
bgt .outer_loop_increment
strb wzr, [sp, x6] // Mark is_prime[multiple] as 0 (false)
add x6, x6, x5 // Go to the next multiple
b .inner_loop
.outer_loop_increment:
add x5, x5, #1
b .outer_loop_test
// --- 5. Collect Primes ---
.collect:
mov x5, #2 // Start checking from 2
mov x8, #0 // x8 will be our prime counter
.collect_loop:
cmp x5, x1 // while (i <= limit)
bgt .epilogue
ldrb w7, [sp, x5] // Load the byte for number 'i'
cbz w7, .collect_loop_increment // If is_prime[i] is 0, skip it
// It's a prime, store it in the output array
str x5, [x0], #8 // Store prime number, and post-increment pointer
add x8, x8, #1 // Increment prime count
.collect_loop_increment:
add x5, x5, #1
b .collect_loop
// --- 6. Epilogue & Return ---
.epilogue:
add sp, sp, x2 // Deallocate stack space
mov x0, x8 // Move the prime count to the return register x0
ret
Section 1: Prologue & Stack Allocation
Every function in assembly needs to manage its own space on the stack. This section prepares a temporary "sieve" array directly on the stack.
add x2, x1, #16: Takes thelimit(inx1) and adds 16. This is a preliminary step for rounding.and x2, x2, #-16: This is a clever trick for rounding up to the nearest multiple of 16. In binary,-16is...11110000. Performing a bitwise AND with this value clears the last four bits, effectively rounding the number down to a multiple of 16. Because we added a value close to 16 beforehand, the net effect is rounding up. This ensures our stack pointer remains 16-byte aligned, which is required by the AArch64 procedure call standard for performance.mov x3, sp: Stores the current stack pointer (top of the stack before allocation) inx3.sub sp, sp, x2: Allocates the required space on the stack by moving the stack pointer down. Our sieve array now exists in the memory between the newspand the old value stored inx3.
Section 2: Initialize Sieve Array
We need to initialize our array, assuming all numbers are prime. We'll use the value 1 (or any non-zero byte) to represent "prime" and 0 for "not prime". This loop fills the memory with 1s very quickly.
mov x4, #-1: Moves the value -1 into registerx4. In 64-bit two's complement, -1 is represented as all bits being set to 1 (0xFFFFFFFFFFFFFFFF).mov x5, sp: Stores the base address of our newly allocated sieve array intox5for a comparison later..fill:: A label marking the beginning of our initialization loop.stp x4, x4, [x3, #-16]!: This is a highly efficient instruction. STP stands for "Store Pair". It stores two 64-bit registers (in this case,x4andx4again) into memory. The address is calculated fromx3with an offset of -16. The!at the end signifies "pre-index," meaningx3is updated to this new address before the store happens. In one go, we write 16 bytes of all1s and move our pointer, making this loop extremely fast.cmp x3, x5: Compares our moving pointer (x3) with the base of the array (x5).bne .fill: "Branch if Not Equal". Ifx3hasn't reached the bottom of the allocated space yet, it jumps back to the.filllabel to continue.
Section 3: Mark 0 and 1 as Not Prime
By definition, 0 and 1 are not prime numbers. We explicitly mark them as such.
mov x3, #0: This instruction is not strictly necessary here and might be a leftover from a different logic structure. The key instructions are the next two.strb wzr, [x5]: STRB means "Store Byte". It takes a byte from a register and stores it in memory.wzris the special "zero register" (the 32-bit version ofxzr), which always reads as 0. We are storing a byte with value 0 at the base address of our sieve ([x5]), marking the number 0 as not prime.strb wzr, [x5, #1]: Similarly, this stores a zero byte at the addressx5 + 1, marking the number 1 as not prime.
Section 4: Main Sieving Loop
This is the heart of the algorithm, where we iterate through numbers and mark their multiples.
● Start Sieving (p = 2)
│
▼
┌───────────────────┐
│ .outer_loop_test │
│ Is p <= limit? │
└─────────┬─────────┘
│ Yes
▼
◆ Is isPrime[p] true?
╱ (ldrb w7, [sp, x5]) ╲
No (cbz) Yes
│ │
│ ▼
│ ┌──────────────────────┐
│ │ .inner_loop │
│ │ Mark multiples of p │
│ │ (2*p, 3*p, ...) │
│ └──────────┬───────────┘
│ │
└─────────────────┬─────────────┘
│
▼
┌─────────────────────────────────┐
│ .outer_loop_increment │
│ p = p + 1 │
└─────────────────┬───────────────┘
│
└─> (Branch back to .outer_loop_test)
mov x5, #2: We usex5as our prime candidate counter,p. We start with 2..outer_loop_test:: The start of the outer loop.cmp x5, x1/bgt .collect: Compares our current prime candidatep(inx5) with thelimit(inx1). Ifpis greater than the limit, the sieving process is done, and we branch to the.collectsection.ldrb w7, [sp, x5]: LDRB means "Load Byte". It loads a single byte from memory at addresssp + x5into the 32-bit registerw7. This checks if the numberpis currently marked as prime.cbz w7, .outer_loop_increment: CBZ stands for "Compare and Branch if Zero". If the byte we loaded is 0 (meaningpwas already marked as not prime), we skip the inner loop entirely and jump to the increment step.add x6, x5, x5: Ifpis prime, we prepare for the inner loop. We initialize our multiple-counterx6top + p, or2*p..inner_loop:: The start of the inner loop for marking multiples.cmp x6, x1/bgt .outer_loop_increment: Checks if the current multiple (inx6) has exceeded the limit. If so, we're done with thispand branch to the outer loop's increment step.strb wzr, [sp, x6]: Marks the current multiple as not prime by storing a zero byte at its corresponding index in the sieve array.add x6, x6, x5: Increments the multiple byp(e.g., from2*pto3*p, then4*p, etc.).b .inner_loop: Unconditionally branches back to the start of the inner loop..outer_loop_increment:: Increments our prime candidatepby one.b .outer_loop_test: Jumps back to the start of the outer loop to test the next number.
Section 5: Collect Primes
After the sieve is complete, we iterate through our boolean array one last time to gather all the numbers that are still marked as prime.
mov x5, #2: We reusex5as our loop counteri, starting from 2.mov x8, #0: We usex8to count the number of primes found. It's initialized to 0..collect_loop:: The start of the collection loop.cmp x5, x1/bgt .epilogue: Checks if our counterihas exceeded the limit. If so, we branch to the function's end.ldrb w7, [sp, x5]: Loads the primality flag for the numberi.cbz w7, .collect_loop_increment: If the flag is 0, the number is not prime, so we skip it and jump to the increment step.str x5, [x0], #8: This is a key instruction. STR stores a full 64-bit register. It stores the prime numberi(fromx5) into the memory location pointed to byx0(our outputprimesarray). The, #8at the end signifies "post-index," meaning the pointer inx0is automatically incremented by 8 bytes (the size of auint64_t) after the store. This elegantly preparesx0for the next prime.add x8, x8, #1: Increments our prime counter..collect_loop_increment:: Increments our counteriby one and branches back to the start of the collection loop.
Section 6: Epilogue & Return
Finally, we clean up the stack and return the result.
add sp, sp, x2: Deallocates the memory we used on the stack by moving the stack pointer back to its original position.mov x0, x8: The AArch64 calling convention dictates that the return value is placed in registerx0. We move our prime count fromx8intox0.ret: Returns control to the calling function.
Potential Optimizations and Considerations
While the provided solution is robust and efficient, several classic optimizations for the Sieve of Eratosthenes can push performance even further. Understanding these is key to mastering low-level programming.
1. Start Marking from p-squared
A crucial insight is that when marking multiples of a prime p, all smaller composite numbers have already been marked by smaller primes. For example, when we get to p=5, the numbers 2*5=10, 3*5=15, and 4*5=20 have already been marked by primes 2 and 3. The first multiple we truly need to mark is p*p.
Implementing this in the inner loop would change add x6, x5, x5 (start from 2p) to a multiplication to start from p*p. This significantly reduces redundant `strb` operations.
2. Skip Even Numbers
After handling the prime number 2, all other even numbers are composite. We can optimize by creating a sieve array that only represents odd numbers. This halves the memory requirement and halves the number of iterations in the main loop. The mapping would be sieve_index = (number - 1) / 2. This adds some complexity to the address calculation but can yield substantial gains for large limits.
3. Bit Packing
The current implementation uses one byte (8 bits) to store a single boolean flag. This is wasteful. We can "pack" 8 flags into a single byte, where each bit represents a number's primality. This reduces memory usage by a factor of 8, which improves cache performance dramatically. Accessing a specific bit requires bitwise operations (and, or, lsl, lsr) to isolate and modify the correct bit within a byte, which adds instruction overhead but is often a net win due to better memory locality.
Pros & Cons: Arm64 Assembly vs. High-Level Language (Python)
To put the choice of language in perspective, let's compare the trade-offs.
| Aspect | Arm64 Assembly | Python |
|---|---|---|
| Performance | Extremely high. Direct CPU control, no overhead. The absolute performance ceiling. | Significantly slower due to interpreter overhead, dynamic typing, and garbage collection. Can be improved with NumPy, but still slower. |
| Memory Usage | Minimal and precisely controlled. Memory is allocated on the stack and deallocated manually. Ideal for memory-constrained systems. | High. Python objects have significant overhead. A boolean list uses much more memory than a raw byte array. |
| Development Time | Very long. The learning curve is steep, and the code is verbose and hard to debug. Requires deep architectural knowledge. | Very short. The code is concise, readable, and easy to write and debug. The algorithm can be expressed in a few lines. |
| Portability | None. The code is specific to the Arm64 architecture and will not run on x86 or other CPUs. | Extremely high. Python code runs on any platform with a Python interpreter (Windows, macOS, Linux, etc.). |
| Best For | CPU/memory benchmarking, OS kernels, device drivers, embedded systems, and performance-critical library functions. | Rapid prototyping, data science, web development, scripting, and applications where developer productivity is more important than raw performance. |
Frequently Asked Questions (FAQ)
- 1. How do you compile and run this Arm64 assembly code?
-
On a system with an Arm64 architecture (like a Raspberry Pi 4, an Apple Silicon Mac, or a Linux server) and the GNU toolchain, you would use the assembler (
as) and the linker (ld). You would also need a C file to call your assembly function.Example C wrapper (
main.c):#include <stdio.h> #include <stdint.h> #include <stdlib.h> extern size_t sieve(uint64_t* primes, uint64_t limit); int main() { uint64_t limit = 100; uint64_t* primes = malloc(limit * sizeof(uint64_t)); size_t count = sieve(primes, limit); printf("Found %zu primes up to %llu:\n", count, limit); for (size_t i = 0; i < count; ++i) { printf("%llu ", primes[i]); } printf("\n"); free(primes); return 0; }Compilation commands:
# Assemble the .s file into an object file as sieve.s -o sieve.o # Compile the C wrapper gcc -c main.c -o main.o # Link them together to create the executable gcc main.o sieve.o -o sieve_runner # Run the program ./sieve_runner - 2. What does the
.globl sievedirective do? - The
.globl(or.global) directive makes a symbol visible to the linker. In this case,.globl sievetells the assembler that thesievelabel should be accessible from other object files. This is what allows our C code to find and call the assembly function after linking. - 3. Why is the stack space allocated as a multiple of 16?
- The Arm AArch64 Procedure Call Standard (AAPCS64) mandates that the stack pointer (
sp) must be 16-byte aligned at all public function interfaces. Maintaining this alignment is crucial for performance, especially when using instructions likestp/ldpthat can access 16 bytes at once. It also ensures compatibility with system libraries and exception handling mechanisms. - 4. What is the AArch64 calling convention?
- It's a standard set of rules for how functions pass arguments and return values. For AArch64:
- The first eight integer/pointer arguments are passed in registers
x0throughx7. - The return value is passed back in register
x0. - Registers
x0-x18are "caller-saved" (a function can use them freely but must assume they will be overwritten by any function it calls). - Registers
x19-x30are "callee-saved" (a function must save their original values before using them and restore them before returning).
x0andx1for the input arguments andx0for the return value. - The first eight integer/pointer arguments are passed in registers
- 5. Can this code run on an Intel/AMD (x86-64) processor?
- No. Assembly language is architecture-specific. Arm64 and x86-64 have completely different instruction sets, register names, and calling conventions. To run this algorithm on an Intel or AMD CPU, it would need to be completely rewritten in x86-64 assembly.
- 6. Is the Sieve of Eratosthenes the most efficient prime-finding algorithm?
- For finding all primes up to a certain limit
n, it is one of the most efficient. Its time complexity is approximately O(n log log n). More advanced algorithms like the Sieve of Atkin are theoretically faster with a complexity of O(n), but they are also significantly more complex to implement. For most practical purposes and limits, a well-optimized Sieve of Eratosthenes is extremely competitive.
Conclusion: The Power of Bare Metal
You've journeyed from a simple concept—sifting for prime numbers—to a deep, line-by-line understanding of its implementation on a modern, high-performance Arm64 processor. This exploration within the kodikra.com exclusive Arm64-assembly curriculum demonstrates that true optimization lies at the intersection of a clever algorithm and a profound understanding of the underlying hardware. By choosing assembly, you trade development speed for ultimate control, achieving performance that is simply out of reach for higher-level languages.
Whether you are benchmarking custom hardware, writing a high-performance computing library, or developing an operating system kernel, the principles learned here are invaluable. You now have the knowledge to read, write, and reason about low-level code, making you a more capable and insightful programmer across any language or platform.
Technology Disclaimer: The code and concepts discussed are based on the ARMv8-A architecture (AArch64). While assembly instructions are generally stable, always consult the official ARM architecture reference manual for the specific processor you are targeting for the most accurate and up-to-date information.
Ready for your next challenge? Continue your journey on our Arm64-assembly learning path and discover more advanced topics in systems programming.
Published by Kodikra — Your trusted Arm64-assembly learning resource.
Post a Comment