Sieve in Arm64-assembly: Complete Solution & Deep Dive Guide

Mastering Prime Numbers: A Deep Dive into the Sieve of Eratosthenes with Arm64 Assembly

The Sieve of Eratosthenes is a highly efficient, ancient algorithm for finding all prime numbers up to a specified limit. Implementing it in Arm64 assembly unlocks unparalleled performance by directly manipulating memory and leveraging the raw power of modern CPU architectures, making it a perfect benchmark.

Imagine stumbling upon a treasure trove of computer parts at a garage sale. You meticulously assemble different combinations, creating unique custom machines. But how do you measure their true power? You need a benchmark—a task that pushes the CPU and memory to their absolute limits. This is where the Sieve of Eratosthenes, implemented in the bare-metal language of Arm64 assembly, becomes your ultimate tool. It's not just an algorithm; it's a rite of passage into the world of high-performance computing. This guide will walk you through every step, from theory to a line-by-line code breakdown, transforming you from a novice to an expert in low-level optimization.

What Is the Sieve of Eratosthenes Algorithm?

The Sieve of Eratosthenes is an elegant and surprisingly simple algorithm for finding prime numbers. Named after the Greek mathematician Eratosthenes of Cyrene, its method is analogous to sifting sand through a sieve to leave only the pebbles behind. In our case, we "sift" out composite (non-prime) numbers, leaving only the primes.

The core idea is to perform a series of eliminations. You start with a list of consecutive integers from 2 up to your desired limit. You then iteratively mark the multiples of each prime number, starting with the first prime, 2.

The Step-by-Step Process:

Create a List: Generate a list of consecutive integers from 2 to your limit, n. Initially, assume all of them are prime. A boolean array (or a byte array in our assembly implementation) is perfect for this, where an index i corresponds to the number i, and its value (true/false) indicates primality.
Start with the First Prime: The first number in your list, 2, is a prime.
Mark the Multiples: Go through the list and "cross out" or "mark" all multiples of 2 (4, 6, 8, etc.) as not prime.
Move to the Next Unmarked Number: The next number in the list that hasn't been marked is 3. This must be a prime.
Repeat the Marking: Now, mark all multiples of 3 (6, 9, 12, etc.) as not prime. Note that some, like 6, will already be marked.
Continue the Process: The next unmarked number is 5 (since 4 was marked). Repeat the process, marking all multiples of 5. Continue this until you reach a prime whose square is greater than your limit, n. Any number remaining unmarked in the list is a prime number.

This visual, iterative process is what makes the algorithm so efficient. Instead of testing each number for divisibility (a slow process), it eliminates entire sets of composite numbers in single passes.

Conceptual Flow Diagram

Here is a simplified visual representation of the Sieve algorithm's logic for finding primes up to a small limit, like 20.

    ● Start (Limit = 20)
    │
    ▼
  ┌───────────────────────────────┐
  │ Create boolean array `isPrime` │
  │ of size 21, all set to `true`  │
  └──────────────┬────────────────┘
                 │
                 ▼
    ◆ p = 2 (First prime)
    │
    ▼
  ┌───────────────────────────────┐
  │ Mark multiples of 2 as `false` │
  │ (4, 6, 8, 10, 12, 14, 16, 18, 20) │
  └──────────────┬────────────────┘
                 │
                 ▼
    ◆ p = 3 (Next unmarked number)
    │
    ▼
  ┌───────────────────────────────┐
  │ Mark multiples of 3 as `false` │
  │ (6, 9, 12, 15, 18)             │
  └──────────────┬────────────────┘
                 │
                 ▼
    ◆ p = 5 (Next unmarked, as 4 is marked)
    │ p*p > 20? No (5*5=25)
    │ (Process would continue, but for brevity...)
    │
    ▼
  ┌───────────────────────────────┐
  │ Collect all numbers `i` where  │
  │ `isPrime[i]` is still `true`    │
  └──────────────┬────────────────┘
                 │
                 ▼
    ● End (Primes: 2, 3, 5, 7, 11, 13, 17, 19)

Why Use Arm64 Assembly for This Task?

In an age of high-level languages like Python and JavaScript, choosing to write in assembly might seem like an archaic decision. However, for tasks that demand the absolute peak of performance and efficiency, assembly language is not just relevant; it's essential. The Arm64 (or AArch64) architecture, which powers everything from Apple's M-series chips to the world's most powerful supercomputers and countless mobile devices, is a prime candidate for such low-level optimization.

The Core Advantages

Unmatched Performance: Assembly gives you direct, one-to-one control over the processor's instructions. There is no abstraction layer, no interpreter, and no garbage collector. Every instruction you write is exactly what the CPU executes. For an algorithm like the Sieve, which involves tight loops and massive memory manipulation, this translates to blistering speed.
Direct Memory Manipulation: The Sieve algorithm's performance is heavily tied to how quickly it can access and modify its boolean "sieve" array. In Arm64 assembly, you can use instructions like stp (Store Pair) to write 16 bytes at a time, or strb (Store Byte) for precise marking. This fine-grained control over memory access patterns is impossible to achieve in most other languages.
Full Register Control: High-level language compilers do a good job of register allocation, but they aren't perfect. By writing assembly, you decide exactly which values live in which of the 31 general-purpose registers (x0-x30). This allows you to keep critical loop counters, pointers, and limits in the fastest possible storage, avoiding slower memory access.
Minimal Code Size: An assembly program, when written well, results in a tiny executable file. There are no large runtimes or libraries to bundle. This is crucial for embedded systems, bootloaders, and performance-critical kernels where every byte counts.
A Deeper Understanding: Writing assembly forces you to understand how a computer truly works at a fundamental level. You learn about the CPU pipeline, memory hierarchy, calling conventions, and system calls. This knowledge makes you a better programmer, even when you return to high-level languages.

For our benchmarking scenario, these advantages are paramount. We want to measure the raw capability of the hardware, and using Arm64 assembly ensures that our measurement is not clouded by the overhead of a language runtime or an inefficient compiler optimization. We are speaking the CPU's native tongue.

How the Arm64 Assembly Implementation Works: A Detailed Code Walkthrough

Now, let's dissect the Arm64 assembly code from the kodikra learning path module. This implementation is a masterclass in efficiency, leveraging specific features of the AArch64 instruction set. We will break it down section by section, explaining the purpose of each instruction and the role of every register.

The code's primary goal is to take a pointer to an output array (primes) and a numeric limit, and fill the array with all prime numbers up to that limit.

The Complete Source Code


.text
.globl sieve

/*
 * extern size_t sieve(uint64_t* primes, uint64_t limit);
 *
 * Register usage according to AArch64 calling convention:
 * x0: primes (pointer to output array), later used as return value (count)
 * x1: limit
 *
 * Local register usage:
 * x2: limit_rounded_up (size of sieve array on stack)
 * x3: loop counter / pointer for initialization
 * x4: fill value (-1, representing all bits set to 1)
 * x5: stack pointer base / loop counter for sieving
 * x6: inner loop counter / address offset
 * x7: temporary register
 * x8: prime count
 * wzr: zero register (used for storing 0)
 */
sieve:
    // --- 1. Prologue & Stack Allocation ---
    add     x2, x1, #16
    and     x2, x2, #-16    // limit, rounded up to the next multiple of 16 for alignment
    mov     x3, sp
    sub     sp, sp, x2      // Allocate space on the stack for the sieve array

    // --- 2. Initialize Sieve Array to all 'true' (0xff) ---
    mov     x4, #-1         // -1 is 0xffffffffffffffff, a repeating pattern of 1s
    mov     x5, sp          // x5 holds the base address of our sieve array
.fill:
    stp     x4, x4, [x3, #-16]! // Store 16 bytes (two 64-bit values) of 1s, pre-decrement
    cmp     x3, x5
    bne     .fill           // Loop until the entire allocated space is filled

    // --- 3. Mark 0 and 1 as not prime ---
    mov     x3, #0
    strb    wzr, [x5]       // Mark 0 as not prime
    strb    wzr, [x5, #1]   // Mark 1 as not prime

    // --- 4. Main Sieving Loop ---
    mov     x5, #2          // Start sieving from p = 2
.outer_loop_test:
    cmp     x5, x1          // while (p <= limit)
    bgt     .collect        // If p > limit, we are done sieving

    ldrb    w7, [sp, x5]    // Load the byte for number 'p'
    cbz     w7, .outer_loop_increment // If is_prime[p] is 0 (false), skip to next p

    // p is prime, now mark its multiples
    add     x6, x5, x5      // Start marking from 2*p
.inner_loop:
    cmp     x6, x1          // while (multiple <= limit)
    bgt     .outer_loop_increment
    strb    wzr, [sp, x6]   // Mark is_prime[multiple] as 0 (false)
    add     x6, x6, x5      // Go to the next multiple
    b       .inner_loop

.outer_loop_increment:
    add     x5, x5, #1
    b       .outer_loop_test

    // --- 5. Collect Primes ---
.collect:
    mov     x5, #2          // Start checking from 2
    mov     x8, #0          // x8 will be our prime counter
.collect_loop:
    cmp     x5, x1          // while (i <= limit)
    bgt     .epilogue
    
    ldrb    w7, [sp, x5]    // Load the byte for number 'i'
    cbz     w7, .collect_loop_increment // If is_prime[i] is 0, skip it

    // It's a prime, store it in the output array
    str     x5, [x0], #8    // Store prime number, and post-increment pointer
    add     x8, x8, #1      // Increment prime count

.collect_loop_increment:
    add     x5, x5, #1
    b       .collect_loop

    // --- 6. Epilogue & Return ---
.epilogue:
    add     sp, sp, x2      // Deallocate stack space
    mov     x0, x8          // Move the prime count to the return register x0
    ret

Section 1: Prologue & Stack Allocation

Every function in assembly needs to manage its own space on the stack. This section prepares a temporary "sieve" array directly on the stack.

add x2, x1, #16: Takes the limit (in x1) and adds 16. This is a preliminary step for rounding.
and x2, x2, #-16: This is a clever trick for rounding up to the nearest multiple of 16. In binary, -16 is ...11110000. Performing a bitwise AND with this value clears the last four bits, effectively rounding the number down to a multiple of 16. Because we added a value close to 16 beforehand, the net effect is rounding up. This ensures our stack pointer remains 16-byte aligned, which is required by the AArch64 procedure call standard for performance.
mov x3, sp: Stores the current stack pointer (top of the stack before allocation) in x3.
sub sp, sp, x2: Allocates the required space on the stack by moving the stack pointer down. Our sieve array now exists in the memory between the new sp and the old value stored in x3.

Section 2: Initialize Sieve Array

We need to initialize our array, assuming all numbers are prime. We'll use the value 1 (or any non-zero byte) to represent "prime" and 0 for "not prime". This loop fills the memory with 1s very quickly.

mov x4, #-1: Moves the value -1 into register x4. In 64-bit two's complement, -1 is represented as all bits being set to 1 (0xFFFFFFFFFFFFFFFF).
mov x5, sp: Stores the base address of our newly allocated sieve array into x5 for a comparison later.
.fill:: A label marking the beginning of our initialization loop.
stp x4, x4, [x3, #-16]!: This is a highly efficient instruction. STP stands for "Store Pair". It stores two 64-bit registers (in this case, x4 and x4 again) into memory. The address is calculated from x3 with an offset of -16. The ! at the end signifies "pre-index," meaning x3 is updated to this new address before the store happens. In one go, we write 16 bytes of all 1s and move our pointer, making this loop extremely fast.
cmp x3, x5: Compares our moving pointer (x3) with the base of the array (x5).
bne .fill: "Branch if Not Equal". If x3 hasn't reached the bottom of the allocated space yet, it jumps back to the .fill label to continue.

Section 3: Mark 0 and 1 as Not Prime

By definition, 0 and 1 are not prime numbers. We explicitly mark them as such.

mov x3, #0: This instruction is not strictly necessary here and might be a leftover from a different logic structure. The key instructions are the next two.
strb wzr, [x5]: STRB means "Store Byte". It takes a byte from a register and stores it in memory. wzr is the special "zero register" (the 32-bit version of xzr), which always reads as 0. We are storing a byte with value 0 at the base address of our sieve ([x5]), marking the number 0 as not prime.
strb wzr, [x5, #1]: Similarly, this stores a zero byte at the address x5 + 1, marking the number 1 as not prime.

Section 4: Main Sieving Loop

This is the heart of the algorithm, where we iterate through numbers and mark their multiples.

    ● Start Sieving (p = 2)
    │
    ▼
  ┌───────────────────┐
  │ .outer_loop_test  │
  │ Is p <= limit?    │
  └─────────┬─────────┘
            │ Yes
            ▼
    ◆ Is isPrime[p] true?
   ╱ (ldrb w7, [sp, x5]) ╲
  No (cbz)              Yes
  │                      │
  │                      ▼
  │                    ┌──────────────────────┐
  │                    │ .inner_loop          │
  │                    │ Mark multiples of p  │
  │                    │ (2*p, 3*p, ...)      │
  │                    └──────────┬───────────┘
  │                               │
  └─────────────────┬─────────────┘
                    │
                    ▼
  ┌─────────────────────────────────┐
  │ .outer_loop_increment           │
  │ p = p + 1                       │
  └─────────────────┬───────────────┘
                    │
                    └─> (Branch back to .outer_loop_test)

mov x5, #2: We use x5 as our prime candidate counter, p. We start with 2.
.outer_loop_test:: The start of the outer loop.
cmp x5, x1 / bgt .collect: Compares our current prime candidate p (in x5) with the limit (in x1). If p is greater than the limit, the sieving process is done, and we branch to the .collect section.
ldrb w7, [sp, x5]: LDRB means "Load Byte". It loads a single byte from memory at address sp + x5 into the 32-bit register w7. This checks if the number p is currently marked as prime.
cbz w7, .outer_loop_increment: CBZ stands for "Compare and Branch if Zero". If the byte we loaded is 0 (meaning p was already marked as not prime), we skip the inner loop entirely and jump to the increment step.
add x6, x5, x5: If p is prime, we prepare for the inner loop. We initialize our multiple-counter x6 to p + p, or 2*p.
.inner_loop:: The start of the inner loop for marking multiples.
cmp x6, x1 / bgt .outer_loop_increment: Checks if the current multiple (in x6) has exceeded the limit. If so, we're done with this p and branch to the outer loop's increment step.
strb wzr, [sp, x6]: Marks the current multiple as not prime by storing a zero byte at its corresponding index in the sieve array.
add x6, x6, x5: Increments the multiple by p (e.g., from 2*p to 3*p, then 4*p, etc.).
b .inner_loop: Unconditionally branches back to the start of the inner loop.
.outer_loop_increment:: Increments our prime candidate p by one.
b .outer_loop_test: Jumps back to the start of the outer loop to test the next number.

Section 5: Collect Primes

After the sieve is complete, we iterate through our boolean array one last time to gather all the numbers that are still marked as prime.

mov x5, #2: We reuse x5 as our loop counter i, starting from 2.
mov x8, #0: We use x8 to count the number of primes found. It's initialized to 0.
.collect_loop:: The start of the collection loop.
cmp x5, x1 / bgt .epilogue: Checks if our counter i has exceeded the limit. If so, we branch to the function's end.
ldrb w7, [sp, x5]: Loads the primality flag for the number i.
cbz w7, .collect_loop_increment: If the flag is 0, the number is not prime, so we skip it and jump to the increment step.
str x5, [x0], #8: This is a key instruction. STR stores a full 64-bit register. It stores the prime number i (from x5) into the memory location pointed to by x0 (our output primes array). The , #8 at the end signifies "post-index," meaning the pointer in x0 is automatically incremented by 8 bytes (the size of a uint64_t) after the store. This elegantly prepares x0 for the next prime.
add x8, x8, #1: Increments our prime counter.
.collect_loop_increment:: Increments our counter i by one and branches back to the start of the collection loop.

Section 6: Epilogue & Return

Finally, we clean up the stack and return the result.

add sp, sp, x2: Deallocates the memory we used on the stack by moving the stack pointer back to its original position.
mov x0, x8: The AArch64 calling convention dictates that the return value is placed in register x0. We move our prime count from x8 into x0.
ret: Returns control to the calling function.

Potential Optimizations and Considerations

While the provided solution is robust and efficient, several classic optimizations for the Sieve of Eratosthenes can push performance even further. Understanding these is key to mastering low-level programming.

1. Start Marking from p-squared

A crucial insight is that when marking multiples of a prime p, all smaller composite numbers have already been marked by smaller primes. For example, when we get to p=5, the numbers 2*5=10, 3*5=15, and 4*5=20 have already been marked by primes 2 and 3. The first multiple we truly need to mark is p*p.

Implementing this in the inner loop would change add x6, x5, x5 (start from 2p) to a multiplication to start from p*p. This significantly reduces redundant `strb` operations.

2. Skip Even Numbers

After handling the prime number 2, all other even numbers are composite. We can optimize by creating a sieve array that only represents odd numbers. This halves the memory requirement and halves the number of iterations in the main loop. The mapping would be sieve_index = (number - 1) / 2. This adds some complexity to the address calculation but can yield substantial gains for large limits.

3. Bit Packing

The current implementation uses one byte (8 bits) to store a single boolean flag. This is wasteful. We can "pack" 8 flags into a single byte, where each bit represents a number's primality. This reduces memory usage by a factor of 8, which improves cache performance dramatically. Accessing a specific bit requires bitwise operations (and, or, lsl, lsr) to isolate and modify the correct bit within a byte, which adds instruction overhead but is often a net win due to better memory locality.

Pros & Cons: Arm64 Assembly vs. High-Level Language (Python)

To put the choice of language in perspective, let's compare the trade-offs.

Aspect	Arm64 Assembly	Python
Performance	Extremely high. Direct CPU control, no overhead. The absolute performance ceiling.	Significantly slower due to interpreter overhead, dynamic typing, and garbage collection. Can be improved with NumPy, but still slower.
Memory Usage	Minimal and precisely controlled. Memory is allocated on the stack and deallocated manually. Ideal for memory-constrained systems.	High. Python objects have significant overhead. A boolean list uses much more memory than a raw byte array.
Development Time	Very long. The learning curve is steep, and the code is verbose and hard to debug. Requires deep architectural knowledge.	Very short. The code is concise, readable, and easy to write and debug. The algorithm can be expressed in a few lines.
Portability	None. The code is specific to the Arm64 architecture and will not run on x86 or other CPUs.	Extremely high. Python code runs on any platform with a Python interpreter (Windows, macOS, Linux, etc.).
Best For	CPU/memory benchmarking, OS kernels, device drivers, embedded systems, and performance-critical library functions.	Rapid prototyping, data science, web development, scripting, and applications where developer productivity is more important than raw performance.

Frequently Asked Questions (FAQ)

1. How do you compile and run this Arm64 assembly code?

On a system with an Arm64 architecture (like a Raspberry Pi 4, an Apple Silicon Mac, or a Linux server) and the GNU toolchain, you would use the assembler (as) and the linker (ld). You would also need a C file to call your assembly function.

Example C wrapper (main.c):

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>

extern size_t sieve(uint64_t* primes, uint64_t limit);

int main() {
    uint64_t limit = 100;
    uint64_t* primes = malloc(limit * sizeof(uint64_t));
    size_t count = sieve(primes, limit);
    printf("Found %zu primes up to %llu:\n", count, limit);
    for (size_t i = 0; i < count; ++i) {
        printf("%llu ", primes[i]);
    }
    printf("\n");
    free(primes);
    return 0;
}

Compilation commands:

# Assemble the .s file into an object file
as sieve.s -o sieve.o

# Compile the C wrapper
gcc -c main.c -o main.o

# Link them together to create the executable
gcc main.o sieve.o -o sieve_runner

# Run the program
./sieve_runner

2. What does the .globl sieve directive do?

The .globl (or .global) directive makes a symbol visible to the linker. In this case, .globl sieve tells the assembler that the sieve label should be accessible from other object files. This is what allows our C code to find and call the assembly function after linking.

3. Why is the stack space allocated as a multiple of 16?

The Arm AArch64 Procedure Call Standard (AAPCS64) mandates that the stack pointer (sp) must be 16-byte aligned at all public function interfaces. Maintaining this alignment is crucial for performance, especially when using instructions like stp/ldp that can access 16 bytes at once. It also ensures compatibility with system libraries and exception handling mechanisms.

4. What is the AArch64 calling convention?

It's a standard set of rules for how functions pass arguments and return values. For AArch64:

The first eight integer/pointer arguments are passed in registers x0 through x7.
The return value is passed back in register x0.
Registers x0-x18 are "caller-saved" (a function can use them freely but must assume they will be overwritten by any function it calls).
Registers x19-x30 are "callee-saved" (a function must save their original values before using them and restore them before returning).

Our code correctly uses x0 and x1 for the input arguments and x0 for the return value.

5. Can this code run on an Intel/AMD (x86-64) processor?

No. Assembly language is architecture-specific. Arm64 and x86-64 have completely different instruction sets, register names, and calling conventions. To run this algorithm on an Intel or AMD CPU, it would need to be completely rewritten in x86-64 assembly.

6. Is the Sieve of Eratosthenes the most efficient prime-finding algorithm?

For finding all primes up to a certain limit n, it is one of the most efficient. Its time complexity is approximately O(n log log n). More advanced algorithms like the Sieve of Atkin are theoretically faster with a complexity of O(n), but they are also significantly more complex to implement. For most practical purposes and limits, a well-optimized Sieve of Eratosthenes is extremely competitive.

Conclusion: The Power of Bare Metal

You've journeyed from a simple concept—sifting for prime numbers—to a deep, line-by-line understanding of its implementation on a modern, high-performance Arm64 processor. This exploration within the kodikra.com exclusive Arm64-assembly curriculum demonstrates that true optimization lies at the intersection of a clever algorithm and a profound understanding of the underlying hardware. By choosing assembly, you trade development speed for ultimate control, achieving performance that is simply out of reach for higher-level languages.

Whether you are benchmarking custom hardware, writing a high-performance computing library, or developing an operating system kernel, the principles learned here are invaluable. You now have the knowledge to read, write, and reason about low-level code, making you a more capable and insightful programmer across any language or platform.

Technology Disclaimer: The code and concepts discussed are based on the ARMv8-A architecture (AArch64). While assembly instructions are generally stable, always consult the official ARM architecture reference manual for the specific processor you are targeting for the most accurate and up-to-date information.

Ready for your next challenge? Continue your journey on our Arm64-assembly learning path and discover more advanced topics in systems programming.

Published by Kodikra — Your trusted Arm64-assembly learning resource.

kodikra

Search this blog