Prime Factors in Arm64-assembly: Complete Solution & Deep Dive Guide

Mastering Prime Factorization: A Deep Dive into Arm64 Assembly

Prime factorization is a fundamental concept in number theory, yet implementing it at the lowest level of abstraction—assembly language—can feel like a monumental task. This guide breaks down the process of computing prime factors using Arm64 assembly, transforming a complex algorithm into a clear, manageable, and highly efficient solution for modern 64-bit ARM processors.

The Challenge: Deconstructing Numbers at the CPU Level

You've likely written a function to find prime factors in Python, Java, or C++. It's a classic computer science exercise. But have you ever wondered what's happening under the hood? How does the CPU, with its limited set of instructions for adding, subtracting, and moving data, actually perform this complex logical task?

Attempting this in assembly language often feels like trying to build a skyscraper with nothing but a hammer and nails. The familiar comfort of `for` loops, `while` loops, and the modulo operator (`%`) vanishes. You're left with registers, memory addresses, and conditional branches. This guide is your blueprint. We will demystify the process, showing you how to translate the high-level logic of prime factorization into the raw, powerful instructions of Arm64 assembly. By the end, you'll not only have a working solution but a much deeper appreciation for the bridge between software logic and hardware execution.

What Are Prime Factors?

Before diving into the code, let's establish a solid foundation. A prime factor is a prime number that divides another integer exactly, without leaving a remainder. The process of prime factorization is breaking down a composite number into the set of prime numbers that, when multiplied together, produce the original number.

For example, let's take the number 60.

We start with the smallest prime, 2. Does 2 divide 60? Yes. 60 / 2 = 30. Our factors so far: [2].
We check 2 again with our new number, 30. Does 2 divide 30? Yes. 30 / 2 = 15. Our factors so far: [2, 2].
We check 2 again with 15. It does not divide evenly.
We move to the next prime, 3. Does 3 divide 15? Yes. 15 / 3 = 5. Our factors so far: [2, 2, 3].
We check 3 again with 5. It does not divide evenly.
We move to the next potential divisor, 4 (which is not prime, but our algorithm handles this), and then 5. Does 5 divide 5? Yes. 5 / 5 = 1. Our factors so far: [2, 2, 3, 5].

Once our number is reduced to 1, the process is complete. The prime factors of 60 are 2, 2, 3, and 5.

A crucial rule in this process is that 1 is not a prime number. A prime number is a natural number greater than 1 that has no positive divisors other than 1 and itself. Since 1 only has one positive divisor (itself), it does not fit the definition.

Why is Prime Factorization Critical in Modern Computing?

This isn't just a theoretical math problem; prime factorization is the bedrock of modern digital security and other computational fields. Its significance stems from a simple fact: it's relatively easy to multiply large prime numbers together, but it is computationally very difficult to take the resulting product and find the original prime factors.

Here are some key areas where this concept is applied:

Cryptography: The most famous example is the RSA (Rivest–Shamir–Adleman) algorithm, which is fundamental to SSL/TLS certificates that secure web traffic (HTTPS). It relies on the difficulty of factoring the product of two very large prime numbers.
Algorithm Design: Many algorithms in number theory, such as Pollard's rho algorithm for integer factorization, are built upon these principles.
High-Performance Computing: In scientific computing and simulations, understanding the prime factors of numbers used in array dimensions or loop bounds can help in optimizing algorithms for better cache performance and parallelization.

Implementing this in Arm64 assembly is particularly relevant for performance-critical applications in embedded systems, mobile devices (which heavily use ARM architecture), and server infrastructure where every CPU cycle counts.

How to Calculate Prime Factors in Arm64 Assembly

We'll implement the "trial division" algorithm, which is the most straightforward method and the one described in our example with the number 60. It involves iteratively dividing the target number by potential divisors, starting from 2.

The Algorithm Logic

Here is the high-level logic we will translate into assembly code. Let n be the number we want to factor.

Initialize a divisor d = 2.
Initialize an empty list to store the factors.
Outer Loop: While n > 1:
1. Inner Loop: While n is divisible by d:
  1. Add d to our list of factors.
  2. Update n to be n / d.
2. Increment the divisor: d = d + 1.
Return the list of factors.

This approach works because by the time we test a composite number (like 4 or 6) as a divisor, we have already divided out all of its smaller prime factors. For instance, we will have already removed all factors of 2 before we ever test 4, so 4 will never divide the remaining number.

Visualizing the Algorithm Flow

This ASCII diagram illustrates the nested loop structure of our trial division algorithm.

    ● Start (Input: n, buffer_ptr)
    │
    ▼
  ┌───────────────────────────┐
  │ Initialize divisor d = 2  │
  │ Initialize count = 0      │
  └────────────┬──────────────┘
               │
               ▼
  ┌─── Outer Loop (while n > 1) ───┐
  │            │                   │
  │            ▼                   │
  │     ◆ Is n divisible by d? ◆   │
  │    ╱          ╲                │
  │  Yes           No              │
  │   │            │               │
  │   ▼            │               │
  │ ┌───────────────────────────┐  │
  │ │ Inner Loop:               │  │
  │ │  - Store d in buffer      │  │
  │ │  - n = n / d              │  │
  │ │  - Increment count        │  │
  │ │  - Repeat from divisibility check │  │
  │ └───────────────────────────┘  │
  │                  │               │
  │                  ▼               │
  │            ┌───────────────┐   │
  │            │ d = d + 1     │   │
  │            └───────────────┘   │
  │                  │               │
  └──────────────────▲───────────────┘
                     │ (Loop back if n > 1)
                     ▼
                 ● End (Return count)

The Complete Arm64 Assembly Solution

Here is the full, commented source code for finding prime factors. This function adheres to the standard ARM 64-bit Procedure Call Standard (AAPCS), where the first argument (our number n) is in register x0, the second argument (the pointer to the results buffer) is in x1, and the return value (the count of factors) is placed in x0.


.global factors

// Function: factors
// Description: Computes the prime factors of a given 64-bit unsigned integer.
// Arguments:
//   x0: The number 'n' to be factored.
//   x1: A pointer to a buffer where the prime factors will be stored.
// Returns:
//   x0: The total count of prime factors found.
//
// Register Usage:
//   x0: Holds the remaining value of 'n' during factorization. Used for return value.
//   x1: Pointer to the current position in the output buffer.
//   x19 (callee-saved): Stores the original number 'n' for comparison.
//   x20 (callee-saved): Stores the current divisor 'd'.
//   x21 (callee-saved): Stores the count of factors found.
//   x22: Temporary register for division result (quotient).
//   x23: Temporary register for remainder calculation.

factors:
    // --- Function Prologue ---
    // Save callee-saved registers we will modify onto the stack.
    // We need to save x19, x20, x21, and the link register (x30).
    stp x19, x20, [sp, #-32]!
    stp x21, x30, [sp, #16]

    // --- Initialization ---
    mov x19, x0          // x19 = n (original number, for loop condition)
    mov x20, #2          // x20 = divisor 'd', starting at 2
    mov x21, #0          // x21 = factor_count, starting at 0

outer_loop:
    // Outer loop condition: while (n > 1)
    // We use the current value of n in x0 for this check.
    cmp x0, #1
    b.le end_factors     // If n <= 1, we are done.

inner_loop:
    // --- Modulo Operation: n % d ---
    // Arm64 doesn't have a modulo instruction. We calculate it using:
    // remainder = n - (n / d) * d
    udiv x22, x0, x20    // x22 = n / d (quotient)
    msub x23, x22, x20, x0 // x23 = n - (x22 * x20) which is the remainder

    // Check if remainder is zero (i.e., n is divisible by d)
    cmp x23, #0
    b.ne increment_divisor // If remainder is not 0, jump to increment the divisor

    // --- Found a factor ---
    // Store the divisor 'd' in the buffer
    str x20, [x1], #8    // Store d (x20) at address in x1, then post-increment x1 by 8 bytes

    // Update n = n / d
    mov x0, x22          // The result of the division is already in x22

    // Increment the factor count
    add x21, x21, #1

    // Loop back to check if the new 'n' is also divisible by 'd'
    b inner_loop

increment_divisor:
    // Increment the divisor 'd'
    add x20, x20, #1

    // Loop back to the start of the outer loop to check the new divisor
    b outer_loop

end_factors:
    // --- Function Epilogue ---
    // Prepare the return value
    mov x0, x21          // Move the final factor_count into x0

    // Restore callee-saved registers from the stack
    ldp x21, x30, [sp, #16]
    ldp x19, x20, [sp], #32

    // Return to the caller
    ret

Detailed Code Walkthrough

Let's dissect the assembly code to understand how each part contributes to the final result.

1. Function Prologue


stp x19, x20, [sp, #-32]!
stp x21, x30, [sp, #16]

The Arm64 calling convention requires that certain registers (x19 through x28) maintain their values across a function call. Since our function uses x19, x20, and x21, we must save their original values on the stack before using them. We also save x30, the link register, which holds the return address. stp (Store Pair) is an efficient instruction for saving two registers at once. The `!` indicates pre-indexing, meaning the stack pointer sp is adjusted before the store operation.

2. Initialization


mov x19, x0
mov x20, #2
mov x21, #0

Here, we set up our main variables:

x19 holds a copy of the original number n. While we modify x0 throughout the process, we might need the original value later (though in this specific implementation, it's more of a good practice for debugging). The primary value of n being factored is kept in x0.
x20 is our divisor d, which starts at 2.
x21 is our factor counter, initialized to 0.

3. The Outer Loop


outer_loop:
    cmp x0, #1
    b.le end_factors

This is the entry point of our main loop. The logic is `while (n > 1)`. The cmp instruction compares the current value of n (in x0) with 1. The b.le (Branch if Less Than or Equal) instruction jumps to the end of the function if the condition is met, effectively breaking the loop.

4. The Modulo Operation and Inner Loop Condition


inner_loop:
    udiv x22, x0, x20
    msub x23, x22, x20, x0
    cmp x23, #0
    b.ne increment_divisor

This is the most clever part of the code.

udiv x22, x0, x20 performs an unsigned division: x22 = x0 / x20 (or quotient = n / d).
msub x23, x22, x20, x0 is the key to the modulo. It stands for "Multiply-Subtract" and calculates x23 = x0 - (x22 * x20). This is exactly the definition of the remainder.
cmp x23, #0 compares the remainder to zero.
b.ne (Branch if Not Equal) jumps past the factor-handling logic if the remainder is not zero, meaning d is not a factor.

5. Handling a Found Factor


str x20, [x1], #8
mov x0, x22
add x21, x21, #1
b inner_loop

If the branch is not taken, it means we've found a factor.

str x20, [x1], #8 stores the divisor d (from x20) into the memory location pointed to by x1. The , #8 part is post-indexing: after the store, x1 is advanced by 8 bytes (the size of a 64-bit integer) to point to the next available slot in the buffer.
mov x0, x22 updates n with the quotient of the division (n = n / d).
add x21, x21, #1 increments our factor counter.
b inner_loop unconditionally branches back to the start of the inner loop to check if this same divisor d is a factor of the new, smaller n.

6. Incrementing the Divisor


increment_divisor:
    add x20, x20, #1
    b outer_loop

This code is executed when the inner loop finishes for a given divisor (i.e., d no longer divides n evenly). We simply add 1 to our divisor in x20 and then branch back to the start of the outer_loop to continue the process with the new divisor.

7. Function Epilogue


end_factors:
    mov x0, x21
    ldp x21, x30, [sp, #16]
    ldp x19, x20, [sp], #32
    ret

Once the outer_loop condition (n > 1) fails, we jump here.

mov x0, x21 places the final factor count into x0, the designated return value register.
The two ldp (Load Pair) instructions do the reverse of the stp instructions at the beginning: they restore the original values of the callee-saved registers and the link register from the stack. The final `[sp], #32` also adjusts the stack pointer back to its original position.
ret returns control to the calling function.

Where This Low-Level Approach Shines

Writing code in assembly is a trade-off. While it's more complex and less portable than a high-level language, it provides unparalleled control over the hardware. This is crucial in specific domains:

Embedded Systems: In devices with tight memory and power constraints, hand-optimized assembly can ensure an algorithm runs within its resource budget.
Operating System Kernels: Core OS components that manage hardware directly, like schedulers or memory managers, are often written in assembly.
Compiler Development: Understanding assembly is essential for anyone writing compilers, as it's the target language for code generation.
Reverse Engineering & Security: Security researchers analyze machine code to find vulnerabilities, and a deep understanding of assembly is non-negotiable.

Assembly vs. High-Level Language: A Comparison

To provide perspective, here's a table comparing the implementation of prime factorization in Arm64 assembly versus a high-level language like Python.

Aspect	Arm64 Assembly	Python
Performance	Extremely high. Direct CPU instruction execution with no overhead.	Significantly slower due to interpreter overhead, dynamic typing, and abstractions.
Control	Total control over registers, memory, and CPU instructions.	Abstracted control. The interpreter/compiler manages memory and register allocation.
Development Time	Very long. Requires deep knowledge of architecture and careful manual management.	Very short. Expressive syntax and built-in data structures make it fast to write.
Readability & Maintainability	Low. Code is verbose and tied to the hardware architecture. Difficult for others to understand.	High. Code is concise, clear, and easy to read and maintain.
Portability	None. Tied specifically to the Arm64 architecture.	Excellent. Runs on any machine with a Python interpreter.

Stack Frame and Register Usage Visualization

This diagram shows how the stack and registers are organized during our function call. Understanding the stack frame is fundamental to low-level programming.

    Caller's Frame
    ──────────────────
    ...
    ──────────────────
    sp ⟶ ┌──────────────────┐  (Stack Pointer before our call)
         │ ...              │
         ├──────────────────┤
         │ Argument 2 (ptr) │ ───> Passed in x1
         ├──────────────────┤
         │ Argument 1 (n)   │ ───> Passed in x0
         └──────────────────┘

          Function `factors` is called...

    sp ⟶ ┌──────────────────┐  (sp is adjusted by `stp ... [sp, #-32]!`)
-32  →   │ x19 (saved)      │
         ├──────────────────┤
-24  →   │ x20 (saved)      │
         ├──────────────────┤
-16  →   │ x21 (saved)      │
         ├──────────────────┤
 -8  →   │ x30 (Link Reg)   │
         ├──────────────────┤
  0  →   │ Original sp      │
         └──────────────────┘
            Our Function's Stack Frame

    During execution:
    x0: current 'n'
    x1: buffer_ptr
    x19: copy of n (callee-saved)
    x20: divisor 'd' (callee-saved)
    x21: count (callee-saved)

Frequently Asked Questions (FAQ)

What is the largest number this Arm64 code can handle?

This code uses 64-bit registers (x0, x19, etc.) and the udiv instruction for 64-bit unsigned integers. Therefore, it can handle any unsigned 64-bit integer, which has a maximum value of 2^64 - 1, or 18,446,744,073,709,551,615.

Why start the divisor at 2 and not 1?

By definition, 1 is not a prime number. Furthermore, every number is divisible by 1, so including it would lead to an infinite loop where we would continuously divide n by 1 without it ever changing. Starting with 2, the smallest prime number, is the correct first step for the trial division algorithm.

How do you compile and run this Arm64 assembly code?

You would typically use a toolchain like GCC. First, you save the assembly code as a .s file (e.g., factors.s). Then you can compile and link it, often with a C "driver" program to call it.

1. Assemble the .s file:


as -o factors.o factors.s

2. Create a C driver (e.g., `main.c`) to call the function.

3. Link them together:


gcc -o my_program main.c factors.o

This process is essential for testing low-level code and is covered in depth in the kodikra learning path for Arm64 Assembly.

What is the difference between `stp` and `str`?

str stands for "Store Register" and stores a single 64-bit register to memory. stp stands for "Store Pair" and stores two 64-bit registers to memory in a single instruction. stp is generally more efficient (faster and results in smaller code size) when you need to save an even number of registers, which is why it's commonly used in function prologues to save registers and the link register.

Could this code be optimized further?

Yes, there are several micro-optimizations and algorithmic improvements possible. For instance, after checking for factors of 2, you only need to check odd divisors (3, 5, 7...). A more significant optimization is to only check for divisors up to the square root of the number n. Implementing a square root function in assembly adds complexity but can dramatically improve performance for large numbers. However, for an educational example, the current implementation provides the best balance of clarity and performance.

Is recursion a good approach for prime factorization in assembly?

While prime factorization can be expressed recursively, it is generally not a good approach in assembly. Each recursive call adds a new frame to the stack, consuming memory and adding the overhead of function call/return instructions. An iterative approach, like the one used in this guide, is almost always more efficient in terms of both speed and memory usage, which are primary concerns when writing assembly code.

Conclusion: From Theory to Machine Code

We have successfully journeyed from the high-level mathematical concept of prime factorization down to its bare-metal implementation in Arm64 assembly. You've learned how to translate abstract loops and arithmetic operations into a concrete sequence of CPU instructions, manage registers, and interact with the stack according to the standard calling convention.

This exercise from the exclusive kodikra.com curriculum does more than just solve a problem; it builds a bridge of understanding between the code you write every day and the hardware that executes it. This deep knowledge is invaluable, making you a more effective programmer, even when you are working with high-level languages.

To continue your journey, deepen your understanding of Arm64 Assembly and explore more complex challenges. The principles of register allocation, memory management, and algorithmic translation you've learned here are foundational to mastering low-level development.

Disclaimer: The assembly code provided is written for the Arm64 (AArch64) architecture and follows the standard AAPCS. It has been tested with common assemblers like GNU AS.

Published by Kodikra — Your trusted Arm64-assembly learning resource.

kodikra

Search this blog