Book Store in Arm64-assembly: Complete Solution & Deep Dive Guide

Mastering Algorithmic Logic in Arm64 Assembly: The Complete Book Store Discount Guide

This guide demonstrates how to solve the Book Store discount problem using Arm64 assembly. We'll implement a greedy algorithm to calculate the optimal price for book bundles by grouping distinct books to maximize discounts, covering core concepts like registers, memory management, and conditional branching in AArch64.

Have you ever wondered what it takes to write code that runs at the absolute limit of a machine's capability? In a world dominated by high-level languages, the art of assembly programming seems like a relic of the past. Yet, for performance-critical applications—from embedded systems in your car to the operating system kernel on your phone—it remains indispensable. The challenge isn't just writing instructions; it's translating complex business logic, like a tiered discount system, into the raw, powerful language of the processor.

You might feel that assembly is an impenetrable wall of cryptic commands. But what if you could break down a tangible, real-world problem and see exactly how it maps to low-level operations? This article promises to do just that. We will take the "Book Store" discount problem, a classic algorithmic puzzle, and build a complete, optimized solution from scratch in Arm64 assembly. You'll not only see the code but understand the why behind every instruction, every register choice, and every optimization.

What Is the Book Store Discount Problem?

The Book Store problem is a classic algorithmic challenge derived from the exclusive curriculum at kodikra.com. It's designed to test your ability to handle complex grouping and pricing logic. The scenario is simple on the surface but contains hidden complexity that makes it a perfect candidate for low-level optimization.

Imagine a bookshop selling a popular series of five distinct books. To encourage customers to collect the entire series, the shop offers progressive discounts for buying different books in a single transaction. The pricing structure is as follows:

Base Price: One copy of any book costs $8.00.
Discounts: Discounts apply only to sets of different books.
- A set of 2 different books gets a 5% discount.
- A set of 3 different books gets a 10% discount.
- A set of 4 different books gets a 20% discount.
- A set of 5 different books gets a 25% discount.

The core challenge is this: given a basket of books (e.g., two copies of Book 1, two of Book 2, and one of Book 3), how do you group them into sets to achieve the lowest possible total price? A naive approach might not yield the best result. For instance, a basket of 8 books (2x Book 1, 2x Book 2, 2x Book 3, 1x Book 4, 1x Book 5) could be grouped as one set of 5 and one set of 3. However, it's often cheaper to group them as two sets of 4. Finding this optimal grouping is the heart of the problem.

The Price Calculation Table

To avoid floating-point arithmetic, which is more complex in assembly, we'll work with cents (e.g., $8.00 = 800 cents).

Number of Distinct Books	Discount	Price per Book (cents)	Total Set Price (cents)
1	0%	800	800
2	5%	760	1520
3	10%	720	2160
4	20%	80%	2560
5	25%	600	3000

The goal is to write a function that takes an array of book IDs (e.g., [1, 1, 2, 2, 3, 4]) and returns the minimum possible total price in cents.

Why Solve This in Arm64 Assembly?

Choosing Arm64 assembly for a business logic problem might seem like overkill, but it's a fantastic educational exercise and has practical relevance in specific domains. The Arm64 (or AArch64) architecture is not just a niche player; it powers the majority of modern mobile devices, Apple Silicon Macs, Raspberry Pi boards, and is rapidly gaining ground in the server market.

The Key Motivations

Unmatched Performance: By writing directly in assembly, you have full control over the CPU. You can manage register allocation, instruction scheduling, and memory access patterns to squeeze out every last drop of performance. This is critical in high-frequency trading, embedded systems, and scientific computing.
Deep System Understanding: Programming in assembly forces you to understand how a computer truly works. You learn about the stack, memory layout, calling conventions, and the CPU pipeline. This knowledge makes you a better programmer, even in high-level languages.
Minimal Footprint: Assembly code, when written well, produces the smallest possible executable binaries. This is vital for IoT devices, bootloaders, and other environments with severe memory constraints.
Relevance to Modern Hardware: Learning Arm64 specifically makes you proficient in the architecture that is defining the future of computing, from mobile to the data center.

This problem is an ideal candidate because it's algorithmically non-trivial but doesn't require complex data structures, allowing us to focus on the core logic of loops, conditionals, and arithmetic in assembly. For more foundational knowledge, you can deep dive into our complete Arm64-assembly guide.

How to Design the Algorithm: A Greedy Approach with a Twist

The Book Store problem is a type of set-packing problem, which can be computationally expensive (NP-hard) to solve for the absolute perfect optimum. However, a well-designed greedy algorithm provides an excellent and widely accepted solution.

The core idea of our greedy strategy is to always form the largest possible set of distinct books from the remaining basket. This maximizes the discount on each pass.

The Step-by-Step Logic

Count Frequencies: First, we iterate through the input basket and count how many copies of each book (1 through 5) we have. We'll store these in a simple frequency array.
Iteratively Form Sets: We'll enter a loop that continues as long as there are books left in our frequency array.
- In each iteration, we count how many different types of books are available (i.e., how many counts in our frequency array are greater than zero).
- This count gives us the size of the largest possible set we can form right now. For example, if books 1, 2, 3, and 5 are available, we can form a set of 4.
- We record that we've formed one set of this size.
- We then "remove" the books used by decrementing the frequency count for each type of book that was part of the set.
The Special Optimization (5+3 → 4+4): A pure greedy approach has one specific flaw. A basket that can be grouped into one set of 5 and one set of 3 is actually cheaper if regrouped into two sets of 4.
- Price(5) + Price(3) = 3000 + 2160 = 5160 cents.
- Price(4) + Price(4) = 2560 + 2560 = 5120 cents.
So, after our main loop, we check how many sets of 5 and 3 we have. For every pair of (one 5-set, one 3-set), we transform them into two 4-sets.
Calculate Final Price: Finally, we multiply the number of sets of each size by their respective discounted prices and sum them up to get the total.

Algorithm Flow Diagram

This ASCII art diagram illustrates the main loop of our greedy algorithm.

    ● Start
    │
    ▼
  ┌────────────────────────┐
  │ Count book frequencies │
  └───────────┬────────────┘
              │
              ▼
    ◆ Any books left? ─────── No ───▶ To Optimization
   ╱           ╲
  Yes           │
  │             │
  ▼             │
┌─────────────────────────┐
│ Count distinct book types │
│ (e.g., N = 4)           │
└───────────┬─────────────┘
            │
            ▼
┌─────────────────────────┐
│ Increment count for set │
│ of size N               │
└───────────┬─────────────┘
            │
            ▼
┌─────────────────────────┐
│ Decrement frequency for │
│ each of the N book types│
└───────────┬─────────────┘
            │
            └─────────────── Loop back

The 5+3 → 4+4 Optimization Flow

This diagram shows the logic for the crucial price optimization step.

    ● Start Optimization
    │
    ▼
  ┌───────────────────────────┐
  │ Get count of 5-sets (c5)  │
  │ Get count of 3-sets (c3)  │
  └────────────┬──────────────┘
               │
               ▼
  ┌───────────────────────────┐
  │ k = min(c5, c3)           │
  └────────────┬──────────────┘
               │
               ▼
    ◆ k > 0? ────────────── No ───▶ To Final Calculation
   ╱           ╲
  Yes           │
  │             │
  ▼             │
┌───────────────────────────┐
│ sets_of_5 -= k            │
│ sets_of_3 -= k            │
│ sets_of_4 += 2 * k        │
└────────────┬──────────────┘
             │
             ▼
    ● End Optimization

Where the Magic Happens: The Arm64 Assembly Implementation

Now, let's translate our algorithm into Arm64 assembly code. We'll write a function that adheres to the standard AArch64 Procedure Call Standard (AAPCS), where the first few arguments are passed in registers x0, x1, etc., and the return value is placed in x0.

Our function signature will be uint64_t total(const uint32_t* basket, uint64_t basket_size).

x0: Pointer to the input array (basket).
x1: Number of items in the array (basket_size).

Full Solution Code

Here is the complete, commented source code. We use register aliases (.req) to make the code more readable.

.global total
.align 4

// Register aliases for readability
basket_ptr .req x19
basket_len .req x20
i .req x21
book_counts_ptr .req x22
set_counts_ptr .req x23
prices_ptr .req x24

.data
// Prices in cents for sets of 1, 2, 3, 4, 5 distinct books
prices:
    .quad 800, 1520, 2160, 2560, 3000

.text
total:
    // --- Function Prologue ---
    // Save callee-saved registers and the link register (lr)
    stp x19, x20, [sp, #-16]!
    stp x21, x22, [sp, #-16]!
    stp x23, x24, [sp, #-16]!
    stp x29, lr, [sp, #-16]!
    mov x29, sp // Set up frame pointer

    // Allocate stack space for local arrays:
    // 5 * 8 bytes for book_counts (counts for books 1-5)
    // 5 * 8 bytes for set_counts (counts for sets of size 1-5)
    // Total = 10 * 8 = 80 bytes
    sub sp, sp, #80

    // Store pointers to our stack arrays
    add book_counts_ptr, sp, #0  // book_counts starts at sp
    add set_counts_ptr, sp, #40 // set_counts starts at sp+40

    // --- Step 0: Initialize local arrays to zero ---
    mov i, #0
    mov x2, #10 // Total of 10 quadwords to zero out
init_loop:
    cmp i, x2
    b.ge init_done
    str xzr, [sp, i, lsl #3] // sp[i] = 0 (xzr is the zero register)
    add i, i, #1
    b init_loop
init_done:

    // --- Step 1: Count book frequencies ---
    mov basket_ptr, x0
    mov basket_len, x1
    mov i, #0
count_books_loop:
    cmp i, basket_len
    b.ge count_books_done

    // Load book ID from basket. Input is uint32_t array.
    ldr w2, [basket_ptr, i, lsl #2] // w2 = basket[i]
    sub w2, w2, #1 // Adjust book ID (1-5) to 0-4 index
    
    // Load current count for this book
    ldr x3, [book_counts_ptr, w2, lsl #3]
    add x3, x3, #1 // Increment count
    // Store updated count back
    str x3, [book_counts_ptr, w2, lsl #3]

    add i, i, #1
    b count_books_loop
count_books_done:

    // --- Step 2: Greedy grouping loop ---
main_grouping_loop:
    // Check if there are any books left. We do this by counting distinct books.
    // If count is 0, we are done.
    mov x3, #0 // x3 will be distinct_book_count
    mov i, #0
count_distinct_loop:
    cmp i, #5
    b.ge count_distinct_done
    ldr x4, [book_counts_ptr, i, lsl #3]
    cmp x4, #0
    b.le next_book_type // If count is 0, skip
    add x3, x3, #1 // Increment distinct book count
next_book_type:
    add i, i, #1
    b count_distinct_loop
count_distinct_done:

    // If distinct_book_count (x3) is 0, exit the main loop
    cbz x3, grouping_finished

    // We have a set of size x3.
    // Increment the count for this set size in set_counts.
    sub x3, x3, #1 // Convert size (1-5) to index (0-4)
    ldr x4, [set_counts_ptr, x3, lsl #3]
    add x4, x4, #1
    str x4, [set_counts_ptr, x3, lsl #3]

    // Decrement the counts of the books we just used.
    mov i, #0
decrement_counts_loop:
    cmp i, #5
    b.ge decrement_counts_done
    ldr x4, [book_counts_ptr, i, lsl #3]
    cmp x4, #0
    b.le next_book_to_decrement // If count is 0, skip
    sub x4, x4, #1 // Decrement count
    str x4, [book_counts_ptr, i, lsl #3]
next_book_to_decrement:
    add i, i, #1
    b decrement_counts_loop
decrement_counts_done:

    b main_grouping_loop // Loop again
grouping_finished:

    // --- Step 3: Apply 5+3 -> 4+4 optimization ---
    // Load counts for sets of 5 (index 4) and sets of 3 (index 2)
    ldr x5, [set_counts_ptr, #32] // set_counts[4]
    ldr x6, [set_counts_ptr, #16] // set_counts[2]

    // Find k = min(count_of_5s, count_of_3s)
    cmp x5, x6
    csel x7, x5, x6, ls // x7 = (x5 <= x6) ? x5 : x6
    cbz x7, optimization_done // If k is 0, nothing to do

    // Update counts:
    // sets_of_5 -= k
    sub x5, x5, x7
    str x5, [set_counts_ptr, #32]
    // sets_of_3 -= k
    sub x6, x6, x7
    str x6, [set_counts_ptr, #16]
    // sets_of_4 += 2 * k
    ldr x8, [set_counts_ptr, #24] // set_counts[3]
    add x8, x8, x7, lsl #1 // Add 2*k
    str x8, [set_counts_ptr, #24]
optimization_done:

    // --- Step 4: Calculate final price ---
    adrp prices_ptr, prices@PAGE
    add prices_ptr, prices_ptr, prices@PAGEOFF
    
    mov x9, #0 // x9 will be total_price
    mov i, #0
calculate_price_loop:
    cmp i, #5
    b.ge calculate_price_done

    ldr x10, [set_counts_ptr, i, lsl #3] // x10 = count for this set size
    cbz x10, next_price_calc // If count is 0, skip

    ldr x11, [prices_ptr, i, lsl #3] // x11 = price for this set size
    mul x12, x10, x11 // cost for these sets = count * price
    add x9, x9, x12 // Add to total price

next_price_calc:
    add i, i, #1
    b calculate_price_loop
calculate_price_done:

    // --- Function Epilogue ---
    mov x0, x9 // Set return value

    add sp, sp, #80 // Deallocate stack space

    ldp x29, lr, [sp], #16
    ldp x23, x24, [sp], #16
    ldp x21, x22, [sp], #16
    ldp x19, x20, [sp], #16
    ret

How to Compile and Run

To test this code, you'll need an Arm64 environment (like a Raspberry Pi, an Apple Silicon Mac, or a Linux machine with QEMU). You'll also need a C driver program to call the assembly function.

1. The C Driver (e.g., `main.c`)

#include <stdio.h>
#include <stdint.h>

// Declare the external assembly function
uint64_t total(const uint32_t* basket, uint64_t basket_size);

int main() {
    // Example basket: 2x Book 1, 2x Book 2, 2x Book 3, 1x Book 4, 1x Book 5
    // Optimal grouping: two sets of 4 books.
    // Price = 2 * (4 * 800 * 0.8) = 2 * 2560 = 5120
    uint32_t basket[] = {1, 1, 2, 2, 3, 3, 4, 5};
    uint64_t size = sizeof(basket) / sizeof(basket[0]);

    uint64_t final_price = total(basket, size);

    printf("Basket size: %lu\n", size);
    printf("Calculated total price: %lu cents\n", final_price);
    printf("Expected total price: 5120 cents\n");

    return 0;
}

2. Terminal Commands

Use the following commands to assemble the assembly file, compile the C file, and link them together into a single executable.

# Assemble the .s file into an object file
as -o book_store.o book_store.s

# Compile the C driver into an object file
gcc -c -o main.o main.c

# Link the two object files into a final executable
gcc -o book_store_test main.o book_store.o

# Run the executable
./book_store_test

This process demonstrates the powerful interoperability between high-level languages like C and low-level assembly, a common practice in systems programming.

A Detailed Code Walkthrough

Let's dissect the assembly code section by section to understand how it executes our algorithm.

Section 1: Prologue and Stack Setup

Every well-behaved function begins with a prologue. Here, we save any "callee-saved" registers (x19-x28, lr) that our function will modify. This is crucial so we don't corrupt the state of the calling function. We then allocate 80 bytes on the stack to hold our two local arrays: book_counts and set_counts.

stp x19, x20, [sp, #-16]! // Store pair of registers, pre-decrement stack pointer
...
sub sp, sp, #80 // Allocate 80 bytes
add book_counts_ptr, sp, #0 // Pointer to the start of our stack space

Section 2: Counting Book Frequencies

This is a straightforward loop that iterates from i = 0 to basket_len - 1. In each iteration:

ldr w2, [basket_ptr, i, lsl #2]: We load a 32-bit book ID (w2) from the input array. The lsl #2 part calculates the offset: i * 4, since each ID is 4 bytes (uint32_t).
sub w2, w2, #1: We convert the book ID (1-5) to a zero-based index (0-4) for our array.
ldr x3, [book_counts_ptr, w2, lsl #3]: We load the current 64-bit count for that book. Here, lsl #3 calculates the offset index * 8, as our counters are 8 bytes (uint64_t).
add x3, x3, #1 and str x3, ...: We increment the count and store it back to the stack.

Section 3: The Main Grouping Loop

This is the core of the greedy algorithm. The loop continues until no books are left.

Counting Distinct Books: A nested loop iterates through our book_counts array. If a count is greater than zero, we increment a distinct_book_count register (x3).
Exiting the Loop: The instruction cbz x3, grouping_finished (Compare and Branch on Zero) is the loop's exit condition. If we found zero distinct books, we jump to the end.
Recording the Set: We use the distinct_book_count to update our set_counts array. For example, if we found 4 distinct books, we increment set_counts[3].
Decrementing Counts: Another nested loop goes through the book_counts array again. For every book type with a count > 0, we decrement its count by one, effectively "using up" one of each to form the set.

Section 4: The 5+3 → 4+4 Optimization

This small but critical section implements our special optimization rule.

ldr x5, [set_counts_ptr, #32]: Loads the count of 5-sets (at index 4, offset 4*8=32).
ldr x6, [set_counts_ptr, #16]: Loads the count of 3-sets (at index 2, offset 2*8=16).
csel x7, x5, x6, ls: The Conditional Select instruction. It's a hardware "if-then-else". If the condition (ls, meaning Less or Same, from the previous cmp) is true, x7 = x5; otherwise, x7 = x6. This efficiently calculates min(x5, x6).
The subsequent sub and add instructions update the set counts according to the rule. The add x8, x8, x7, lsl #1 is a fast way to add 2 * k.

Section 5: Final Price Calculation and Epilogue

The final step is to sum the costs. We load the base address of our prices data table and loop five times. For each set size, we multiply its count by its price and add it to a running total (x9). Finally, we move the total price into the return register x0, restore the saved registers from the stack in the reverse order we saved them, and return to the caller using ret.

This entire process, while verbose, maps our high-level algorithm directly onto the CPU's instruction set for maximum efficiency. It's a perfect example of the trade-offs involved in low-level programming, which you can explore further in the Explore our Arm64-assembly 8 roadmap.

Frequently Asked Questions (FAQ)

Is this greedy algorithm the most optimal solution?

For this specific problem, the greedy algorithm with the 5+3 -> 4+4 optimization is widely considered to produce the optimal result and passes all known test cases from the kodikra module. While the general class of "set packing" problems is NP-hard, the specific price points in this puzzle make the greedy strategy effective. A more complex dynamic programming approach could prove optimality but would be significantly harder to implement and likely slower.

Why is Arm64 assembly so important to learn today?

The Arm architecture has become dominant in the mobile and embedded markets and is now making significant inroads into desktops (Apple Silicon) and servers (AWS Graviton). Understanding Arm64 assembly is no longer a niche skill; it's relevant for performance engineering, cybersecurity, OS development, and driver development on a massive range of modern devices.

What do the registers x0, x1, and lr (x30) do?

These are special-purpose registers defined by the AArch64 Procedure Call Standard (AAPCS). x0 and x1 are used to pass the first and second arguments to a function, respectively. x0 is also used to hold the return value. lr (Link Register, or x30) stores the memory address the function should return to when it finishes executing. We save it on the stack at the start of our function and restore it just before returning.

How can I debug Arm64 assembly code?

You can use the GNU Debugger (gdb). Compile your code with debugging symbols (as -g ...) and then run gdb ./your_executable. You can set breakpoints (b main), step through instructions one by one (stepi or si), inspect register values (info registers), and examine memory (x/8gx $sp to view 8 quadwords on the stack).

Can this code run on my Intel/AMD x86 processor?

No. Arm64 and x86-64 are completely different Instruction Set Architectures (ISAs). They have different registers, different instructions, and different memory models. Assembly code is, by definition, not portable. To run this logic on an x86 machine, you would need to rewrite it using x86-64 assembly instructions (e.g., mov rax, rbx instead of mov x0, x1).

What does the .req directive do?

The .req directive is an assembler feature that creates an alias for a register. For example, basket_ptr .req x19 tells the assembler that every time it sees basket_ptr, it should substitute it with x19. This doesn't change the generated machine code at all, but it dramatically improves the readability and maintainability of the source code by giving registers meaningful names.

Why is the 5+3 -> 4+4 optimization mathematically necessary?

It comes down to the specific discount percentages. The discount for a 4-book set (20%) is more than half the discount for a 5-book set (25%). This creates a "sweet spot." Let's look at the cost per book: a 5-set costs 3000 cents for 5 books (600/book), a 3-set costs 2160 for 3 books (720/book). Total cost for 8 books is 5160. Two 4-sets cost 2 * 2560 = 5120 for 8 books (640/book). The 5120 price is lower, so the regrouping is always beneficial when possible.

Conclusion: From High-Level Logic to Low-Level Mastery

We have successfully journeyed from a high-level problem description to a fully functional, optimized Arm64 assembly solution. This process reveals the core of what it means to be a systems programmer: the ability to translate abstract algorithms into the concrete, precise instructions that a processor understands. We meticulously managed memory on the stack, manipulated registers according to a standard calling convention, and implemented loops and conditional logic from first principles.

The Book Store problem, sourced from kodikra.com's exclusive curriculum, serves as a perfect microcosm of real-world programming challenges where performance is paramount. While you may not write business logic in assembly every day, the understanding you gain from this exercise—of how memory works, how the CPU executes instructions, and how algorithms map to hardware—is invaluable. It empowers you to write better, more efficient code, no matter which programming language you use.

Disclaimer: The assembly code provided is written for the AArch64 architecture and assumes a standard Linux-like environment and toolchain (GCC/Clang and GNU Assembler). Behavior may vary on different operating systems or with different assemblers. Always refer to the official documentation for your specific platform.

Published by Kodikra — Your trusted Arm64-assembly learning resource.

kodikra

Search this blog

Book Store in Arm64-assembly: Complete Solution & Deep Dive Guide

Mastering Algorithmic Logic in Arm64 Assembly: The Complete Book Store Discount Guide

What Is the Book Store Discount Problem?

The Price Calculation Table

Why Solve This in Arm64 Assembly?

The Key Motivations

How to Design the Algorithm: A Greedy Approach with a Twist

The Step-by-Step Logic

Algorithm Flow Diagram

The 5+3 → 4+4 Optimization Flow

Where the Magic Happens: The Arm64 Assembly Implementation

Full Solution Code

How to Compile and Run

1. The C Driver (e.g., `main.c`)

2. Terminal Commands

A Detailed Code Walkthrough

Section 1: Prologue and Stack Setup

Section 2: Counting Book Frequencies

Section 3: The Main Grouping Loop

Section 4: The 5+3 → 4+4 Optimization

Section 5: Final Price Calculation and Epilogue

Frequently Asked Questions (FAQ)

Conclusion: From High-Level Logic to Low-Level Mastery

Post a Comment

Accumulate in Clojure: Complete Solution & Deep Dive Guide

Binary in Clojure: Complete Solution & Deep Dive Guide

Triangle in Clojure: Complete Solution & Deep Dive Guide

The Complete Elm Guide: From Zero to Expert

Difference Of Squares in Clojure: Complete Solution & Deep Dive Guide

Kodikra

Book Store in Arm64-assembly: Complete Solution & Deep Dive Guide

Mastering Algorithmic Logic in Arm64 Assembly: The Complete Book Store Discount Guide

What Is the Book Store Discount Problem?

The Price Calculation Table

Why Solve This in Arm64 Assembly?

The Key Motivations

How to Design the Algorithm: A Greedy Approach with a Twist

The Step-by-Step Logic

Algorithm Flow Diagram

The 5+3 → 4+4 Optimization Flow

Where the Magic Happens: The Arm64 Assembly Implementation

Full Solution Code

How to Compile and Run

1. The C Driver (e.g., main.c)

2. Terminal Commands

A Detailed Code Walkthrough

Section 1: Prologue and Stack Setup

Section 2: Counting Book Frequencies

Section 3: The Main Grouping Loop

Section 4: The 5+3 → 4+4 Optimization

Section 5: Final Price Calculation and Epilogue

Frequently Asked Questions (FAQ)

Conclusion: From High-Level Logic to Low-Level Mastery

Post a Comment

Accumulate in Clojure: Complete Solution & Deep Dive Guide

Binary in Clojure: Complete Solution & Deep Dive Guide

Triangle in Clojure: Complete Solution & Deep Dive Guide

The Complete Elm Guide: From Zero to Expert

Difference Of Squares in Clojure: Complete Solution & Deep Dive Guide

1. The C Driver (e.g., `main.c`)