Phone Number in Arm64-assembly: Complete Solution & Deep Dive Guide
From Chaos to Clean: Mastering Phone Number Parsing in Arm64 Assembly
A comprehensive guide to sanitizing and validating North American Numbering Plan (NANP) phone numbers using low-level Arm64 assembly. This tutorial breaks down string manipulation, character filtering, and rule-based validation, providing a complete solution from zero to hero for handling complex user input at the hardware level.
You've just been hired at a top-tier communications company, and your first task is a baptism by fire. The user database is a mess of phone numbers in every conceivable format: `(123) 456-7890`, `123.456.7890`, `+1 123 456 7890`, and countless other variations. Your mission, should you choose to accept it, is to build a hyper-efficient function that can tame this chaos, converting valid numbers into a clean, standardized format and rejecting the invalid ones. But there's a catch: for maximum performance, you must do it in Arm64 assembly. This guide will walk you through that exact process, turning you into a low-level data sanitation expert.
What is Phone Number Validation and Why is it Critical?
Phone number validation is the process of verifying that a string of characters represents a plausible, dialable phone number that adheres to a specific numbering plan. For this task, we focus on the North American Numbering Plan (NANP), the system governing phone numbers in the United States, Canada, and several other territories. A clean, validated number is essential for system reliability, ensuring that SMS messages are sent, calls are connected, and databases remain consistent.
The Rules of the NANP Game
To write our validation logic, we first need to understand the rules. A number is considered valid under NANP if it meets these criteria:
- Length: The number must contain either 10 digits or 11 digits.
- Country Code (11-digit numbers): If the number has 11 digits, the first digit must be `1`. Any other leading digit makes it invalid.
- Area Code (Digits 1-3): The first digit of the 10-digit number (or the second digit of an 11-digit number) cannot be `0` or `1`. This is often referred to as the `NXX` rule, where `N` is any digit from 2-9.
- Exchange Code (Digits 4-6): Similarly, the first digit of the exchange code (the 4th digit of the 10-digit number) also cannot be `0` or `1`.
- Ignored Characters: Punctuation like parentheses `()`, hyphens `-`, dots `.`, plus signs `+`, and spaces should be completely ignored during validation.
Our assembly function must systematically strip these ignored characters and then apply the validation rules to the remaining sequence of digits.
Why Use Arm64 Assembly for String Manipulation?
In a world dominated by high-level languages like Python or JavaScript, reaching for assembly might seem like overkill. However, for certain applications, it's the most logical choice. Arm64 (also known as ARMv8-A or AArch64) is the architecture powering the vast majority of modern smartphones, tablets, and increasingly, servers and laptops. Writing assembly gives you unparalleled control and performance.
Pros and Cons of Using Assembly for this Task
| Aspect | Pros (Using Arm64 Assembly) | Cons (Risks & Considerations) |
|---|---|---|
| Performance | Direct control over registers and memory leads to minimal overhead and maximum speed. No runtime or garbage collector involved. | Significantly more complex to write and debug. A poorly written assembly function can be slower than a compiler-optimized high-level one. |
| Memory Usage | Extremely low memory footprint. You allocate and manage only what is absolutely necessary. | Manual memory management is error-prone. Buffer overflows and memory leaks are real risks if not handled carefully. |
| Portability | The code is highly specific to the Arm64 architecture. | Code is not portable to other architectures like x86-64 without a complete rewrite. |
| Learning Value | Provides a deep understanding of how computers process data at a fundamental level, from memory access to CPU branching. | Steep learning curve. Requires knowledge of the A64 instruction set, registers, and calling conventions. |
For the exclusive curriculum at kodikra.com's Arm64-assembly path, this module is designed to build foundational skills in string processing, conditional logic, and memory management—skills that are transferable to any low-level programming context.
How It Works: A Line-by-Line Code Walkthrough
Our approach involves an in-place modification of the input string. We'll use two pointers: a "read" pointer that scans the original string, and a "write" pointer that places cleaned digits at the beginning of that same string. This is a highly efficient technique that avoids allocating new memory.
Let's dissect the complete solution provided in the kodikra module.
The Assembly Code Solution
.text
.globl clean
/* extern void clean(char *str); */
/*
* Register Allocation:
* x0: Input string pointer (char *str). Also used for final length calculation.
* x1: Write pointer. Points to the next location for a clean digit.
* x2: Read pointer. Scans the original string.
* w3: Holds the current byte (character) being read.
* w4: Holds the length of the cleaned digit string.
* w5: Temporary register for character validation.
*/
clean:
mov x1, x0 // Initialize write pointer (x1) to start of string (x0)
mov x2, x0 // Initialize read pointer (x2) to start of string (x0)
.read:
ldrb w3, [x2], #1 // Load byte from [read_ptr], then increment read_ptr
cbz w3, .validate // If byte is null terminator (0), jump to validation
// Filter out non-digit punctuation and spaces
cmp w3, #' '
beq .read
cmp w3, #'('
beq .read
cmp w3, #')'
beq .read
cmp w3, #'+'
beq .read
cmp w3, #'-'
beq .read
cmp w3, #'.'
beq .read
// Check if the character is a digit ('0' through '9')
cmp w3, #'0'
blt .invalid_char // If less than '0', it's an invalid character
cmp w3, #'9'
bgt .invalid_char // If greater than '9', it's an invalid character
// If it's a digit, write it to the clean section of the string
strb w3, [x1], #1 // Store byte at [write_ptr], then increment write_ptr
b .read
.invalid_char:
// This is a special case for letters or invalid punctuation
// We must invalidate the entire number
b .set_invalid
.validate:
// Calculate the length of the cleaned digit string
sub x4, x1, x0 // length = write_ptr - start_ptr
// Rule: Must be 10 or 11 digits
cmp x4, #10
beq .check_10_digits
cmp x4, #11
beq .check_11_digits
b .set_invalid // If not 10 or 11, it's invalid
.check_11_digits:
ldrb w5, [x0] // Load the first digit
cmp w5, #'1' // Rule: If 11 digits, first must be '1'
bne .set_invalid // If not '1', invalid
add x0, x0, #1 // It is '1', so effectively skip it for next checks
// Fall through to check the remaining 10 digits
.check_10_digits:
// Rule: Area code (1st digit) cannot be '0' or '1'
ldrb w5, [x0]
cmp w5, #'1'
ble .set_invalid // If <= '1', invalid
// Rule: Exchange code (4th digit) cannot be '0' or '1'
ldrb w5, [x0, #3]
cmp w5, #'1'
ble .set_invalid // If <= '1', invalid
// If all checks pass, it's a valid number
b .valid
.set_invalid:
// Overwrite the string with ten '0's to signify an invalid number
mov x1, x0 // Reset write pointer to the start
mov w3, #'0'
mov x4, #10 // Loop 10 times
.invalid_loop:
strb w3, [x1], #1
subs x4, x4, #1
bne .invalid_loop
b .final_terminate
.valid:
// For an 11-digit number, we skipped the '1'. Now we need to write the final 10 digits.
// For a 10-digit number, x0 hasn't moved, so this is a no-op.
// This logic ensures the final string is always the 10-digit NANP number.
sub x4, x1, x0 // Get length of string to copy (will be 10)
mov x2, x0 // Read from the (potentially advanced) start
mov x1, x0 // Write to the original start
bl memcpy // Use standard library memcpy for simplicity, or a manual loop
// Fall through to terminate
.final_terminate:
// Null-terminate the final string
strb wzr, [x0, #10] // Write a null byte at the 11th position
ret
Phase 1: Initialization and the Filtering Loop
The function begins by setting up two pointers in registers x1 (write pointer) and x2 (read pointer). Both are initialized to the start of the string, which is passed in register x0 according to the ARM64 calling convention.
The core logic resides in the .read loop. Let's visualize its flow:
● Start Loop (.read)
│
▼
┌───────────────────┐
│ ldrb w3, [x2], #1 │ (Read char, advance read ptr)
└─────────┬─────────┘
│
▼
◆ Is char \0 ? ◆
╱ ╲
Yes No
│ │
▼ ▼
.validate ◆ Is char punctuation? ◆
(End Loop) ╱ (e.g., '(', '-', ' ')
Yes No
│ │
└─────────┬─────────────┘
│
▼
.read (Loop back)
│
▼
◆ Is char a digit? ◆
╱ ('0'-'9') ╲
Yes No
│ │
▼ ▼
┌──────────────────┐ .invalid_char
│ strb w3, [x1], #1│ (Error out)
└──────────────────┘
(Write char, advance write ptr)
│
▼
.read (Loop back)
The ldrb w3, [x2], #1 instruction is the engine of this loop. It loads a single byte from the memory address in x2 into the lower 32 bits of register x3 (as w3), and then it post-increments the x2 register by 1. This is an efficient way to read through a string character by character.
If the character is the null terminator (checked with cbz w3, .validate), the filtering is done, and we jump to the validation phase. Otherwise, a series of cmp/beq pairs checks for and skips over any allowed punctuation. If a character is neither punctuation nor a digit (like a letter), we branch to .invalid_char to fail fast.
Phase 2: The Validation Gauntlet
Once all digits are collected at the start of the string, the .validate section begins. The first step is to calculate the number of digits we found: sub x4, x1, x0. This subtracts the start address (x0) from the final write pointer address (x1) to get the length.
The validation logic follows a strict top-down flow, checking each NANP rule in sequence.
● Start Validation (.validate)
│
▼
┌──────────────────┐
│ length = x1 - x0 │
└─────────┬────────┘
│
▼
◆ length == 11? ◆
╱ ╲
Yes No
│ │
▼ ▼
┌──────────────┐ ◆ length == 10? ◆
│ ldrb w5, [x0]│ ╱ ╲
└──────┬───────┘ Yes No
│ │ │
▼ │ ▼
◆ w5 == '1'? ◆ │ .set_invalid
╱ ╲ │
Yes No │
│ │ │
▼ ▼ │
add x0, x0, #1 .set_invalid
│ (Skip '1') │
│ │
└───────┬───────┘
│
▼
.check_10_digits
│
▼
┌────────────────┐
│ ldrb w5, [x0] │ (Check Area Code)
└────────┬───────┘
│
▼
◆ w5 > '1'? ◆
╱ ╲
Yes No
│ │
▼ ▼
┌────────────────┐ .set_invalid
│ ldrb w5, [x0,#3]│ (Check Exchange Code)
└────────┬───────┘
│
▼
◆ w5 > '1'? ◆
╱ ╲
Yes No
│ │
▼ ▼
.valid .set_invalid
If the length is 11, we check if the first digit is '1'. If it is, we cleverly advance our "effective" start pointer (x0) by one, so the subsequent checks for the area and exchange codes work on the remaining 10 digits seamlessly. If any rule fails, the code branches to .set_invalid.
Phase 3: The Final Verdict
If the number passes all checks, we branch to .valid. Here, we ensure the final output is the clean, 10-digit number by copying the 10 digits from our (potentially advanced) start pointer back to the original start of the string. This handles the case where an 11-digit number was given. Finally, we must null-terminate the string. strb wzr, [x0, #10] writes a zero byte (wzr is the zero register) at the 11th position (index 10), marking the end of our 10-digit string.
If the number is invalid, the .set_invalid block is executed. It overwrites the start of the buffer with ten zeros ("0000000000") and null-terminates it, providing a clear, standardized indicator of an invalid number.
How to Compile and Test the Code
To test this assembly function, you can't run it in isolation. It needs to be called from a higher-level program, like C. This demonstrates the real-world use of assembly: writing performance-critical functions that are linked into a larger application.
Step 1: Create a C Wrapper (e.g., main.c)
This C program will define some test strings, call our assembly function, and print the results.
#include <stdio.h>
#include <string.h>
// Declare the external assembly function
extern void clean(char *str);
void test_number(const char* description, char* number) {
printf("Original: %s (%s)\n", number, description);
clean(number);
printf("Cleaned: %s\n\n", number);
}
int main() {
// We need mutable strings, so we use char arrays
char num1[] = "(223) 456-7890";
test_number("Valid 10-digit", num1);
char num2[] = "223.456.7890";
test_number("Valid with dots", num2);
char num3[] = "1 (223) 456-7890";
test_number("Valid 11-digit with country code", num3);
char num4[] = "223-456-7890-123";
test_number("Invalid length (too long)", num4);
char num5[] = "123-456-7890";
test_number("Invalid area code (starts with 1)", num5);
char num6[] = "223-056-7890";
test_number("Invalid exchange code (starts with 0)", num6);
char num7[] = "223-ABC-7890";
test_number("Invalid with letters", num7);
return 0;
}
Step 2: Assemble and Link
Save the assembly code as `phone_number.s`. Now, open your terminal and run the following commands. These are standard for a Linux environment with GCC toolchain.
# Assemble the Arm64 code into an object file
as -o phone_number.o phone_number.s
# Compile the C wrapper into an object file
gcc -c -o main.o main.c
# Link the two object files together to create the final executable
gcc -o phone_number_cleaner phone_number.o main.o
# Run the executable
./phone_number_cleaner
This process demonstrates the powerful modularity of software development, where you can mix and match languages to use the best tool for each part of the job. The complete process is a core concept taught in the Arm64 Assembly learning roadmap on kodikra.com.
Frequently Asked Questions (FAQ)
- Why is
ldrbused instead ofldr? ldrbstands for "Load Register Byte". It loads a single byte (8 bits) from memory, which is perfect for processing ASCII characters in a C-style string. Theldrinstruction would load a full word (4 bytes) or double word (8 bytes), which is not what we want when iterating character by character.- What is the purpose of the
cbzinstruction? cbzmeans "Compare and Branch on Zero". It's a highly efficient instruction that compares a register's value to zero and branches to a label if it is zero. In our code,cbz w3, .validatechecks if the loaded character is the null terminator (`\0`), which has an ASCII value of 0. It's faster than doing a separatecmp w3, #0followed by abeq .validate.- How does this code handle Unicode or UTF-8 characters?
- It doesn't. This implementation is designed specifically for ASCII strings, where one character is one byte. Handling multi-byte UTF-8 characters would require significantly more complex logic to identify character boundaries and would likely be much slower. For this specific problem (NANP phone numbers), assuming ASCII is a safe and efficient constraint.
- What does the `[x2], #1` syntax mean in the `ldrb` instruction?
- This is called "post-indexed addressing mode". It tells the CPU to first use the address currently in register
x2to load the data, and after the load is complete, update the value inx2by adding1to it. It's a single instruction that accomplishes both reading and advancing the pointer. - Could SIMD instructions (NEON) be used to optimize this?
- Yes, for very large batches of numbers, SIMD (Single Instruction, Multiple Data) could provide a massive speedup. You could load 16 bytes at once into a NEON register and use vector instructions to compare all 16 bytes against multiple punctuation characters simultaneously. However, this adds significant complexity and is often only worthwhile when processing very large amounts of data.
- What is the `wzr` register used for?
wzr(orxzrfor the 64-bit version) is the "Zero Register". It is a hardwired register that always reads as zero and discards any writes to it. Usingstrb wzr, ...is the most efficient way to write a null byte (`\0`) to memory to terminate a string.
Conclusion: From Low-Level Logic to High-Level Value
You have successfully journeyed from a chaotic mess of user-provided strings to clean, validated, and standardized phone numbers, all using the raw power of Arm64 assembly. This exercise from the kodikra.com curriculum is more than just a string manipulation task; it's a deep dive into the fundamentals of computing. You've mastered memory pointers, conditional branching, ASCII character encoding, and the Application Binary Interface (ABI) that allows assembly and C to work together.
While you may not write data sanitation routines in assembly every day, the understanding you've gained is invaluable. It provides a solid foundation for debugging, performance optimization, and understanding what's truly happening under the hood of the high-level languages you use. This knowledge is what separates a good programmer from a great one.
Disclaimer: The assembly code provided is written for the A64 instruction set (ARMv8-A) and assumes a standard Linux-like environment and calling convention. It may require modification for other operating systems or ABIs.
Ready to tackle the next challenge? Explore the full Arm64 Assembly learning path or dive deeper into other low-level languages on the kodikra.com platform.
Published by Kodikra — Your trusted Arm64-assembly learning resource.
Post a Comment