Run Length Encoding in C: Complete Solution & Deep Dive Guide

a close up of a computer screen with code on it

The Ultimate Guide to Run-Length Encoding in C: From Zero to Hero

Run-Length Encoding (RLE) is a foundational lossless data compression algorithm in C that reduces file size by replacing consecutive sequences of identical characters with a count and a single character. This technique is ideal for data with high redundancy, such as simple images or text files.

Ever stared at a massive text file or a slowly loading image and wondered how it could be made smaller without losing a single pixel or character? You've stumbled upon the core problem of data compression, a field dedicated to shrinking data to save space and speed up transmission. The frustration of dealing with bulky data is a common pain point for developers and users alike.

This guide promises to be your definitive resource for mastering one of the most intuitive compression algorithms: Run-Length Encoding (RLE). We will demystify RLE and provide you with the practical C programming skills to implement it from scratch. By the end, you'll not only understand the theory but will have built your own fully functional encoder and decoder, giving you a powerful tool for your software engineering arsenal.


What is Run-Length Encoding? The Core Concept

Run-Length Encoding, or RLE, is a form of lossless data compression. The term "lossless" is crucial; it means that after compressing and then decompressing the data, you get back the exact original data with zero information lost. This is in contrast to "lossy" compression (like JPEG for images or MP3 for audio), where some non-essential information is discarded to achieve much higher compression ratios.

The algorithm's logic is beautifully simple: it finds "runs" of consecutive, identical data elements (like characters in a string) and replaces them with a single instance of the data element and a count of its occurrences. It's like saying "twelve W's" instead of "W, W, W, W, W, W, W, W, W, W, W, W."

Consider the classic example:

Original String: "WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWB" (53 characters)
Encoded String:  "12WB12W3B24WB" (13 characters)

In this case, the 53-character string is compressed down to just 13 characters, a significant saving. The encoded string is read as: "twelve W's, one B, twelve W's, three B's, twenty-four W's, one B." The original data can be perfectly reconstructed from this compressed form.

The Encoding Process Visualized

Here is a simplified visual flow of how the encoding algorithm processes an input string. It scans for sequences, counts them, and appends the result to a new string.

    ● Start with Input String ("AAABBC")
    │
    ▼
  ┌──────────────────┐
  │ Initialize empty │
  │ result string    │
  └────────┬─────────┘
           │
           ▼
    Scan string from left (char 'A')
           │
           ├─▶ Found a run of 'A'
           │   Count = 3
           ▼
  ┌──────────────────┐
  │ Append "3A" to   │
  │ result           │
  └────────┬─────────┘
           │
           ▼
    Continue scan (char 'B')
           │
           ├─▶ Found a run of 'B'
           │   Count = 2
           ▼
  ┌──────────────────┐
  │ Append "2B" to   │
  │ result           │
  └────────┬─────────┘
           │
           ▼
    Continue scan (char 'C')
           │
           ├─▶ Found a run of 'C'
           │   Count = 1
           ▼
  ┌──────────────────┐
  │ Append "C" to    │
  │ result (skip '1')│
  └────────┬─────────┘
           │
           ▼
    ● End of String. Final Result: "3A2BC"

Why Use RLE (and When to Avoid It)?

While modern compression algorithms like Lempel-Ziv (used in ZIP) or Huffman Coding are more powerful for general-purpose use, RLE holds its own due to its sheer simplicity and speed. It shines in specific scenarios but can be counterproductive in others.

Implementing RLE in a low-level language like C is a fantastic exercise. It forces you to engage directly with fundamental concepts like pointers, dynamic memory allocation with malloc() and free(), and manual string manipulation. C provides the raw power and control needed to build an efficient, memory-aware implementation without the overhead of higher-level abstractions.

Pros and Cons of Run-Length Encoding

Understanding the trade-offs is key to being an effective engineer. Not every tool is right for every job. Here’s a breakdown of where RLE excels and where it falls short.

Pros (Advantages) Cons (Disadvantages)
Computationally Simple: The algorithm is extremely fast and requires very little processing power, making it suitable for real-time systems or devices with limited resources. Inefficient with Non-Repetitive Data: If the data has few or no runs, RLE can actually increase the file size. The string "ABCDE" would become "1A1B1C1D1E", doubling its length.
Easy to Implement: The logic for both encoding and decoding is straightforward, making it a great introductory algorithm for data compression. Limited Compression Ratio: Compared to more complex algorithms like Huffman or LZW, RLE typically achieves a much lower compression ratio on diverse data.
Lossless: Guarantees perfect reconstruction of the original data, which is critical for text, executable code, or scientific data. Sensitive to Data Order: Shuffling the order of data can completely destroy the runs and negate any compression benefits.
Excellent for Specific Data Types: Highly effective for simple graphical formats (like icons, early BMPs), fax transmissions, and data with long, monotonous sequences. No Contextual Understanding: RLE only looks at adjacent identical characters. It cannot identify larger patterns or statistical redundancies in the data.

How to Implement RLE Encoding in C

Now, let's dive into the practical implementation. We'll build a C function, run_length_encode, that takes a standard C string (const char*) as input and returns a dynamically allocated string (char*) containing the encoded data.

The core logic involves iterating through the input string while keeping track of the current character and its consecutive count. When the character changes or the end of the string is reached, we append the count and the character to our result buffer.

The C Code for Encoding

Here is a complete, well-commented C function for RLE encoding. Pay close attention to the memory management and the logic for converting integer counts to string representations.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

// Function to perform Run-Length Encoding
char *run_length_encode(const char *text) {
    if (text == NULL || *text == '\0') {
        // Handle null or empty input string by returning an empty, allocated string.
        char *empty_result = malloc(1);
        if (empty_result) {
            empty_result[0] = '\0';
        }
        return empty_result;
    }

    size_t text_len = strlen(text);
    // Allocate memory for the worst-case scenario (e.g., "ABCDE" -> "1A1B1C1D1E").
    // Each char could become a number + char. We'll use a generous buffer.
    // A more precise calculation would be harder, so we oversize and reallocate later if needed.
    // For a 32-bit system, a count can be up to 10 digits. Let's assume max 11 chars (10 digits + char) per run.
    size_t result_capacity = text_len * 2 + 1; // A safe starting point
    char *result = malloc(result_capacity);
    if (!result) {
        return NULL; // Memory allocation failed
    }
    result[0] = '\0'; // Start with an empty string

    size_t result_len = 0;
    
    for (size_t i = 0; i < text_len; ) {
        char current_char = text[i];
        size_t count = 1;
        size_t j = i + 1;
        
        // Count consecutive occurrences of the current character
        while (j < text_len && text[j] == current_char) {
            count++;
            j++;
        }

        // Buffer to hold the string representation of the count
        // 21 bytes is enough for a 64-bit number + null terminator
        char count_str[21]; 
        
        if (count > 1) {
            // Convert count to string
            int written = snprintf(count_str, sizeof(count_str), "%zu", count);
            
            // Check if there's enough space in the result buffer
            if (result_len + written + 1 + 1 > result_capacity) {
                result_capacity = (result_len + written + 2) * 2;
                char* new_result = realloc(result, result_capacity);
                if (!new_result) {
                    free(result);
                    return NULL; // Reallocation failed
                }
                result = new_result;
            }
            
            // Append the count and the character
            strcat(result, count_str);
            result_len += written;
        }

        // Append the character itself
        // Check for space one last time
        if (result_len + 1 + 1 > result_capacity) {
             result_capacity = (result_len + 2) * 2;
             char* new_result = realloc(result, result_capacity);
             if (!new_result) {
                 free(result);
                 return NULL;
             }
             result = new_result;
        }

        result[result_len] = current_char;
        result[result_len + 1] = '\0';
        result_len++;

        // Move the main loop index to the next new character
        i = j;
    }

    return result;
}

Code Walkthrough: Encoding Logic Explained

  1. Function Signature: The function char *run_length_encode(const char *text) takes a read-only string text and is expected to return a new string on the heap. The caller is responsible for calling free() on the returned pointer.
  2. Edge Case Handling: We first check for a NULL or empty input string. If so, we return a dynamically allocated empty string to maintain a consistent API.
  3. Memory Allocation: This is the trickiest part in C. We don't know the final size of the encoded string beforehand. We make an educated guess for the initial size (text_len * 2 + 1). This covers the worst-case scenario where every character is unique (e.g., "ABC" -> "A2B..."), but we still need to be prepared for multi-digit counts. We use malloc to allocate this initial buffer. A more robust solution would be to `realloc` as needed, which is included in this implementation.
  4. Outer Loop: The main for loop iterates through the input string. The index i is not incremented in the loop header; instead, we manually advance it by setting i = j at the end of each run.
  5. Inner Loop (Counting Runs): Inside the loop, an inner while loop starts from the next character (j = i + 1) and counts how many times current_char repeats consecutively.
  6. Appending the Count:
    • If the count is greater than 1, we need to append it to the result.
    • We use snprintf to safely convert the size_t count into a string (count_str). snprintf is preferred over sprintf as it prevents buffer overflows.
    • We then use strcat to append this number string to our result buffer.
  7. Appending the Character: After appending the count (if any), we append the actual character. We do this by direct array access result[result_len] = current_char; and manually null-terminating the string. This is more efficient than using strcat for a single character.
  8. Advancing the Index: We set i = j. This crucial step jumps the main loop's index past the run we just processed, ensuring we start the next iteration at the beginning of a new run.
  9. Returning the Result: Finally, the function returns the pointer to the newly created, null-terminated encoded string.

How to Implement RLE Decoding in C

Decoding is the reverse process. The function will take the compressed string and expand it back to its original form. The logic involves parsing numbers and then appending the subsequent character that many times.

For example, when the decoder sees "12W", it needs to: 1. Parse the number "12". 2. Read the character "W". 3. Append "W" to the result string twelve times.

The Decoding Process Visualized

This flow diagram illustrates how the decoder parses the compressed string, separating numbers from characters to reconstruct the original data.

    ● Start with Encoded String ("3A2BC")
    │
    ▼
  ┌──────────────────┐
  │ Initialize empty │
  │ decoded string   │
  └────────┬─────────┘
           │
           ▼
    Scan string from left...
           │
           ├─▶ Found a number: '3'
           │   Parse full number: 3
           │   Read next char: 'A'
           ▼
  ┌──────────────────┐
  │ Append 'A' three │
  │ times to result  │
  └────────┬─────────┘
           │
           ▼
    Continue scan...
           │
           ├─▶ Found a number: '2'
           │   Parse full number: 2
           │   Read next char: 'B'
           ▼
  ┌──────────────────┐
  │ Append 'B' two   │
  │ times to result  │
  └────────┬─────────┘
           │
           ▼
    Continue scan...
           │
           ├─▶ Found a character: 'C' (no preceding number)
           │   Assume count is 1
           ▼
  ┌──────────────────┐
  │ Append 'C' one   │
  │ time to result   │
  └────────┬─────────┘
           │
           ▼
    ● End of String. Final Result: "AAABBC"

The C Code for Decoding

This function, run_length_decode, requires careful parsing of the input string to distinguish between digits that form a count and the characters that need to be repeated.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>

// Function to perform Run-Length Decoding
char *run_length_decode(const char *text) {
    if (text == NULL || *text == '\0') {
        char *empty_result = malloc(1);
        if (empty_result) {
            empty_result[0] = '\0';
        }
        return empty_result;
    }

    // We must calculate the required size for the decoded string first
    // to avoid expensive reallocations in a loop.
    size_t decoded_len = 0;
    for (size_t i = 0; text[i] != '\0'; ) {
        size_t count = 1;
        size_t num_len = 0;

        if (isdigit(text[i])) {
            char *end_ptr;
            count = strtoul(&text[i], &end_ptr, 10);
            num_len = end_ptr - &text[i];
        }
        
        decoded_len += count;
        i += num_len + 1; // Move past the number and the character
    }

    // Now allocate the exact memory needed
    char *result = malloc(decoded_len + 1);
    if (!result) {
        return NULL; // Allocation failed
    }
    result[0] = '\0';
    
    size_t result_idx = 0;
    for (size_t i = 0; text[i] != '\0'; ) {
        size_t count = 1;
        size_t num_len = 0;
        char char_to_repeat;

        if (isdigit(text[i])) {
            char *end_ptr;
            count = strtoul(&text[i], &end_ptr, 10);
            num_len = end_ptr - &text[i];
            char_to_repeat = text[i + num_len];
        } else {
            char_to_repeat = text[i];
        }

        // Append the character 'count' times
        for (size_t k = 0; k < count; k++) {
            result[result_idx++] = char_to_repeat;
        }

        i += num_len + 1; // Move to the start of the next run
    }
    
    result[result_idx] = '\0'; // Null-terminate the final string
    return result;
}

Code Walkthrough: Decoding Logic Explained

  1. Pre-calculating Size: Unlike encoding, for decoding we can determine the exact size of the final string *before* building it. We perform a "dry run" pass over the compressed string, parsing the counts and summing them up to get decoded_len. This allows us to do a single, precise malloc, which is far more efficient than repeated realloc calls.
  2. Parsing Numbers:
    • The loop iterates through the compressed string. At each position, it checks if the character is a digit using isdigit() from <ctype.h>.
    • If it's a digit, we use strtoul (String to Unsigned Long) to parse the complete number. strtoul is robust; it reads all consecutive digits and tells us where it stopped parsing via the end_ptr. This elegantly handles multi-digit counts like "12" or "100".
    • If the character is not a digit, we assume a default count of 1.
  3. Building the Result:
    • After parsing the count and identifying the char_to_repeat, a simple inner for loop appends the character to our result buffer the required number of times.
    • We use an index result_idx to keep track of our position in the result buffer, which is more efficient than repeated calls to strcat.
  4. Advancing the Index: The outer loop index i is advanced by the length of the number string (num_len) plus one for the character itself, correctly positioning it for the next run.
  5. Null Termination: After the loop finishes, we place a null terminator \0 at the end of the constructed string.

Putting It All Together: A Complete C Program

To make these functions usable, let's wrap them in a main function that demonstrates the full encode-decode cycle. This program will take a sample string, encode it, print the result, then decode it back and verify that it matches the original.

Compiling and Running the Code

Save the complete code below as rle.c. You can compile and run it from your terminal using a C compiler like GCC.

Terminal Commands:

gcc rle.c -o rle -std=c11 -Wall
./rle

The -std=c11 flag ensures we're using a modern C standard, and -Wall enables all compiler warnings, which is good practice for catching potential bugs.

Full Program (rle.c)

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>

// ... (Paste the run_length_encode function here) ...
char *run_length_encode(const char *text) {
    if (text == NULL || *text == '\0') {
        char *empty_result = malloc(1);
        if (empty_result) { empty_result[0] = '\0'; }
        return empty_result;
    }
    size_t text_len = strlen(text);
    size_t result_capacity = text_len * 2 + 1;
    char *result = malloc(result_capacity);
    if (!result) { return NULL; }
    result[0] = '\0';
    size_t result_len = 0;
    for (size_t i = 0; i < text_len; ) {
        char current_char = text[i];
        size_t count = 1;
        size_t j = i + 1;
        while (j < text_len && text[j] == current_char) {
            count++;
            j++;
        }
        char count_str[21];
        if (count > 1) {
            int written = snprintf(count_str, sizeof(count_str), "%zu", count);
            if (result_len + written + 1 + 1 > result_capacity) {
                result_capacity = (result_len + written + 2) * 2;
                char* new_result = realloc(result, result_capacity);
                if (!new_result) { free(result); return NULL; }
                result = new_result;
            }
            strcat(result, count_str);
            result_len += written;
        }
        if (result_len + 1 + 1 > result_capacity) {
             result_capacity = (result_len + 2) * 2;
             char* new_result = realloc(result, result_capacity);
             if (!new_result) { free(result); return NULL; }
             result = new_result;
        }
        result[result_len] = current_char;
        result[result_len + 1] = '\0';
        result_len++;
        i = j;
    }
    return result;
}


// ... (Paste the run_length_decode function here) ...
char *run_length_decode(const char *text) {
    if (text == NULL || *text == '\0') {
        char *empty_result = malloc(1);
        if (empty_result) { empty_result[0] = '\0'; }
        return empty_result;
    }
    size_t decoded_len = 0;
    for (size_t i = 0; text[i] != '\0'; ) {
        size_t count = 1;
        size_t num_len = 0;
        if (isdigit(text[i])) {
            char *end_ptr;
            count = strtoul(&text[i], &end_ptr, 10);
            num_len = end_ptr - &text[i];
        }
        decoded_len += count;
        i += num_len + 1;
    }
    char *result = malloc(decoded_len + 1);
    if (!result) { return NULL; }
    result[0] = '\0';
    size_t result_idx = 0;
    for (size_t i = 0; text[i] != '\0'; ) {
        size_t count = 1;
        size_t num_len = 0;
        char char_to_repeat;
        if (isdigit(text[i])) {
            char *end_ptr;
            count = strtoul(&text[i], &end_ptr, 10);
            num_len = end_ptr - &text[i];
            char_to_repeat = text[i + num_len];
        } else {
            char_to_repeat = text[i];
        }
        for (size_t k = 0; k < count; k++) {
            result[result_idx++] = char_to_repeat;
        }
        i += num_len + 1;
    }
    result[result_idx] = '\0';
    return result;
}

int main() {
    const char *original_text = "WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWB";
    printf("Original Text:    \"%s\"\n", original_text);
    printf("Original Length:  %zu\n\n", strlen(original_text));

    // Encode the text
    char *encoded_text = run_length_encode(original_text);
    if (encoded_text) {
        printf("Encoded Text:     \"%s\"\n", encoded_text);
        printf("Encoded Length:   %zu\n\n", strlen(encoded_text));

        // Decode the text
        char *decoded_text = run_length_decode(encoded_text);
        if (decoded_text) {
            printf("Decoded Text:     \"%s\"\n", decoded_text);
            printf("Decoded Length:   %zu\n\n", strlen(decoded_text));

            // Verification
            if (strcmp(original_text, decoded_text) == 0) {
                printf("SUCCESS: Original and decoded texts match!\n");
            } else {
                printf("FAILURE: Texts do not match.\n");
            }

            // Clean up memory
            free(decoded_text);
        }
        free(encoded_text);
    }

    return 0;
}

Expected Output:

Original Text:    "WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWB"
Original Length:  53

Encoded Text:     "12WB12W3B24WB"
Encoded Length:   13

Decoded Text:     "WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWWWB"
Decoded Length:   53

SUCCESS: Original and decoded texts match!

Frequently Asked Questions (FAQ)

1. What happens if RLE makes the data larger?
This is a common and important scenario. If the input data has no repeating characters (e.g., "abcdefg"), RLE will expand it (e.g., "abcdefg" becomes "Abcdefg", assuming single runs are not prefixed with '1'). A professional implementation often includes a check: if the compressed size is greater than or equal to the original size, the original data is stored instead, with a flag indicating it is uncompressed.
2. How does this RLE implementation handle numbers in the original text?
The provided decoding implementation assumes the input text does not contain digits that could be misinterpreted as counts. For example, encoding "AAA11BB" would produce "3A212B". The decoder would misinterpret "212B". To handle arbitrary data, a more complex scheme is needed, such as using an escape character to differentiate between literal numbers and run counts.
3. Is Run-Length Encoding still relevant today?
Yes, but in niche applications. While it's not used for general-purpose file compression like .zip or .gz, it is still a component within more complex standards. It's used in formats like TIFF, BMP, and PCX for image data, and was fundamental to fax machine (ITU T.4 standard) compression. Its primary modern relevance is as an excellent educational tool for learning about compression, algorithms, and low-level data manipulation.
4. Can RLE be used for binary data, not just text?
Absolutely. The core concept is data-agnostic. You can apply RLE to any stream of data, including binary files. Instead of characters, you would be counting runs of identical bytes. This is common in image compression where you might have long runs of the same color pixel (byte value).
5. What's the main difference between RLE and Huffman Coding?
RLE compresses based on the physical sequence of data (runs of identical values). Huffman Coding is a statistical method that compresses based on the frequency of data. It assigns shorter codes to more frequent characters and longer codes to less frequent ones, regardless of where they appear in the data. Huffman is generally more effective for text and other diverse data types.
6. How does your code handle counts greater than 9?
The provided code handles multi-digit counts correctly. The encoder uses snprintf to convert any integer (like 12, 100, etc.) into a string. The decoder uses strtoul, which is designed to parse multi-digit numbers from a string until it hits a non-digit character. This makes the implementation robust for long runs.
7. Are there any standard library functions in C for RLE?
No, the C standard library does not include built-in functions for Run-Length Encoding or other high-level compression algorithms. This is by design, as C aims to be a lean, portable language. Implementing algorithms like RLE is a task left to the programmer, making it a perfect learning module from the kodikra C curriculum.

Conclusion: From Theory to Practice

You have now journeyed from the fundamental theory of Run-Length Encoding to a complete, practical implementation in C. We've seen that RLE, despite its simplicity, is a powerful illustration of lossless compression, forcing us to tackle essential C programming concepts like dynamic memory management, string manipulation, and algorithmic logic.

While you may reach for more advanced libraries for production-level compression needs, the knowledge gained here is invaluable. Understanding how to manipulate data at this low level is a hallmark of a skilled C programmer. You now have a solid foundation to explore more complex algorithms and to appreciate the trade-offs involved in data compression.

To continue your journey, consider exploring other algorithms or diving deeper into systems programming. Check out the complete kodikra learning path for C to discover your next challenge.

Disclaimer: The C code provided in this article is written for clarity and educational purposes, conforming to the C11/C17 standard. For production systems, always add more extensive error handling, bounds checking, and security considerations.


Published by Kodikra — Your trusted C learning resource.