Variable Length Quantity in C: Complete Solution & Deep Dive Guide

a close up of a computer screen with code on it

Variable Length Quantity in C: The Ultimate Guide to Data Compression

Variable Length Quantity (VLQ) is a universal encoding scheme that represents arbitrarily large integers using a flexible number of bytes. This method is highly efficient for data streams containing many small numbers, saving significant space compared to fixed-width integers, making it crucial for formats like MIDI and Protobuf.


The Hidden Cost of Wasted Bytes

Imagine you're designing a protocol for a real-time music application. You need to send millions of musical notes, timings, and velocity values over a network every second. Most of these values are small—a note's pitch might be 60, its velocity 100. Using a standard 32-bit integer (4 bytes) for each of these feels like hiring a massive shipping container to transport a single shoebox. It works, but it's incredibly wasteful.

This inefficiency multiplies quickly. Wasted bytes lead to higher bandwidth costs, slower transmission times, and larger storage footprints. You've hit a common but frustrating roadblock in software engineering: how do you represent data, especially numerical data, in the most compact way possible without losing information? You need a smarter, more flexible approach.

This is precisely the problem that Variable Length Quantity (VLQ) encoding was designed to solve. It’s a clever technique that allows you to use only one byte for small numbers, two for slightly larger ones, and so on, adapting dynamically to the magnitude of the data. This guide will take you from the core theory of VLQ to a complete, production-ready implementation in C, transforming how you think about data serialization.


What Exactly is Variable Length Quantity (VLQ)?

Variable Length Quantity (VLQ) is a form of universal code used to encode integers into a sequence of bytes. Its primary characteristic is that the number of bytes used to represent an integer is not fixed; it depends on the magnitude of the integer itself. Small integers use fewer bytes, while larger integers use more.

The magic behind VLQ lies in how it uses the bits within each byte. In a standard 8-bit byte, VLQ reserves the most significant bit (MSB) as a special flag called the continuation bit. The remaining 7 bits are used to store the actual data, or the "payload."

  • If the continuation bit (MSB) is 1, it signals that more bytes follow as part of the current integer.
  • If the continuation bit (MSB) is 0, it signals that this is the final byte for the current integer.

This simple rule allows us to chain bytes together to represent numbers far larger than what a single byte could normally hold (which is typically 0-255). This technique is a cornerstone of data compression and serialization in many well-known technologies, including:

  • MIDI (Musical Instrument Digital Interface) files: Used to encode event timings in music sequences.
  • Google Protocol Buffers (Protobuf): Employs a variant called "Varints" for efficient data serialization.
  • Source Maps in Web Development: Used to map compiled code (like minified JavaScript) back to its original source.
  • DWARF Debugging Format: Uses a similar encoding called LEB128 (Little-Endian Base 128).

Why Should You Use VLQ Encoding?

The primary motivation for using VLQ is space efficiency, especially in datasets with a skewed distribution of numbers. Most computer systems default to fixed-size integers like int32_t (4 bytes) or int64_t (8 bytes). This is fine when your numbers are consistently large, but it's highly inefficient if your data consists mostly of small values.

Consider a system logging user IDs. If the first million users have IDs from 1 to 1,000,000, storing each ID as a 64-bit integer is overkill. The number 127, which can be represented with 7 bits, would still occupy 8 bytes in memory or on disk. With VLQ, it would only take up one byte.

VLQ shines because it offers a "pay-as-you-go" model for data representation. You only use the bytes you absolutely need. This leads to smaller file sizes, faster network transmission, and reduced storage costs, which are critical advantages in high-performance and large-scale systems.


How Does the VLQ Algorithm Work? A Deep Dive

Understanding VLQ requires thinking in terms of bitwise operations. The process is split into two distinct parts: encoding (converting an integer to bytes) and decoding (converting bytes back to an integer).

The Encoding Process Explained

Encoding transforms a standard integer into a VLQ byte sequence. The core idea is to break the number down into 7-bit chunks, starting from the least significant bits.

Let's encode the number 137 (binary 10001001):

  1. Is the number greater than 127 (0x7F)? Yes. This means it will require more than one byte.
  2. Extract the first 7 bits: We take the lowest 7 bits of 137. In binary, 10001001 AND 01111111 gives 0001001. This is our first payload.
  3. Set the continuation bit: Since more data is coming, we set the MSB of this first byte to 1. So, 1 + 0001001 becomes 10001001 (decimal 137, or 0x89).
  4. Shift the original number: We shift the original number 137 right by 7 bits. 10001001 >> 7 becomes 1.
  5. Process the remaining value: The remaining number is 1. Is it greater than 127? No. This will be our final byte.
  6. Create the final byte: The value is 1 (binary 0000001). Since this is the last byte, its MSB is 0. The byte is 00000001 (decimal 1, or 0x01).
  7. Assemble the sequence: VLQ bytes are ordered from most significant to least significant chunks, so we reverse the order of our generated bytes. The final sequence is [0x81, 0x09]. Wait, let me re-calculate that. The standard is least significant group first. Let's re-do the example for 137 (binary `10001001`). 1. The number is `137`. 2. `137 & 0x7F` = `9` (binary `0001001`). 3. Since `137 > 127`, we need more bytes. So we set the continuation bit on this first byte: `9 | 0x80` = `137` (or `0x89`). First byte is `0x89`. 4. We shift the original number right by 7: `137 >> 7` = `1`. 5. The new number is `1`. `1 & 0x7F` = `1`. 6. Since `1 <= 127`, this is the last byte. Its continuation bit is 0. The byte is just `1` (or `0x01`). 7. The bytes are generated in reverse order (least significant 7-bit group first). So the sequence is `[0x89, 0x01]`. Let's check this. `(0x89 & 0x7F) = 9`. `(0x01 & 0x7F) = 1`. Reconstruct: `(1 << 7) | 9` = `128 + 9 = 137`. Correct. The final sequence for 137 is `[0x89, 0x01]`. Let's try a bigger number, like 2097151 (or `0x1FFFFF`). 1. `0x1FFFFF & 0x7F` = `0x7F`. Number is > 127, so set continuation bit: `0x7F | 0x80` = `0xFF`. First byte: `0xFF`. 2. Shift right by 7: `0x1FFFFF >> 7` = `0x3FFF`. 3. `0x3FFF & 0x7F` = `0x7F`. Number is > 127, so set continuation bit: `0x7F | 0x80` = `0xFF`. Second byte: `0xFF`. 4. Shift right by 7: `0x3FFF >> 7` = `0x7F`. 5. `0x7F & 0x7F` = `0x7F`. Number is <= 127, so this is the last byte. No continuation bit. Third byte: `0x7F`. 6. The final sequence is `[0xFF, 0xFF, 0x7F]`.

        ● Start with Integer (e.g., 2097151)
        │
        ▼
      ┌─────────────────┐
      │ Loop while N > 0  │
      └────────┬────────┘
               │
               ├─ Loop 1: N = 2097151
               │   │
               │   ├─> Extract 7 bits: N & 0x7F  (payload = 127)
               │   │
               │   └─> Shift N >> 7            (N becomes 16383)
               │   │
               │   └─> Set continuation bit: payload | 0x80
               │       (Byte 1 = 0xFF)
               │
               ├─ Loop 2: N = 16383
               │   │
               │   ├─> Extract 7 bits: N & 0x7F  (payload = 127)
               │   │
               │   └─> Shift N >> 7            (N becomes 127)
               │   │
               │   └─> Set continuation bit: payload | 0x80
               │       (Byte 2 = 0xFF)
               │
               ├─ Loop 3: N = 127
               │   │
               │   ├─> Extract 7 bits: N & 0x7F  (payload = 127)
               │   │
               │   └─> Shift N >> 7            (N becomes 0)
               │   │
               │   └─> This is the LAST byte. NO continuation bit.
               │       (Byte 3 = 0x7F)
               │
               ▼
      ┌──────────────────────────┐
      │ Assemble Bytes (in order)│
      │ [0xFF, 0xFF, 0x7F]       │
      └──────────────────────────┘
               │
               ▼
        ● End
    

    The Decoding Process Explained

    Decoding reverses the process. We read bytes one by one, strip the continuation bit, and assemble the 7-bit payloads back into a single integer.

    Let's decode the byte sequence [0xFF, 0xFF, 0x7F]:

    1. Initialize result: Start with a result variable set to 0.
    2. Read the first byte (0xFF): Check its MSB. Is 0xFF & 0x80 non-zero? Yes. This means more bytes will follow.
    3. Process the payload: Get the 7-bit payload: 0xFF & 0x7F gives 127 (or 0x7F). Add this to our result. Result is now 127.
    4. Read the second byte (0xFF): Check its MSB. It's also set. This is not the last byte.
    5. Shift and combine: Before adding the new payload, we must shift our current result left by 7 bits to make room. Result becomes 127 << 7, which is 16256.
    6. Process the payload: Get the 7-bit payload: 0xFF & 0x7F gives 127. Add it to the shifted result: 16256 | 127 = 16383. Result is now 16383.
    7. Read the third byte (0x7F): Check its MSB. Is 0x7F & 0x80 non-zero? No. This is the final byte.
    8. Shift and combine: Shift our current result left by 7 bits: 16383 << 7 = 2097024.
    9. Process the final payload: Get the 7-bit payload: 0x7F & 0x7F gives 127. Add it to the result: 2097024 | 127 = 2097151.
    10. Final Result: The decoded integer is 2097151.
        ● Start with Byte Stream [0xFF, 0xFF, 0x7F]
        │
        ▼
      ┌───────────────────┐
      │ Initialize Result = 0 │
      └─────────┬─────────┘
                │
                ▼
        ◆ Read Byte (0xFF). MSB is 1? ── Yes
       ╱        │
      │         ▼
      │       ┌──────────────────────────┐
      │       │ Get payload: 0xFF & 0x7F = 127 │
      │       │ Shift Result << 7: 0 << 7 = 0  │
      │       │ Combine: Result |= 127         │
      │       │ (Result is now 127)          │
      │       └──────────────────────────┘
      │         │
      └─────────┤
                ▼
        ◆ Read Byte (0xFF). MSB is 1? ── Yes
       ╱        │
      │         ▼
      │       ┌──────────────────────────┐
      │       │ Get payload: 0xFF & 0x7F = 127 │
      │       │ Shift Result << 7: 127 << 7 = 16256 │
      │       │ Combine: Result |= 127         │
      │       │ (Result is now 16383)        │
      │       └──────────────────────────┘
      │         │
      └─────────┤
                ▼
        ◆ Read Byte (0x7F). MSB is 1? ── No (This is the last byte)
                  │
                  ▼
                ┌──────────────────────────┐
                │ Get payload: 0x7F & 0x7F = 127 │
                │ Shift Result << 7: 16383 << 7 = 2097024 │
                │ Combine: Result |= 127         │
                │ (Result is now 2097151)      │
                └──────────────────────────┘
                  │
                  ▼
        ● Final Integer: 2097151
    

    Implementing VLQ in C: A Complete Solution

    Now, let's translate this logic into robust C code. This solution, part of the exclusive kodikra.com C learning path, provides functions for both encoding and decoding, complete with error handling.

    We'll use types from <stdint.h> like uint32_t and uint8_t to ensure our integer sizes are explicit and portable.

    
    #include <stdio.h>
    #include <stdint.h>
    #include <stdlib.h>
    #include <stdbool.h>
    
    // Define a max buffer size for safety in our example
    #define MAX_VLQ_BYTES 5
    
    /**
     * @brief Encodes a 32-bit unsigned integer into a VLQ byte sequence.
     *
     * @param value The integer to encode.
     * @param buffer A pointer to the output buffer where bytes will be written.
     * @return The number of bytes written to the buffer.
     */
    int vlq_encode(uint32_t value, uint8_t *buffer) {
        uint8_t temp[MAX_VLQ_BYTES];
        int count = 0;
    
        // A value of 0 should be encoded as a single byte 0x00
        if (value == 0) {
            buffer[0] = 0x00;
            return 1;
        }
    
        // Process the value while it's greater than 0
        while (value > 0) {
            // Extract the lowest 7 bits
            uint8_t byte = value & 0x7F;
            value >>= 7;
    
            // If there's more data to come, set the continuation bit
            if (value > 0) {
                byte |= 0x80;
            }
            
            // Store the byte in a temporary array
            if (count < MAX_VLQ_BYTES) {
                temp[count] = byte;
            } else {
                // This should not happen for uint32_t, but good practice
                return -1; // Error: overflow
            }
            count++;
        }
    
        // The bytes were generated in reverse order (LSB first), so we reverse them
        // to get the correct MSB-first order for the final output.
        for (int i = 0; i < count; i++) {
            buffer[i] = temp[count - 1 - i];
        }
    
        return count;
    }
    
    /**
     * @brief Decodes a VLQ byte sequence into a 32-bit unsigned integer.
     *
     * @param buffer A pointer to the input buffer containing VLQ bytes.
     * @param length The number of bytes available in the buffer.
     * @param decoded_value A pointer to store the resulting integer.
     * @return The number of bytes read from the buffer, or -1 on error.
     */
    int vlq_decode(const uint8_t *buffer, size_t length, uint32_t *decoded_value) {
        *decoded_value = 0;
        int bytes_read = 0;
    
        for (size_t i = 0; i < length; ++i) {
            bytes_read++;
            uint8_t byte = buffer[i];
            
            // Check for potential overflow before shifting
            // If the top 7 bits of the result are already set, shifting left by 7
            // will cause an overflow on a 32-bit integer.
            if ((*decoded_value & 0xFE000000) != 0) {
                return -1; // Error: Overflow
            }
    
            // Shift the existing value left by 7 to make room for the new payload
            *decoded_value <<= 7;
            
            // Add the 7-bit payload from the current byte
            *decoded_value |= (byte & 0x7F);
    
            // Check the continuation bit. If it's 0, we are done.
            if ((byte & 0x80) == 0) {
                return bytes_read;
            }
        }
    
        // If we exit the loop, it means the sequence was incomplete
        // (last byte still had continuation bit set)
        return -1; // Error: Incomplete sequence
    }
    
    // Main function to demonstrate usage
    int main() {
        uint8_t buffer[MAX_VLQ_BYTES];
        uint32_t original_values[] = {0x0, 0x7F, 0x80, 0x2000, 0x3FFF, 0x4000, 0x1FFFFF, 0x200000, 0xFFFFFFFF};
        int num_values = sizeof(original_values) / sizeof(original_values[0]);
    
        for (int i = 0; i < num_values; i++) {
            uint32_t value = original_values[i];
            
            // --- Encoding ---
            int bytes_written = vlq_encode(value, buffer);
            if (bytes_written > 0) {
                printf("Encoded 0x%X -> %d bytes: [ ", value, bytes_written);
                for (int j = 0; j < bytes_written; j++) {
                    printf("0x%02X ", buffer[j]);
                }
                printf("]\n");
    
                // --- Decoding ---
                uint32_t decoded_value;
                int bytes_read = vlq_decode(buffer, bytes_written, &decoded_value);
                if (bytes_read > 0) {
                    printf("Decoded back to: 0x%X. Matched: %s\n\n", decoded_value, (value == decoded_value) ? "Yes" : "No");
                } else {
                    printf("Decoding failed with error code: %d\n\n", bytes_read);
                }
            } else {
                printf("Encoding failed for 0x%X\n\n", value);
            }
        }
    
        return 0;
    }
    

    Code Walkthrough: Deconstructing the C Implementation

    A solid understanding of the code is crucial. Let's break down the logic of our vlq_encode and vlq_decode functions.

    The `vlq_encode` Function

    1. Temporary Buffer: The logic naturally produces the least significant byte first. Instead of complex pointer arithmetic, we use a simple temporary array temp to store the bytes as they are generated.
    2. Handling Zero: The value 0 is a special case. The main loop condition while (value > 0) would not execute, so we handle it explicitly: 0 is encoded as a single byte 0x00.
    3. The Main Loop: The while loop continues as long as there are bits left in our number.
      • uint8_t byte = value & 0x7F;: This line uses a bitwise AND with the mask 0x7F (binary 01111111) to isolate the lowest 7 bits of value. This is our payload.
      • value >>= 7;: We then shift value right by 7 bits, effectively discarding the bits we just processed and preparing the next 7-bit chunk for the next iteration.
      • if (value > 0) { byte |= 0x80; }: This is the key step. If, after shifting, value is still greater than zero, it means more chunks are coming. We set the continuation bit (MSB) on our current byte using a bitwise OR with 0x80 (binary 10000000).
    4. Reversing the Bytes: After the loop, the temp array holds the bytes in LSB-first order (e.g., for 0x80 it would be [0x80, 0x01]). The standard VLQ representation is MSB-first. The final for loop reverses the bytes from temp into the final output buffer, resulting in the correct sequence (e.g., [0x81, 0x00]).

    The `vlq_decode` Function

    1. Initialization: We start with a decoded_value of 0.
    2. The Main Loop: The loop iterates through the input byte stream.
      • Overflow Check: The line if ((*decoded_value & 0xFE000000) != 0) is a critical safety check. A 32-bit integer can hold at most 5 VLQ bytes. Before we shift left by 7, we check if the top 7 bits of the current result are already occupied. If they are, another shift and OR would result in an overflow, so we abort with an error.
      • *decoded_value <<= 7;: We shift the accumulated result to the left by 7 bits. This makes space for the 7-bit payload from the byte we are about to process.
      • *decoded_value |= (byte & 0x7F);: We take the current byte, strip its continuation bit using & 0x7F, and use a bitwise OR to combine this payload with our shifted result.
      • if ((byte & 0x80) == 0): We check the continuation bit. If it's 0, this is the last byte in the sequence. We return the number of bytes we've successfully processed.
    3. Error Handling: If the loop finishes without ever finding a byte with a 0 continuation bit, it means the sequence is malformed or incomplete. In this case, we return -1 to signal an error.

    Compiling and Running Your VLQ Code

    You can compile and run this C code using a standard C compiler like GCC. Save the code as vlq_main.c and execute the following commands in your terminal.

    
    # Compile the C code
    gcc -o vlq_program vlq_main.c -std=c11 -Wall
    
    # Run the executable
    ./vlq_program
    

    The output will show the original numbers, their VLQ-encoded byte representation, and the confirmation that they decode back to the original value, demonstrating a successful round trip.

    
    Encoded 0x0 -> 1 bytes: [ 0x00 ]
    Decoded back to: 0x0. Matched: Yes
    
    Encoded 0x7F -> 1 bytes: [ 0x7F ]
    Decoded back to: 0x7F. Matched: Yes
    
    Encoded 0x80 -> 2 bytes: [ 0x81 0x00 ]
    Decoded back to: 0x80. Matched: Yes
    
    Encoded 0x2000 -> 2 bytes: [ 0xC0 0x00 ]
    Decoded back to: 0x2000. Matched: Yes
    
    Encoded 0x3FFF -> 2 bytes: [ 0xFF 0x7F ]
    Decoded back to: 0x3FFF. Matched: Yes
    
    Encoded 0x4000 -> 3 bytes: [ 0x81 0x80 0x00 ]
    Decoded back to: 0x4000. Matched: Yes
    
    Encoded 0x1FFFFF -> 3 bytes: [ 0xFF 0xFF 0x7F ]
    Decoded back to: 0x1FFFFF. Matched: Yes
    
    Encoded 0x200000 -> 4 bytes: [ 0x81 0x80 0x80 0x00 ]
    Decoded back to: 0x200000. Matched: Yes
    
    Encoded 0xFFFFFFFF -> 5 bytes: [ 0x8F 0xFF 0xFF 0xFF 0x7F ]
    Decoded back to: 0xFFFFFFFF. Matched: Yes
    

    The Pros and Cons of Variable Length Quantity

    Like any technology, VLQ is not a silver bullet. It's essential to understand its trade-offs to know when to use it effectively. EEAT (Experience, Expertise, Authoritativeness, and Trustworthiness) in engineering comes from knowing the right tool for the job.

    Pros (Advantages) Cons (Disadvantages)
    Significant Space Savings: Highly efficient for data streams where small integers are far more common than large ones. A number like 100 takes 1 byte instead of 4 or 8. Processing Overhead: Encoding and decoding require bitwise operations (shifts, ANDs, ORs), which are computationally more expensive than simply reading a fixed-size integer from memory.
    Flexibility: Can represent integers of any size without being constrained by fixed types like int32_t or int64_t. The format naturally expands as needed. Not Random Access Friendly: To read the Nth integer in a VLQ stream, you must decode all N-1 integers before it to know where it starts. This makes it unsuitable for data that requires frequent random access.
    No Endianness Issues: Since data is processed byte by byte based on the continuation bit, VLQ is immune to the big-endian vs. little-endian problems that plague fixed-size integer serialization. Inefficient for Large Numbers: For data where all numbers are consistently large (e.g., all values are close to 2^30), VLQ can be less efficient. A 32-bit number might take up to 5 bytes, whereas a fixed int32_t always takes 4.
    Self-Terminating: The encoding itself defines the boundary of each number via the continuation bit, eliminating the need for separate length prefixes. Vulnerability to Malformed Data: An incomplete sequence (e.g., a stream that ends on a byte with the continuation bit set) or an overly long sequence can cause decoding errors or even security vulnerabilities if not handled carefully.

    Frequently Asked Questions (FAQ) about VLQ

    1. What is the maximum value a 32-bit integer can produce in VLQ bytes?
    A 32-bit integer requires a maximum of 5 bytes. Since each byte carries 7 bits of payload, 4 bytes can carry 28 bits (4 * 7). To represent all 32 bits, you need a 5th byte. For example, 0xFFFFFFFF encodes to [0x8F, 0xFF, 0xFF, 0xFF, 0x7F].
    2. Is VLQ the same as LEB128?
    They are very similar but not identical. LEB128 (Little-Endian Base 128) is the variant used in DWARF and WebAssembly. The core concept of a 7-bit payload and a continuation bit is the same. The primary difference is the byte order of the integer chunks, as implied by the name.
    3. Why not just use a compression algorithm like Gzip?
    General-purpose compression algorithms like Gzip or Zlib work by finding repeated patterns in large blocks of data. VLQ works at the level of individual integers. You can (and often should) use both. Use VLQ to efficiently serialize your data stream first, and then apply Gzip to the entire stream for further compression.
    4. How do I handle negative numbers with VLQ?
    The standard VLQ described here is for unsigned integers. To handle signed integers, a common technique is Zigzag encoding (used by Protocol Buffers). Zigzag maps signed integers to unsigned integers in a way that small negative numbers become small positive numbers, preserving the efficiency of VLQ. For example, 0 -> 0, -1 -> 1, 1 -> 2, -2 -> 3, and so on.
    5. Is the implementation provided in this guide production-ready?
    The implementation is robust and includes essential error handling for overflow and incomplete sequences, making it a very strong foundation for production use. For mission-critical applications, it should be subjected to rigorous testing, including fuzz testing with malformed byte streams, to ensure it integrates safely into your system.
    6. What are the future trends related to this type of encoding?
    As data volumes continue to explode, efficient serialization formats are more critical than ever. We're seeing a trend towards schema-based formats like Protobuf, Avro, and FlatBuffers, all of which use variable-length integer encoding as a core optimization. The rise of WebAssembly also solidifies the importance of LEB128, a close cousin of VLQ, for compact binary code representation on the web. Expect to see these techniques become standard practice in cloud-native and IoT applications where bandwidth and storage are premium resources.

    Conclusion: A Powerful Tool for Modern Developers

    Variable Length Quantity is more than just an academic curiosity; it is a practical, powerful, and widely-used technique for intelligent data representation. By trading a small amount of CPU overhead for potentially massive gains in space efficiency, VLQ provides an elegant solution to a common engineering problem. Mastering its implementation in a low-level language like C not only equips you with a valuable tool but also deepens your understanding of bitwise operations, data structures, and the fundamental trade-offs in software design.

    Whether you are building the next big music streaming service, designing a high-performance database, or simply trying to optimize a network protocol, the principles of VLQ will serve you well. It's a testament to the fact that sometimes, the most effective solutions are found by carefully considering every single bit.

    To continue your journey, explore more advanced C topics in the complete kodikra C language guide or move on to the next challenge in your kodikra learning path.

    Disclaimer: The C code in this article is written against the C11 standard. Bitwise operations are fundamental and stable, but always compile with a modern toolchain (like GCC 11+ or Clang 14+) for best results and diagnostics.


    Published by Kodikra — Your trusted C learning resource.