30% off Lifetime Premium Access expires December 2nd!

30% off Lifetime Premium Access expires Dec 2nd!

Your Dashboard

Interview Coaching

Mock Interviews

1:1 Mentorship

Salary Negotiation

Practice

Learn

System Design

ML System Design

Code

Behavioral

Salary Negotiation

Interview Guides

Blog

Community

Interview Questions

Interview Experiences

Peer System Design Library

Ask The Community

Discord

Refer a Friend

Pricing

Become a Coach

⌘K

Pricing Become a Coach

⌘K

Get Premium

KV Store Serialize/Deserialize

Asked at:

OpenAI

DESCRIPTION

Implement a key-value store that can serialize multiple string key-value pairs into a single string and deserialize it back. Both keys and values can contain any characters including delimiters, newlines, and special characters. For example, serializing {"key:1": "val=ue", "k\ney": "v"} should produce a string that can be deserialized back to the exact same dictionary.

Input:


{"name": "John:Doe", "city": "New,York"}

Output:


Serialize(dict): "Serialized string" 
Deserialize("Serialized string"): original dict

Explanation: Serialization is the process of converting a dictionary into a string, and deserialization is the process of converting a string back into a dictionary.

Constraints:

Keys and values are strings that can contain any character (including delimiters, newlines, quotes)
Serialization must be reversible - deserialize(serialize(data)) == data
Must handle empty strings and empty dictionaries

Understanding the Problem

The core challenge is choosing a delimiter strategy that won't conflict with arbitrary string content. Simple delimiters like : or , fail when those characters appear in keys/values. Escaping (like \:) works but adds complexity and requires careful handling. Thus an ideal approach would be if we could eliminate delimiter conflicts entirely.

Building Intuition

A naive approach using delimiters like key:value,key2:value2 breaks when data contains those characters. Length-prefixed encoding solves this by storing length:data for each string - for example, 3:a:b unambiguously means a 3-character string a:b. This format is self-describing and handles any character content.

This pattern appears in real systems like Redis protocol (RESP), netstrings, and HTTP chunked encoding. Understanding length-prefixed serialization is essential for building robust data interchange formats that don't require escaping. It's a fundamental technique for handling untrusted or arbitrary binary data.

Common Pitfalls

Using simple delimiters - fails when data contains delimiter characters; use length-prefixing instead

Forgetting to handle empty strings - 0: represents an empty string, must parse correctly

Integer overflow on length parsing - validate length values before allocating/reading data

Off-by-one errors in substring extraction - carefully track position after reading each length:data pair

Implementation

Serialization Function

The serialize function converts a dictionary into a length-prefixed string format. For each key-value pair, encode as keyLen:keyData:valueLen:valueData. For example, {"ab": "xyz"} becomes 2:ab3:xyz. Iterate through the dictionary, calculate string lengths using len(), and concatenate the formatted segments. Handle empty dictionaries by returning an empty string.

def serialize(kv_dict):
    if not kv_dict:
        return ""
    
    result = []
    for key, value in kv_dict.items():
        key_str = str(key)
        value_str = str(value)
        segment = f"{len(key_str)}:{key_str}{len(value_str)}:{value_str}"
        result.append(segment)
    
    return "".join(result)

Deserialization Function

The deserialize function parses the encoded string back into a dictionary. Use a position pointer to track the current parsing location. Read characters until : to extract the length integer, then extract exactly that many characters as data. Repeat this process twice per key-value pair (once for key, once for value). For example, parsing 2:ab3:xyz reads 2, extracts ab, reads 3, extracts xyz, yielding {"ab": "xyz"}.

def deserialize(s):
    kv = {}
    pos = 0
    while pos < len(s):
        colon = s.index(':', pos)
        key_len = int(s[pos:colon])
        pos = colon + 1
        key = s[pos:pos + key_len]
        pos += key_len
        
        colon = s.index(':', pos)
        val_len = int(s[pos:colon])
        pos = colon + 1
        val = s[pos:pos + val_len]
        pos += val_len
        
        kv[key] = val
    return kv

Length Parsing Helper

Create a helper function read_length_and_data(s, pos) that extracts one length-prefixed segment. Starting at position pos, scan forward to find :, parse the integer length, then extract the substring of that exact length. Return both the extracted data and the new position. This encapsulates the core parsing logic and makes deserialization cleaner by handling one segment at a time.

def read_length_and_data(s, pos):
    colon_pos = s.index(':', pos)
    length = int(s[pos:colon_pos])
    data_start = colon_pos + 1
    data_end = data_start + length
    data = s[data_start:data_end]
    return data, data_end

Follow-up Challenge

A common follow-up question to this problem is how you'd handle large KV stores that exceed memory limits by implementing chunked persistence with a fixed upload size.

Chunked Serialization with Metadata Tracking

The core challenge shifts from serializing the entire KV store as one blob to splitting it into fixed-size chunks (e.g., 1KB each) for upload or storage. This matters because real systems often have size constraints on network packets, API payloads, or storage blocks. You'll need a way to track how many chunks exist so deserialization knows when to stop reading.

The key insight is introducing a metadata file that stores critical information like total_chunks count before writing the actual data chunks. During serialization, you write the metadata first, then split your serialized KV data into 1KB chunks as separate files or blocks. During deserialization, you read the metadata to know exactly how many chunks to fetch and reassemble.

Important considerations include:

Handling the last chunk which will likely be smaller than 1KB
Deciding whether to split on byte boundaries (simple) or respect character boundaries to avoid corrupting multi-byte UTF-8 sequences
Choosing a naming scheme for chunk files (e.g., data_chunk_0, data_chunk_1) that preserves order
Ensuring atomic writes of metadata so you never have orphaned chunks

import json
import os

CHUNK_SIZE = 1024

def serialize_chunked(kv_store, base_path):
    serialized = json.dumps(kv_store)
    data_bytes = serialized.encode('utf-8')
    total_chunks = (len(data_bytes) + CHUNK_SIZE - 1) // CHUNK_SIZE
    
    metadata = {'total_chunks': total_chunks}
    with open(f'{base_path}_metadata.json', 'w') as f:
        json.dump(metadata, f)
    
    for i in range(total_chunks):
        chunk = data_bytes[i * CHUNK_SIZE:(i + 1) * CHUNK_SIZE]
        with open(f'{base_path}_chunk_{i}', 'wb') as f:
            f.write(chunk)

def deserialize_chunked(base_path):
    with open(f'{base_path}_metadata.json', 'r') as f:
        metadata = json.load(f)
    
    data_bytes = b''
    for i in range(metadata['total_chunks']):
        with open(f'{base_path}_chunk_{i}', 'rb') as f:
            data_bytes += f.read()
    
    return json.loads(data_bytes.decode('utf-8'))

What We've Learned

Pattern: Length-prefixed encoding (`length:data`) eliminates delimiter conflicts and handles arbitrary content safely
Use Case: Essential for serialization protocols (Redis RESP, netstrings), binary data encoding, and robust data interchange
Implementation: Track position carefully during parsing - read length, extract exact substring, advance position
Edge Cases: Handle empty strings (`0:`), empty dictionaries, and validate lengths before extraction

Problems to Practice

Decode String

medium

Stack

This problem involves parsing and encoding/decoding strings with special characters and nested structures, similar to serializing/deserializing a KV store. Both require careful string manipulation and handling of delimiters or special characters to preserve data integrity.

Practice

Implement Trie Methods

medium

Trie

Building a Trie data structure requires implementing insert and search methods similar to a KV store's put/get operations. Both involve designing efficient data structures for string storage and retrieval with proper handling of string keys.

Practice

Word Break

medium

Dynamic Programming

Word Break involves parsing strings and determining valid segmentations, which relates to the deserialization challenge of splitting encoded strings back into key-value pairs. Both require careful string boundary detection and validation.

Practice

Question Timeline

See when this question was last asked and where, including any notes left by other candidates.

Company

Level

Region

Early October, 2025

OpenAI

Staff

Early October, 2025

OpenAI

Senior

Early October, 2025

OpenAI

Staff

Followup - 1kb chunk size based uploads. Need to make sure a mechanism (perhaps another metadata file ?) exists to persist the total number of chunks uploaded so the restore process can read all the chunks correctly

Get Premium to View All 10+ Reports

Comments

Your account is free and you can post anonymously if you choose.

Hello Interview Premium

Recent interview questions

System Design Guided Practice

Exclusive content

Learn More

KV Store Serialize/Deserialize

DESCRIPTION

Understanding the Problem

Building Intuition

Common Pitfalls

Implementation

Serialization Function

Deserialization Function

Length Parsing Helper

Follow-up Challenge

Chunked Serialization with Metadata Tracking

What We've Learned

Problems to Practice

Question Timeline

Comments

Questions

Learn

Links

Legal

Contact