Your Dashboard
Interview Coaching
Learn
System Design
ML System Design
Code
Behavioral
Salary Negotiation
Interview Guides
KV Store Serialize/Deserialize
Asked at:
OpenAI
DESCRIPTION
Implement a key-value store that can serialize multiple string key-value pairs into a single string and deserialize it back. Both keys and values can contain any characters including delimiters, newlines, and special characters. For example, serializing {"key:1": "val=ue", "k\ney": "v"} should produce a string that can be deserialized back to the exact same dictionary.
Input:
Output:
Explanation: Serialization is the process of converting a dictionary into a string, and deserialization is the process of converting a string back into a dictionary.
Constraints:
- Keys and values are strings that can contain any character (including delimiters, newlines, quotes)
- Serialization must be reversible - deserialize(serialize(data)) == data
- Must handle empty strings and empty dictionaries
Understanding the Problem
The core challenge is choosing a delimiter strategy that won't conflict with arbitrary string content. Simple delimiters like : or , fail when those characters appear in keys/values. Escaping (like \:) works but adds complexity and requires careful handling. Thus an ideal approach would be if we could eliminate delimiter conflicts entirely.
Building Intuition
A naive approach using delimiters like key:value,key2:value2 breaks when data contains those characters. Length-prefixed encoding solves this by storing length:data for each string - for example, 3:a:b unambiguously means a 3-character string a:b. This format is self-describing and handles any character content.
This pattern appears in real systems like Redis protocol (RESP), netstrings, and HTTP chunked encoding. Understanding length-prefixed serialization is essential for building robust data interchange formats that don't require escaping. It's a fundamental technique for handling untrusted or arbitrary binary data.
Common Pitfalls
Implementation
Serialization Function
The serialize function converts a dictionary into a length-prefixed string format. For each key-value pair, encode as keyLen:keyData:valueLen:valueData. For example, {"ab": "xyz"} becomes 2:ab3:xyz. Iterate through the dictionary, calculate string lengths using len(), and concatenate the formatted segments. Handle empty dictionaries by returning an empty string.
def serialize(kv_dict):if not kv_dict:return ""result = []for key, value in kv_dict.items():key_str = str(key)value_str = str(value)segment = f"{len(key_str)}:{key_str}{len(value_str)}:{value_str}"result.append(segment)return "".join(result)
Deserialization Function
The deserialize function parses the encoded string back into a dictionary. Use a position pointer to track the current parsing location. Read characters until : to extract the length integer, then extract exactly that many characters as data. Repeat this process twice per key-value pair (once for key, once for value). For example, parsing 2:ab3:xyz reads 2, extracts ab, reads 3, extracts xyz, yielding {"ab": "xyz"}.
def deserialize(s):kv = {}pos = 0while pos < len(s):colon = s.index(':', pos)key_len = int(s[pos:colon])pos = colon + 1key = s[pos:pos + key_len]pos += key_lencolon = s.index(':', pos)val_len = int(s[pos:colon])pos = colon + 1val = s[pos:pos + val_len]pos += val_lenkv[key] = valreturn kv
Length Parsing Helper
Create a helper function read_length_and_data(s, pos) that extracts one length-prefixed segment. Starting at position pos, scan forward to find :, parse the integer length, then extract the substring of that exact length. Return both the extracted data and the new position. This encapsulates the core parsing logic and makes deserialization cleaner by handling one segment at a time.
def read_length_and_data(s, pos):colon_pos = s.index(':', pos)length = int(s[pos:colon_pos])data_start = colon_pos + 1data_end = data_start + lengthdata = s[data_start:data_end]return data, data_end
Follow-up Challenge
A common follow-up question to this problem is how you'd handle large KV stores that exceed memory limits by implementing chunked persistence with a fixed upload size.
Chunked Serialization with Metadata Tracking
The core challenge shifts from serializing the entire KV store as one blob to splitting it into fixed-size chunks (e.g., 1KB each) for upload or storage. This matters because real systems often have size constraints on network packets, API payloads, or storage blocks. You'll need a way to track how many chunks exist so deserialization knows when to stop reading.
The key insight is introducing a metadata file that stores critical information like total_chunks count before writing the actual data chunks. During serialization, you write the metadata first, then split your serialized KV data into 1KB chunks as separate files or blocks. During deserialization, you read the metadata to know exactly how many chunks to fetch and reassemble.
Important considerations include:
- Handling the last chunk which will likely be smaller than 1KB
- Deciding whether to split on byte boundaries (simple) or respect character boundaries to avoid corrupting multi-byte UTF-8 sequences
- Choosing a naming scheme for chunk files (e.g., data_chunk_0, data_chunk_1) that preserves order
- Ensuring atomic writes of metadata so you never have orphaned chunks
import jsonimport osCHUNK_SIZE = 1024def serialize_chunked(kv_store, base_path):serialized = json.dumps(kv_store)data_bytes = serialized.encode('utf-8')total_chunks = (len(data_bytes) + CHUNK_SIZE - 1) // CHUNK_SIZEmetadata = {'total_chunks': total_chunks}with open(f'{base_path}_metadata.json', 'w') as f:json.dump(metadata, f)for i in range(total_chunks):chunk = data_bytes[i * CHUNK_SIZE:(i + 1) * CHUNK_SIZE]with open(f'{base_path}_chunk_{i}', 'wb') as f:f.write(chunk)def deserialize_chunked(base_path):with open(f'{base_path}_metadata.json', 'r') as f:metadata = json.load(f)data_bytes = b''for i in range(metadata['total_chunks']):with open(f'{base_path}_chunk_{i}', 'rb') as f:data_bytes += f.read()return json.loads(data_bytes.decode('utf-8'))
What We've Learned
- Pattern: Length-prefixed encoding (`length:data`) eliminates delimiter conflicts and handles arbitrary content safely
- Use Case: Essential for serialization protocols (Redis RESP, netstrings), binary data encoding, and robust data interchange
- Implementation: Track position carefully during parsing - read length, extract exact substring, advance position
- Edge Cases: Handle empty strings (`0:`), empty dictionaries, and validate lengths before extraction
Problems to Practice
medium
This problem involves parsing and encoding/decoding strings with special characters and nested structures, similar to serializing/deserializing a KV store. Both require careful string manipulation and handling of delimiters or special characters to preserve data integrity.
medium
Building a Trie data structure requires implementing insert and search methods similar to a KV store's put/get operations. Both involve designing efficient data structures for string storage and retrieval with proper handling of string keys.
medium
Word Break involves parsing strings and determining valid segmentations, which relates to the deserialization challenge of splitting encoded strings back into key-value pairs. Both require careful string boundary detection and validation.
Question Timeline
See when this question was last asked and where, including any notes left by other candidates.
Early October, 2025
OpenAI
Staff
Early October, 2025
OpenAI
Senior
Early October, 2025
OpenAI
Staff
Followup - 1kb chunk size based uploads. Need to make sure a mechanism (perhaps another metadata file ?) exists to persist the total number of chunks uploaded so the restore process can read all the chunks correctly
Comments
Hello Interview Premium
Your account is free and you can post anonymously if you choose.