Limited Time Offer:Up to 20% off Hello Interview Premium
Up to 20% off Hello Interview Premium 🎉
Hello Interview
Your Dashboard
System Design
Code
Low Level Design
Behavioral
AI Coding
New
ML System Design
Salary Negotiation
Interview Guides
Blog
System Design
Low Level Design
AI Coding
Behavioral
New
Interview Questions
Success Stories
System Design
Low-Level Design
New
Ask The Community
Discord
Mock Interviews
1:1 Mentorship
Refer a Friend
Pricing
Sign in / Sign up
Search
⌘K
Pricing

Tutor

KV Store Serialize/Deserialize

Asked at:

OpenAI


DESCRIPTION

Implement a key-value store that can serialize multiple string key-value pairs into a single string and deserialize it back. Both keys and values can contain any characters including delimiters, newlines, and special characters. For example, serializing {"key:1": "val=ue", "k\ney": "v"} should produce a string that can be deserialized back to the exact same dictionary.

Input:

{"name": "John:Doe", "city": "New,York"}

Output:

Serialize(dict): "Serialized string" 
Deserialize("Serialized string"): original dict


Explanation: Serialization is the process of converting a dictionary into a string, and deserialization is the process of converting a string back into a dictionary.

Constraints:

  • Keys and values are strings that can contain any character (including delimiters, newlines, quotes)
  • Serialization must be reversible - deserialize(serialize(data)) == data
  • Must handle empty strings and empty dictionaries

Understanding the Problem

The core challenge is choosing a delimiter strategy that won't conflict with arbitrary string content. Simple delimiters like : or , fail when those characters appear in keys/values. Escaping (like \:) works but adds complexity and requires careful handling. Thus an ideal approach would be if we could eliminate delimiter conflicts entirely.

Building Intuition

A naive approach using delimiters like key:value,key2:value2 breaks when data contains those characters. Length-prefixed encoding solves this by storing length:data for each string - for example, 3:a:b unambiguously means a 3-character string a:b. This format is self-describing and handles any character content.

This pattern appears in real systems like Redis protocol (RESP), netstrings, and HTTP chunked encoding. Understanding length-prefixed serialization is essential for building robust data interchange formats that don't require escaping. It's a fundamental technique for handling untrusted or arbitrary binary data.

Common Pitfalls

Using simple delimiters - fails when data contains delimiter characters; use length-prefixing instead

Forgetting to handle empty strings - 0: represents an empty string, must parse correctly

Integer overflow on length parsing - validate length values before allocating/reading data

Off-by-one errors in substring extraction - carefully track position after reading each length:data pair

Implementation

Serialization Function

The serialize function converts a dictionary into a length-prefixed string format. For each key-value pair, encode as keyLen:keyData:valueLen:valueData. For example, {"ab": "xyz"} becomes 2:ab3:xyz. Iterate through the dictionary, calculate string lengths using len(), and concatenate the formatted segments. Handle empty dictionaries by returning an empty string.

Solution
Python
Language
def serialize(kv_dict):
if not kv_dict:
return ""
result = []
for key, value in kv_dict.items():
key_str = str(key)
value_str = str(value)
segment = f"{len(key_str)}:{key_str}{len(value_str)}:{value_str}"
result.append(segment)
return "".join(result)
Deserialization Function

The deserialize function parses the encoded string back into a dictionary. Use a position pointer to track the current parsing location. Read characters until : to extract the length integer, then extract exactly that many characters as data. Repeat this process twice per key-value pair (once for key, once for value). For example, parsing 2:ab3:xyz reads 2, extracts ab, reads 3, extracts xyz, yielding {"ab": "xyz"}.

Solution
Python
Language
def deserialize(s):
kv = {}
pos = 0
while pos < len(s):
colon = s.index(':', pos)
key_len = int(s[pos:colon])
pos = colon + 1
key = s[pos:pos + key_len]
pos += key_len
colon = s.index(':', pos)
val_len = int(s[pos:colon])
pos = colon + 1
val = s[pos:pos + val_len]
pos += val_len
kv[key] = val
return kv
Length Parsing Helper

Create a helper function read_length_and_data(s, pos) that extracts one length-prefixed segment. Starting at position pos, scan forward to find :, parse the integer length, then extract the substring of that exact length. Return both the extracted data and the new position. This encapsulates the core parsing logic and makes deserialization cleaner by handling one segment at a time.

Solution
Python
Language
def read_length_and_data(s, pos):
colon_pos = s.index(':', pos)
length = int(s[pos:colon_pos])
data_start = colon_pos + 1
data_end = data_start + length
data = s[data_start:data_end]
return data, data_end

Follow-up Challenge

A common follow-up question to this problem is how you'd handle large KV stores that exceed memory limits by implementing chunked persistence with a fixed upload size.

Chunked Serialization with Metadata Tracking

The core challenge shifts from serializing the entire KV store as one blob to splitting it into fixed-size chunks (e.g., 1KB each) for upload or storage. This matters because real systems often have size constraints on network packets, API payloads, or storage blocks. You'll need a way to track how many chunks exist so deserialization knows when to stop reading.

The key insight is introducing a metadata file that stores critical information like total_chunks count before writing the actual data chunks. During serialization, you write the metadata first, then split your serialized KV data into 1KB chunks as separate files or blocks. During deserialization, you read the metadata to know exactly how many chunks to fetch and reassemble.

Important considerations include:

  • Handling the last chunk which will likely be smaller than 1KB
  • Deciding whether to split on byte boundaries (simple) or respect character boundaries to avoid corrupting multi-byte UTF-8 sequences
  • Choosing a naming scheme for chunk files (e.g., data_chunk_0, data_chunk_1) that preserves order
  • Ensuring atomic writes of metadata so you never have orphaned chunks
Solution
Python
Language
import json
import os
CHUNK_SIZE = 1024
def serialize_chunked(kv_store, base_path):
serialized = json.dumps(kv_store)
data_bytes = serialized.encode('utf-8')
total_chunks = (len(data_bytes) + CHUNK_SIZE - 1) // CHUNK_SIZE
metadata = {'total_chunks': total_chunks}
with open(f'{base_path}_metadata.json', 'w') as f:
json.dump(metadata, f)
for i in range(total_chunks):
chunk = data_bytes[i * CHUNK_SIZE:(i + 1) * CHUNK_SIZE]
with open(f'{base_path}_chunk_{i}', 'wb') as f:
f.write(chunk)
def deserialize_chunked(base_path):
with open(f'{base_path}_metadata.json', 'r') as f:
metadata = json.load(f)
data_bytes = b''
for i in range(metadata['total_chunks']):
with open(f'{base_path}_chunk_{i}', 'rb') as f:
data_bytes += f.read()
return json.loads(data_bytes.decode('utf-8'))

What We've Learned

  • Pattern: Length-prefixed encoding (`length:data`) eliminates delimiter conflicts and handles arbitrary content safely
  • Use Case: Essential for serialization protocols (Redis RESP, netstrings), binary data encoding, and robust data interchange
  • Implementation: Track position carefully during parsing - read length, extract exact substring, advance position
  • Edge Cases: Handle empty strings (`0:`), empty dictionaries, and validate lengths before extraction

Problems to Practice

Decode String

medium

Stack

This problem involves parsing and encoding/decoding strings with special characters and nested structures, similar to serializing/deserializing a KV store. Both require careful string manipulation and handling of delimiters or special characters to preserve data integrity.

Practice
Implement Trie Methods

medium

Trie

Building a Trie data structure requires implementing insert and search methods similar to a KV store's put/get operations. Both involve designing efficient data structures for string storage and retrieval with proper handling of string keys.

Practice
Word Break

medium

Dynamic Programming

Word Break involves parsing strings and determining valid segmentations, which relates to the deserialization challenge of splitting encoded strings back into key-value pairs. Both require careful string boundary detection and validation.

Practice

Question Timeline

See when this question was last asked and where, including any notes left by other candidates.

Company
​
Level
All Regions
Region

Mid May, 2026

OpenAI

Staff

Early May, 2026

OpenAI

Senior

1. No timestamps and history of values for a given key. 2. For persistence of the serialized data, a mock S3 Bucket class was given. 3. It was important to talk about the KV store lifecycle: can we call serialization twice in a row? And deserialization? Why? And then I'd need to update the code according to my reasoning. So, better not to wait till these questions pop up and talk about these cases proactively.

Late April, 2026

OpenAI

Mid-level

Get Premium to View All 30+ Reports

Your account is free and you can post anonymously if you choose.

Hello Interview Premium

Recent interview questions
System Design Guided Practice
Exclusive content
Learn More
Questions
Meta SWE Interview QuestionsAmazon SWE Interview QuestionsGoogle SWE Interview QuestionsOpenAI SWE Interview QuestionsEngineering Manager (EM) Interview Questions
Learn
Learn System DesignLearn DSALearn BehavioralLearn ML System DesignLearn Low Level DesignGuided Practice
Links
FAQPricingGift PremiumHello Interview Premium
Legal
Terms and ConditionsPrivacy PolicySecurity
Contact
About UsProduct Support

7511 Greenwood Ave North Unit #4238 Seattle WA 98103


© 2026 Optick Labs Inc. All rights reserved.