Create a Database of 3000 Lines with 0 Dependencies | Blog

AljwadhJanuary 19, 2025

0 1,589 5 minutes read

01. Complex systems are built from simple ideas

Complex software such as databases, compilers, and browsers are treated like black boxes. You use it every day as a usersbut you may not understand them as a programmeralthough they are nothing but code. Why?

They have little in common with the day-to-day work of programmers.
Their code bases are so large, so debilitating.

But that doesn’t mean you can’t learn these things. People make small, toy versions of these things for learning purposes, such as C in 4 Functions, Translators in the Making, Web Browser Engineering. It turns out that if you minimize the uninteresting details and focus on the essentials, you can recreate seemingly complex software and learn its main ideas.

For compilers, some of the main ideas include:

Parsing objects with recursion.
Represent a program as a tree and simulate it.
Represents a program as a list of instructions with gotos.
Stack machine or register machine.

None of these ideas are complicated. So a compiler is a practical and useful exercise for programmers.

02. What are the core ideas of databases?

I built a small database of 3000 lines
from scratch Go to learn the core ideas of databases. It’s not complicated if you approach it one way.

2.1 Loss of power atomicity

A database stores data on disk, but why not just use files? Why is a filesystem not a database? Because databases eliminate the common source of data corruption: partially written files due to crashes or power outages.

Video games warn you not to turn off the power while saving, because a partially written save file will destroy all your progress. Building a database will teach you how to eliminate this concern. The solution is a add-only log with checksum for each object.

Append-only means that writes do not destroy existing data.
The checksum means that partial writes are detected and discarded, which increases the atomic power loss.

╔═══╤════╦═══╤════╦══╌╌╌══╦═══╤════╗
║sum│data║sum│data║more...║sum│data║
╚═══╧════╩═══╧════╩══╌╌╌══╩═══╧════╝
                          ╰───┬────╯
                    possible partial write

These are the first 2 ideas. But they are not enough to create a database.

2.2 Index data with data structures

Logs have uses for systems like Kafka, but you can’t just put everything in a log and forget about it. You need to find things in it, so you need data structures like B + trees, LSM-trees, hashtables. That’s where I started, I built a B+tree of 366 lines.

But if you just put a data structure on disk, you have a partial write problem. Is it possible to apply log-based ideas to data structures? There are 2 ideas:

The first idea is to create an append-only data structure, like a log. There are ways to create a B+tree append-only, also called a
copy-on-write B+tree. A copy-on-write tree does not overwrite existing nodes, it creates new nodes from leaf to root for the entire path. New nodes can be added like a log.

    d           d         D*
   / \         / \       / \
  b   e  ──►  b   e  +  B*  e
 / \         / \       / \
a   c       a   c     a   C*
            original  updated

Another idea is to use a log and a main data structure:

Before updating the main data structure, save the intended updates to the log. Each log table contains a “write this data to that position” list.
Then apply the updates to the main data structure.
When the database starts, it always uses the last log table to fix a possible partial write to the main data structure, regardless of its state.

I chose the copy-on-write B+ tree because it does not require a log, which reduces the LoC. Transferring the B+ tree to disk involves interfacing with the filesystem, including fsync(). At 601 lines, I have an append-only KV store on disk.

2.3 Recycle and reuse unused space

I need to take care of the append-only result because the storage is not infinite. I just added a free recycle list of unused B+tree nodes. At 731 lines, I have a functional KV store.

                     first_item
                         ↓
list_head ─► ( next |    xxxxx )
                ↓
             ( next | xxxxxxxx )
                ↓
list_tail ─► ( NULL | xxxx     )
                         ↑
                     last_item

2.4 Relational DB in KV

An SQL query can be arbitrarily complex, but it boils down to 2 data structure operations: point query and range query. So I started with B+tree data structures. Databases are built on data structures, not some theory about relationships.

I added a layer on top of KV to store records with primary keys and columns. Each record is encoded as a KV pair. LoC went up to 1107. But still only KV. So I added 2 more parts:

Range question, 1294 lines.
Secondary index, 1438 lines.

I’m not bothering with SQL at this point because SQL is just a user interface. I’m building APIs for databases called a storage engine.

2.5 Concurrent control

I can ignore concurrency and make everything serialized. But using a copy-on-write B+ tree means that readers get snapshot isolation for free, because updates don’t break old versions. So the readers will not be blocked by the writer.

I went one step further to allow write transactions to run concurrently, and then merge the updates at the end of the transaction. This requires identifying and rejecting conflicting updates. In 1702 lines, I have a transactional interface.

2.6 SQL-like query language

My user interface is just a set of APIs to operate the database. The next step is the query language. I chose an SQL-like syntax. I’m only implementing simple queries with a point/range query, so no query planner.

SQL parsing is not a database topic, but a compiler topic. This is done by using recursion in a certain way, which is hardly an idea if you know it.

A query language is not just about data, it can perform calculations, such as
SELECT a + b FROM c. So I need a translator.

Then I need a lot of glue code between the query language and the database. The lines of code increased from 1702 to 2795, most of them uninteresting.

03. Database of 3000 lines, incrementally

This is my minimalist attempt to create a database from scratch. Each step adds an important feature while minimizing uninteresting details.

366	B+tree data structure.
601	Just add KV.
731	Practical KV with free list.
1107	KV tables.
1294	Range questions.
1438	Second indexes.
1461	Transaction interfaces.
1702	Concurrency control.
2795	Like SQL query language.

The lessons are like a book

I made it a BOOK so you can follow my steps. The first half of the book is free online and covers the most interesting parts of the storage engine, which can be used as a KV on its own.

3000 lines is not much, but it is an effective exercise. The first edition only took me 2 months of free time. The book is updated every time I learn something new, so this is the second edition.

https://build-your-own.org/byo_logo_256.png

2025-01-16 16:59:00

AljwadhJanuary 19, 2025

0 1,589 5 minutes read

Create a Database of 3000 Lines with 0 Dependencies | Blog

01. Complex systems are built from simple ideas

02. What are the core ideas of databases?

2.1 Loss of power atomicity

2.2 Index data with data structures

2.3 Recycle and reuse unused space

2.4 Relational DB in KV

2.5 Concurrent control

2.6 SQL-like query language

03. Database of 3000 lines, incrementally

The lessons are like a book

Aljwadh

Leave a Reply Cancel reply

Elon Musk agrees with Tweet saying Americans aren’t smart enough for tech jobs

Apple Allows Support for Satellite T-Mobile and Starlink in the iPhone

Lamar Kendrick will appear in Synth Riders experience on Apple Pro vision

The 2024 Movie Monster State of the Union

Thousands of people are evacuating in LA as wildfires and extreme winds hit Southern California

Republican Congressman Jim Jordan asks Big Tech if Biden was sent to A.I.

Ryan Reynolds and Andrew Garfield Are Game to Return as Deadpool and Spider-Man

Your Dishwasher Is Gross. Here’s How to Clean It

Apple Music expands its live radio offerings with three new stations

Ready Player Me’s Player Zero sees momentum for Web3 collectible avatars

The 33 Best Shows on Apple TV+ Right Now (December 2024)

01. Complex systems are built from simple ideas

02. What are the core ideas of databases?

2.1 Loss of power atomicity

2.2 Index data with data structures

2.3 Recycle and reuse unused space

2.4 Relational DB in KV

2.5 Concurrent control

2.6 SQL-like query language

03. Database of 3000 lines, incrementally

The lessons are like a book

Aljwadh

This trillion dollar Artificial Intelligence (AI) Chip Stock is up 110% in the last year. Could A Stock Split Be On The Horizon?

NZ-WU19 vs NGR-WU19 Dream11 Prediction Today Match 11 U19 Women's World Cup T20 2025

Related Articles

FIELDCRAFT, SUBLIVAL, AND HEART: ALT.ANYMOUS.MESSAGE NEWSGROUP

Github – Directvt / VTM: Text-based desktop

Microsoft’s new dragons are AI Assistant for Healthcare

Or is the COBOL default in 1875-05-20 for corrupts or missing dates?

Leave a Reply Cancel reply

Republican Congressman Jim Jordan asks Big Tech if Biden was sent to A.I.

Ryan Reynolds and Andrew Garfield Are Game to Return as Deadpool and Spider-Man

Your Dishwasher Is Gross. Here’s How to Clean It

Apple Music expands its live radio offerings with three new stations

Ready Player Me’s Player Zero sees momentum for Web3 collectible avatars

The 33 Best Shows on Apple TV+ Right Now (December 2024)