Technical posts by Arif Khan: B+ tree data structure explained

It seems there is a lot of confusion among most of developers on data structure used to create indexes in database systems.

Most of the people think indexes in database systems are created using B trees, in reality, B+ tree is the structure used for indexes.

Tables in database systems are only logical structures, in my opinion, it is very important to know the physical data structure used to store data, developers can write efficient code when they know about structure used to store data.

I thought to put a blog post using typical sample data usually stored in database tables, and index it using B+ tree. This post doesn't have code, intention is to provide high level explanation on B+ tree.

Let's take sample data and start building index using B+ tree.

Above is the sample employee data, Employee ID (EID) is the key in data set.
B+ tree is combination of B tree (Index set) and Sequential data set (this is where data pages are sequentially stored).

I'll introduce properties of B tree as we make progress in index creation.

Follow these steps carefully.
1) Insert data for EID (Employee ID) 12 ...

Allocate a block to store employee data, assume this block is 8 KB in size.
Save record for Eid=12 in the allocated block as shown below.

This 8 KB block is called data page, which is "+" part of B+ tree.
Let's assume each record in our sample data takes 4 KB, which means maximum two records can be stored in a data page.

Now is the time to introduce important property of a data page.
Data page is also a node with some capacity order, capacity order is the minimum number of records that can be stored in a node. In our sample, data page has capacity order of one, minimum one record and maximum two records can be stored.
If capacity order of a node is "x" , then minimum number of records in a node are "x" and maximum are "2x" and node cannot be empty.

We need a data structure that points us to correct data page when index key is used, this data structure is B tree. Let's create a B tree with index key 12 and connect it to data page.

Allocate a block of 8 KB in size and store ONLY index key (12) in it, this is a index node, 8 KB block can store many values as it contains only keys in the index node.

Here is a B+ tree with index node and data page. Note: Blue color index node is also 8 KB in size. Index node in blue color in the below image is a single node B tree and data page is "+" part of B+ tree. Left pointer points to data page containing data for employees with ID less than or equal to 12.

To keep it simple, our index node has capacity order of two, which means minimum two keys and maximum four keys in a node, except root node.
In case of B tree, root node is allowed to have single value, all non-root nodes should have number of keys at-least equivalent to it's capacity order (two in our case for index node).

In our sample, capacity order of data page is one and capacity order of index node is two.

2)Insert record with EID = 1 from our sample set into B+ tree.

Record with EID=1 goes into existing page , there is no change to index pointer.
Data page becomes full with this insertion.

3) Insert next two records with ID's 14 & 13 sequentially.
Here is how tree looks like after two insertions, I'm showing in a single image, insertions happens one after another.

Image of the data page is reduced due to space constraints, othercolumns section represents all columns except key from our sample for each employee.

Existing data page was filled with insertion in step (2), new data page is created to add new records.
These two data pages are linked sequentially in a sorted order of the index key.
Index node is updated with right side pointer pointing to data page containing records with keys greater than 12.

4) Insert data for EID =4.
To keep the data ordered in data pages, record with EID=4 should go into first page, first page is already full, it need a split. When a page split happens, existing records are split into two halves and higher half is moved into new page.

Here is how it looks after splitting first page and new insertion.

Record with EID=12 is moved to new page and EID=4 data is inserted into first page.
Key values in index are adjusted to point to correct data pages.
Pointer to the left of 4 points to data page containing keys less than or equal to 4.
Pointer to the right of 4 and left of 12 points to data page containing keys greater than 4 and less than or equal to 12.
Pointer to the right of 12 points to data page containing keys greater than 12.

5) Follow similar process and insert data for EID's 6,7,8,3 & 5 sequentially from our sample set.
Here is how B+ tree looks after these insertions...

At this stage, root node is full.
Let's see what happens when new record is added.

6) Insert data for EID = 2
Data for EID=2 should go into first page, which is full.
Page split happens, data for EID=3 is moved to new page.

With the addition of new data page, root node should be adjusted, root node at this stage is full.
Capacity order of root node is two, which means it can contain only up to four values.
Above image shows how root node looks like if it can contain five values.
Root node should be split, middle key 5 in green should be made as root, key 5 goes up , keys 2 & 3 which are smaller than 5 should be to the left of root (5) and keys 7 & 12 should be to the right of root (5).

Here is how it looks with page split on root node.

In the above image, nodes enclosed in a orange color triangle forms B tree, which is also called Index set.
Data pages linked sequentially in the logical order of index keys is called Sequential set.
Combination of Index set(B tree) and sequential set of data pages is called B+ tree.

This completes creation of B+ tree using our sample data. If you look at it, tree was built Bottom-Up.
When a tree is implemented, pointers are nothing but the addresses of storage blocks on a disk (you can imagine these as offsets in a file).

In a relational database systems terminology, when someone says leaf level pages of a clustered index are data pages,it means, data page from a sequential set of B+ tree, as we seen in our sample tree.

Let's see how we can use the tree to query data, here are couple of queries...

Fetch single record
Q1) What's the SSN, Address,City and State of employee with EID = 8?
A) Start at root node, 8 is higher than key (5) in the root node. Take pointer to the right of key 5 in the root node and go one level down to the node with keys 7 & 12. EID=8 is greater than 7 and less than 12, take pointer to the right of 7 and go to page containing records for EID's 8 & 12. Load this page into memory and look for record with EID=8, fetch the required columns (data) from the record and return those to client.

Fetch multiple records using range scan.
Q2)What's the SSN, Address,City and State of employees with EID's between 4 and 12 (inclusive)?
A) There are multiple ways to fetch this data, let's use most commonly used access path.

Range scan :
Starting value in the range is 4, access page containing record for EID=4, sequentially load pages starting from page containing EID=4 until we hit a page with EID greater than 12, pages are sequentially ordered, we can stop page load once we see a page with key greater than 12.

In our sample, start at root node, starting value in the range 4 is smaller than key 5 in the root node, take pointer to the left of 5 and go one level down to the index node containing keys 2 & 3. Take pointer to the right of 3 and reach page containing data for EID=4, start loading data pages sequentially until you see page with value greater than 12 (i.e. last page in our sample tree). Process the pages loaded into memory by searching for records with EID's between 4 & 12 and return those to client.

Read ahead operation: Here I would like to mention about read ahead operation, most of the relational databases fetches referenced page and also some adjacent pages along with referenced one. This will avoid additional trips to disk in case of range scan and improves performance, provided data is not fragmented, i.e. physical and logical ordering of data pages are same. You cannot take advantage of read ahead when data is fragmented.

Read ahead operation in relational databases is similar to spatial locality used while accessing data from main memory.

Some other characteristics of a B+ tree.

Index node is also a block of keys, number of keys that can fit in a index block depends on size of node, it's a 8 KB block in our sample. Many keys can fit in a index page and occurrence of page splits in index nodes is less when compared to data pages. This property keeps height of B+ tree short and under control.
Presence of key in index node doesn't guarantee presence of data in data pages.When we search for EID=0 in our sample B+ tree, index set takes us to first data page, but first data page doesn't contain EID=0. You need to load the data page by traversing thru the tree and search for key you are interested in the data page.
New page that gets created due to page split may be allocated far away from the page that's being split, but get's connected in a logical order. This causes random reads and affects performance.
Keeping data pages physically adjacent to each other in a logical order of keys improves performance.
B+ tree enables faster range scans.
Most of the database systems connects index nodes and data pages using doubly linked list.
If index set is in memory, loading data page from disk will be faster operation. Keeping root node of index in memory improves performance, root node is probed in all operations.
B+ tree is a Bottom-Up tree.

There are various ways to create a system that enables faster access of data from secondary storage, B+ tree is commonly used data structure in the implementation of RDBMS systems.
You can create your own storage system that suits your needs.
Here is a white paper on a storage system called Haystack implemented by Facebook to store their photos.
http://www.stanford.edu/class/cs240/readings/haystack.pdf
Their implementation is simple, I liked it so much and read it probably three times.

You don't need to use relational databases if it doesn't suit your application, create your own storage system, which can be as simple as a file divided into blocks with an index and you have full control to tweak as it suits your needs.

If you like this post, please do not forget to share it on social networks by clicking buttons below.

Friday, January 10, 2014

B+ tree data structure explained