Durable on-disk graph

Wendy Chen
CPSC 426 Distributed Systems, Fall 2016

The following service description is adapted from the lab specification:

Service description

To add durability, the graph service logs each transaction to a persisting disk. While the graph is stored in main memory as it was in lab1, each mutating client request is now logged. The client can also checkpoint the graph, which involves serializing the graph and dumping it to another partition of the persisting disk. When the graph is checkpointed, the log can be erased as well. Generation numbers are attached with the checkpoint and the log in order to distinguish which transactions are not contained within a checkpoint. Due to the possiblity of a checkpoint, the server first tries to load the graph from checkpoint when it is started before reconstructing any transactions stored in the log.

Starting from lab2, Google Cloud is used to host the graph service. The graph service uses a VM running Linux with an extra persisting disk added.

The graph service has the same API described in lab1, with additional error codes. Mutating client requests (add_node, add_edge, remove_node, remove_edge) return 507 if the log is full, indicating a necessary checkpoint.

The checkpoint command is as follows:

Function	Arguments	Return
checkpoint	n/a	200 on success 507 if there is insufficient space for the checkpoint

To start the server, the user should run:
./cs426_graph_server <port> <devfile>

The user may specify any port they desire, but the disk should be /dev/sdb.

To format the disk (increment the generation number), the user should run:
./cs426_graph_server -f <port> <devfile>

The -r flag is also supported, and it randomizes the disk. Of course, this effectively erases the log and the checkpoint, and the -r flag is only provided for testing purposes.

Compilation Details

The current Makefile is for Linux, so that the graph service can be run on the VM's created on Google Cloud. This lab continues to use Mongoose.

Implementation Details

Introducing durability involves logging and checkpointing over a disk. These new functions can be found in log.cpp and checkpoint.cpp. The mutating functions in api.cpp are updated to log a transaction before calling a graph method.

The three new main areas we deal with in this lab are: reading/writing to a disk, transactions in a log, and serializing the graph for checkpointing.

To read and write to the persistent disk, we divide it into smaller blocks. Because we want to store both log transactions and a checkpoint, we partition the disk into two sections. The blocks that begin each section are referred to as superblocks, to which we write header information about the log or checkpoint itself. Our disk size is 10GB, and we divide it into blocks of size 4KB. The last 2GB are reserved for the checkpoint. To allocate memory from the disk, we use mmap.

For the log, the information in the superblock takes the form:
byte 0: u64 checksum
byte 8: u32 current_generation
byte 12: u32 start_of_log
byte 16: u32 size_of_log

Each 4KB block containing log entries will have the structure:
byte 0: u64 checksum
byte 8: u32 generation_number
byte 12: u32 num_entries
byte 16: first 20-byte entry
byte 36: second 20-byte entry
...

Each log entry will take the form:
byte 0: u32 opcode
byte 4: u64 node_a_id
byte 12: u64 node_b_id

The opcodes are as follows: 0 for add_node, 1 for add_edge, 2 for remove_node, 3 for remove_edge.

The functions in log.cpp are able to read and write to the persistent disk and record log entries. If the block is full, the functions in log.cpp also update the superblock and allocate a new block for the log.

Checkpointing functions are located in checkpoint.cpp. The checkpoint superblock contains the following information:
byte 0: u64 checksum
byte 8: u32 generation
byte 12: num_blocks
byte 16: serial_size

The log is serialized as a stream of unsigned 64-bit integers. First, we send the total number of nodes in the graph. Then, we send each node, followed by the number of neighbors it has, followed by the ids for each of its neighbors.

Test script

The provided test script (mytest.sh) tests that reading and writing to the disk over multiple 4KB blocks works. To test the durability of the graph, the user should manually restart server, then check if the graph has persisted by querying the graph with read requests.

Links:

Home: Demo video and Abstract

First iteration: Local in-memory graph

Second iteration: Durable on-disk graph

Third iteration: Chain-replicated graph

Fourth iteration: Partitioned graph

Source code: Each graph is stored under its appropriate branch (lab1, lab2, lab3, lab4)