Marco Serafini

Homepage

COMPSCI 677 - Spring 20

Course homepage

Project 3

This project is a group project and it is due on Apr 24th, 11.59 pm


Project goals

The goal of this project is to teach you:


Project description

First, Pygmy.com has three NEW books in its catalog:

  1. How to finish Project 3 on time
  2. Why theory classes are so hard.
  3. Spring in the Pioneer Valley

These books were added to the catalog during a spring break sale that turned out to be a big success. However, with the growing popularity of the book store, buyers began to complain about the time that the system needs to process their requests. You are tasked with rearchitecting Pygmy.com’s online store you built in Lab 2 to handle this higher workload.

This project is based on Project 2. All the requirements of Project 2 still apply to Project 3. This assignment has three parts.


Part 1: Replication and caching

In this part, we will add replication and caching to improve request processing latency. While the front-end server in lab 2 was a very simple component, in this part, we will add two types of functionality to the front-end node. First, we will add an in-memory cache that caches the results of recent requests to the servers. Second, assume that both the order and catalog server are replicated - their code and their database files are replicated on multiple machines (for this part, assume two replicas each for the order and catalog servers). To deal with this replication, the front end node needs to implement a load balancing algorithm that takes each incoming request and sends it to one of the replicas. You may use any load balancing algorithm such as round-robin, least-loaded and can do load balancing on a per-request basis or a per user-session basis. The front end node is NOT replicated. It receives all incoming requests from clients. Upon receiving a request, it first checks the in-memory cache to see if it can be satisfied from the cache. The cache stores results of recently accessed lookup queries (i.e. book id, number of items in stock, and cost) locally. When a new query request comes in, the front end server checks the cache first before it forwards the request to the catalog server. Note that caching is only useful for read requests (queries to catalog); write requests, which are basically orders or update requests, to the catalog must be processed by the order or catalog servers rather than the cache. You can implement the cache server in one of two ways: it can run as a separate component from the front-end server and you will then need to use REST calls to get and put items from and to the cache; or your in-memory cache can be integrated into the front-end server process, in which case, internal function calls are used to get and put items into the cache.

Cache consistency needs to be addressed whenever a database entry is updated by buy requests or the arrival of new stocks of books. To ensure strong consistency guarantees, you should implement a server-push techniques where backend replicas send invalidate requests to the in-memory cache prior to making any writes to their database files. The invalidate request causes the data for that item to be removed from the cache. Feel free to add other caching features such as a limit on the number of items in the cache, which will then need a cache replacement policy such as LRU to replace older items with newer ones.

The replicas should also use an internal protocol to ensure that any writes to their database are also performed at the other replica to keep them in sync with one another.

Like in Lab 2, all components use REST APIs to communicate with one another.


Part 2: Dockerize your application

Pygmy.com has decided to use container technology to make it easy to deploy its application on its servers. From our virtualization lectures, we learnt about OS containers and docker. Docker is a tool that enables you to package each application component at a portable docker container. Your goal is to package each of your components: front-end server, cache server (if it is a separate micro-service from the front-end server), catalog and backend server as docker containers. There are many tools available on the Internet to take the code for each component and package it as a container/ image. You are free to use any such tool to create your docker container images (also feel free to use piazza to discuss specifics of a tool or share your ideas).

Once you have created a docker version of your app, you should also upload it into lab 3 github (in a docker directory for lab 3), so that you can use docker to directly download it from github and deploy on any machine. It is not necessary to test the dockerized version on EdLab but it should at least run on your local machine.


Part 3: Fault tolerance

In this part of the lab, we will make the application fault tolerant. To do so, you need to add new functionality to the front-end node to detect back end node failures. We assume that the order or catalog server replicas can fail at any time and that failures are crash failures (not Byzantine failures). The front end node needs to use heartbeat messages or any other method of your choosing to detect such failures. Upon detecting a failure, the front-end node the uses its load balancing algorithm to forward all subsequent requests to the other non-faulty replica.

Be sure to track requests that were in-progress while the failure occurred and these in-progress requests should be re-issued to the non-faulty replica to get a proper response. Finally, you also need to implement a recovery process where the failed replica is restarted from a crash and then uses a resync method to syncronize its database state with the non-faulty replica so that the two are again in sync.


Part 4: Optional Extra Credit Part. Consensus using RAFT

This part is optional and may be attempted for extra credit. This part can take significant effort and you should attempt it only if the rest of your lab is complete and in good shape. Assume that the order server is replicated on three nodes. Implement a RAFT consensus protocol that uses state machine replication so that all replicas can order incoming writes and apply them to the database in the same order. This will ensure that race conditions do not occur where concurrent incoming orders go to two different replicas and get applied to the other replicas in different orders. You will need to implement an on-disk log that implements state machine replication as part of RAFT. You will further need to show that failures of a order replica does not prevent the others from making progress since the majority of the replicas (2 out of 3) are still up.


Evaluation and Measurements

Lab 3 no longer requires you to write test cases, although you should feel free to use them to verify your code works as expected. However, lab 3 still requires you to do performance measurements or experiments to show/evaluate specific functionality.

  1. Compute the average response time (query/buy) of your new systems as before. What is the response time with and without caching? How much does caching help?
  2. Construct a simple experiment that issues orders or catalog updates (i.e., database writes) to invalidate the cache and maintain cache consistency. What are the overhead of cache consistency operations? What is the latency of a subsequent request that sees a cache miss?
  3. Construct an experiment to show your fault tolerance does the work as you expect. You should start your system with no failures and introduce a crash fault and show that the fault is detected and the other replica can mask this fault. Also be sure to show the process of recovery and resynchronization.

Make necessary plots to support your conclusions.


What You Will Submit

When you have finished implementing the complete assignment as described above, you will submit your solution to github. We expect you would have used github throughout for source code development for this lab; please use github to turn in all of the following (in addition to your code)

  1. Source code with inline comments/documentation.
  2. A copy of the output generated by running your program. This is similar to lab 2. Be sure to include output for each part of the lab.
  3. A seperate document of approximately two to three pages describing the overall program design, a description of “how it works”, and design tradeoffs considered and made. Also describe possible improvements and extensions to your program (and sketch how they might be made). You also need to describe clearly how we can run your program - if we can’t run it, we can’t verify that it works. Please submit the design document and the output in the docs directory in your repository.
  4. Include one single script that automatically compiles your code, deploys it on multiple EdLab machines, and runs all your tests and experiments on those machines. You can assume that the graders will have set up passwordless SSH to run your script.
  5. Performance results of your measurements/experiments should be included in the docs directory. Provide the results/data as a simple graphs or table with brief explanation.

Grading policy for all programming assignments