Systems for Data Science

COMPSCI 590S

In this course, students will learn the fundamentals behind large-scale systems in the context of data science. We will cover the issues involved in scaling up (to many processors) and out (to many nodes) parallelism in order to perform fast analyses on large datasets. These include locality and data representation, concurrency, distributed databases and systems, performance analysis and understanding. We will explore the details of existing and emerging data science platforms, including map-reduce and data analytics systems like Hadoop and Apache Spark.

Class meetings: Tuesday/Thursday 2:30pm-3:45pm, Integrated Science Building (ISB) 221

TA: John Fallon (jtfallon@umass.edu); Office hours: Tuesday 11 am - 1 pm @ CS311, Cube 2

Prerequisites: COMPSCI 311, COMPSCI 345, and COMPSCI 377.

Credits: 3

Required Texts: This is an emerging topic so we will read and review recent technical papers.

Course Format

The course consists of two meetings per week. Each meeting includes a lecture. Readings will be assigned as preparation for each class meeting. Two projects will be assigned during the course. The projects provide students with an opportunity to explore the topics in more depth and in a specialized domain. A midterm exam and a final exam will be given. Grades will be determined by a combination of projects, exam scores, and class participation.

Course grades will be distributed as follows (subject to change):

  • Reviews and class participation: 10%
  • Midterm: 20%
  • Final exam: 30%
  • Projects: 40%

In order to pass the exam, you will need to show to show sufficient performance in all these activities.

Reviews

Before each class, you will have to read and review the paper(s) associated with the lecture and post a review in this review submission site. You need to receive an invitation to access the system. Please contact me if you have not yet received it. Reviews must be entered before the class associated with the paper. The deadline is 11 pm on the day before the class.

Advice on writing systems reviews (for conferences) is available here.

Projects

There will be two coding projects assigned. You can discuss with your colleagues about high-level topics (understanding the requirements of the assignment) and low-level technical details (how to use certain language constructs, e.g. threads), but not about the specifics of the assignment (how to design the system required by the assignment).

We will employ reliable tools for detecting plagiarism in the code. These tools are robust to things like variable renaming and reordering of instructions.

The exams will include questions on the technical details of the projects. These questions will be difficult to answer if you have not written the code yourself. Unsatisfactory answers to these questions will reflect in a lower score for the project assignments.

Project 1 - due October 22, 23:59PM Wordcount is famously the “Hello, world!” of many data science platforms (e.g., MapReduce and Spark). Your first project is to implement a distributed, fault tolerant version of wordcount in Java.

Project 2 - due November 26, 23:59PM For this project, you will implement the core functionality of a graph analytics system based on Pregel / Giraph. The API is explicitly modeled on Pregel’s.

Exams

Midterm: October 25, 7-9 PM in GOES 20

Final: December 20, 3:30-5:30 PM in Goessmann Lab. Add rm 64

Accommodation Statement

The University of Massachusetts Amherst is committed to providing an equal educational opportunity for all students. If you have a documented physical, psychological, or learning disability on file with Disability Services (DS), you may be eligible for reasonable academic accommodations to help you succeed in this course. If you have a documented disability that requires an accommodation, please notify me within the first two weeks of the semester so that we may make appropriate arrangements.

Academic Honesty Statement

Since the integrity of the academic enterprise of any institution of higher education requires honesty in scholarship and research, academic honesty is required of all students at the University of Massachusetts Amherst. Academic dishonesty is prohibited in all programs of the University. Academic dishonesty includes but is not limited to: cheating, fabrication, plagiarism, and facilitating dishonesty. Appropriate sanctions may be imposed on any student who has committed an act of academic dishonesty. Instructors should take reasonable steps to address academic misconduct. Any person who has reason to believe that a student has committed academic dishonesty should bring such information to the attention of the appropriate course instructor as soon as possible. Instances of academic dishonesty not related to a specific course should be brought to the attention of the appropriate department Head or Chair. Since students are expected to be familiar with this policy and the commonly accepted standards of academic integrity, ignorance of such standards is not normally sufficient evidence of lack of intent (http://www.umass.edu/dean_students/codeofconduct/acadhonesty/).