Systems for Data Science (COMPSCI 532)
In this course, students will learn the fundamentals behind large-scale systems used for data science. We will cover the issues involved in scaling up (to many processors) and out (to many nodes) parallelism in order to perform fast analyses on large datasets. These include locality and data representation, concurrency, distributed databases and systems, performance analysis and understanding. We will explore the details of existing and emerging data science platforms, including map-reduce and data analytics systems like Hadoop and Apache Spark, graph databases, stream processing systems, and systems for machine learning.
Class meetings: Tuesday/Thursday 2:30pm-3:45pm, Morrill Science Center Building II, 222
TA: Nathan Ng (email@example.com); Office hours: Tuesday 4.30-5.30 PM @ CS 207
Piazza (for Q&A): https://piazza.com/umass/fall2019/compsci532/home
Prerequisites: COMPSCI 311, COMPSCI 345, and COMPSCI 377.
Required Texts: This is an emerging topic so we will read and review recent technical papers, which represent the reading material for the exams. The course slides will be made available before exams. Here are some general background resources.
The course consists of two meetings per week. Each meeting includes a lecture. Readings will be assigned as preparation for each class meeting. Two-three projects will be assigned during the course. The projects provide students with an opportunity to explore the topics in more depth and in a specialized domain. A midterm exam and a final exam will be given. Attendance is mandatory. Exams and projects will be prepared assuming that students have attended all classes.
Course grades will be distributed as follows (subject to change):
- Midterm: 20%
- Final exam: 30%
- Projects: 50%
In order to pass the exam, you will need to show to show sufficient performance in all these activities.
There will be two-three coding projects assigned. Projects will be performed in groups of two people. Here is a more detailed description of the mechanics of forming a group and returning a project.
Project 1 - due October 10, 11:59PM
Project 1 evaluation The evaluation of the project will consists of a 30-minutes individual oral session where each student will demo his/her implementation. Students will take their laptop to the session, clone their code from GitHub on a fresh directory during the session, run the code, and answer questions on the implementation. Students are resposible for making sure that everything runs correctly on their laptop before the demo session.
Demo sessions will be held on November 1, 4, and 8. Slots can be booked following the invitation on Piazza. First come first serve.
Project 2 - due November 14, 11:59PM
Project 2 evaluation The evaluation will be similar to the one of Project 1. Information about the demo session has been published on Piazza.
Project 3 - due December 10, 11:59PM
The exams will contain questions about the papers read during the course and the in-class discussions. The slides contain only a subset of the material and are alone not sufficient to answer the questions in the exams.
Midterm: October 22 at 7-9pm in ILC S331
Final: December 18 at 3:30-5:30 PM, Location TBA
Course Schedule (subject to change)
- September 3 (Lecture #1) slides
- Course introduction
- Paper: The Anatomy of a Large-Scale Hypertextual Web Search Engine (pdf)
- September 5 (Lecture #2) slides
- Background: Locality, parallelism, fault tolerance
- September 10 (Lecture #3) slides
- Data-parallel systems: MapReduce
- Paper: MapReduce: Simplified Data Processing on Large Clusters (pdf)
- September 12 (Lecture #4) slides
- Data-parallel systems: Spark
- Paper: Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing (pdf)
- September 17 (Lecture #5) slides
- Stream processing
- Paper: The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing (pdf)
- Paper: Apache Flink: Stream and Batch Processing in a Single Engine (pdf)
- Paper: Discretized Streams: Fault-Tolerant Streaming Computation at Scale (pdf). Sections 1 to 3 are mandatory, the rest is optional.
- September 19 (Lecture #6) slides
- Distributed database management systems
- Paper: The State of the Art in Distributed Query Processing. Sections 1-2 are mandatory, the rest is optional. (pdf)
- Paper: Join Processing in Relational Databases. Sections 1, 2, 5 are mandatory, the rest is optional. (pdf)
- Paper: Distributed Join Algorithms on Thousands of Cores (pdf)
- September 24 (Lecture #7) slides
- September 26 (Lecture #8) slides
- October 1 (Lecture #9) slides
- October 3
- Putting it together: Hadoop tutorial
- October 8 (Lecture #10) slides
- Graph databases
- Optional reading: Cypher: An Evolving Query Language for Property Graphs (pdf).
- October 10 (Lecture #11) slides
- Graph mining
- Paper: Arabesque: A System for Distributed Graph Mining. (pdf)
- Project 1 due
- October 15: No Class - UMass Monday
- October 17
- Recap class
- October 22
- No class
- October 22: Midterm
- 7-9pm in ILC S331
- October 24 (Lecture #12) slides
- Understanding performance: Memory management
- October 29
- Putting it together: Distributed programming tutorial
- October 31 (Lecture #13) slides
- Background: Scalability and Replication
- November 5 (Lecture #14) slides
- Data storage and file systems
- Paper: The Google File System. (pdf)
- November 7 (Lecture #15) slides
- Key-value stores
- Paper: Dynamo: Amazon’s Highly Available Key-value Store. (pdf)
- November 12 (Lecture #16) slides
- Distributed Key-value stores
- November 14 (Lecture #17) slides
- Resource management
- Paper: Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. (pdf)
- Project 2 due
- November 19 (Lecture #18) slides
- Cloud analitics
- Paper: Occupy the Cloud: Distributed Computing for the 99%. (pdf)
- November 21 (Lecture #19) slides
- November 26: No class - Thanksgiving
- November 28: No class - Thanksgiving
- December 3 (Lecture #20) slides
- Machine learning: TensorFlow
- Paper: TensorFlow: A System for Large-Scale Machine Learning (pdf)
- December 5 (Lecture #21) slides
- Machine learning pipelines
- Paper: Clipper: A Low-Latency Online Prediction Serving System (pdf)
- December 10
- Recap class
- December 18: Final exam
Laptops, tablets, phones and electronic device policy
Cell phones should be switched off or put on slient alert during class lectures. Texting or using phones for other purposes (e.g, email, social media, web browsing) during class is strictly prohibited. Laptops and tablets are NOT permitted during lectures. The use of such devices in class tends to be a distraction and hampers learning. Please respect this policy by not using laptops or tablets during the lecture. Any student with an electronic device that disrupts the class or violates this policy will lose 2 points from their final grade.
The University of Massachusetts Amherst is committed to providing an equal educational opportunity for all students. If you have a documented physical, psychological, or learning disability on file with Disability Services (DS), you may be eligible for reasonable academic accommodations to help you succeed in this course. If you have a documented disability that requires an accommodation, please notify me within the first two weeks of the semester so that we may make appropriate arrangements.
Academic Honesty Statement
Since the integrity of the academic enterprise of any institution of higher education requires honesty in scholarship and research, academic honesty is required of all students at the University of Massachusetts Amherst. Academic dishonesty is prohibited in all programs of the University. Academic dishonesty includes but is not limited to: cheating, fabrication, plagiarism, and facilitating dishonesty. Appropriate sanctions may be imposed on any student who has committed an act of academic dishonesty. Instructors should take reasonable steps to address academic misconduct. Any person who has reason to believe that a student has committed academic dishonesty should bring such information to the attention of the appropriate course instructor as soon as possible. Instances of academic dishonesty not related to a specific course should be brought to the attention of the appropriate department Head or Chair. Since students are expected to be familiar with this policy and the commonly accepted standards of academic integrity, ignorance of such standards is not normally sufficient evidence of lack of intent (http://www.umass.edu/dean_students/codeofconduct/acadhonesty/).