"Big Data" Research and Education
Big Data @ CSE
From scientific discovery to business intelligence, "Big Data" is changing our world. This revolution is being driven by many factors:
- A proliferation of sensors
- More generally, the creation of almost all information in digital form
- Dramatic cost reductions in storage
- Dramatic increases in network bandwidth
- Dramatic cost reductions and scalability improvements in computation
- Dramatic algorithmic breakthroughs in machine learning and other areas
The opportunities are enormous. So are the challenges, created by entirely new applications and by the relentlessly increasing volume, velocity, and variety of data. "Big Data" puts computer science at the center of advances in every imaginable field.
UW CSE is driving the "Big Data" revolution. Our traditional strength in data management (Magda Balazinska, Bill Howe, Dan Suciu), machine learning (Pedro Domingos), and open information extraction (Oren Etzioni, Dan Weld) has recently been augmented by key hires in machine learning (Emily Fox, Carlos Guestrin, Ben Taskar) and data visualization (Jeff Heer).
Our efforts are coordinated with those of outstanding researchers in the University of Washington's top-ten programs in Statistics, Biostatistics, and Applied Mathematics, among others. Through the University of Washington eScience Institute (directed by Ed Lazowska) we are integrally involved in ensuring that researchers across the campus have access to cutting-edge approaches to data-driven discovery.
No university offers a more vibrant environment!
Starting in September 2013, CSE will offer a new PhD in Computer Science with specialization in Big Data. This degree is part of a campus-wide effort to provide a multi-disciplinary Big Data PhD.
Many CSE research areas tackle problems related to "Big Data" and many of our research projects cross over between areas.
Big Data Management
Extracting knowledge from Big Data is a high-touch business today, requiring a human expert who deeply understands the application domain as well as a growing ecosystem of complex distributed systems and advanced statistical methods. These experts are hired in part for their statistical expertise, but report that the majority of their time is spent scaling and optimizing the relatively basic data manipulation tasks in preparation for the actual statistical analysis or machine learning step: identifying relevant data, cleaning, filtering, joining, grouping, transforming, extracting features, and evaluating results. The database groupin Computer Science & Engineering has a set of projects that focus on Big Data Management and the related challenges. The most relevant projects include:
Big Data Machine Learning
Big Data Visualization
The increasing scale and accessibility of digital data provides an unprecedented resource for informing research, public policy, business strategy, and even our personal lives. Yet, acquiring and storing this data is, by itself, of little value. We must make sense of data in order to produce value from it. Turning data into knowledge is a fundamental challenge for both computer systems and user interface research: it requires integrating analysis algorithms with human judgments of the meaning and significance of observed patterns.
Our research in big data visualization seeks to enhance our ability to analyze and communicate data through the design of interactive visual analysis tools. We study the perceptual, cognitive, and social factors affecting data analysis in order to enable a broad class of analysts to more effectively work with data: to improve the efficiency and scale at which experts work, and lower the threshold for non-experts. Our work spans new visualization systems, the design of visualization and interaction techniques, and evaluation through both deployment and controlled experiments.
Big Data Architecture and Programming
We are exploring two distinct directions on architectures, programming and system support for Big Data. The first is on how emerging non-volatile resistive memory technologies such as phase-change memory (PCM) can enable new storage systems. The second is on enabling large-scale irregular parallel computation, such as graph analytics, to run well on commodity processors. Crunching large graphs is the basis of many emerging applications, such as social network analysis and bioinformatics. Graph analytics algorithms exhibit little locality and therefore present significant performance challenges. Hardware multithreading systems (e.g., Cray XMT) show that with enough concurrency, we can tolerate long latencies. Unfortunately, this solution is not available with commodity parts. Our goal is to develop a latency-tolerant system built out of commodity parts and mostly in software.
For more details, please visit our project websites:
Big Data Systems
Big Data challenges are also affecting many aspects of the operating systems that must drive novel types of machines and workloads. Two representative challenges that are being tackled in our department are the following:
Very large memory machines
DRAM prices continue to drop, and simultaneously, new workloads are emerging that combine a lack of locality with large data sets. Together, these trends argue for building machines with very large amounts of physical memory. In this project, we are exploring the architectural, OS, and application-level implications of large memory machines. For example, at any moment, a TLB can only cover a tiny fraction of a 10TB physical address space; does this matter, and if so, what architectural innovations can help? As another example, are there any OS data structures or algorithms that scale poorly as the amount of physical memory grows very large? As a third example, how can we speed up the bulk-loading of terabytes of DRAM as new applications launch, or the machine reboots?
Predictable tail-latency in large-scale data parallel systems
Distributed and large-scale data-parallel systems are designed to provide high throughput and low average latency. However, the tail latencies (e.g., 99.9th percentile request latency) in these systems are also important, as stragglers introduced by high tail-latency events can slow down overall progress or impact interactivity. In this project, we are exploring causes of tail latency across all layers of the stack, including local storage hardware, distributed storage systems, the communication stack, and the operating system. As an example, we find that individually rare events, such as repartitioning fragments in a distributed storage system, become frequent at large scale, contributing to the latency tail.
Big Data on the Web
How can a computer accumulate a massive body of knowledge? What will Web search engines look like in ten years?
To address these questions, the Open IE project has been developing a Web-scale information extraction system that reads arbitrary text from any domain on the Web, extracts meaningful information, and stores it in a unified knowledge base for efficient querying. In contrast to traditional information extraction, the Open Information Extraction paradigm attempts to overcome the knowledge acquisition bottleneck by extracting a large number of relations at once.
Big Data and Discovery
At the University of Washington eScience Institute we are ushering in a new era of data-driven discovery: Jim Gray's "Fourth Paradigm." An interdisciplinary center rooted in Computer Science & Engineering with participants from across the campus, the eScience Institute strives to position UW at the forefront of both big data techniques and the fields that depend on them.