"Big Data" Research and Education

Big Data @ CSE

From scientific discovery to business intelligence, "Big Data" is changing our world. This revolution is being driven by many factors:

  • A proliferation of sensors
  • More generally, the creation of almost all information in digital form
  • Dramatic cost reductions in storage
  • Dramatic increases in network bandwidth
  • Dramatic cost reductions and scalability improvements in computation
  • Dramatic algorithmic breakthroughs in machine learning and other areas

The opportunities are enormous. So are the challenges, created by entirely new applications and by the relentlessly increasing volume, velocity, and variety of data. "Big Data" puts computer science at the center of advances in every imaginable field.

UW CSE is driving the "Big Data" revolution. Our traditional strength in data management (Magda Balazinska, Bill Howe, Dan Suciu), machine learning (Pedro Domingos), and open information extraction (Oren Etzioni, Dan Weld) has recently been augmented by key hires in machine learning (Emily Fox, Carlos Guestrin, Ben Taskar) and data visualization (Jeff Heer).

Our efforts are coordinated with those of outstanding researchers in the University of Washington's top-ten programs in Statistics, Biostatistics, and Applied Mathematics, among others. Through the University of Washington eScience Institute (directed by Ed Lazowska) we are integrally involved in ensuring that researchers across the campus have access to cutting-edge approaches to data-driven discovery.

No university offers a more vibrant environment!

 


Education


Starting in September 2013, CSE will offer a new PhD in Computer Science with specialization in Big Data. This degree is part of a campus-wide effort to provide a multi-disciplinary Big Data PhD.

Through University of Washington Educational Outreach we are also offering a certificate program in Data Science and a MOOC in Data Science on the Coursera platform.

 


Research


Many CSE research areas tackle problems related to "Big Data" and many of our research projects cross over between areas.

Big Data Management

Nuage Project

Extracting knowledge from Big Data is a high-touch business today, requiring a human expert who deeply understands the application domain as well as a growing ecosystem of complex distributed systems and advanced statistical methods. These experts are hired in part for their statistical expertise, but report that the majority of their time is spent scaling and optimizing the relatively basic data manipulation tasks in preparation for the actual statistical analysis or machine learning step: identifying relevant data, cleaning, filtering, joining, grouping, transforming, extracting features, and evaluating results. The database groupin Computer Science & Engineering has a set of projects that focus on Big Data Management and the related challenges. The most relevant projects include:
  1. AstroDB: Since 2008, the DB group has established a continued and fruitful collaboration with faculty in the Astronomy department. Astronomy is an exemplar Big Data science with an array of fascinating challenges with high potential impact. The database group is also an important participant in the UW eScience institute through which we collaborate with multiple other domain scientists across campus.
  2. Myria: Focuses on building a new system for Big Data Management that is both fast and flexible, offering this system as a cloud service, and addressing both the theoretical and systems challenges associated with Big Data Management as a Cloud service.
  3. Nuage: Focuses on a variety of technical problems related to Big Data and Cloud Computing, with special focus on scientific appliciations.
  4. CQMS: Focuses on the tools that facilitate the use of Big Data systems.
  5. Data Eco$y$tem: Focuses on cloud data markets and other questions at the boundary of pricing and data management.
  6. SQLShare: A cloud-hosted collaborative database-as-a-service platform aimed at automating or eliminating obstacles to using relational databases: installation, configuration, schema design, tuning, data ingest, and even application building. You simply upload your data and immediately start querying and sharing it.

 CQM ProjectCloud Data Pricing


Big Data Machine Learning

 


Big Data Visualization

Example visualizations

The increasing scale and accessibility of digital data provides an unprecedented resource for informing research, public policy, business strategy, and even our personal lives. Yet, acquiring and storing this data is, by itself, of little value. We must make sense of data in order to produce value from it. Turning data into knowledge is a fundamental challenge for both computer systems and user interface research: it requires integrating analysis algorithms with human judgments of the meaning and significance of observed patterns.
 
Our research in big data visualization seeks to enhance our ability to analyze and communicate data through the design of interactive visual analysis tools. We study the perceptual, cognitive, and social factors affecting data analysis in order to enable a broad class of analysts to more effectively work with data: to improve the efficiency and scale at which experts work, and lower the threshold for non-experts. Our work spans new visualization systems, the design of visualization and interaction techniques, and evaluation through both deployment and controlled experiments.
 
Example visualization

Big Data Architecture and Programming

We are exploring two distinct directions on architectures, programming and system support for Big Data. The first is on how emerging non-volatile resistive memory technologies such as phase-change memory (PCM) can enable new storage systems. The second is on enabling large-scale irregular parallel computation, such as graph analytics, to run well on commodity processors. Crunching large graphs is the basis of many emerging applications, such as social network analysis and bioinformatics. Graph analytics algorithms exhibit little locality and therefore present significant performance challenges. Hardware multithreading systems (e.g., Cray XMT) show that with enough concurrency, we can tolerate long latencies. Unfortunately, this solution is not available with commodity parts. Our goal is to develop a latency-tolerant system built out of commodity parts and mostly in software.

For more details, please visit our project websites:

Big Data Systems

Big Data challenges are also affecting many aspects of the operating systems that must drive novel types of machines and workloads. Two representative challenges that are being tackled in our department are the following:
 
Very large memory machines
 
DRAM prices continue to drop, and simultaneously, new workloads are emerging that combine a lack of locality with large data sets.  Together, these trends argue for building machines with very large amounts of physical memory.  In this project, we are exploring the architectural, OS, and application-level implications of large memory machines.  For example, at any moment, a TLB can only cover a tiny fraction of a 10TB physical address space;  does this matter, and if so, what architectural innovations can help?  As another example, are there any OS data structures or algorithms that scale poorly as the amount of physical memory grows very large?  As a third example, how can we speed up the bulk-loading of terabytes of DRAM as new applications launch, or the machine reboots?
 
 
Predictable tail-latency in large-scale data parallel systems
 
Distributed and large-scale data-parallel systems are designed to provide high throughput and low average latency.  However, the tail latencies (e.g., 99.9th percentile request latency) in these systems are also important, as stragglers introduced by high tail-latency events can slow down overall progress or impact interactivity.  In this project, we are exploring causes of tail latency across all layers of the stack, including local storage hardware, distributed storage systems, the communication stack, and the operating system.  As an example, we find that individually rare events, such as repartitioning fragments in a distributed storage system, become frequent at large scale, contributing to the latency tail.

Big Data on the Web

How can a computer accumulate a massive body of knowledge? What will Web search engines look like in ten years?

To address these questions, the Open IE project has been developing a Web-scale information extraction system that reads arbitrary text from any domain on the Web, extracts meaningful information, and stores it in a unified knowledge base for efficient querying. In contrast to traditional information extraction, the Open Information Extraction paradigm attempts to overcome the knowledge acquisition bottleneck by extracting a large number of relations at once.

We have run Open IE over billions of webpages. The results of our run are available at this demo. For more information visit our project page.


Big Data and Discovery

eScience Logo

 

At the University of Washington eScience Institute we are ushering in a new era of data-driven discovery: Jim Gray's "Fourth Paradigm." An interdisciplinary center rooted in Computer Science & Engineering with participants from across the campus, the eScience Institute strives to position UW at the forefront of both big data techniques and the fields that depend on them.

Last changed Thu, 2013-01-10 15:01