"Big Data" Research and Education

Big Data @ CSE

From scientific discovery to business intelligence, "big data" is changing our world. The dissemination of nearly all information in digital form, the proliferation of sensors, breakthroughs in machine learning and visualization, and dramatic improvements in cost, bandwidth, and scalability are combining to create enormous opportunity. The field also presents enormous challenges, thanks to the relentless increase in the volume, velocity, and variety of information ripe for mining and analysis.

eScience Institute logoUW CSE faculty and students are driving the "Big Data" revolution. We combine depth and breadth of expertise in data management, data visualization, sensing, open information extraction, and systems and networking. We have emerged as one of the world's preeminent centers for machine learning research.

Our efforts span the campus. The University of Washington eScience Institute - recipient of a $38.7 million award from the Moore and Sloan Foundations (joint with Berkeley and NYU), a $9.3 million award from the Washington Research Foundation, and a $2.8 million NSF IGERT award - positions UW at the forefront of data-intensive discovery.

No university offers a more vibrant environment than the University of Washington! And no program offers more opportunity to shape this field than UW CSE!

Learn more below.


Education


CSE is offering a PhD in Computer Science with specialization in Big Data. This degree is part of a campus-wide effort to provide a multi-disciplinary Big Data PhD.

In partnership with the University of Washington eScience Institute, we are also engaged in a variety of other initiatives in data science education.


Research


Many CSE research areas tackle problems related to "Big Data" and many of our research projects cross over between areas.

Big Data Management

Nuage Project

Extracting knowledge from Big Data is a high-touch business today, requiring a human expert who deeply understands the application domain as well as a growing ecosystem of complex distributed systems and advanced statistical methods. These experts are hired in part for their statistical expertise, but report that the majority of their time is spent scaling and optimizing the relatively basic data manipulation tasks in preparation for the actual statistical analysis or machine learning step: identifying relevant data, cleaning, filtering, joining, grouping, transforming, extracting features, and evaluating results. The database group in Computer Science & Engineering has a set of projects that focus on Big Data Management and the related challenges. The most relevant projects include:
  1. AstroDB: Since 2008, the DB group has established a continued and fruitful collaboration with faculty in the Astronomy department. Astronomy is an exemplar Big Data science with an array of fascinating challenges with high potential impact. The database group is also an important participant in the UW eScience institute through which we collaborate with multiple other domain scientists across campus.
  2. Myria: Focuses on building a new system for Big Data Management that is both fast and flexible, offering this system as a cloud service, and addressing both the theoretical and systems challenges associated with Big Data Management as a Cloud service. See also our earlier Nuage project.
  3. CQMS: Focuses on the tools that facilitate the use of Big Data systems.
  4. Data Eco$y$tem: Focuses on cloud data markets and other questions at the boundary of pricing and data management.
  5. SQLShare: A cloud-hosted collaborative database-as-a-service platform aimed at automating or eliminating obstacles to using relational databases: installation, configuration, schema design, tuning, data ingest, and even application building. You simply upload your data and immediately start querying and sharing it.

 CQM ProjectCloud Data Pricing


Big Data Machine Learning

http://homes.cs.washington.edu/~pedrod/logo.jpg

UW is one of the world's top centers of research in machine learning. We are active in most major areas of ML and in a variety of applications like natural language processing, vision, computational biology, the Web, and social networks.

You can find more information on the ML page.


Big Data Visualization

Example visualizations

The increasing scale and accessibility of digital data provides an unprecedented resource for informing research, public policy, business strategy, and even our personal lives. Yet, acquiring and storing this data is, by itself, of little value. We must make sense of data in order to produce value from it. Turning data into knowledge is a fundamental challenge for both computer systems and user interface research: it requires integrating analysis algorithms with human judgments of the meaning and significance of observed patterns.
 
Our research in big data visualization seeks to enhance our ability to analyze and communicate data through the design of interactive visual analysis tools. We study the perceptual, cognitive, and social factors affecting data analysis in order to enable a broad class of analysts to more effectively work with data: to improve the efficiency and scale at which experts work, and lower the threshold for non-experts. Our work spans new visualization systems, the design of visualization and interaction techniques, and evaluation through both deployment and controlled experiments.
 
Example visualization

Big Data Architecture and Programming

We are exploring two distinct directions on architectures, programming and system support for Big Data. The first is on how emerging non-volatile resistive memory technologies such as phase-change memory (PCM) can enable new storage systems. The second is on enabling large-scale irregular parallel computation, such as graph analytics, to run well on commodity processors. Crunching large graphs is the basis of many emerging applications, such as social network analysis and bioinformatics. Graph analytics algorithms exhibit little locality and therefore present significant performance challenges. Hardware multithreading systems (e.g., Cray XMT) show that with enough concurrency, we can tolerate long latencies. Unfortunately, this solution is not available with commodity parts. Our goal is to develop a latency-tolerant system built out of commodity parts and mostly in software.

For more details, please visit our project websites:

Big Data Systems

Big Data challenges are also affecting many aspects of the operating systems that must drive novel types of machines and workloads. Two representative challenges that are being tackled in our department are the following:
 
Very large memory machines
 
DRAM prices continue to drop, and simultaneously, new workloads are emerging that combine a lack of locality with large data sets.  Together, these trends argue for building machines with very large amounts of physical memory.  In this project, we are exploring the architectural, OS, and application-level implications of large memory machines.  For example, at any moment, a TLB can only cover a tiny fraction of a 10TB physical address space;  does this matter, and if so, what architectural innovations can help?  As another example, are there any OS data structures or algorithms that scale poorly as the amount of physical memory grows very large?  As a third example, how can we speed up the bulk-loading of terabytes of DRAM as new applications launch, or the machine reboots?
 
Predictable tail-latency in large-scale data parallel systems
 
Distributed and large-scale data-parallel systems are designed to provide high throughput and low average latency.  However, the tail latencies (e.g., 99.9th percentile request latency) in these systems are also important, as stragglers introduced by high tail-latency events can slow down overall progress or impact interactivity.  In this project, we are exploring causes of tail latency across all layers of the stack, including local storage hardware, distributed storage systems, the communication stack, and the operating system.  As an example, we find that individually rare events, such as repartitioning fragments in a distributed storage system, become frequent at large scale, contributing to the latency tail.

Big Data on the Web

How can a computer accumulate a massive body of knowledge? What will Web search engines look like in ten years?

To address these questions, the Open IE project has been developing a Web-scale information extraction system that reads arbitrary text from any domain on the Web, extracts meaningful information, and stores it in a unified knowledge base for efficient querying. In contrast to traditional information extraction, the Open Information Extraction paradigm attempts to overcome the knowledge acquisition bottleneck by extracting a large number of relations at once.

We have run Open IE over billions of webpages. The results of our run are available at this demo. For more information visit our project page.


Campus-Wide Collaboration


eScience Institute logo

 

At the University of Washington eScience Institute we are ushering in a new era of data-intensive discovery: Jim Gray's "Fourth Paradigm." An interdisciplinary center rooted in Computer Science & Engineering with participants from across the campus, the eScience Institute strives to position UW at the forefront of both big data techniques and the fields that depend on them.