Data Science at the Allen School

From scientific discovery to business intelligence, data science is changing our world. The dissemination of nearly all information in digital form, the proliferation of sensors, breakthroughs in machine learning and visualization, and dramatic improvements in cost, bandwidth, and scalability are combining to create enormous opportunity. The field also presents enormous challenges, thanks to the relentless increase in the volume, velocity, and variety of information ripe for mining and analysis.

eScience Institute logoAllen School faculty and students are driving the data science revolution. We combine depth and breadth of expertise in data management, data visualization, sensing, open information extraction, and systems and networking. We have emerged as one of the world's preeminent centers for machine learning research.

Our efforts span the campus. The University of Washington eScience Institute - recipient of a $38.7 million award from the Moore and Sloan Foundations (joint with Berkeley and NYU), a $9.3 million award from the Washington Research Foundation, and a $2.8 million NSF IGERT award - positions UW at the forefront of data-intensive discovery.

No university offers a more vibrant environment than the University of Washington — and no program offers more opportunity to shape this field than the Allen School!


Education


The Allen School is offering multiple education programs in data science:

In partnership with the University of Washington eScience Institute, we are also engaged in a variety of other initiatives in data science education.


Research


Many Allen School research labs tackle problems related to Big Data and many of our research projects cross over between areas.

Big Data Management

Nuage Project

Extracting knowledge from Big Data is a high-touch business today, requiring a human expert who deeply understands the application domain as well as a growing ecosystem of complex distributed systems and advanced statistical methods. These experts are hired in part for their statistical expertise, but report that the majority of their time is spent scaling and optimizing the relatively basic data manipulation tasks in preparation for the actual statistical analysis or machine learning step: identifying relevant data, cleaning, filtering, joining, grouping, transforming, extracting features, and evaluating results. The UW Database Grouphas a set of projects that focus on Big Data Management and the related challenges. The most relevant projects include:
  1. LightDB: Data management system for video data. We are building a new data management system for virtual reality videos and other video applications. Our work includes building a new benchmark, called Visual Road, for this type of workloads.
  2. Image analytics: Database support for machine learning workloads. We are developing new data management techniques to support novel workloads, including machine learning. Our focus so far has been on accelerating and better supporting deep learning over scientific image databases. 
  3. DeepQuery: Machine learning for database systems. We study how machine learning, including deep learning, can serve to improve data management systems. This project also included the Cuttlefish work that applied reinforcement learning to adaptive query processing.
  4. Themis: Open world data management and analytics system: We are developing a new type of data management and analytics system designed to work with samples of real-world data, yet answer analytical queries about that world. This work also includes work on query optimization and the EntropyDB system
  5. AstroDB: Since 2008, the Database Group has engaged in a continued and fruitful collaboration with faculty in the Astronomy department. Astronomy is an exemplar Big Data science with an array of fascinating challenges with high potential impact. The group is also an important participant in the UW eScience institute through which we collaborate with multiple other domain scientists across campus.
  6. Myria: Focuses on building a new system for Big Data Management that is both fast and flexible, offering this system as a cloud service, and addressing both the theoretical and systems challenges associated with Big Data Management as a cloud service. See also our earlier Nuage project.
  7. CQMS: Focuses on the tools that facilitate the use of Big Data systems.
  8. Data Eco$y$tem: Focuses on cloud data markets and other questions at the boundary of pricing and data management.

 CQM ProjectCloud Data Pricing


Big Data Machine Learning

http://homes.cs.washington.edu/~pedrod/logo.jpg

UW is one of the world's top centers of research in machine learning. We are active in most major areas of ML and in a variety of applications like natural language processing, vision, computational biology, the Web, and social networks.

You can find more information on the ML page.


Big Data Visualization

Example visualizations

The increasing scale and accessibility of digital data provides an unprecedented resource for informing research, public policy, business strategy, and even our personal lives. Yet, acquiring and storing this data is, by itself, of little value. We must make sense of data in order to produce value from it. Turning data into knowledge is a fundamental challenge for both computer systems and user interface research: it requires integrating analysis algorithms with human judgments of the meaning and significance of observed patterns.
 
Our research in big data visualization seeks to enhance our ability to analyze and communicate data through the design of interactive visual analysis tools. We study the perceptual, cognitive, and social factors affecting data analysis in order to enable a broad class of analysts to more effectively work with data: to improve the efficiency and scale at which experts work, and lower the threshold for non-experts. Our work spans new visualization systems, the design of visualization and interaction techniques, and evaluation through both deployment and controlled experiments.
 
Example visualization

Big Data Architecture and Programming

We are exploring two distinct directions on architectures, programming and system support for Big Data. The first is on how emerging non-volatile resistive memory technologies such as phase-change memory (PCM) can enable new storage systems. The second is on enabling large-scale irregular parallel computation, such as graph analytics, to run well on commodity processors. Crunching large graphs is the basis of many emerging applications, such as social network analysis and bioinformatics. Graph analytics algorithms exhibit little locality and therefore present significant performance challenges. Hardware multithreading systems (e.g., Cray XMT) show that with enough concurrency, we can tolerate long latencies. Unfortunately, this solution is not available with commodity parts. Our goal is to develop a latency-tolerant system built out of commodity parts and mostly in software.

For more details, please visit our project websites:

Big Data Systems

Big Data challenges are also affecting many aspects of the operating systems that must drive novel types of machines and workloads. Two representative challenges that are being tackled in our department are the following:
 
Very large memory machines
 
DRAM prices continue to drop, and simultaneously, new workloads are emerging that combine a lack of locality with large data sets.  Together, these trends argue for building machines with very large amounts of physical memory.  In this project, we are exploring the architectural, OS, and application-level implications of large memory machines.  For example, at any moment, a TLB can only cover a tiny fraction of a 10TB physical address space;  does this matter, and if so, what architectural innovations can help?  As another example, are there any OS data structures or algorithms that scale poorly as the amount of physical memory grows very large?  As a third example, how can we speed up the bulk-loading of terabytes of DRAM as new applications launch, or the machine reboots?
 
Predictable tail-latency in large-scale data parallel systems
 
Distributed and large-scale data-parallel systems are designed to provide high throughput and low average latency.  However, the tail latencies (e.g., 99.9th percentile request latency) in these systems are also important, as stragglers introduced by high tail-latency events can slow down overall progress or impact interactivity.  In this project, we are exploring causes of tail latency across all layers of the stack, including local storage hardware, distributed storage systems, the communication stack, and the operating system.  As an example, we find that individually rare events, such as repartitioning fragments in a distributed storage system, become frequent at large scale, contributing to the latency tail.

Big Data on the Web

How can a computer accumulate a massive body of knowledge? What will Web search engines look like in ten years?

To address these questions, the Open IE project has been developing a Web-scale information extraction system that reads arbitrary text from any domain on the Web, extracts meaningful information, and stores it in a unified knowledge base for efficient querying. In contrast to traditional information extraction, the Open Information Extraction paradigm attempts to overcome the knowledge acquisition bottleneck by extracting a large number of relations at once.

We have run Open IE over billions of webpages. The results of our run are available at this demo. For more information visit our project page.


Campus-Wide Collaboration


eScience Institute logo

 

At the University of Washington eScience Institute we are ushering in a new era of data-intensive discovery: Jim Gray's "Fourth Paradigm." An interdisciplinary center rooted in Computer Science & Engineering with participants from across the campus, the eScience Institute strives to position UW at the forefront of both big data techniques and the fields that depend on them.