Data Science at the Allen School
From scientific discovery to business intelligence, data science is changing our world. The dissemination of nearly all information in digital form, the proliferation of sensors, breakthroughs in machine learning and visualization, and dramatic improvements in cost, bandwidth, and scalability are combining to create enormous opportunity. The field also presents enormous challenges, thanks to the relentless increase in the volume, velocity, and variety of information ripe for mining and analysis.
Allen School faculty and students are driving the data science revolution. We combine depth and breadth of expertise in data management, data visualization, sensing, open information extraction, and systems and networking. We have emerged as one of the world's preeminent centers for machine learning research.
Our efforts span the campus. The University of Washington eScience Institute - recipient of a $38.7 million award from the Moore and Sloan Foundations (joint with Berkeley and NYU), a $9.3 million award from the Washington Research Foundation, and a $2.8 million NSF IGERT award - positions UW at the forefront of data-intensive discovery.
No university offers a more vibrant environment than the University of Washington — and no program offers more opportunity to shape this field than the Allen School!
The Allen School is offering multiple education programs in data science:
- Our Ph.D. in Computer Science now includes an Advanced Data Science specialization (Ph.D.). This degree is part of a campus-wide effort to provide a unified and multi-disciplinary Data Science Specialization in Ph.D. degrees across campus.
- Our undergraduate programs in Computer Science and Computer Engineering now include a Data Science specialization (Bachelors). This degree is also part of a campus-wide effort to provide coordinated data science specializations in various majors across campus.
- We are one of the six departments behind UW's Data Science Master's program.
Many Allen School research labs tackle problems related to Big Data and many of our research projects cross over between areas.
Big Data Management
Extracting knowledge from Big Data is a high-touch business today, requiring a human expert who deeply understands the application domain as well as a growing ecosystem of complex distributed systems and advanced statistical methods. These experts are hired in part for their statistical expertise, but report that the majority of their time is spent scaling and optimizing the relatively basic data manipulation tasks in preparation for the actual statistical analysis or machine learning step: identifying relevant data, cleaning, filtering, joining, grouping, transforming, extracting features, and evaluating results. The UW Database Group has a set of projects that focus on Big Data Management and the related challenges. The most relevant projects include:
Big Data Machine Learning
UW is one of the world's top centers of research in machine learning. We are active in most major areas of ML and in a variety of applications like natural language processing, vision, computational biology, the Web, and social networks.
You can find more information on the ML page.
Big Data Visualization
The increasing scale and accessibility of digital data provides an unprecedented resource for informing research, public policy, business strategy, and even our personal lives. Yet, acquiring and storing this data is, by itself, of little value. We must make sense of data in order to produce value from it. Turning data into knowledge is a fundamental challenge for both computer systems and user interface research: it requires integrating analysis algorithms with human judgments of the meaning and significance of observed patterns.
Our research in big data visualization seeks to enhance our ability to analyze and communicate data through the design of interactive visual analysis tools. We study the perceptual, cognitive, and social factors affecting data analysis in order to enable a broad class of analysts to more effectively work with data: to improve the efficiency and scale at which experts work, and lower the threshold for non-experts. Our work spans new visualization systems, the design of visualization and interaction techniques, and evaluation through both deployment and controlled experiments.
Big Data Architecture and Programming
We are exploring two distinct directions on architectures, programming and system support for Big Data. The first is on how emerging non-volatile resistive memory technologies such as phase-change memory (PCM) can enable new storage systems. The second is on enabling large-scale irregular parallel computation, such as graph analytics, to run well on commodity processors. Crunching large graphs is the basis of many emerging applications, such as social network analysis and bioinformatics. Graph analytics algorithms exhibit little locality and therefore present significant performance challenges. Hardware multithreading systems (e.g., Cray XMT) show that with enough concurrency, we can tolerate long latencies. Unfortunately, this solution is not available with commodity parts. Our goal is to develop a latency-tolerant system built out of commodity parts and mostly in software.
For more details, please visit our project websites:
Big Data Systems
Big Data challenges are also affecting many aspects of the operating systems that must drive novel types of machines and workloads. Two representative challenges that are being tackled in our department are the following:
Very large memory machines
DRAM prices continue to drop, and simultaneously, new workloads are emerging that combine a lack of locality with large data sets. Together, these trends argue for building machines with very large amounts of physical memory. In this project, we are exploring the architectural, OS, and application-level implications of large memory machines. For example, at any moment, a TLB can only cover a tiny fraction of a 10TB physical address space; does this matter, and if so, what architectural innovations can help? As another example, are there any OS data structures or algorithms that scale poorly as the amount of physical memory grows very large? As a third example, how can we speed up the bulk-loading of terabytes of DRAM as new applications launch, or the machine reboots?
Predictable tail-latency in large-scale data parallel systems
Distributed and large-scale data-parallel systems are designed to provide high throughput and low average latency. However, the tail latencies (e.g., 99.9th percentile request latency) in these systems are also important, as stragglers introduced by high tail-latency events can slow down overall progress or impact interactivity. In this project, we are exploring causes of tail latency across all layers of the stack, including local storage hardware, distributed storage systems, the communication stack, and the operating system. As an example, we find that individually rare events, such as repartitioning fragments in a distributed storage system, become frequent at large scale, contributing to the latency tail.
Big Data on the Web
How can a computer accumulate a massive body of knowledge? What will Web search engines look like in ten years?
To address these questions, the Open IE project has been developing a Web-scale information extraction system that reads arbitrary text from any domain on the Web, extracts meaningful information, and stores it in a unified knowledge base for efficient querying. In contrast to traditional information extraction, the Open Information Extraction paradigm attempts to overcome the knowledge acquisition bottleneck by extracting a large number of relations at once.
At the University of Washington eScience Institute we are ushering in a new era of data-intensive discovery: Jim Gray's "Fourth Paradigm." An interdisciplinary center rooted in Computer Science & Engineering with participants from across the campus, the eScience Institute strives to position UW at the forefront of both big data techniques and the fields that depend on them.