Big Data, Enormous Oppportunity Oceanography Astronomy Alumni awards Alum achievement Diamond winner Fitzpatrick Chair’s message Alumni entrepreneur profiles WibiData Usermind Two rising stars join CSE faculty Faculty awards and honors Student awards and honors News and events Bay Area alum meetup CSE annual luncheon Ben Taskar Datagrams
Big data, enormous opportunity
UW core team (clockwise from lower left): Tom Daniel (Biology),
Andy Connolly (Astronomy),Bill Howe (CSE), Ed Lazowska (CSE),
Randy LeVeque (Applied Math), Tyler McCormick (Stat + Sociology),
Cecilia Aragon (HCDE), Ginger Armbrust (Oceanography),
Sarah Loebman (Astronomy). Missing: Magda Balazinska (CSE),
Josh Blumenstock (iSchool), Mark Ellis (Geography),
Carlos Guestrin (CSE), Thomas Richardson (Stat),
Werner Stuetzle (Stat), John Vidale (ESS).
Every credit card transaction, embedded sensor stream from sea floor or smartphone, web click on a social media site, or DNA sequencer adds to the petabytes of heterogeneous, noisy datasets now available to researchers. Figuring out how to extract value from this Big Data lies at the heart of 21st century discovery.
On November 12, 2013 as the featured talk at a White House Office of Science and Technology Policy (OSTP) event, a five-year, $37.8 million award from the Gordon and Betty Moore Foundation and the Alfred P. Sloan Foundation was made to the University of Washington, the University of California at Berkeley, and New York University. The award aims to dramatically accelerate the growth of data-intensive discovery in a broad range of scientific fields. “In order to remain at the forefront,” said Ed Lazowska, Bill & Melinda Gates Chair in Computer Science & Engineering and Founding Director of the UW eScience Institute, “UW must be a leader in advancing the methodologies of data science and in putting these methodologies to work in the broadest imaginable range of fields.”
Lazowska led the UW’s team of more than a dozen faculty members from across the campus in winning this prestigious award. Berkeley’s team, led by Nobel laureate astrophysicist Saul Perlmutter, and NYU’s team, helmed by neuroscientist and computer scientist Yann LeCun, partnered with the UW group for the six-month, intense collaboration that led to the grant.
At UW, the funds will pay salaries for data science experts who will work with researchers across campus on data analytics and methodologies, for postdoctoral data science fellows pursuing interdisciplinary research agendas, and for research scientists stationed in other departments and centers. A dedicated “data science studio” in the Physics/Astronomy Building will offer meeting areas and drop-in workspaces to encourage collaboration across the UW’s colleges and schools.
According to OSTP director John Holdren, “America is rich with institutions that are expert at generating data, but as a nation we have not fulfilled our potential to make the most of these data by merging pre-competitive resources, partnering on analytics, and sharing lessons learned.” UW, Berkeley, and NYU are now paving the way in defining and organizing efforts to fulfill this potential.
Representatives from the three schools have defined core goals to achieve over the five-year grant duration:
The Moore/Sloan award encourages cross-disciplinary cooperation in an academic world that can erect barriers to research and discovery. Methodology specialists often lack the discipline-specific knowledge needed to adapt and invent tools to investigate complex scientific problems. And domain scientists cannot realize the full potential of their research if they lack varied data science tools and know-how.
Furthermore, “brain drains” occur when students or post-docs walk away from universities for corporate life, leaving tools and methods undocumented and unshared and new opportunities unexplored. Incentives are needed to keep them in the academic setting. “These data scientists are coveted in industry as well as academia. One of the missions we have in this effort is to provide competitive career paths that allow these experts the freedom to remain in academia and apply their skills to the most important problems in science,” says Bill Howe, co-lead of the eScience Institute and Affiliate Assistant Professor of Computer Science & Engineering.
UW: Investing in “Big Data” Since Before It Was Cool
The University of Washington has already assumed a leadership role in addressing such problems and directing Big Data discussions. Its eScience Institute — created in 2008 as the brainchild of visionaries Ed Lazowska, Tom Daniel (Professor of Biology), Werner Stuetzle (Professor of Statistics and Divisional Dean, Arts & Sciences) and others — has collaborated with the UW’s Center for Statistics and the Social Sciences, now more than a decade old, to become a national leader in advancing the techniques and technologies of data-intensive discovery. Said Lazowska, “UW has been investing in ‘Big Data’ since long before it became cool. The eScience Institute, CSSS, and our partners across the entire campus are creating the intellectual infrastructure needed to ensure UW’s continued leadership in learning and discovery.”
Work in Progress
At the eScience Institute, researchers from many UW disciplines are already at work with statisticians, applied mathematicians and computer scientists to solve problems involving complex, massive-scale datasets. “This has enabled scientists to ‘self-serve,’” states Bill Howe, so “they no longer feel ‘locked out’ of their data (and) required to tap on the shoulder of some resident programmer just to do their science.”
Ongoing collaborations include work in fields such as oceanography, astronomy, biology, neuroscience, sociology, earth sciences, and global health. Two scenarios are highlighted in article subsections: (1) collaborative work on SQLShare, a tool that helps marine biologists with only rudimentary knowledge of database systems use them to integrate and analyze tabular science data, and (2) joint work to advance database architectures for next-generation survey astronomy.
The Data Science Incubation Program
The success of these one-off collaborations, further powered by Moore/Sloan funding, spurred the eScience Institute to scale up their work to reach more people and projects. One result is the Data Science Incubation Program. This program solicits proposals on data science projects around campus where “a little bit of hands-on assistance can go a long way,” according to Howe.
Researchers send the eScience Institute short proposals describing their data analysis problem and identifying a scientist who will work closely with data science professionals in the Institute to solve it. The Institute then chooses the proposals that can make the strongest impact, and joint work begins in the data science studio. “There is a multiplicative effect because data science is the great unifier for the next decade,” notes Lazowska, “where people discover that, despite being in different fields, they have similar problems that admit similar solutions.”
The five projects chosen in the first round of Incubation applications include students, postdocs and faculty as project leads and span the fields of statistics, astronomy, seismology, economics, sociology, and health metrics.
What the Future Holds
“Data science is the rising tide that lifts all boats,” says Lazowska. “If UW doesn’t lead in data science, we won’t lead in any other field. We have smart homes, smart cars, smart robots. It’s time to put the ‘smarts’ into discovery.”
For more information on collaboration scenarios, see “Digital Exploration of the Microscopic Marine Ecosystem” and “Advanced Database Architectures for Analyzing a Universe of Data”.
Virginia Armbrust studies marine microbes, whose microscopic stature belies their importance to the planet. As Armbrust, Director of UW’s School of Oceanography, explains, “These microorganisms recycle the elements and drive critical biogeochemical ocean processes. By studying the temporal and spatial distribution of these microbes, their genes and DNA sequences, we can see how ecosystems are responding to environmental changes and how they might respond to future ocean conditions.”
Like most research scientists today, Armbrust has had to develop new ways of thinking in order to accommodate the massive flow of data now available. She notes that oceanography has gone from being a “very data-poor SeaFlow instrument science, gathering data with a few research cruises, to a very data-rich science, with sensors throughout the ocean, relaying data back tocomputers 24 hours a day.” The SeaFlow instrument developed in her lab, for example, analyzes 15,000 cells per second continuously during a multi-week research cruise — potentially 5-15 GB of data/day.
This scale of data required a new approach to analysis. She explains: “I was trained to plot all the data points, and look at the data in different ways. But suddenly we couldn’t look at all the data, which meant that we couldn’t figure out which questions to ask. We literally couldn’t even see the data that was coming in. I had a lot of people with great, interesting questions to ask, and they couldn’t ask them.”
“Cytograms” displaying data from the SeaFlow instrument. Each color represents a different category of microbe. The left and right images show very different population profiles— the kind of variance that SeaFlow is designed to help us understand.
SQLShare: Database-as-a-Service for Science
Bill Howe, Associate Director of the UW eScience Institute and Affiliate Assistant Professor in Computer Science & Engineering, reached out to the Armbrust Lab to see how the Institute might help. Through a close collaboration, the teams developed a tool called SQLShare, intended to make databases dramatically easier to deploy in science contexts. Hosted in the cloud, SQLShare simplified the use of database technology: there was no software to install, and data could be ingested directly, queried, and shared with others, all through a browser. “SQLShare was motivated by our observation that scientists weren’t often using database technology, even when it seemed like a good fit,” says Howe. “The problem turned out not to be the technology itself, but the way it was typically delivered.”
Dan Halperin giving a demonstration
of SQLShare to a rapt audience
The SQLShare technology made a believer of Armbrust: “It really opened my eyes. After seeing the power of databases, I became a born-again database person… And SQLShare demonstrated the power of the cloud, broke down that conceptual barrier for me, and helped me begin to see what cloud computing makes possible.” She further notes that “SQLShare allows more of a back and forth with the data, so you can look at the data and figure out what kinds of questions you want to ask. It helps you weed out the garbage data, lets you focus on the good data.”
The collaboration with the eScience Institute became a process of thinking together about the data. Armbrust describes the iterative process she began to use: “I think I know the question, and we answer it with SQLShare, and then I realize, ‘Oh, that’s not the question I want to ask, I want to ask something else.’ That kind of rapid iteration is what SQLShare makes possible.” Armbrust concluded that what took weeks of analysis using desktop tools took mere minutes using SQLShare.
Interdisciplinary Collaborative Analysis Enabled by SQLShare
In Spring 2013, the Armbrust and e-Science teams conducted a “field test” of the SQLShare technology. In the context of an interdisciplinary oceanography project led by Armbrust, 40 oceanographers from varied sub-disciplines — physical, biological, chemical — organized a retreat to plan an integrative research agenda based on heterogeneous data they had independently collected the previous summer on a joint research cruise. Representatives from the eScience Institute brought all data pre-loaded into SQLShare. As a result, the meeting’s agenda shifted from “planning” to “doing.”
With help from an eScience team led by Dan Halperin, the Institute’s Director of Research in Scalable Data Analytics, researchers directly queried the data “at the speed of discussion.” For example, they asked about the relationship between Zinc and Cobalt at a particular location, or explored whether ocean salinity affected the count of a particular virus. These questions were directly translated into SQLShare queries and answered on the spot. Immediate answers enabled immediate discussion of results, a critical factor in making the best possible use of the rare combination of domain specialists present at the meeting. “Our hypothesis was that SQLShare could be used interactively to steer the discussion in real time,” says Halperin. However, “What we found was that using it in this way was not just faster, but fundamentally different — different questions and different science.”
Myria: Scaling Up beyond SQLShare
Armbrust envisions eventually deploying SeaFlow sensors on hundreds of commercial and research vessels simultaneously, all providing a real-time view of the ocean’s biological dynamics.
One of the oceanographers
using tools powered by SQLShare
at the retreat
At this scale, however, even SQLShare could not meet data storage and manipulation requirements. In a recent project called Myria, led by the CSE Database Group and the eScience Institute, three CSE faculty members began working with science users like Armbrust to develop the next generation of cloud-based data management and analytics systems. Like SQLShare, Myria is designed to be delivered as a service to support direct analysis through a web-browser — a critical feature for researchers with limited IT support. Unlike SQLShare, Myria scales to hundreds of terabytes and supports much more complex analytics. Halperin, technical lead of the Myria project, is working closely with the Armbrust team to adapt the system for the scale and complexity envisioned in the future of the SeaFlow project.
Armbrust sees the shared, online, cloud-based approach as critical to empowering those in her field to work in data-intensive science: “My dream is to make this data and the data processing available to other scientists for their own research. I strongly believe that we move forward only when we’re all working together.”
For further information, please refer to the following:
One of Big Data’s biggest challenges lies in helping astronomers test theories of the creation and composition of the Universe. How can astronomers make optimal use of terabytes of simulation data and raw data drawn periodically from the Sloan Digital Sky Survey (SDSS) and eventually from the massively powerful Large Synoptic Survey Telescope (LSST)?
Multi-department collaborations through the eScience Institute often occur on what Magda Balazinska, Associate Professor in Computer Science & Engineering (CSE), calls a “What can we do together?” basis. In the field of Astronomy, the answer is the AstroDB collaboration.
Striving to create advanced database architectures and tools for next-generation survey Astronomy, AstroDB brings together students, postdocs and faculty from the Astronomy and Computer Science departments at the UW. Under its umbrella are many projects focused on “exploring new ways of analyzing data, either by pushing the scale at which data analysis becomes easy or by pushing on the available capabilities,” said Balazinska. These collaborative efforts aim not only to expand the boundaries of scientific discovery in astronomy, but also to produce tools and methodologies that are efficient, scalable, reproducible and, crucially, reusable in other scientific domains.
This simulation represents a uniform, cosmological volume that is 80 million light years on a side, contains ~2 billion particles and is capable of resolving scales down to ~1000 light years. This high resolution in such a large volume, only possible due to the large scalability of our code ChaNGa, resolves the morphologies of galaxies down to very small masses giving us a large statistical sample of interesting objects. We evolved the simulation for ~1.5 billion years, creating a dataset of ~5 TB, which we will use to understand the formation and evolution of galaxies in the early universe.
Efficient Processing of Telescope Images: AscotDB and More
Astronomers currently analyze sky surveys by collecting raw images using large telescopes and digital cameras, processing the images with special-purpose data analysis pipelines (i.e., specialized computer programs that correct for data defects, determine exact positions of stars and galaxies, or perform other preliminary data processing tasks) and preparing catalogs of celestial objects with different properties. These catalogs are then available for general scientific inquiry.
In contrast, the AstroDB team’s new tool, called AscotDB, enables the direct analysis of raw pixel data. This lets scientists ask fundamentally different questions than when processing catalog data only. An extensible data analysis system, AscotDB provides the interactive analysis of astronomical survey data. While it uses the SciDB parallel array processing engine as a back-end, SciDBi was designed for processing arrays, while telescope data is “spherical.” Therefore, AscotDB wraps SciDB with Python middleware to provide efficient support for spherical data and adds iterative processing, a critical step in the analysis process. AscotDB’s front-end design integrates a graphical interface for locating moving/transient objects and a programmatic iPython interface for both detecting and measuring such objects, enabling both interactive exploration and deep data analysis.
According to Andy Connolly, Professor of Astronomy at the UW, “The work here, along with future directions of development in SciDB, points to a system where a full-sky worth of time-domain astronomy imaging data can be directly stored and indexed in a way that will enable efficient image analysis tasks to be performed on-demand by users without their needing to understand the details of how to parallelize and scale their applications.”
AscotDB is one of many ongoing collaborations spearheaded by CSE’s Database group and aimed at making sky survey image processing and analysis faster and more convenient. Other projects have analyzed Apache™ Hadoop® workloads, explored the use of iterative parallel array processing and parallel database management systems combined with machine-learning methods, and produced tools to help write SQL queries for SDSS image data. According to Jake Vanderplas, Director of Research in Physical Sciences at the eScience Institute, “Such a close collaboration between astronomers and computer scientists is, from what I’ve seen, fairly unique in the academic community. We’ve found it extremely fruitful: by combining these two realms of expertise, we’re able to explore novel research directions in both fields.”
By reducing data management overhead, computer and data scientists are helping astronomers explore the universe of miracles that await discovery.
Analyzing Astronomical Simulation Data: Myria with MyMergerTree Visualizations
How are comets delivered into Earth-crossing orbits? How do stars like the Sun migrate over great distances in the Milky Way? How do galaxies form and evolve?
The MyMergerTree service enables astronomers to specify
a set of galaxies at present day and have Myria compute
the merger trees for galaxies that show their evolution over time.
Astronomers investigate such questions using cosmological simulations, N-body simulations, in particular. N-body simulations model and project the dynamic behavior of particles/objects, usually under the influence of physical forces such as gravity. They help to explore a broad range of problems, from defining the attributes of few-body systems, such as the Earth-Moon-Sun, to understanding the large-scale structure of the cosmos.
The N-body Shop, an interdisciplinary UW group doing pioneering work in the area of astrophysical cosmological simulations, produces simulations that can range in size from tens of gigabytes to multiple terabytes; analyzing them requires the scaling and optimizing of many data-related tasks — such as data filtering and grouping — before statistical analysis and evaluation efforts can even begin. For simulations, AstroDB initiatives focus on creating techniques and tools that use relational database management systems, both single node and parallel, and new MapReduce-type systems to simplify data manipulation tasks so astronomers can focus on the querying and evaluating pay-off activities.
One revolutionary system we are exploring is Myria, or “Big Data Management as a Cloud Service.” A cooperative effort guided by the CSE Database group, Myria facilitates the uploading of large datasets, the performance of complex data analysis tasks in the cloud, and the sharing of results with colleagues. Sarah Loebman (UW Astronomy PhD and Michigan Society of Fellows postdoctoral fellow) says the following of Myria and her ongoing work with the Database group:
“Collaborating with the UW CSE Database group has been a transformative experience for us; thanks to the group’s work with Myria, we’re able to ask fundamentally different scientific questions than we’ve ever asked before. For example, we are currently working with a 5 terabyte simulation that is very hard to analyze in its entirety through conventional means… Myria’s distributed platform and flexible design allows us to load all 5 terabytes of simulation data at once and trace any given subset of the data across time.”
Myria is being used in other projects as well. A recent astronomy application is MyMergerTree, a service that helps astronomers study the growth history of galaxies by following their merger trees in large-scale astronomical simulations. Of the powerful Myria/MyMergerTree duo, Loebman says, “This lets us consider how structures like galaxies form and evolve and how properties we observe today (like shape and luminosity) are correlated with a galaxy’s merger history.”
For further information, please refer to the following:
Squeezing a Big Orange into Little Boxes: The AscotDB System for Parallel Processing of Data on a Sphere. Jacob Vanderplas, Emad Soroush, Simon Krughoff, Magdalena Balazinska, and Andrew Connolly IEEE Data Engineering Bulletin. 36(4):11-20, 2013. (PDF)
Overview of SciDB: Large Scale Array Storage, Processing and Analysis. J. Rogers, R. Simakov, E. Soroush, P. Velikhov, M. Balazinska, D. DeWitt, B. Heath, D. Maier, S. Madden, J. Patel, M. Stonebraker, S. Zdonik, A. Smirnov, K. Knizhnik, Paul G. Brown. SIGMOD 2010. (PDF)
UW website links