University of Washington homepage
eScience Institute
Wiki (UW only) | Web

William Howe

Bill Howe

Director of Research, Data Management and Analytics, eScience Institute
Affiliate Assistant Professor, Department of Computer Science and Engineering
University of Washington

Office: 450 Paul G. Allen Center
Phone: 206-221-9261
Fax: 206-543-2969
email: billhowe at cs dot washington dot edu
NSF biosketch in pdf
OpenSSH rsa public key

 

Projects

Publications

Talks

Bio

Service

Background

News

Students

I am actively seeking students to collaborate with on the projects below. We have several projects underway involving cross-disciplinary teams of researchers. If you are a CSE student with an interest in advancing oceanography, biology, physics, or astronomy, or a domain science student with an interest in data-intensive scalable computing, databases, visualization, or algorithms, I would be happy to meet with you to discuss possible collaborations. Send me an email!
To new Phd students: Note that the first step in admissions to UW CSE is handled centrally by the department; an email to me does not influence the process.

Research

The bottleneck to scientific discovery is no longer data acquisition, but data analysis.This trend can be attributed to advances in data acquisition technology: high-throughput lab techniques, remote sensing platforms, and high resolution computational modeling. While the technology and resources necessary to collect or generate such data en masse are becoming widely available, technology to manage and analyze the data have not kept pace. Traditionally, each data acquisition activity was coupled to a specific hypothesis, but now researchers collect data en masse---they "download the world"---exchanging a problem of how to extract knowledge from the environment to one of how to extract knowledge from a database. Data analysis, and not experimental data acquisition, is the new bottleneck to discovery.

Research Topics

Management of very large or very complex science data. Data-intensive scalable computing, scientific databases, visualization, mashups, integration of ad hoc science data.

Current Projects

Horizon: Visual Data Analytics in the Cloud
I am the lead PI on two NSF grants exploring the question of how cloud computing can support interactive, visual, exploratory science. Through an NSF Cluster Exploratory grant, and in partnership with visualization experts at the University of Utah, we are exploring the use of MapReduce as a common framework for both scalable data processing and scalable visualization. Through an NSF EAGER grant, I am developing a new visualization algebra for use with the Microsoft Azure platform. The core goal of both projects is to allow scientists to analyze terabytes of data in the cloud as efficiently, conveniently, and as deeply as they can analyze megabytes of data on their laptops. This work led to the HaLoop system.

Read more...

SQLShare: Database-as-a-Service for Long Tail Science
Informed by the dataspace abstraction proposed by Halevy, Franklin, and Maier, we are devleoping a platform for ad hoc databases called SQLShare that allows a user to bootstrap a collaborative database environment just by uploading data, writing queries, and sharing the results.

Our approach is to provide a basic system for querying data in the cloud (using Microsoft SQL Azure), then explore a set of smart services to streamline and automate analysis. Specifically: 1) queries are saved as views and can be shared with others for collaborative analysis, 2) we derive automatic starter queries directly from the data to bootstrap analysis, 3) we derive dashboards (``mashups'') directly from the data to automate visual analysis, 4) we are working to translate English fragments into SQL fragments to assist SQL novices, and 5) we are using previous work here at UW on SQL Autocomplete features.

Our motivation is long tail science. In contrast to "big science" projects such as the Large Synoptic Survey Telescope and the Large Hadron Collider, the challenge faced in the long tail of science is not only about data volume, but about data complexity. Projects in oceanography or the life sciences may involve cleaning and integrating data from hundreds of heterogeneous data sources. Although sheer scale is not typically the defining feature of these data sources, the volumes involved are not insignificant: In the life sciences, for example, a modern short-read sequencer can generate a terabyte per day. At the University of Washington, there are approximately ten of these sequencers used on campus, and 20 more are scheduled to be purchased in the next few years. Low-cost, high-throughput mass spectrometry, microarray, and flow cytometry are similarly poised to produce exponential growth in data volumes in the next few years. Read more....

This project is supported by a Moore Foundation Grant and a 2010 Jim Gray Seed Grant from Microsoft Research.

Parallel Datalog on new Computing Platforms
Building on our work on HaLoop, we are developing a Datalog interface to massively parallel platforms including HaLoop/Hadoop, the Cray XMT, and Microsoft's Daytona Platform on the Azure cloud. The Cray XMT supports massive multi-threading --- millions of simultaneous threads accessing shared memory at low latency --- eliminating dependence on a deep cache hierarchy for performance. PNNL is exploring the XMT as a platform for a Graph Database. While the XMT has proven capabilities in graph processing, a general-purpose semantic database necessarily involves "conventional" computation in addition to massively thread-parallel computation. A query language to insulate the user from this heterogeneity, transparently splitting a query into conventional and XMT components, does not exist. We are designing a prototype language with this property.

GridFields: Algebraic Manipulation of Unstructured Meshes
The large datasets produced by simulations typically have a grid structure that is not amenable to storage within traditional database systems. We've developed an algebra of GridFields that allows convenient manipulation of grid-structured datasets much in the way the relational algebra allows convenient manuipulation of table-structured data. This work originated in the context of CMOP, the NSF Science and Tehcnology Center for Coastal Margin Observation and Prediction.

This work is supported by a subcontract from Woods Hole Oceanographic Institute via the NSF-funded Ocean Observatories Initiaitve and an NSF EAGER award.

Data Pricing I am a Co-PI on the Data Pricing project.

SciDB
I am on the Science Advisory Board for the
SciDB project, representing requirements from theenvironmental modeling community.

Teaching

CS599c: Scientific Data Management, Spring 2010, University of Washington, with Magda Balazinska
CS410/510: Scientific Data Management, Summer 2006, Portland State University

Publications

  • Parallel Visualization on Large Clusters Using Mapreduce.
    Huy Vo, Brian Summa, J. Comba, Juliana Freire, Bill Howe, Claudio Silva, and Valerio Pascucci.
    In Proceedings of Large-Scale Data Analysis and Visualization (LDAV), 2011.
  • Data Markets in the Cloud: An Opportunity for the Database Community
    Magda Balazinska, Bill Howe, Dan Suciu
    VLDB 2011 (vision track) (2nd place best paper award)
  • Database-as-a-Service for Long Tail Science
    Bill Howe, Garret Cole, Emad Souroush, Paraschos Koutris, Alicia Key, Nodira Khoussainova, Leilani Battle
    SSDBM 2011
  • Automatic Example Queries for Ad Hoc Databases
    Bill Howe, Garret Cole, Nodira Khoussainova, Leilani Battle
    SIGMOD 2011 (demo)
  • HaLoop: Efficient Iterative Data Processing On Large Clusters (pdf)
    YingYi Bu, Bill Howe, Magdalena Balazinska, Michael Ernst
    PVLDB, 3(1) (VLDB 2010) (selected for Best of VLDB 2010 issue of VLDB Journal)
  • SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
    Bill Howe, Garret Cole
    Microsoft eScience Workshop 2010
  • Scalable clustering algorithm for N-body simulations in a shared-nothing cluster(pdf)
    YongChul Kwon, Dylan Nunley, Jeffrey P. Gardner, Magdalena Balazinska, Bill Howe and Sarah Loebman
    SSDBM 2010
  • Client + Cloud: Seamless Architectures for Visual Data Analytics in the Ocean Sciences (pdf)
    Keith Grochow, Bill Howe, Roger Barga, Ed Lazowska
    22nd International Statistical and Scientific Database Management Conference (SSDBM 2010)
  • Skew-Resistant Parallel Processing of Feature-Extracting Scientific User-Defined Functions (pdf)
    YongChul Kwon, Magdalena Balazinska, Bill Howe, and Jerome Rolia
    SOCC 2010.
  • Analyzing Massive Astrophysical Datasets: Can Pig/Hadoop or a Relational DBMS Help?
    Sarah Loebman, Dylan Nunley, YongChul Kwon, Bill Howe, Magdalena Balazinska, and Jeffrey P. Gardner
    IASDS 2009
  • Scientific Mashups: Runtime-Configurable Data Product Ensembles (pdf)
    Bill Howe, Harrison Green-Fishback, David Maier
    21st International Statistical and Scientific Database Management Conference (SSDBM 2009)
  • End-to-End e-Science: Integrating Workflow, Query, Visualization, and Provenance at an Ocean Observatory
    Bill Howe, Peter Lawson, Renee Bellinger, Erik Anderson, Emanuele Santos, Juliana Freire, Carlos Scheidegger, Antonio Baptista, Claudio Silva (pdf)
    IEEE eScience Conference (eScience 2008)
  • Smoothing the ROI Curve for Scientific Data Management Applications (pdf) Bill Howe, David Maier, Laura Bright
    Third Biennial Conference on Innovative Data Systems Research (CIDR 2007)
  • Algebraic Manipulation of Scientific Datasets (extended)
    Bill Howe, David Maier
    VLDB Journal, 14(4), November 2005
  • Querying and Visualizing Gridded Datasets for e-Science (pdf) (quality color pdf handout) (smaller bw pdf handout)
    Bill Howe, David Maier
    21st International Conference on Data Engineering (ICDE 2005) (demo)
  • Retrofitting a Data Model to an Existing Environmental Repository (pdf)
    Bill Howe, David Maier
    17th International Statistical and Scientific Database Management Conference (SSDBM 2005)
  • Algebraic Manipulation of Scientific Datasets (pdf)
    Bill Howe, David Maier
    30th International Conference on Very Large Data Bases (VLDB 2004) (one of four selected for Best of VLDB 2004 issue of VLDB Journal)
  • Emergent Semantics: Towards Self-Organizing Scientific Metadata (pdf)
    Bill Howe, Kuldeep Tanna, Paul Turner, David Maier
    International Conference on Semantics for a Networked World (SFNW 2004), co-located with SIGMOD 2004.
  • Logical and Physical Data Independence for Native Scientific Data Repositories (pdf)
    Bill Howe and David Maier
    IEEE Data Engineering Bulletin, 27(4), December 2004

Dissertation

Selected Talks (powerpoint)

Some of these talks contain macros that require a visualization ActiveX control that you don't have, so you may safely respond with "disable macros" if prompted with a dialog.

All movies will appear as still images by default. If you want the movies to play, download download them, unzip them in the same directory as the presentation, and make sure you open the presentation wth the correct working directory (i.e., by double-clicking the file rather than by using File->Open.)

Bio

Bill Howe is the Director of Research for Scalable Data Analytics at the UW eScience Institute and holds an Affiliate Assistant Professor appointment in Computer Science & Engineering, where he studies the application of scientific databases, cloud computing, and frameworks for scalable data analysis. Howe has received two Jim Gray Seed Grant awards from Microsoft Research for work on managing environmental data, and has had two papers selected to appear in VLDB Journal's "Best of Conference" issue (2004 and 2010) for work in data-intensive computing for science. Howe serves on the program and organizing committees for a number of conferences in the area of scientific data management, and serves on the Science Advisory Board of the SciDB project, a project to build a new database system expressly for science. He holds a Ph.D. in Computer Science from Portland State University, where he studied under Prof. David Maier, and a Bachelor's degree in Industrial & Systems Engineering from Georgia Tech.

Professional Service

  • Program Committee, PVLDB 2012-2013
  • Demo Co-chair, SSDBM 2013
  • Program Committee, ScienceCloud 2012
  • Chair, Workshop on HPC meets Databases, co-located with Supercomputing 2011
  • Editorial Board, Journal of Data Semantics
  • Organizing Committee, XLDB 2011
  • Program Committee, LDAV 2011
  • Program Committee, ScienceCloud 2011
  • Co-Chair, Workshop on Array Databases
  • Registration Chair, SSDBM 2011
  • Program Committee, SSDBM 2011
  • Demonstrations Program Committee, SIGMOD 2011
  • Program Committee, EDBT 2010
  • Program Committee, SSDBM 2010
  • Program Committee, IIMAS Workshop, 2008
  • Reviewer, VLDB Journal, 2007
  • Program Committee, dg.o 2006
  • Program Committee, dg.o 2005
  • Demonstrations Program Committee, SIGMOD 2005
  • Student Session Program Committee, dg.o 2004
Professional background

Phd, Computer Science, Portland State University, 2006
BS, Industrial and Systems Engineering, Georgia Tech, 1999

I've been working with databases since 1995 when I worked for Delta Airlines as a co-op in their Technical Operations facility. When I graduated from Georgia Tech, I went to work for Deloitte Consulting designing and building enterprise client-server applications, specifically Customer Relatonship Management (CRM) systems with Siebel. After Deloitte and before graduate school, I worked as an independent contractor at Microsoft and other companies as diverse as newly deregulated telecommunications carriers to providers of oil field exploration services.

Page modified June 1, 2011

.