================================================================ CSE 344 -- Fall 2011 Lecture 21: Parallel Databases ================================================================ Two MAJOR trends that are pushing Computer Science toward parallel computation: 1. Moore's law (exponential growth in density of transistors per chip) is no longer reflected in increased clock speeds. Increased hardware performance will be available only through parallelism. Think multicore: 4 cores today, perhaps 64 in a few years. 2. Cloud computing commoditizes access to large compute clusters. Ten years ago, only google could afford 1000 servers; today you can rent this by swiping your credit card at Amazon Web Services (AWS)... almost... if you need more than 20, you need to talk to them. Note: Moore's law says that the number of transistors that can be placed inexpensively on an integrated circuit doubles approximately every two years [Intel co-founder Gordon E. Moore described the trend in his 1965 paper and predicted that it will last for at least 10 years] ================================================================ We also live in the era of "big data": Everyone (companies, scientists, etc.) have data that is too big, too fast, and too complex to be managed without changing tools and processes. Conveniently relational algebra is easy to parallelize and parallel DBMSs have already been studied in the 80's! As a result, we are seeing an explosion of and a huge success of db analytics companies -- Greenplum founded in 2003 acquired by EMC in 2010 A parallel shared-nothing DBMS (this lecture) -- Vertica founded in 2005 and acquired by HP in 2011 A parallel, column-store shared-nothing DBMS (see 444 for discussion of column-stores) -- DATAllegro founded in 2003 acquired by Microsoft in 2008 A parallel, shared-nothing DBMS Aster Data Systems founded in 2005 acquired by Teradata in 2011 A parallel, shared-nothing, MapReduce-based data processing system (next lecture). SQL on top of MapReduce Netezza founded in 2000 and acquired by IBM in 2010 A parallel, shared-nothing DBMS. Many companies in particular DATAllegro and Neteezza provide a data appliance: combination of hardware and software in one package. Others have appliance versions of their systems (DATAllegro). Others yet, use some proprietary hardware (Teradata) or at least identify preferred hardware configurations (Greenplum). It is a great time to be in the data management, data mining/statistics, or machine learning fields. ================================================================ There are two basic methods today for building a parallel data processing engines: -- Based on parallel DBMSs from the 80's (this lecture). -- Based on the MapReduce system from Google, published in 2004 (next lecture). A few observations: -- MapReduce-based systems all include many of the key parallel DBMS features including a declarative query language (SQL or some close variant), indexes, and more. -- Most companies use both types of systems at the same time. -- Most data processing system vendors have developed methods to integrate both types of systems within one engine. -- Neither tool is sufficient to address all of today's challenges. A lot more work is needed! ================================================================ Traditional parallel databases Parallel DBMSs can be used both for transaction processing (OLTP) and for running long queries that perform various data analytics (OLAP). We will focus on the latter in this course. Terminology: P = number of processors (or servers) Speedup: Query time should shrink proportionally to the number of processors used. -- Ideal: linear speedup. -- In practice it is hard to achieve linear speed-up due to start-up costs, interference, and skew. Scaleup: If we grow the input data size in proportion to the number of machines, the runtime should remain constant. -- Ideal: linear scaleup -- In practice: same problem as above. Other terminology commonly used today (just to make things more confusing): Scale up: Increase performance by using a more powerful machine. Scale out: Increase performance by adding machines to the system. ================================================================ Types of parallel architectures: 1. Shared Memory -- Processors share both RAM and disk -- Dozens to hundreds of processors -- Example: SQL Server runs on a single machine and can leverage many threads to get a query to run faster (see query plans on IISQLSRV) Characteristics: -- Easy to use and program. -- But very expensive to scale: last remaining cash cows in the hardware industry 2. Shared Disk -- All processors access the same disks -- Found in the largest "single-box" (non-cluster) multiprocessors -- Oracle dominates this class of systems. Characteristics: -- Also hard to scale past a certain point: existing deployments typically have fewer than 10 machines. 3. Shared nothing -- Cluster of machines on high-speed network -- Called "clusters" or "blade servers" -- Each machine has its own memory and disk: lowest contention. -- NOTE: Because all machines today have many cores and many disks, then shared-nothing systems typically run many "nodes" on a single physical machine. Characteristics: -- Today, this is the most scalable architecture. -- Most difficult to administer and tune. ================================================================ Basic query processing on one node: *** Discuss in class: -- selection -- group-by -- join ================================================================ Data partitioning on a shared nothing architecture. Have a large table R(K,A,B,C), need to partition it on shared-nothing architecture; what are the options ? -- Block-partition -- Range partition on an attribute A -- Hash partition on attribute A 1. discuss how to compute GroupBy(R, A, sum(C)) for each of the partitions. 2. discuss how to compute GroupBy(R, B, sum(B)) for each of the partitions. ================================================================ **** In class: Discuss Parallel Hash-Joins R(A,B) Join S(B,C) ================================================================ Extra notes: - Parallelism also means running many operators within one query in parallel and running many queries in parallel. - Relational operators can run unchanged in a parallel DBMS, they only need to be connected with special "exchange" operators that shuffle data.