================================================================
CSE 344 -- Fall 2011
Lecture 21:   Parallel Databases

================================================================

Two MAJOR trends that are pushing Computer Science toward parallel
computation:

1. Moore's law (exponential growth in density of transistors per chip)
   is no longer reflected in increased clock speeds.  Increased
   hardware performance will be available only through parallelism.
   Think multicore: 4 cores today, perhaps 64 in a few years.

2. Cloud computing commoditizes access to large compute clusters.  Ten
   years ago, only google could afford 1000 servers; today you can
   rent this by swiping your credit card at Amazon Web Services
   (AWS)... almost... if you need more than 20, you need to talk to
   them.

Note: Moore's law says that the number of transistors that can be
placed inexpensively on an integrated circuit doubles approximately
every two years [Intel co-founder Gordon E. Moore described the trend
in his 1965 paper and predicted that it will last for at least 10
years]


================================================================

We also live in the era of "big data": Everyone (companies,
scientists, etc.) have data that is too big, too fast, and too complex
to be managed without changing tools and processes.

Conveniently relational algebra is easy to parallelize and
parallel DBMSs have already been studied in the 80's!

As a result, we are seeing an explosion of and a huge success of db
analytics companies

-- Greenplum founded in 2003 acquired by EMC in 2010
A parallel shared-nothing DBMS (this lecture)

-- Vertica founded in 2005 and acquired by HP in 2011
A parallel, column-store shared-nothing DBMS (see 444 for discussion of column-stores)

-- DATAllegro founded in 2003 acquired by Microsoft in 2008 
A parallel, shared-nothing DBMS

Aster Data Systems founded in 2005 acquired by Teradata in 2011 
A parallel, shared-nothing, MapReduce-based data processing system (next
lecture).
SQL on top of MapReduce

Netezza founded in 2000 and acquired by IBM in 2010
A parallel, shared-nothing DBMS.

Many companies in particular DATAllegro and Neteezza provide
a data appliance: combination of hardware and software in one package.
Others have appliance versions of their systems (DATAllegro).
Others yet, use some proprietary hardware (Teradata)
or at least identify preferred hardware configurations (Greenplum).

It is a great time to be in the data management, data
mining/statistics, or machine learning fields.


================================================================
There are two basic methods today for building a parallel
data processing engines:
-- Based on parallel DBMSs from the 80's (this lecture).
-- Based on the MapReduce system from Google, published in 2004 (next
lecture).

A few observations:
-- MapReduce-based systems all include many of the key parallel
DBMS features including a declarative query language (SQL or some
close variant), indexes, and more.
-- Most companies use both types of systems at the same time.
-- Most data processing system vendors have developed methods to 
integrate both types of systems within one engine.
-- Neither tool is sufficient to address all of today's challenges. A lot
more work is needed!


================================================================
Traditional parallel databases

Parallel DBMSs can be used both for transaction processing (OLTP) and
for running long queries that perform various data analytics (OLAP).
We will focus on the latter in this course.

Terminology:

    P = number of processors (or servers)

    Speedup: Query time should shrink proportionally
    to the number of processors used.

         -- Ideal: linear speedup.
         -- In practice it is hard to achieve linear 
            speed-up due to start-up costs, interference, and skew.

   Scaleup: If we grow the input data size in proportion to the
   number of machines, the runtime should remain constant.

         -- Ideal: linear scaleup
         -- In practice: same problem as above.


Other terminology commonly used today (just to make things more
confusing):

Scale up: Increase performance by using a more powerful machine.
Scale out: Increase performance by adding machines to the system.
         

================================================================
Types of parallel architectures:

1. Shared Memory
    -- Processors share both RAM and disk
    -- Dozens to hundreds of processors
    -- Example: SQL Server runs on a single machine and can
       leverage many threads to get a query to run faster (see
       query plans on IISQLSRV)

  Characteristics:
    -- Easy to use and program.
    -- But very expensive to scale: last remaining cash
       cows in the hardware industry

2. Shared Disk
    -- All processors access the same disks
    -- Found in the largest "single-box" (non-cluster) multiprocessors
    -- Oracle dominates this class of systems.

  Characteristics:
    -- Also hard to scale past a certain point: existing deployments
       typically have fewer than 10 machines.

3. Shared nothing
    -- Cluster of machines on high-speed network
    -- Called "clusters" or "blade servers"
    -- Each machine has its own memory and disk: lowest contention.
    -- NOTE: Because all machines today have many cores and many
    disks, then shared-nothing systems typically run many "nodes" 
    on a single physical machine.

  Characteristics:
    -- Today, this is the most scalable architecture.
    -- Most difficult to administer and tune.

================================================================

Basic query processing on one node:

*** Discuss in class:
   -- selection
   -- group-by
   -- join

================================================================

Data partitioning on a shared nothing architecture.

Have a large table R(K,A,B,C), need to partition it on shared-nothing
architecture; what are the options ?

  -- Block-partition
  -- Range partition on an attribute A
  -- Hash partition on attribute A

1. discuss how to compute GroupBy(R, A, sum(C)) for each of the
   partitions.

2. discuss how to compute GroupBy(R, B, sum(B)) for each of the
   partitions.

================================================================

**** In class: Discuss Parallel Hash-Joins

R(A,B) Join S(B,C)


================================================================
Extra notes:

- Parallelism also means running many operators within one query
  in parallel and running many queries in parallel.

- Relational operators can run unchanged in a parallel DBMS,
  they only need to be connected with special "exchange" operators
  that shuffle data.