Amazon Redshift

What You Need To Know

PyTN 2014
Photo: NASA

Who Is This Guy?

Brian Dailey

CTO & Co-founder at Stratasan

Monkeys around a bit with Python, PostgreSQL,
Django, and a few more bits and pieces.

@byeliad

What's Redshift?

bizspeak:

Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse service ...

"It achieves its high performance through extreme parallelism, columnar data storage, and smart data compression." - AirBnB

Why we use it

It looks a lot like PostgreSQL
Much cheaper than analytic appliances (Vertica et al)
Easy and fast ad-hoc aggregate queries

If this is PostgreSQL:

This is Redshift:

What's it Cost?

US East (N. Virginia)
	vCPU	ECU	Memory (GiB)	Storage	I/O	Price
DW1 - Dense Storage
dw1.xlarge	2	4.4	15	2TB HDD	0.30GB/s	$0.850 per Hour
dw1.8xlarge	16	35	120	16TB HDD	2.40GB/s	$6.800 per Hour
DW2 - Dense Compute
dw2.large	2	7	15	0.16TB SSD	0.20GB/s	$0.250 per Hour
dw2.8xlarge	32	104	244	2.56TB SSD	3.70GB/s	$4.800 per Hour

Per Terabyte

US East (N. Virginia)
	Effective Price per TB per Year
	On-Demand	1yr RI	3yr RI
dw1.xlarge	$3,723	$2,192	$999
dw1.8xlarge	$3,723	$2,192	$999
dw2.large	$13,688	$8,794	$5,498
dw2.8xlarge	$16,425	$11,018	$5,498

Speed

Scan Query

Aggregate Query

Join Query

"Hive, Impala, and Shark are used is because they offer a high degree of flexibility, both in terms of the underlying format of the data and the type of computation employed."

Full Study: https://amplab.cs.berkeley.edu/benchmark/

AirBnB tested against Hadoop...

"Simple range query against a giant table with 3 billion rows, we saw 5x performance improvement over Hive!"

What It's Good For

Analytics

... And not good for

Your web application

Redshift
Architecture

Leader Node

Photo: Frederik Magle

Compute Nodes

Photo: Paul Robinson

Types of nodes...

XL: 2 cores, 15GiB memory, 3 disk drives with 2TB of local attached storage.
8XL: 16 cores, 120GiB memory, 24 disk drives (16TB)
(New) SSD XL: - 160 GB of SSD storage, 2 Intel Xeon E5-2670v2 virtual cores, and 15 GiB of RAM.
(New) SSD 8XL: - 2.56 TB of SSD storage, 32 Intel Xeon E5-2670v2 virtual cores, and 244 GiB of RAM.

Each node has multiple slices (one for each core)
Executes queries in parallel over each slice
Data is divvied up to each slice by a distribution key...

Building Tables

unenforced
constraints

Photo: Nicolas Raymand

“Amazon Redshift does not enforce unique, primary-key, and foreign-key constraints.”

unsupported types

Arrays
INTERVAL, TIME, TIMESTAMP WITH TIMEZONE
SERIAL, MONEY
XML
UUID
...and more...

supported types

SMALLINT
INTEGER
BIGINT
DECIMAL
REAL
DOUBLE PRECISION
BOOLEAN
CHAR
VARCHAR
DATE
TIMESTAMP
(New) JSON

"CHAR and VARCHAR data types are defined
in terms of bytes, not characters."

Distributing Data

Distribution Key

Default is round robin / even distribution.
Like values can be (approximately) stored together on node slices.
ALL distribution copies the table to each node.
Aim is to distribute data evenly across slices.

EVEN is the default. Follows a round-robin distribution. Use this if table is denormalized, doesn't participate in joins, or doesn't have a clear choice on dist key. KEY give you one column and attempts to group like values together on slices. If you join on a column frequently, use this. "ALL distribution is appropriate only for relatively slow moving tables; that is, tables that are not updated frequently or extensively. Small dimension tables do not benefit significantly from ALL distribution, because the cost of redistribution is low."

In general, if you use KEY you still want to aim for even distribution!

Sort Key

Amazon Redshift stores your data on disk in sorted order according to the sort key.

“Amazon Redshift stores columnar data in 1 MB disk blocks. The min and max values for each block are stored as part of the metadata.”

One more thing...

You can't ALTER COLUMN.
Only one ADD COLUMN per ALTER TABLE.

Photo credit: Ben Stanfield

Loading Data Into Redshift

Your Options

S3
(New) Remote Host (via SSH)
DynamoDB

Best Practice

S3

(Per Amazon)

s3cmd is your friend.

http://s3tools.org/s3cmd

s3cmd put gigantic_csv_file.csv.part.gz.00000 s3://mybucket/mypath/

COPY table_name [ (column1 [,column2, ...]) ]
FROM 's3://objectpath'
[ WITH ] CREDENTIALS [AS] 'aws_access_credentials'
[ option [ ... ] ]

objectpath is a prefix. Loads in parallel.
gzip
encryption

Loading via Remote Host

Connect to instances via SSH, executes commands, and loads generated text output.

Debugging

STL_LOAD_ERRORS
STL_LOADERROR_DETAIL
STL_S3CLIENT

Debugging

NOLOAD
MAXERROR

You need 2.5x data size to load if sorted or to vacuum table.

Date or timestamp columns must be formatted in same way (only defined once).

(ACCEPTANYDATE will load NULL when format does not match.)

Ingest via SQL

Sure...

QUERYING DATA

Looks/smells like Postgres 8.0.2.

AVG(INT) will return INT.
Note that this is NOT listed in "Features implemented differently."

EXTRACTING DATA

unload ('select * from tablename')
to 's3://mybucket/'
credentials 'aws_access_key_id=[your-access-key-id];
aws_secret_access_key=[your-secret-access-key]';

UNLOAD dumps to multiple files

s3://bucket/table_data_0000
s3://bucket/table_data_0000

...unless you trick it.

unload ('select * from (
select * from ... where ...) limit 2147483647') ...

But if you're over 5.8GiB, files will still split.

It's also slower.

Optimization

Workload Manager (WLM)

Controls concurrent queries. Less means more memory per process.

"By default, a cluster is configured with one queue that can run five queries concurrently."

VACUUM (ANALYZE)

Vacuum early, vacuum often.

Anytime you've made non-trivial changes to data.

VACUUM is expensive. Schedule for slow days.

Only one VACUUM at a time per cluster.

VACUUM doesn't resort unaffected slices.

Debugging VACUUM

svv_vacuum_progress
svv_vacuum_summary

Size your columns appropriately.

DATE or TIMESTAMP > CHAR

Test different configurations!

2 DW1 8XL nodes?
16 DW1 XL nodes?
5 DW2 XL nodes?

Web UI

Wrapping up

Redshift is cost-effective.

...fast.

...familiar.

Redshift

Nice balance!

"In fact, our analysts like Redshift so much that they don’t want to go back to Hive and other tools even though a few key features are lacking in Redshift. Also, we have noticed that big joins of billions of rows tend to run for a very long time, so for that we’d go back to hadoop for help." -- AirBnB

"Oddly enough, Redshift isn’t going to sell because devs think it’s super-duper-whizz-bang. It’s going to sell because it took a problem and an industry famous for it’s opaque pricing, high TCO, and unreliable results and completely turned it on its head." -- AK Tech Block

Amazon Redshift

What You Need To Know

Who Is This Guy?

Brian Dailey

toc

What's Redshift?

bizspeak:

Why we use it

If this is PostgreSQL:

This is Redshift:

What's it Cost?

Per Terabyte

Speed

Scan Query

Aggregate Query

Join Query

What It's Good For

Analytics

... And not good for

Your web application

Redshift Architecture

Leader Node

Compute Nodes

Types of nodes...

Building Tables

unenforcedconstraints

unsupported types

supported types

Distributing Data

Distribution Key

Sort Key

One more thing...

Loading Data Into Redshift

Your Options

Best Practice

S3

(Per Amazon)

s3cmd is your friend.

Loading via Remote Host

Debugging

Debugging

QUERYING DATA

Looks/smells like Postgres 8.0.2.

EXTRACTING DATA

Optimization

Workload Manager (WLM)

VACUUM (ANALYZE)

Web UI

Wrapping up

Redshift is cost-effective.

...fast.

...familiar.

Redshift

Nice balance!

THE END

Thanks!

Questions? Tomatoes?

Redshift
Architecture

unenforced
constraints