Amazon Redshift

What You Need To Know

PyTN 2014
Photo: NASA

Who Is This Guy?

Brian Dailey

CTO & Co-founder at Stratasan

Monkeys around a bit with Python, PostgreSQL,
Django, and a few more bits and pieces.

@byeliad

toc

  • What is Redshift?
  • Why We Use It
  • Cost Overview
  • Rough Benchmarks
  • Redshift Architecture
  • Building Tables
  • Distribution and Sort Keys
  • Loading Data
  • Querying Data
  • Extracting Data
  • Optimization Tips
  • Web UI

What's Redshift?

bizspeak:

Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse service ...

Ha ha business!

"It achieves its high performance through extreme parallelism, columnar data storage, and smart data compression." - AirBnB

Why we use it

  • It looks a lot like PostgreSQL
  • Much cheaper than analytic appliances (Vertica et al)
  • Easy and fast ad-hoc aggregate queries

If this is PostgreSQL:

normal

This is Redshift:

wax + F1

What's it Cost?

US East (N. Virginia)
vCPUECUMemory (GiB)StorageI/OPrice
DW1 - Dense Storage
dw1.xlarge24.4152TB HDD0.30GB/s$0.850 per Hour
dw1.8xlarge163512016TB HDD2.40GB/s$6.800 per Hour
DW2 - Dense Compute
dw2.large27150.16TB SSD0.20GB/s$0.250 per Hour
dw2.8xlarge321042442.56TB SSD3.70GB/s$4.800 per Hour

Per Terabyte

US East (N. Virginia)
Effective Price per TB per Year
On-Demand1yr RI3yr RI
dw1.xlarge$3,723 $2,192 $999
dw1.8xlarge$3,723 $2,192 $999
dw2.large$13,688 $8,794 $5,498
dw2.8xlarge$16,425 $11,018 $5,498

Speed

Scan Query

Scan Query Speed

Aggregate Query

Aggregate Query Speed

Join Query

Join Query Speed
"Hive, Impala, and Shark are used is because they offer a high degree of flexibility, both in terms of the underlying format of the data and the type of computation employed."

Full Study: https://amplab.cs.berkeley.edu/benchmark/

AirBnB tested against Hadoop...

"Simple range query against a giant table with 3 billion rows, we saw 5x performance improvement over Hive!"

What It's Good For

Analytics

... And not good for

Your web application

Redshift
Architecture

redshift architecture

Leader Node

Photo: Frederik Magle

Compute Nodes

Photo: Paul Robinson

Types of nodes...

  1. XL: 2 cores, 15GiB memory, 3 disk drives with 2TB of local attached storage.
  2. 8XL: 16 cores, 120GiB memory, 24 disk drives (16TB)
  3. (New) SSD XL: - 160 GB of SSD storage, 2 Intel Xeon E5-2670v2 virtual cores, and 15 GiB of RAM.
  4. (New) SSD 8XL: - 2.56 TB of SSD storage, 32 Intel Xeon E5-2670v2 virtual cores, and 244 GiB of RAM.
  • Each node has multiple slices (one for each core)
  • Executes queries in parallel over each slice
  • Data is divvied up to each slice by a distribution key...

Building Tables

unenforced
constraints

Photo: Nicolas Raymand

“Amazon Redshift does not enforce unique, primary-key, and foreign-key constraints.”

unsupported types

  • Arrays
  • INTERVAL, TIME, TIMESTAMP WITH TIMEZONE
  • SERIAL, MONEY
  • XML
  • UUID
  • ...and more...

supported types

  • SMALLINT
  • INTEGER
  • BIGINT
  • DECIMAL
  • REAL
  • DOUBLE PRECISION
  • BOOLEAN
  • CHAR
  • VARCHAR
  • DATE
  • TIMESTAMP
  • (New) JSON
Dragons
"CHAR and VARCHAR data types are defined
in terms of bytes, not characters."

Distributing Data

Distribution Key

  • Default is round robin / even distribution.
  • Like values can be (approximately) stored together on node slices.
  • ALL distribution copies the table to each node.
  • Aim is to distribute data evenly across slices.

In general, if you use KEY you still want to aim for even distribution!

Sort Key

Amazon Redshift stores your data on disk in sorted order according to the sort key.
“Amazon Redshift stores columnar data in 1 MB disk blocks. The min and max values for each block are stored as part of the metadata.”

One more thing...

  • You can't ALTER COLUMN.
  • Only one ADD COLUMN per ALTER TABLE.


Photo credit: Ben Stanfield

Loading Data Into Redshift

Your Options

  • S3
  • (New) Remote Host (via SSH)
  • DynamoDB

Best Practice

S3

(Per Amazon)

s3cmd is your friend.

http://s3tools.org/s3cmd

s3cmd put gigantic_csv_file.csv.part.gz.00000 s3://mybucket/mypath/
COPY table_name [ (column1 [,column2, ...]) ]
FROM 's3://objectpath'
[ WITH ] CREDENTIALS [AS] 'aws_access_credentials'
[ option [ ... ] ]
  • objectpath is a prefix. Loads in parallel.
  • gzip
  • encryption

Loading via Remote Host

Connect to instances via SSH, executes commands, and loads generated text output.

Debugging

  • STL_LOAD_ERRORS
  • STL_LOADERROR_DETAIL
  • STL_S3CLIENT

Debugging

  • NOLOAD
  • MAXERROR
Dragons

You need 2.5x data size to load if sorted or to vacuum table.

Date or timestamp columns must be formatted in same way (only defined once).

(ACCEPTANYDATE will load NULL when format does not match.)

Ingest via SQL

Sure...

WTF is wrong with you?

QUERYING DATA

Looks/smells like Postgres 8.0.2.

  • AVG(INT) will return INT.
  • Note that this is NOT listed in "Features implemented differently."

EXTRACTING DATA

unload ('select * from tablename')
to 's3://mybucket/'
credentials 'aws_access_key_id=[your-access-key-id];
aws_secret_access_key=[your-secret-access-key]';

UNLOAD dumps to multiple files

s3://bucket/table_data_0000
s3://bucket/table_data_0000

...unless you trick it.

unload ('select * from (
select * from ... where ...) limit 2147483647') ...

But if you're over 5.8GiB, files will still split.

It's also slower.

Optimization

Workload Manager (WLM)

Controls concurrent queries. Less means more memory per process.

"By default, a cluster is configured with one queue that can run five queries concurrently."

VACUUM (ANALYZE)

Vacuum early, vacuum often.

Anytime you've made non-trivial changes to data.

VACUUM is expensive. Schedule for slow days.

Only one VACUUM at a time per cluster.

VACUUM doesn't resort unaffected slices.

Debugging VACUUM

  • svv_vacuum_progress
  • svv_vacuum_summary

Size your columns appropriately.

DATE or TIMESTAMP > CHAR

Test different configurations!

  • 2 DW1 8XL nodes?
  • 16 DW1 XL nodes?
  • 5 DW2 XL nodes?

Web UI

create cluster
create cluster
create cluster
Performance console
Queries
Query Detail
Copying
Copy detail

Wrapping up

Redshift is cost-effective.

...fast.

...familiar.

Redshift

Nice balance!

"In fact, our analysts like Redshift so much that they don’t want to go back to Hive and other tools even though a few key features are lacking in Redshift. Also, we have noticed that big joins of billions of rows tend to run for a very long time, so for that we’d go back to hadoop for help." -- AirBnB

"Oddly enough, Redshift isn’t going to sell because devs think it’s super-duper-whizz-bang. It’s going to sell because it took a problem and an industry famous for it’s opaque pricing, high TCO, and unreliable results and completely turned it on its head." -- AK Tech Block

THE END

Thanks!

Questions? Tomatoes?

@byeliad
github.com/briandailey
dailytechnology.net/slides-amazon-redshift

Thanks to @jasonamyers for suffering through my first draft.