IaaS (Infrastructure as a Service, like Azure, Google Could, these kinds of virtual services), this is used to build large companies involved in different kinds of servers.

Not like DigitalOcean and Linode(VPS - virtual process service). It is more for building wordpress or kind of small websites involved in single server.

Services

CDN (CloudFront)
Content deliver network, make you to access the website from the closest place.
Glacier
Store data that is not used frequently
Storage
Store data that is used frequently
Virtual Server
Lambda
Pure compute without worrying the server.
Database

Benefits

Scalable(just spend more money)
Total Cost of Ownership is low , you need to hire people to deal with different servers and modules, like power, cooler, etc.
Highly reliable for price point
Centralized Billing and Management

Problems

lock in
learning curve
cost adds up

Pricing

compute
storage
bandwidth
interaction

Normal File system

Linux default disk block size = 4 KB, file smaller than a block, the rest of the block will be wasted
GFS <-> HDFS
MapReduce <-> Hadoop

HDFS

Specially designed FS for storing big data with a streaming access pattern (write once and read as many as you want)
default disk block size = 64MB, file smaller than a block, the rest of the block will NOT BE wasted

Hadoop

daemons

master daemons: name node, secondary name node, job tracker
slaves daemons: data node, task tracker

example - theory

we(client) have 200MB data, so we need 4 blocks
we need 1 name node(nn), and several data node(dn), e.g. 8 data nodes.
nn creates metadata, creates daemons.
nn passes metadata back to client. Then client distributes the blocks to the data nodes and make replications based on the info from name node.
the data nodes send heartbeats back to the nn to notify that it is alive.
client sends code to the data node
job tracker tells task trackers to do its job
after the job are finished, the job tracker will assign a reducer.

example - real world

split data(documents) into input splits, and pass them to Record Readers,
then send them to the mapper. (default for text jobs is to split document into lines then send the lines to the mappers).
then shuffle the data to make the pairs with the same key together, default shuffle(sort) in Hadoop is alphabetically.
then reduce (each reducer reduces one key)

HDFS instructions

step 1
hdfs dfs -ls /, hdfs dfs -mkdir, hdfs dfs -put, hdfs dfs -get
step 2 move file to hdfs
hdfs dfs -put input.txt /user/class/
step 3 complie
javac -cp $HADOOP_core.jar *.java
step 4
jar cvf test.jar *.class
step 5
hadoop jar wordcount.jar ...WordCount

Setup

Setup your AWS accounts by following the below steps:

Go to AWS (https://aws.amazon.com/) and create an account. You need to enter your credit card info.
You can find an AWS account number in your AWS profile. Use that account number to apply for AWS educate credits at https://aws.amazon.com/education/awseducate/apply/ It will take a few hours before your receive an email confirming your credits are active.

If you have not received your AWS educate credits and are not using free tier services you will be charged on your credit card for usage! If you do, you will be responsible for any costs incurred.

AWS