Machine Learning Scientist Sr at Robert Half

Its a great question and merits some elaboration. So the short answer is hadoop and spark are not even apples to apples. Let me illustrate through my own personal experience 


1. Say a company is wanting to get into the big data ecosystem. A typical entry point that enables success is to convert existing aggregations, reporting using NOSQL. 

2. Also lets assume two user types in this company 
[a] Information/ Data Analyst (can do RDB SQLs pretty well) 
[b] Developer who can do Java/Python, SQLs well 

3. The starting point is data. So the first step is to say replicate the entire DB on a daily basis into NOSQL. What this means is somehow (every developer including me has their own personal flavor of how to extract and transport the data from RDB to HDFS every day - Sqoop / Pentaho/ or data extracts and bash SCP whatever) 

4. But to put these flat files extracts from tables in the DB you need a Hadoop ecosystem. You have several choices. I have sent multiple hadoop/hive projects using Cloudera CDH distribution and I have confidence in Cloudera. So u get a few EC2 instances and using Cloudera Manager and setup the hadoop ecosystem and it comes with Hive and Impala as well. 

5. Next you define Hive table DDLs and point the location of each table to a location on HDFS where the flat files are. So in theory if u have X tables in your RDB, I would start with X tables in Hive 

6. At this stage of your journey to the big data ecosystem, you can open the gates to your big data system to [a] Information/ Data Analyst , and they can run ad-hoc queries using Impala(in memory and blazing fast) or Hive (more powerful functions available) and schedule daily reports etc (Oozie is my choice for DAG workflows and scheduling). You may want top configure security on the Hive tables using Sentry. I use Presto as well that sits very nicely on HDFS and used Hive metastore (like Impala) 

7. With this ecosystem in place , the developers can start working on one or more of the following 
- Write Spark streaming code ( I prefer Scala) / or Hadoop Map Reduce code in Java to process logs and other data into say , TSV or CSV that be loaded into newly created Hive tables. These tables created from log files can then be used for any correlations between the hive tables replicated from the enterprise DB 
- Build datasets from the above logs and tables and can feed machine learning software such as SparkMLlib or H20 (by 0xdata) 
- Spark SQL has can talk nicely with Hive metastore , so if you already have Hive tables then you can write SQL-ish code using Spark and get your answers. 

Sorry for the long answer to a short question ! Hope it helps

你可能感兴趣的:(big,data,hadoop,NoSQL)