关于spark thriftserver的提问

Is the Spark thrift server intended to be used for direct queries as normal RDBMs queries but with huge data behind?
I am working on a web application that is supposed to display some graphs base on analyzing huge number of tweets. 


The tweets are stored in json format on HDFS. 


My manager (he is not a technical guy by the way) is convinced that we should use Apache spark for querying the data from the files stored into HDFS because spark is much faster than mapreduce. He also knows that there is a spark thrift server for querying the data using spark. 


For me, I am not sure that apache spark is intended to be used this way. I think Apache spark is intended to be used as a programming framework for developing applications that doing a long running tasks in less time. 


Although spark is providing a thrift server for querying the data, this does not mean it can be used as normal DBMS. 


What do you think?
3 Answers
Chris Schrader
Chris Schrader, Business Intelligence Consultant
2.3k Views • Chris is a Most Viewed Writer in Apache Hadoop with 110+ answers.
Thrift has nothing to do with "querying" data.  It is used as a backend framework for coordinating services across a distributed computing platform: Apache Thrift - Home.  


Apache Spark does have a SQL component, but its intended to be an extension of the batch processing framework.


From reading what you wrote, you have two different problem to solve:


Performing analysis on large number (not sure what you mean by large but I'll assume hundreds gigabytes or terabytes) of tweets stored in HDFS
Serving/displaying the results of that analysis in a web application
Those two problems are almost always solved by two completely different platforms.  You could use Spark to solve the first problem.  I've used Spark to process JSON data before from data that was streamed in, and it works great.  You could also use something like Hive for ease of querying for any ad-hoc analysis.


The second problem is typically solved by a traditional RDBMS.  You will need to store the results/summary of your processing in the first problem in a very low latency platform (this is not what Spark is good at).  Since it sounds like you're writing a custom web application to display the data, you could probably leverage something like hBase as well (if you wanted to stay within the core Hadoop stack).


Lastly, I would encourage your manager to actually go read the documentation on these projects.  It sounds like he read some random blog post or heard some guy talking about these technologies and either completely misunderstood them or bought into some kind of hype/sales pitch.
Written 23 Mar 2015 • View Upvotes


William Emmanuel Yu
William Emmanuel Yu, In the business of figuring out ... how to store, what to do, how to make sen...
1.3k Views • William is a Most Viewed Writer in Apache Spark.
For our projects, we don't expose the Spark or other Hadoop tool interfaces to an end user facing real time application. Instead, we generally use it to pre-process data and load them into a more real time transaction system. Lately, this has been NOSQL type database systems.
Written 24 Mar 2015 • View Upvotes
Adriaan Bloem
Adriaan Bloem, without context Information is just Data
1.3k Views
I agree -- Spark may be faster than MapReduce, but it's still not really intended to be directly exposed to a web interface querying it. Or to put it this way, I'm sure you could, but that doesn't mean you should!


Depending on how complex the analysis is you need to do on the tweets, it sounds like a job for ElasticSearch with Kibana. That would greatly reduce the complexity of your setup, you probably don't even need to use HDFS to store the tweets in the first place (depending on where you're getting them from).
Written 23 Mar 2015 • View Upvotes


翻译如下:
Spark thriftserver倾向于被用于直接的查询(类似传统的关系型数据库,只不过数据量更大)吗?




我在做一个Web应用,它用于展示分析大量的推特并且展示一些图形。
这些推特在HDFS上以JSON的格式保存。
我的经理(非技术)推荐我们使用Spark来查询这些存储在HDFS上的文件,因为Spark比MapReduce计算框架快。他也知道Spark thriftserver可以用于查询数据。
对我而言,我并不确定Spark是否可以这样用。我想Spark更倾向于被用作开发在短时间内会长期运行的任务。
虽然Spark提供了Thriftserver来查询数据,但这并不意味着它可以被用作传统的关系型数据库。
你怎样想?




第一个回答:
Thrift没有做任何查询数据的事情。它只是一个在分布式计算平台上提供整合服务的后端框架:Apache Thrift-Home。
根据上面你写的,你有两个问题需要解决:
1.分析存储在HDFS上的大量的推特数据(虽然不确定你的大是什么意思,但我猜想应该是GB或TB大小的数据)
2.在Web应用上展示分析结果
这两个问题通常会用两个平台解决。你可以利用Spark解决第一个问题。我尝试过在数据流入时利用Spark处理JSON数据,它工作得非常好。你也可以使用Hive之类的东西进行任何查询分析。
第二个问题则通常使用传统的关系型数据库解决。你需要将第一个问题的分析/统计结果存入一个低延时的平台中(这并非Spark所擅长)。听起来好像你在写一个Web应用用于展示数据,你也可以使用HBase(如果你希望一直在Hadoop栈中的话)。
最后,我推荐你的经理读一些这些项目的文档。看起来好像他阅读了一些博客或者听到了某些人讨论这些技术,但是完全的误解了,或者陷入炒作中。




第二个回答
在我们项目中,我们不将Spark或者其他Hadoop的工具用于最终用户面对的实时系统上。事实上,我们预先用它们处理数据,并将数据写入实时的事务系统中。那是一个NoSQL数据库系统。




第三个回答:
我同意--Spark也许比MapReduce计算框架快,但是它并不能被直接用于面向Web查询请求。或者这样说,我知道你可以这样做,但你不应该这样做。
基于你需要对推特进行的复杂的分析,它听起来像是ElasticSearch或者Kibana需要做的事情。这也许会降低你安装的复杂度,也许你甚至不应该在第一步使用HDFS存储推特。(依赖于你从哪儿获取的它们)


心情为:。。。。。。

你可能感兴趣的:(关于spark thriftserver的提问)