While most databases are accessible via ODBC where we have an efficient way via turbodbc to turn results into a pandas.DataFrame, there are nowadays a lot of databases that either only come solely with a JDBC driver or the non-JDBC drivers are not part of free or open-source offering. To access these databases, you can use JayDeBeApi which is using JPype to call the JDBC driver. JPype starts a JVM inside the Python process and exposes the Java APIs as plain Python objects. While the convenience of use is really nice, this Java-Python bridge sadly comes at a high serialisation cost.
One of the main goals of Apache Arrow is to remove the serialisation cost of tabular data between different languages. A typical example where this already is successfully used is the Scala-Python bridge in PySpark. Here the communication between the JVM and Python is done via Py4J, a bridge between Python and JVM process. As there are multiple processes involved, the serialisation cost is reduced but communication and data copy between the ecosystems still exists.
In the following, we want to present an alternative approach to retrieve data via JDBC where the overhead between the JVM and pandas is kept as minimal as possible. This includes retrieving the whole data on the JVM side, transforming it to an Arrow Record Batch and then passing the memory pointer to that Record Batch over to Python. The important detail here is that we only pass a pointer to the data to Python, not the data itself.
Benchmark setup
In this benchmark, we will use Apache Drill as the database using its official JDBC driver. For the data, we will use the January 2017 Yellow Cab New York City trip data converted to Parquet. We start Drill in its embedded mode using ./bin/drill-embedded. There we can already peak into the data using
As the main aspect here is to show how to access databases using JDBC in Python, we will use JayDeBeApi now to connect to this running Drill instance. Therefore we start a JVM with jpype and then connect using jaydebeapi and the drill-jdbc-all-1.16.0.jar JAR to the database. For the JDBC connections, it is important that we have either a classpath with all Java dependencies or as in this case, a JAR that already bundles all dependencies. Finally, we execute the query and use the result to construct a pandas.DataFrame.
importjaydebeapiimportjpypeimportosclasspath=os.path.join(os.getcwd(),"apache-drill-1.16.0/jars/jdbc-driver/drill-jdbc-all-1.16.0.jar")jpype.startJVM(jpype.getDefaultJVMPath(),f"-Djava.class.path={classpath}")conn=jaydebeapi.connect('org.apache.drill.jdbc.Driver','jdbc:drill:drillbit=127.0.0.1')cursor=conn.cursor()query="""
SELECT *
FROM dfs.`/…/data/yellow_tripdata_2016-01.parquet`
LIMIT 1
"""cursor.execute(query)columns=[c[0]forcincursor.description]data=cursor.fetchall()df=pd.DataFrame(data,columns=columns)
To measure the performance, we have tried initially to run the full query to measure the retrieval performance but as this didn’t finish after 10min, we reverted to running the SELECT query with different LIMIT sizes. This lead to the following response times on my laptop (mean ± std. dev. of 7 runs):
LIMIT n
Time
10000
7.11 s ± 58.6 ms
100000
1min 9s ± 1.07 s
1000000
11min 31s ± 4.76 s
Out of curiosity, we have retrieved the full result set once and this came down to an overall time of 2h 42min 59son a warm JVM.
pyarrow.jvm and combined jar
As the above times were quite frustrating, we have high hopes that using Apache Arrow could bring a decent speedup for this operation. To use Apache Arrow Java and the Drill ODBC driver together, we need to bundle both together on the JVM classpath. The simplest way to do this is generate a new JAR that includes all dependencies using a build tool like Apache Maven. With the following pom.xml you get a fat JAR using mvn assembly:single. It is important here that your Apache Arrow Java version matches the pyarrow version, in this case here, both are at 0.15.1. It might still work when they differ but as there is limited API stability between the two implementations, this could otherwise lead to crashes.
After the JAR has been built, we now want to start the JVM with it loaded. Sadly, jpype has the limitation that you need to restart your Python process when you want to restart the JVM with different parameters. We thus adjust the JVM startup command to:
To use Apache Arrow Java to retrieve the result, we need to instantiate a RootAllocator that is used in Arrow Java to allocate the off-heap memory and also construct a DriverManager instance to connect to the database.
Once this is setup, we can use the Java method sqlToArrow to query a database using JDBC, retrieve the result and convert it to an Arrow RecordBatch on the Java side. With the helper pyarrow.jvm.record_batch we can take the jpype reference to the Java object, extract the memory address of the RecordBatch and create a matching Python pyarrow.RecordBatch object that points to the same memory.
Using these commands, we can now execute the same queries again and compare them to the jaydebeapi times:
LIMIT n
Time (JayDeBeApi)
Time (pyarrow.jvm)
Speedup
10000
7.11 s ± 58.6 ms
165 ms ± 5.86 ms
43x
100000
1min 9s ± 1.07 s
538 ms ± 29.6 ms
128x
1000000
11min 31s ± 4.76 s
5.05 s ± 596 ms
136x
With the pyarrow.jvm approach, we not get similar times to turbodbc.fetchallarrow() on other databases that come with an open ODBC driver. This also leads to the retrieval of the whole being a more sane 50.2 s instead of the hours-long wait with jaydebeapi.
Conclusion
By moving the row-to-columnar conversion to the JVM and avoiding to create intermediate Python objects before creating a pandas.DataFrame again, we can speedup the retrieval times for JDBC drivers in Python by over *100x*. As a user, you need to change your calls to jaydebeapi to the Apache Arrow Java API and pyarrow.jvm`. Additionally, you will have to take care that the Apache Arrow Java and the JDBC drivers are on the Java classpath. By using a common Java build tool, this can be achieved by simply declaring them as dependencies of a dummy package.
最近受我的朋友委托用js+HTML做一个像手册一样的程序,里面要有可展开的大纲,模糊查找等功能。我这个人说实在的懒,本来是不愿意的,但想起了父亲以前教我要给朋友搞好关系,再加上这也可以巩固自己的js技术,于是就开始开发这个程序,没想到却出了点小问题,我做的查找只能绝对查找。具体的js代码如下:
function search(){
var arr=new Array("my
实例:
CREATE OR REPLACE PROCEDURE test_Exception
(
ParameterA IN varchar2,
ParameterB IN varchar2,
ErrorCode OUT varchar2 --返回值,错误编码
)
AS
/*以下是一些变量的定义*/
V1 NUMBER;
V2 nvarc
Spark Streaming简介
NetworkWordCount代码
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
代码示例:
# include <stdio.h>
//冒泡排序
void sort(int * a, int len)
{
int i, j, t;
for (i=0; i<len-1; i++)
{
for (j=0; j<len-1-i; j++)
{
if (a[j] > a[j+1]) // >表示升序
nginx日志分割 for linux 默认情况下,nginx是不分割访问日志的,久而久之,网站的日志文件将会越来越大,占用空间不说,如果有问题要查看网站的日志的话,庞大的文件也将很难打开,于是便有了下面的脚本 使用方法,先将以下脚本保存为 cutlog.sh,放在/root 目录下,然后给予此脚本执行的权限
复制代码代码如下:
chmo
http://bukhantsov.org/2011/08/how-to-determine-businessobjects-service-pack-and-fix-pack/
The table below is helpful. Reference
BOE XI 3.x
12.0.0.
y BOE XI 3.0 12.0.
x.
y BO
大家都知道吧,这很坑,尤其是用惯了mysql里的自增字段设置,结果oracle里面没有的。oh,no 我用的是12c版本的,它有一个新特性,可以这样设置自增序列,在创建表是,把id设置为自增序列
create table t
(
id number generated by default as identity (start with 1 increment b