本文是【HXE】系列的完结篇,总结了安装HANA Express Edition所需要执行的所有步骤。
安装数据库和配置连接
下载器
到SAP开发者官网,详细阅读安装路线图,确认安装需求之后,到这个网站下载下载器。下面我以虚拟机安装为例,详细讲述安装过程。
安装指南
下载指南或者教程(同时可以参考这个视频)完成安装
配置数据库连接
参考这份指南 配置数据库连接
参考这份指南 配置eclipse
配置其他链接
Putty可参考这份指南 配置
WinSCP略
基础配置
查看信息
参考这份指南
首先,安装完了HANA数据库之后,系统分别帮我们创建了一个用户:SYSTEM,以及两个数据库实例:SYSTEMDB和HXE,在接下来的操作中,我们会对这两个数据库进行操作。
以用户SYSTEM,数据库:SYSTEMDB执行,查看现在的数据库信息
SELECT DATABASE_NAME, DESCRIPTION, ACTIVE_STATUS, RESTART_MODE FROM SYS.M_DATABASES ORDER BY 1;
返回结果应该是下面的表格:
DATABASE_NAME |
DESCRIPTION |
ACTIVE_STATUS |
RESTART_MODE |
---|---|---|---|
HXE |
HXE-90 |
YES |
DEFAULT |
SYSTEMDB |
SystemDB-HXE-90 |
YES |
DEFAULT |
如果没有HXE的信息,则以用户SYSTEM在数据库SYSTEMDB执行:
CREATE DATABASE HXE SYSTEM USER PASSWORD ;
如果有HXE的信息,但是ACTIVE_STATUS
为 NO,则以用户SYSTEM,数据库:SYSTEMDB执行:
ALTER SYSTEM START DATABASE HXE;
以用户SYSTEM在数据库HXE执行,查看已经安装的AFL包:
SELECT * FROM SYS.AFL_PACKAGES;
创建用户和权限控制
以用户SYSTEM在数据库HXE执行,创建用户和授权
-- Uncomment this if you want to start from scratch
-- DROP USER ML_USER CASCADE;
CREATE USER ML_USER PASSWORD ;
-- Use this if you don't want to be forced to update your password on the first connection.
-- CREATE USER ML_USER PASSWORD NO FORCE_FIRST_PASSWORD_CHANGE;
-- or
ALTER USER ML_USER DISABLE PASSWORD LIFETIME;
GRANT AFLPM_CREATOR_ERASER_EXECUTE TO ML_USER;
GRANT AFL__SYS_AFL_AFLPAL_EXECUTE TO ML_USER;
GRANT DATA ADMIN TO ML_USER;
GRANT IMPORT TO ML_USER;
GRANT EXECUTE on _SYS_REPO.GRANT_ACTIVATED_ROLE TO ML_USER;
以用户ML_USER
在数据库 HXE(默认的密码是上面语句定义的)执行,创建特定的Schema:
-- Uncomment this if you want to start from scratch
-- DROP SCHEMA ML_DATA CASCADE;
CREATE SCHEMA ML_DATA;
SET SCHEMA ML_DATA;
以用户ML_USER
在数据库HXE执行,查看现有的用户和Schema
SELECT CURRENT_USER || ' / ' || CURRENT_SCHEMA FROM DUMMY;
安装组件
Python
参考:【HXE】Python + HANA? Yes!
R
参考这份指南
创建R的管理员用户
在shell里执行:
sudo useradd -m -d /home/r -c "R Administrator" radm
sudo passwd radm
Then, you can execute the following command to add the radm
user to the sudoer
list which will be required to proceed will the installation:
sudo bash -c 'echo "radm ALL=(ALL) NOPASSWD: ALL" >>/etc/sudoers'
sudo bash -c 'echo "umask 022" >>/home/r/.bashrc'
Now, you can switch to the radm
user if not done yet:
sudo su -l radm
激活SUSE系统
查看SUSE系统激活状态
sudo SUSEConnect --status-text
如果返回以下信息,则是成功激活:
Installed Products:
------------------------------------------
SUSE Linux Enterprise Server for SAP Applications 12 SP3
(SLES_SAP/12.3/x86_64)
Registered
------------------------------------------
如果返回Not Registered,则前往SUSE官网注册,并根据上面的系统版本选择激活码,然后执行
sudo SUSEConnect -r -e
激活成功后返回下面的信息
Registered SLES_SAP 12.2 x86_64
To server: https://scc.suse.com
Using E-Mail:
Successfully registered system.
注册完成之后,安装以下的Python依赖包:
-
SUSE Linux Package for SAP Applications 12 SP2
sudo SUSEConnect -p SLES_SAP/12.2/x86_64
可能会返回Error: Registration server returned 'Please provide a Registration Code for this product' (422),忽略这一句就好
-
SUSE Linux Enterprise Software Development Kit 12 SP2
sudo SUSEConnect -p sle-sdk/12.2/x86_64
-
Toolchain
Modulesudo SUSEConnect -p sle-module-toolchain/12/x86_64
刷新zypper库
sudo zypper refresh
安装以下的编译器(注意:保证网络畅通):
sudo zypper install --type pattern Basis-Devel
安装以下的依赖环境:
sudo zypper install \
xorg-x11-devel \
readline-devel \
libcurl-devel \
gcc-fortran \
gcc-c++ \
xz-devel \
pcre-devel \
texinfo \
cairo-devel
安装Java
前往SAP Development Tools下载指定的Java安装包,之后
执行下面的命令
sudo rpm -ivh /sapjvm--linux-x64.rpm
Then you will need to update the "alternatives" and enable your flavor of java using the following commands:
sudo update-alternatives --install "/usr/bin/java" "java" "/usr/java/sapjvm_8_latest/bin/java" 1
sudo update-alternatives --set java /usr/java/sapjvm_8_latest/bin/java
查看java版本
java -version
返回以下内容:
java version "1.8.0_xx"
Java(TM) SE Runtime Environment (build 1.8.0_xx-yyy)
安装TexInfo
Texinfo
is the official documentation format of the GNU project and is used by multiple project including R to build the manuals.
However, the texinfo
package available in most repository does not provide all the tools required to compile R from the ground.
For mode details about texinfo
, you can visit: https://www.gnu.org/software/texinfo/
.
The texinfo
required to compile is 5.1, but in this example we will be using a newer version.
In the below script, curl
is used to download the package, but if your machine is not connected to the Internet, you can download manually the texinfo
package from http://ftp.gnu.org/gnu/texinfo/ and transfer it.
From your terminal console, execute the following command:
cd ~
curl http://ftp.gnu.org/gnu/texinfo/texinfo-6.5.tar.gz -o ~/texinfo-6.5.tar.gz
tar -xf ~/texinfo-6.5.tar.gz
cd ~/texinfo-6.5
./configure --prefix=/usr --disable-static > install.log
make clean >> install-textinfo.log
make >> install-textinfo.log
make info >> install-textinfo.log
sudo make install >> install-textinfo.log
sudo chmod -R 755 /usr/share/texinfo
make clean >> install-textinfo.log
# rm ~/texinfo-6.5.tar.gz
To verify that your setup is correct you can run the following command:
texi2any --help
No error message should be displayed.
下载、编译并安装R
As explained previously, we need to recompile R with shlib
enabled in order to use it with SAP HANA, express edition.
In this example we will be using a newer version than the one listed in the PAM.
In the below script, curl
is used to download the package, but if your machine is not connected to the Internet, you can download manually the R
package from https://cran.r-project.org/ and transfer it.
cd ~
curl https://cloud.r-project.org/src/base/R-3/R-3.4.3.tar.gz -o ~/R-3.4.3.tar.gz
tar -xf ~/R-3.4.3.tar.gz
cd ~/R-3.4.3
./configure --prefix=/usr --enable-R-shlib > install-r.log
> 报错:configure: WARNING: neither inconsolata.sty nor zi4.sty found: PDF vignettes and package manuals will not be rendered optimally
解决:下载 http://mirrors.ctan.org/fonts/inconsolata.zip
解压 :unzip inconsolata.zip
将文件拷贝到目录下:cp -Rfp inconsolata/* /usr/share/texmf/
刷新sty:mktexlsr
[参考](https://www.zhihu.com/question/47921911)
make clean >> install-r.log
make >> install-r.log
make info >> install-r.log
sudo make install >> install-r.log
sudo chmod -R 755 /usr/lib64/R
make clean >> install-r.log
# rm ~/R-3.4.3.tar.gz
To verify that your setups is correct you can run the following command:
echo "R.version.string" | R --save -q
Provide an answer to the question below then click on Validate.
下载、编译并安装Rserve
Rserve
acts as a socket server (TCP/IP or local sockets) which allows binary requests to be sent to an R process.
Every connection has a separate workspace and working directory.
Client-side implementations are available for popular languages such as C/C++ and Java, allowing any application to use facilities of R without the need of linking to R code.
Rserve
supports remote connection, user authentication and file transfer.
If your host is connected to the Internet, you can leverage the CRAN mirror to install Rserve
else you can download it manually and transfer it.
To install the Rserve
package and make available to every user you should start R as a supper user running the following command:
sudo R
Then you can use the following command if your server is connected to the internet:
install.packages("Rserve")
You will be prompted to select one of the CRAN mirror from which the package will be downloaded.
If your server is not connected to the internet you can use instead:
cd ~
curl https://cloud.r-project.org/src/contrib/Rserve_1.7-3.tar.gz -o Rserve_1.7-3.tar.gz
install.packages("//Rserve_1.7-3.tar.gz", repos = NULL)
You can find the archive on the cloud mirror: https://cloud.r-project.org/src/contrib
You can pick version 1.7-3.
Type q()
to quit your R session as super user.
To verify that the Rserve
package is properly installed, open a new R session and execute the following command:
library("Rserve")
You should not receive any message after executing the command.
Now, as we installed the Rserve
as super user, we need to add proper rights to any users executing the following command:
sudo chmod 755 /usr/lib64/R/bin/Rserve
# 虚拟机快照“BeforeRserver”
启动Rserve
You can start Rserve
using the following command:
R CMD Rserve --RS-port --no-save --RS-encoding utf8
The port for starting Rserve
has to be chosen wisely as it will be configured in SAP HANA over the next step.
You can use 9999 as this port is not used often:
R CMD Rserve --RS-port 9999 --no-save --RS-encoding utf8
The --no-save
option makes sure that the invoked R runtime do not store the R environment onto the file system after the R execution has been stopped.
This is important to avoid the file system to be filled over time due to multiple R runs.
There is currently no support for automatically starting the Rserve
server after rebooting the Linux host.
To accomplish this, you can use crontab
with a shell script like the following, which starts a new Rserve
process if none is running:
pgrep -u -f "Rserve --RS-port --no-save --RS-encoding utf8" || R CMD Rserve --RS-port --no-save --RS-encoding utf8
For example with hxeadm
on port 9999:
pgrep -u radm -f "Rserve --RS-port 9999 --no-save --RS-encoding utf8" || R CMD Rserve --RS-port 9999 --no-save --RS-encoding utf8
配置SAP HANA
To enable the calling of R procedures from SAP HANA, the index server configuration parameters from the calcEngine
section must be configured.
Connect to the HXE tenant using the SYSTEM user credentials and execute the following SQL statement:
ALTER SYSTEM ALTER CONFIGURATION ('indexserver.ini', 'SYSTEM') SET ('calcEngine', 'cer_rserve_addresses' ) = 'localhost:9999' WITH RECONFIGURE;
ALTER SYSTEM ALTER CONFIGURATION ('indexserver.ini', 'SYSTEM') SET ('calcEngine', 'cer_timeout' ) = '300' WITH RECONFIGURE;
ALTER SYSTEM ALTER CONFIGURATION ('indexserver.ini', 'SYSTEM') SET ('calcEngine', 'cer_rserve_maxsendsize') = '0' WITH RECONFIGURE;
You will notice that the port number must correspond to the one used to start Rserve
.
Now, you need to create the Rsever
source by executing the following SQL statement:
CREATE REMOTE SOURCE "Local Rserve"
ADAPTER "rserve"
CONFIGURATION 'server=localhost;port=9999';
Now, you need to grant the ML_USER
by executing the following SQL statement:
GRANT CREATE R SCRIPT TO ML_USER;
Then allow the ML_USER
to access the Local Rserve
source by executing the following SQL statement:
ALTER USER ML_USER SET PARAMETER RSERVE REMOTE SOURCES = 'Local Rserve';
测试安装
In order to test the configuration, you will execute a simple procedure that will read the Iris dataset and store it into a table.
Connect to the HXE tenant using the ML_USER
user credentials and execute the following SQL statement:
CREATE SCHEMA R_DATA;
SET SCHEMA R_DATA;
-- Uncomment the drop statement is you want to run it from scratch
DROP TABLE IRIS;
DROP PROCEDURE LOAD_IRIS;
DROP PROCEDURE DISPLAY_IRIS;
CREATE COLUMN TABLE IRIS (
"Sepal.Length" DOUBLE,
"Sepal.Width" DOUBLE,
"Petal.Length" DOUBLE,
"Petal.Width" DOUBLE,
"Species" VARCHAR(5000)
);
CREATE PROCEDURE LOAD_IRIS(OUT iris "IRIS")
LANGUAGE RLANG AS
BEGIN
library(datasets)
data(iris)
iris <- cbind(iris)
END;
CREATE PROCEDURE DISPLAY_IRIS()
AS BEGIN
CALL LOAD_IRIS(iris);
INSERT INTO IRIS SELECT * FROM :iris;
END;
CALL DISPLAY_IRIS();
SELECT * FROM IRIS;
The Iris dataset will display the measurements in centimeters of the sepal length and width and petal length and width for about 50 flowers from each of 3 species of iris. Therefore, the result should display 150 rows.
SELECT COUNT(1) FROM R_DATA.IRIS;
常用操作
- 启动R server(每次重启服务器都需要执行)
R CMD Rserve --RS-port 9999 --no-save --RS-encoding utf8
查看R服务进程:
ps -aux|grep Rserve
关闭R进程:
kill
Stop R
pgrep -u hxeadm -f "Rserve --RS-port 9999"
pkill -u hxeadm -f "Rserve --RS-port 9999"
MovieLens
参考
导入数据
前往MovieLens下载数据集。下载之后,通过FTP传输文件到服务器/tmp/PAL/ml-latest/
下,之后以用户ML_USER在数据库HXE执行导入:
-- create table
-- download the dataset and extract the files to path '/tmp/PAL/' (assumed shcema is 'PAL')
SET SCHEMA PAL;
DROP TABLE movielens_links;
create column table movielens_links(
movieid integer not null,
imdbid integer,
tmdbid integer,
primary key (
movieid
)
);
IMPORT FROM CSV FILE '/tmp/PAL/ml-latest/links.csv' INTO movielens_links
WITH COLUMN LIST IN FIRST ROW RECORD DELIMITED BY '\n' FIELD DELIMITED BY ',' OPTIONALLY ENCLOSED BY '"' ERROR LOG 'movielens_links.err' THREADS 10 BATCH 10000 TABLE LOCK
;
DROP TABLE movielens_movies;
create column table movielens_movies(
movieid integer not null,
title nvarchar(255),
genres nvarchar(255),
primary key (
movieid
)
);
IMPORT FROM CSV FILE '/tmp/PAL/ml-latest/movies.csv'
INTO movielens_movies
WITH COLUMN LIST IN FIRST ROW RECORD DELIMITED BY '\n' FIELD DELIMITED BY ',' OPTIONALLY ENCLOSED BY '"' ERROR LOG 'movielens_movies.err' THREADS 10 BATCH 10000 TABLE LOCK
;
DROP TABLE movielens_ratings;
create column table movielens_ratings(
userid integer not null,
movieid integer not null,
rating decimal,
timestamp integer,
primary key (
userid,
movieid
)
);
IMPORT FROM CSV FILE '/tmp/PAL/ml-latest/ratings.csv'
INTO movielens_ratings
WITH COLUMN LIST IN FIRST ROW RECORD DELIMITED BY '\n' FIELD DELIMITED BY ',' OPTIONALLY ENCLOSED BY '"' ERROR LOG 'movielens_ratings.err' THREADS 10 BATCH 10000 TABLE LOCK
;
DROP TABLE movielens_tags;
create column table movielens_tags(
userid integer not null,
movieid integer not null,
tag nvarchar(255) not null,
timestamp integer,
primary key (
userid,
movieid,
tag
)
);
IMPORT FROM CSV FILE '/tmp/PAL/ml-latest/tags.csv'
INTO movielens_tags
WITH COLUMN LIST IN FIRST ROW RECORD DELIMITED BY '\n' FIELD DELIMITED BY ',' OPTIONALLY ENCLOSED BY '"' ERROR LOG 'movielens_tags.err' THREADS 10 BATCH 10000 TABLE LOCK
;
数据探索
-- https://developers.sap.com/group.hxe-aa-movielens-sql.html
select 'links' as table, count(1) as count from movielens_links
union all
select 'movies' as table, count(1) as count from movielens_movies
union all
select 'ratings' as table, count(1) as count from movielens_ratings
union all
select 'tags' as table, count(1) as count from movielens_tags;
-- Check missing values
select 'links with missing movie' as label, count(1) as count
from movielens_links l
where not exists (select 1 from movielens_movies m where l.movieid = m.movieid)
union all
select 'movies with mising link', count(1)
from movielens_movies m
where not exists (select 1 from movielens_links l where l.movieid = m.movieid);
-- Check the genres
select 'movies with missing genres' as label, count(1) as count
from movielens_movies
where genres is null or length(genres)=0;
-- get the list of genres used across our 9125 movies with the following SQL:
do begin
declare genrearray nvarchar(255) array;
declare tmp nvarchar(255);
declare idx integer;
declare sep nvarchar(1) := '|';
declare cursor cur for select distinct genres from movielens_movies;
declare genres nvarchar (255) := '';
idx := 1;
for cur_row as cur() do
select cur_row.genres into genres from dummy;
tmp := :genres;
while locate(:tmp,:sep) > 0 do
genrearray[:idx] := substr_before(:tmp,:sep);
tmp := substr_after(:tmp,:sep);
idx := :idx + 1;
end while;
genrearray[:idx] := :tmp;
end for;
genrelist = unnest(:genrearray) as (genre);
select genre from :genrelist group by genre;
end;
-- get the number of movies associated with each genres by adjusting the previous SQL:
do begin
declare genrearray nvarchar(255) array;
declare tmp nvarchar(255);
declare idx integer;
declare sep nvarchar(1) := '|';
declare cursor cur for select distinct genres from movielens_movies;
declare genres nvarchar (255) := '';
idx := 1;
for cur_row as cur() do
select cur_row.genres into genres from dummy;
tmp := :genres;
while locate(:tmp,:sep) > 0 do
genrearray[:idx] := substr_before(:tmp,:sep);
tmp := substr_after(:tmp,:sep);
idx := :idx + 1;
end while;
genrearray[:idx] := :tmp;
end for;
genrelist = unnest(:genrearray) as (genre);
select genre, count(1) from :genrelist group by genre;
end;
-- get the number of genres associated with each movies using the following SQL:
select
movieid
, title
, occurrences_regexpr('[|]' in genres) + 1 as genre_count
, genres
from movielens_movies
order by genre_count asc;
-- count the movies per genre count using the following SQL:
select
genre_count, count(1)
from (
select occurrences_regexpr('[|]' in genres) + 1 genre_count
from movielens_movies
) group by genre_count order by genre_count;
-- have a look at the tags distribution using the following SQL:
select count(1)
from (
select movieid, count(1) as tag_count
from movielens_tags
group by movieid
);
-- determine the tag count distribution per movies using the following SQL:
select tag_count, count(1)
from (
select movieid, count(1) as tag_count
from movielens_tags
group by movieid
)
group by tag_count order by tag_count;
-- determine the rating count distribution per movies using the following SQL:
select rating_count, count(1) as movie_count
from (
select movieid, count(1) as rating_count
from movielens_ratings
group by movieid
)
group by rating_count order by rating_count asc;
-- some calculations:
select distinct
min(rating_count) over( ) as min,
max(rating_count) over( ) as max,
avg(rating_count) over( ) as avg,
sum(rating_count) over( ) as sum,
median(rating_count) over( ) as median,
stddev(rating_count) over( ) as stddev,
count(*) over( ) as category_count
from (
select movieid, count(1) as rating_count
from movielens_ratings
group by movieid
)
group by rating_count;
-- determine the rating count distribution per user using the following SQL:
select rating_count, count(1) as user_count
from (
select userid, count(1) as rating_count
from movielens_ratings
group by userid
)
group by rating_count order by 1 desc;
-- some calculations:
select distinct
min(rating_count) over( ) as min,
max(rating_count) over( ) as max,
avg(rating_count) over( ) as avg,
sum(rating_count) over( ) as sum,
median(rating_count) over( ) as median,
stddev(rating_count) over( ) as stddev,
count(*) over( ) as category_count
from (
select userid, count(1) as rating_count
from movielens_ratings
group by userid
)
group by rating_count order by 1 desc;
-- determine the rating notation distribution using the following SQL:
select rating, count(1) as rating_count
from movielens_ratings
group by rating order by 1 desc;
-- determine the users distribution per rating notation using the following SQL:
select rating, count(1) as users_count from (
select userid, rating, count(1) as rating_count
from movielens_ratings
group by userid, rating
)
group by rating order by 1 desc;
-- determine the movies distribution per rating notation using the following SQL:
select rating, count(1) as movie_count from (
select movieid, rating, count(1) as rating_count
from movielens_ratings
group by movieid, rating
)
group by rating order by 1 desc;
APL(略)
具体参考:MovieLens with SAP HANA APL Recommendation (MovieLens SQL)
Understand the capabilities and options made available with the SAP HANA Automated Predictive Library (APL), which algorithm can be used to address your goal, and apply it to the data set
PAL
有两种方式调用PAL函数:
The direct technique
为即将使用的PAL函数生成特定的AFL wrapper,由于生成AFL wrapper需要指定表格类型、签名表格、输入和输出的表格,所以在创建之处就要定义。
一旦定生成了AFL wrapper,可以通过调用声明来产生特定的AFL wrapper,下面是一个例子:
-- --------------------------------------------------------------------------
-- Create the table types
-- --------------------------------------------------------------------------
DROP TYPE PAL_APRIORI_DATA_T;
CREATE TYPE PAL_APRIORI_DATA_T AS TABLE(
"CUSTOMER" INTEGER,
"ITEM" VARCHAR(20)
);
DROP TYPE PAL_APRIORI_RESULT_T;
CREATE TYPE PAL_APRIORI_RESULT_T AS TABLE(
"PRERULE" VARCHAR(500),
"POSTRULE" VARCHAR(500),
"SUPPORT" DOUBLE,
"CONFIDENCE" DOUBLE,
"LIFT" DOUBLE
);
DROP TYPE PAL_APRIORI_PMMLMODEL_T;
CREATE TYPE PAL_APRIORI_PMMLMODEL_T AS TABLE(
"ID" INTEGER,
"PMMLMODEL" VARCHAR(5000)
);
DROP TYPE PAL_CONTROL_T;
CREATE TYPE PAL_CONTROL_T AS TABLE(
"NAME" VARCHAR(100),
"INTARGS" INTEGER,
"DOUBLEARGS" DOUBLE,
"STRINGARGS" VARCHAR (100)
);
-- --------------------------------------------------------------------------
-- Create the AFL wrapper corresponding to the target PAL function
-- --------------------------------------------------------------------------
DROP TYPE PROCEDURE_SIGNATURE_T;
CREATE TYPE PROCEDURE_SIGNATURE_T AS TABLE(
"NAME" VARCHAR (50),
"INTARGS" INTEGER,
"DOUBLEARGS" DOUBLE,
"STRINGARGS" VARCHAR (100)
);
DROP TYPE TRAINED_MODEL_T;
CREATE TYPE TRAINED_MODEL_T AS TABLE(
"NAME" VARCHAR (50),
"VALUE" VARCHAR (5000)
);
DROP TABLE OPERATION_CONFIG;
DROP TYPE OPERATION_CONFIG_T;
CREATE TYPE OPERATION_CONFIG_T AS TABLE(
"NAME" VARCHAR (50),
"VALUE" VARCHAR (5000)
);
-- --------------------------------------------------------------------------
-- Create the AFL wrapper corresponding to the target PAL function
-- --------------------------------------------------------------------------
DROP TABLE PAL_APRIORI_PDATA_TBL;
CREATE COLUMN TABLE PAL_APRIORI_PDATA_TBL(
"POSITION" INT,
"SCHEMA_NAME" NVARCHAR(256),
"TYPE_NAME" NVARCHAR(256),
"PARAMETER_TYPE" VARCHAR(7)
);
INSERT INTO PAL_APRIORI_PDATA_TBL VALUES (1, 'MYSCHEMA', 'PAL_APRIORI_DATA_T', 'IN');
INSERT INTO PAL_APRIORI_PDATA_TBL VALUES (2, 'MYSCHEMA', 'PAL_CONTROL_T', 'IN');
INSERT INTO PAL_APRIORI_PDATA_TBL VALUES (3, 'MYSCHEMA', 'PAL_APRIORI_RESULT_T', 'OUT');
INSERT INTO PAL_APRIORI_PDATA_TBL VALUES (4, 'MYSCHEMA', 'PAL_APRIORI_PMMLMODEL_T', 'OUT');
CALL "SYS".AFLLANG_WRAPPER_PROCEDURE_DROP('MYSCHEMA', 'PAL_APRIORI_RULE_PROC');
CALL "SYS".AFLLANG_WRAPPER_PROCEDURE_CREATE('AFLPAL', 'APRIORIRULE', 'MYSCHEMA', 'PAL_APRIORI_RULE_PROC', PAL_APRIORI_PDATA_TBL);
-- --------------------------------------------------------------------------
-- Create the Parameter table corresponding to the target PAL function
-- --------------------------------------------------------------------------
DROP TABLE #PAL_CONTROL_TBL;
CREATE LOCAL TEMPORARY COLUMN TABLE #PAL_CONTROL_TBL(
"NAME" VARCHAR(100),
"INTARGS" INTEGER,
"DOUBLEARGS" DOUBLE,
"STRINGARGS" VARCHAR (100)
);
INSERT INTO #PAL_CONTROL_TBL VALUES ('THREAD_NUMBER', 2, null, null);
INSERT INTO #PAL_CONTROL_TBL VALUES ('MIN_SUPPORT', null, 0.1, null);
INSERT INTO #PAL_CONTROL_TBL VALUES ('MIN_CONFIDENCE', null, 0.3, null);
INSERT INTO #PAL_CONTROL_TBL VALUES ('MIN_LIFT', null, 1.1, null);
INSERT INTO #PAL_CONTROL_TBL VALUES ('MAX_CONSEQUENT', 1, null, null);
DROP TABLE PAL_APRIORI_RESULT_TBL;
CREATE COLUMN TABLE PAL_APRIORI_RESULT_TBL LIKE PAL_APRIORI_RESULT_T;
DROP TABLE PAL_APRIORI_PMMLMODEL_TBL;
CREATE COLUMN TABLE PAL_APRIORI_PMMLMODEL_TBL LIKE PAL_APRIORI_PMMLMODEL_T;
-- --------------------------------------------------------------------------
-- Call the target PAL function using the generated wrapper
-- --------------------------------------------------------------------------
CALL "DM_PAL".PAL_APRIORI_RULE_PROC(PAL_APRIORI_TRANS_TBL, #PAL_CONTROL_TBL, PAL_APRIORI_RESULT_TBL, PAL_APRIORI_PMMLMODEL_TBL) WITH overview;
The procedure technique
这种方式不仅更简单直接,而且运行效率更高,可扩展性也更好。
这种方法不直接操作AFL wrapper,而是操作PAL进程,让进程处理AFL的工作,下面是一个例子:
-- --------------------------------------------------------------------------
-- Create the AFL wrapper corresponding to the target PAL function
-- --------------------------------------------------------------------------
DROP TABLE PAL_PARAMETER_TBL;
CREATE TABLE PAL_PARAMETER_TBL (
"PARAM_NAME " VARCHAR(100),
"INT_VALUE" INTEGER,
"DOUBLE_VALUE" DOUBLE,
"STRING_VALUE" VARCHAR (100)
);
INSERT INTO PAL_PARAMETER_TBL VALUES ('MIN_SUPPORT', null, 0.1, null);
INSERT INTO PAL_PARAMETER_TBL VALUES ('MIN_CONFIDENCE', null, 0.3, null);
INSERT INTO PAL_PARAMETER_TBL VALUES ('MIN_LIFT', null, 1.1, null);
INSERT INTO PAL_PARAMETER_TBL VALUES ('MAX_CONSEQUENT', 1, null, null);
INSERT INTO PAL_PARAMETER_TBL VALUES ('PMML_EXPORT', 1, null, null);
DROP TABLE PAL_APRIORI_TRANS_TBL;
CREATE COLUMN TABLE PAL_APRIORI_TRANS_TBL (
"CUSTOMER" INTEGER,
"ITEM" VARCHAR(20)
);
INSERT INTO PAL_APRIORI_TRANS_TBL VALUES (, - );
-- --------------------------------------------------------------------------
-- Execute the PAL function using its AFL wrapper and the actual input/output tables
-- --------------------------------------------------------------------------
CALL _SYS_AFL.PAL_APRIORI(PAL_APRIORI_TRANS_TBL, PAL_PARAMETER_TBL, ?, ?); with overview;
配置
以以用户ML_USER,数据库:HXE,执行,清除历史结果
-- --------------------------------------------------------------------------
-- drop function in/out tables, helper tables and views
-- --------------------------------------------------------------------------
drop table pal_movielens_parameters;
drop table pal_movielens_apriori_pmmlmodel;
drop table pal_movielens_apriori_result;
drop view pal_movielens_apriori_data_input;
创建输入和输出表格和视图结构
-- --------------------------------------------------------------------------
-- create the config and output tables
-- --------------------------------------------------------------------------
create column table pal_movielens_parameters (
param_name varchar(100),
int_value integer,
double_value double,
string_value varchar (100)
);
create column table pal_movielens_apriori_result (
prerule varchar(500),
postrule varchar(500),
support double,
confidence double,
lift double
);
create row table pal_movielens_apriori_pmmlmodel (
row_index integer,
model_content clob
);
-- --------------------------------------------------------------------------
-- create the input data view
-- --------------------------------------------------------------------------
create view pal_movielens_apriori_data_input as
select userid, movieid
from movielens_ratings;
配置算法变量。注意:必须设置MIN_SUPPORT
和MIN_CONFIDENCE
这两个变量。
-- --------------------------------------------------------------------------
-- configuration
-- --------------------------------------------------------------------------
truncate table pal_movielens_parameters;
insert into pal_movielens_parameters values ('MIN_SUPPORT' , null, 0.1 , null); -- no default
insert into pal_movielens_parameters values ('MIN_CONFIDENCE', null, 0.1 , null); -- no default
insert into pal_movielens_parameters values ('MIN_LIFT' , null, 0.0 , null); -- default is 0.0
insert into pal_movielens_parameters values ('MAX_CONSEQUENT', 1 , null, null); -- default is 500
insert into pal_movielens_parameters values ('MAXITEMLENGTH' , 2 , null, null); -- default is 5
insert into pal_movielens_parameters values ('UBIQUITOUS' , null, 1.0 , null); -- default is 1.0
insert into pal_movielens_parameters values ('PMML_EXPORT' , 1 , null, null); -- default is 0
insert into pal_movielens_parameters values ('TIMEOUT' , 3600, null, null); -- default is 3600
insert into pal_movielens_parameters values ('THREAD_RATIO' , null, 0.0 , null); -- default is 0.0
select * from pal_movielens_parameters;
执行算法
truncate table pal_movielens_apriori_result;
truncate table pal_movielens_apriori_pmmlmodel;
-- --------------------------------------------------------------------------
-- execute the pal function to train the model
-- --------------------------------------------------------------------------
call _sys_afl.pal_apriori(
pal_movielens_apriori_data_input
, pal_movielens_parameters
, pal_movielens_apriori_result
, pal_movielens_apriori_pmmlmodel
) with overview;
协同过滤的结果
drop view pal_movielens_apriori_result_collaborative;
create view pal_movielens_apriori_result_collaborative as
select *
from (
select
t1.userid
, row_number() over(partition by t1.userid order by t1.score desc, t1.consequent desc ) as rank
, t1.consequent as movieid
, t1.score
, movies.title
, movies.genres
, links.imdbid
, links.tmdbid
from (
select input_data.userid, rules.postrule as consequent, max(rules.confidence) as score
from movielens_ratings as input_data
left outer join (select * from pal_movielens_apriori_result) rules on (cast (input_data.movieid as varchar(500)) = rules.prerule)
where rules.postrule is not null
group by input_data.userid, rules.postrule
) t1
left outer join movielens_movies movies on movies.movieid = t1.consequent
left outer join movielens_links links on links.movieid = t1.consequent
) t1
where t1.rank <= 5;
基于内容的过滤结果
drop view pal_movielens_apriori_result_contentbased;
create view pal_movielens_apriori_result_contentbased as
select *
from (
select
t1.movieid
, row_number() over(partition by t1.movieid order by t1.score desc, t1.consequent desc ) as rank
, t1.consequent as similar_movieid
, t1.score
, movies.title
, movies.genres
, links.imdbid
, links.tmdbid
from (
select movieid, rules.postrule as consequent, rules.confidence as score
from movielens_movies as input_data
left outer join (select * from pal_movielens_apriori_result) rules on (cast (input_data.movieid as varchar(500)) = rules.prerule)
where rules.postrule is not null
) t1
left outer join movielens_movies movies on movies.movieid = t1.consequent
left outer join movielens_links links on links.movieid = t1.consequent
) t1
where t1.rank <= 5;
结果展示
Let's verify how many users will actually get recommendations using the following SQL:
select reco_count, count(1) as user_count
from (
select userid, max(rank) as reco_count
from pal_movielens_apriori_result_collaborative
group by userid
) group by reco_count order by reco_count desc;
Let's verify how many distinct movies will actually get recommended to a user (part of the top 5 scores) using the following SQL:
select
count(1) as movie_count
, count(1) *100 / (select count(1) as count from movielens_movies ) as movie_ratio
from (
select movieid
from pal_movielens_apriori_result_collaborative
group by movieid
);
Let's verify how many distinct movies will potentially get recommended to a user (not just the top 5 scores) using the following SQL:
select
count(1) as movie_count
, count(1) *100 / (select count(1) as count from movielens_movies ) as movie_ratio
from (
select prerule as movieid
from pal_movielens_apriori_result
where prerule not like '%&%'
group by prerule
);
基于以上结果,我们可以得出结论:
- 602位用户都会收到5部电影的推荐
- 只有0.51%的电影被推荐给用户
- 只有2.5%的电影有可能被推送给用户
结果度量
Let's verify how many movies will actually get recommendations using the following SQL:
select reco_count, count(1) as movie_count
from (
select movieid, max(rank) as reco_count
from pal_movielens_apriori_result_contentbased
group by movieid
) group by reco_count order by 1 desc;
Let's verify how many distinct movies will actually get recommended to a user (part of the top 5 scores) using the following SQL:
select
count(1) as movie_count
, count(1) *100 / (select count(1) as count from movielens_movies ) as movie_ratio
from (
select movieid
from pal_movielens_apriori_result_contentbased
group by movieid
);
Let's verify how many rating does the movies with no recommendation have using the following SQL:
select rating_count, count(1) as movie_count
from (
select ratings.movieid, count(1) as rating_count
from movielens_ratings ratings
left outer join (
select movieid
from (
select prerule as movieid
from pal_movielens_apriori_result
where prerule not like '%&%'
group by prerule
)
) t1 on (ratings.movieid = t1.movieid)
where t1.movieid is null
group by ratings.movieid
) group by rating_count;
As you can see, the movies with no recommendations have up to 92 ratings, and this list include the 3063 movies with only one rating and the 1202 with only 2 ratings.
PAL官方指南
导入PAL官方数据集
设置导入路径
出于安全因素考量,用户如果想利用SQL语句导入csv文件,只能导入csv_import_path_filter
参数定义的路径中的csv文件。
以用户ML_USER在数据库HXE执行,查看当前这一参数的设置情况:
SELECT
*
FROM
M_INIFILE_CONTENTS
WHERE
SECTION = 'import_export'
AND KEY = 'enable_csv_import_path_filter'
在默认情况下,这个字段的值为true
,而csv_import_path_filter
定义的路径为空。
以用户SYSTEM在数据库HXE执行,解除这一参数的限制:
ALTER SYSTEM
ALTER CONFIGURATION ('indexserver.ini', 'database')
SET ('import_export', 'enable_csv_import_path_filter') = 'false'
WITH RECONFIGURE;
以用户SYSTEM在数据库HXE执行,查看参数设置:
SELECT
*
FROM
M_INIFILE_CONTENTS
WHERE
SECTION = 'import_export'
AND KEY = 'csv_import_path_filter';
默认情况下,返回空值,则以下三个路径都可以使用:
$DIR_INSTANCE/work
$DIR_INSTANCE/backup
$SAP_RETRIEVAL_PATH/trace
一般情况下,$DIR_INSTANCE
的绝对路径为/usr/sap/HXE/HDB90
,如果你想设置其他路径,以用户SYSTEM在对应的数据库执行:
ALTER SYSTEM
ALTER CONFIGURATION ('indexserver.ini', 'database')
SET ('import_export', 'csv_import_path_filter') = '/path1;/path2'
WITH RECONFIGURE;
设置好路径之后,我们需要前往saphanaacademy/PAL下载所需的代码和数据集,并跟随视频进行数据导入:
Predictive Analysis Library on Youtube
Predictive Analysis Library on bilibili(partical)
注意:请查看HANA数据库版本,选择对应版本所需的文件
报错:insufficient privilege: Not authorized
解决:缺乏权限,以用户SYSTEM在对应的数据库执行:GRANT IMPORT TO
;
查看服务器状态和授权
-- SAP_HANA_Predictive_Analysis_Library_PAL_en SETUP GUIDE (HANA 2.0 SPS 03)
-- PLEASE USE SYSTEM USER TO EXECUTE THIS FILES
-- check tenant database exists and is started
SELECT * FROM SYS.M_DATABASES;
-- check script server
SELECT * FROM SYS.M_SERVICES;
-- add script server to tenant database
-- ALTER DATABASE HXE ADD 'scriptserver';
-- START SCRIPT SERVER
-- ALTER SYSTEM ALTER CONFIGURATION ('daemon.ini', 'SYSTEM') SET ('scriptserver', 'instances') = '1' WITH RECONFIGURE;
-- CHECK AFL PAL FUNCTIONS ARE INSTALLED
SELECT * FROM SYS.AFL_FUNCTIONS WHERE PACKAGE_NAME='PAL' AND FUNCTION_NAME LIKE '%_ANY';
-- SYSTEM SETUP
DROP SCHEMA DM_PAL CASCADE;
CREATE SCHEMA DM_PAL;
-- authorize access to SYS views
GRANT CATALOG READ TO DEVUSER;
-- AUTHORIZE CREATION & REMOVAL OF PAL PROCEDURES
GRANT AFLPM_CREATOR_ERASER_EXECUTE TO DEVUSER;
-- AUTHORIZE EXECUTION OF PAL PROCEDURES
GRANT AFL__SYS_AFL_AFLPAL_EXECUTE TO DEVUSER;
-- https://archive.sap.com/discussions/thread/3673775
-- AUTHORIZE READ ACCESS TO INPUT DATA
GRANT SELECT ON SCHEMA DM_PAL TO DEVUSER;
-- AUTHORIZE MODELING FOR DATA PREVIEW
GRANT MODELING TO DEVUSER;
-- authorize data administration & import
GRANT DATA ADMIN TO DEVUSER;
GRANT IMPORT TO DEVUSER;
GRANT EXECUTE on _SYS_REPO.GRANT_ACTIVATED_ROLE TO DEVUSER;
-- Grant the required access and try to execute the algorithm through a user:
-- https://archive.sap.com/discussions/thread/3373573
GRANT CREATE ANY, ALTER, DROP, EXECUTE, SELECT, INSERT, UPDATE, DELETE, INDEX ON SCHEMA DM_PAL TO DEVUSER;
GRANT SELECT ON SCHEMA DM_PAL TO DEVUSER with GRANT OPTION;
GRANT SELECT ON SCHEMA DM_PAL TO DEVUSER with GRANT OPTION;
GRANT SELECT ON SCHEMA DM_PAL TO _SYS_REPO with GRANT OPTION;
GRANT SELECT ON SCHEMA DM_PAL TO _SYS_REPO with GRANT OPTION;
GRANT CONTENT_ADMIN TO DEVUSER;
-- create the wrapper generator & eraser .... and assign access to the user
GRANT EXECUTE ON system.afl_wrapper_generator to DEVUSER;r
GRANT EXECUTE ON system.afl_wrapper_eraser to DEVUSER;
-- TROUBLESHOOTING
-- Could not execute SAP DBTech JDBC: [258]: insufficient privilege: Not authorized
-- GRANT ALL *AFL* ROLES TO DEVUSER;
由于HXE没有安装XS,所以我这边先不使用WebIDE和HDI container
-- authorize WebIDE developers to access AFL metadata
GRANT SELECT ON "_SYS"."AFL_AREAS" TO "SYS_XS_HANA_BROKER"."XSA_DEV_USER_ROLE";
GRANT SELECT ON "_SYS"."AFL_FUNCTION_PARAMETERS" TO "SYS_XS_HANA_BROKER"."XSA_DEV_USER_ROLE";
GRANT SELECT ON "_SYS"."AFL_FUNCTION_PROPERTIES" TO "SYS_XS_HANA_BROKER"."XSA_DEV_USER_ROLE";
GRANT SELECT ON "_SYS"."AFL_FUNCTIONS" TO "SYS_XS_HANA_BROKER"."XSA_DEV_USER_ROLE";
GRANT SELECT ON "_SYS"."AFL_PACKAGES" TO "SYS_XS_HANA_BROKER"."XSA_DEV_USER_ROLE";
GRANT SELECT ON "_SYS"."AFL_TEXTS" TO "SYS_XS_HANA_BROKER"."XSA_DEV_USER_ROLE";
-- authorize HDI container owner to access AFL metadata
GRANT SELECT ON "_SYS"."AFL_AREAS" TO "_SYS_DI_OO_DEFAULTS";
GRANT SELECT ON "_SYS"."AFL_FUNCTION_PARAMETERS" TO "_SYS_DI_OO_DEFAULTS";
GRANT SELECT ON "_SYS"."AFL_FUNCTION_PROPERTIES" TO "_SYS_DI_OO_DEFAULTS";
GRANT SELECT ON "_SYS"."AFL_FUNCTIONS" TO "_SYS_DI_OO_DEFAULTS";
GRANT SELECT ON "_SYS"."AFL_PACKAGES" TO "_SYS_DI_OO_DEFAULTS";
GRANT SELECT ON "_SYS"."AFL_TEXTS" TO "_SYS_DI_OO_DEFAULTS";
authorize HDI container owner to execute PAL procedures
GRANT AFL__SYS_AFL_AFLPAL_EXECUTE TO "_SYS_DI_OO_DEFAULTS";
K-Means
预先条件:
- DM_PAL is a schema belonging to DEVUSER;
- USER1 has been assigned the AFL__SYS_AFL_AFLPAL_EXECUTE or AFL__SYS_AFL_AFLPAL_EXECUTE_WITH_GRANT_OPTION role.
示范(wrapper procedure technique)
SET SCHEMA DM_PAL ;
DROP TYPE PAL_KMEANS_DATA_T;
CREATE TYPE PAL_KMEANS_DATA_T AS TABLE(
"ID" INTEGER,
"V000" DOUBLE,
"V001" VARCHAR(2),
"V002" DOUBLE
);
DROP TYPE PAL_CONTROL_T;
CREATE TYPE PAL_CONTROL_T AS TABLE(
"NAME" VARCHAR (100),
"INTARGS" INTEGER,
"DOUBLEARGS" DOUBLE,
"STRINGARGS" VARCHAR (100)
);
DROP TYPE PAL_KMEANS_ASSIGNED_T;
CREATE TYPE PAL_KMEANS_ASSIGNED_T AS TABLE(
"ID" INTEGER,
"CLUSTER" INTEGER,
"DISTANCE" DOUBLE,
"SLIGHT_SILHOUETTE" DOUBLE
);
DROP TYPE PAL_KMEANS_CENTERS_T;
CREATE TYPE PAL_KMEANS_CENTERS_T AS TABLE(
"CLUSTER_ID" INTEGER,
"V000" DOUBLE,
"V001" VARCHAR(2),
"V002" DOUBLE
);
DROP TYPE PAL_KMEANS_SIL_CENTERS_T;
CREATE TYPE PAL_KMEANS_SIL_CENTERS_T AS TABLE(
"CLUSTER_ID" INTEGER,
"SLIGHT_SILHOUETTE" DOUBLE
);
DROP TYPE PAL_KMEANS_STATISTIC_T;
CREATE TYPE PAL_KMEANS_STATISTIC_T AS TABLE(
"NAME" VARCHAR(50),
"VALUE" DOUBLE
);
DROP TYPE PAL_KMEANS_MODEL_T;
CREATE TYPE PAL_KMEANS_MODEL_T AS TABLE(
"JID" INTEGER,
"JSMODEL" VARCHAR(5000)
);
DROP TABLE PAL_KMEANS_PDATA_TBL;
CREATE COLUMN TABLE PAL_KMEANS_PDATA_TBL("POSITION" INT, "SCHEMA_NAME" NVARCHAR(256), "TYPE_NAME" NVARCHAR(256), "PARAMETER_TYPE" VARCHAR(7));
INSERT INTO PAL_KMEANS_PDATA_TBL VALUES (1, 'DM_PAL', 'PAL_KMEANS_DATA_T', 'IN');
INSERT INTO PAL_KMEANS_PDATA_TBL VALUES (2, 'DM_PAL', 'PAL_CONTROL_T', 'IN');
INSERT INTO PAL_KMEANS_PDATA_TBL VALUES (3, 'DM_PAL', 'PAL_KMEANS_ASSIGNED_T', 'OUT');
INSERT INTO PAL_KMEANS_PDATA_TBL VALUES (4, 'DM_PAL', 'PAL_KMEANS_CENTERS_T', 'OUT');
INSERT INTO PAL_KMEANS_PDATA_TBL VALUES (5, 'DM_PAL', 'PAL_KMEANS_SIL_CENTERS_T', 'OUT');
INSERT INTO PAL_KMEANS_PDATA_TBL VALUES (6, 'DM_PAL', 'PAL_KMEANS_STATISTIC_T', 'OUT');
INSERT INTO PAL_KMEANS_PDATA_TBL VALUES (7, 'DM_PAL', 'PAL_KMEANS_MODEL_T', 'OUT');
CALL "SYS".AFLLANG_WRAPPER_PROCEDURE_DROP('DM_PAL', 'PAL_KMEANS_PROC');
CALL "SYS".AFLLANG_WRAPPER_PROCEDURE_CREATE('AFLPAL', 'KMEANS', 'DM_PAL', 'PAL_KMEANS_PROC', PAL_KMEANS_PDATA_TBL);
DROP TABLE #PAL_CONTROL_TBL;
CREATE LOCAL TEMPORARY COLUMN TABLE #PAL_CONTROL_TBL(
"NAME" VARCHAR (100),
"INTARGS" INTEGER,
"DOUBLEARGS" DOUBLE,
"STRINGARGS" VARCHAR (100)
);
INSERT INTO #PAL_CONTROL_TBL VALUES ('THREAD_NUMBER', 2, null, null);
INSERT INTO #PAL_CONTROL_TBL VALUES ('GROUP_NUMBER', 4, null, null);
INSERT INTO #PAL_CONTROL_TBL VALUES ('INIT_TYPE', 1, null, null);
INSERT INTO #PAL_CONTROL_TBL VALUES ('DISTANCE_LEVEL',2, null, null);
INSERT INTO #PAL_CONTROL_TBL VALUES ('MAX_ITERATION', 100, null, null);
INSERT INTO #PAL_CONTROL_TBL VALUES ('EXIT_THRESHOLD', null, 1.0E-6, null);
INSERT INTO #PAL_CONTROL_TBL VALUES ('CATEGORY_WEIGHTS', null, 0.5, null);
DROP TABLE PAL_KMEANS_DATA_TBL;
CREATE COLUMN TABLE PAL_KMEANS_DATA_TBL LIKE PAL_KMEANS_DATA_T;
INSERT INTO PAL_KMEANS_DATA_TBL VALUES (0, 0.5, 'A', 0.5);
INSERT INTO PAL_KMEANS_DATA_TBL VALUES (1, 1.5, 'A', 0.5);
INSERT INTO PAL_KMEANS_DATA_TBL VALUES (2, 1.5, 'A', 1.5);
INSERT INTO PAL_KMEANS_DATA_TBL VALUES (3, 0.5, 'A', 1.5);
INSERT INTO PAL_KMEANS_DATA_TBL VALUES (4, 1.1, 'B', 1.2);
INSERT INTO PAL_KMEANS_DATA_TBL VALUES (5, 0.5, 'B', 15.5);
INSERT INTO PAL_KMEANS_DATA_TBL VALUES (6, 1.5, 'B', 15.5);
INSERT INTO PAL_KMEANS_DATA_TBL VALUES (7, 1.5, 'B', 16.5);
INSERT INTO PAL_KMEANS_DATA_TBL VALUES (8, 0.5, 'B', 16.5);
INSERT INTO PAL_KMEANS_DATA_TBL VALUES (9, 1.2, 'C', 16.1);
INSERT INTO PAL_KMEANS_DATA_TBL VALUES (10, 15.5, 'C', 15.5);
INSERT INTO PAL_KMEANS_DATA_TBL VALUES (11, 16.5, 'C', 15.5);
INSERT INTO PAL_KMEANS_DATA_TBL VALUES (12, 16.5, 'C', 16.5);
INSERT INTO PAL_KMEANS_DATA_TBL VALUES (13, 15.5, 'C', 16.5);
INSERT INTO PAL_KMEANS_DATA_TBL VALUES (14, 15.6, 'D', 16.2);
INSERT INTO PAL_KMEANS_DATA_TBL VALUES (15, 15.5, 'D', 0.5);
INSERT INTO PAL_KMEANS_DATA_TBL VALUES (16, 16.5, 'D', 0.5);
INSERT INTO PAL_KMEANS_DATA_TBL VALUES (17, 16.5, 'D', 1.5);
INSERT INTO PAL_KMEANS_DATA_TBL VALUES (18, 15.5, 'D', 1.5);
INSERT INTO PAL_KMEANS_DATA_TBL VALUES (19, 15.7, 'A', 1.6);
DROP TABLE PAL_KMEANS_ASSIGNED_TBL;
CREATE COLUMN TABLE PAL_KMEANS_ASSIGNED_TBL LIKE PAL_KMEANS_ASSIGNED_T;
DROP TABLE PAL_KMEANS_CENTERS_TBL;
CREATE COLUMN TABLE PAL_KMEANS_CENTERS_TBL LIKE PAL_KMEANS_CENTERS_T;
DROP TABLE PAL_KMEANS_SIL_CENTERS_TBL;
CREATE COLUMN TABLE PAL_KMEANS_SIL_CENTERS_TBL LIKE PAL_KMEANS_SIL_CENTERS_T;
DROP TABLE PAL_KMEANS_STATISTIC_TBL;
CREATE COLUMN TABLE PAL_KMEANS_STATISTIC_TBL LIKE PAL_KMEANS_STATISTIC_T;
DROP TABLE PAL_KMEANS_MODEL_TBL;
CREATE COLUMN TABLE PAL_KMEANS_MODEL_TBL LIKE PAL_KMEANS_MODEL_T;
CALL "DM_PAL".PAL_KMEANS_PROC(PAL_KMEANS_DATA_TBL, #PAL_CONTROL_TBL, PAL_KMEANS_ASSIGNED_TBL, PAL_KMEANS_CENTERS_TBL, PAL_KMEANS_SIL_CENTERS_TBL, PAL_KMEANS_STATISTIC_TBL, PAL_KMEANS_MODEL_TBL) with OVERVIEW;
SELECT * FROM PAL_KMEANS_ASSIGNED_TBL;
SELECT * FROM PAL_KMEANS_CENTERS_TBL;
SELECT * FROM PAL_KMEANS_SIL_CENTERS_TBL;
SELECT * FROM PAL_KMEANS_STATISTIC_TBL;
SELECT * FROM PAL_KMEANS_MODEL_TBL;
示范(direct technique)
SET SCHEMA DM_PAL;
DROP TABLE PAL_KMEANS_DATA_TBL;
CREATE COLUMN TABLE PAL_KMEANS_DATA_TBL(
"ID" INTEGER,
"V000" DOUBLE,
"V001" VARCHAR(2),
"V002" DOUBLE
);
INSERT INTO PAL_KMEANS_DATA_TBL VALUES (0, 0.5, 'A', 0.5);
INSERT INTO PAL_KMEANS_DATA_TBL VALUES (1, 1.5, 'A', 0.5);
INSERT INTO PAL_KMEANS_DATA_TBL VALUES (2, 1.5, 'A', 1.5);
INSERT INTO PAL_KMEANS_DATA_TBL VALUES (3, 0.5, 'A', 1.5);
INSERT INTO PAL_KMEANS_DATA_TBL VALUES (4, 1.1, 'B', 1.2);
INSERT INTO PAL_KMEANS_DATA_TBL VALUES (5, 0.5, 'B', 15.5);
INSERT INTO PAL_KMEANS_DATA_TBL VALUES (6, 1.5, 'B', 15.5);
INSERT INTO PAL_KMEANS_DATA_TBL VALUES (7, 1.5, 'B', 16.5);
INSERT INTO PAL_KMEANS_DATA_TBL VALUES (8, 0.5, 'B', 16.5);
INSERT INTO PAL_KMEANS_DATA_TBL VALUES (9, 1.2, 'C', 16.1);
INSERT INTO PAL_KMEANS_DATA_TBL VALUES (10, 15.5, 'C', 15.5);
INSERT INTO PAL_KMEANS_DATA_TBL VALUES (11, 16.5, 'C', 15.5);
INSERT INTO PAL_KMEANS_DATA_TBL VALUES (12, 16.5, 'C', 16.5);
INSERT INTO PAL_KMEANS_DATA_TBL VALUES (13, 15.5, 'C', 16.5);
INSERT INTO PAL_KMEANS_DATA_TBL VALUES (14, 15.6, 'D', 16.2);
INSERT INTO PAL_KMEANS_DATA_TBL VALUES (15, 15.5, 'D', 0.5);
INSERT INTO PAL_KMEANS_DATA_TBL VALUES (16, 16.5, 'D', 0.5);
INSERT INTO PAL_KMEANS_DATA_TBL VALUES (17, 16.5, 'D', 1.5);
INSERT INTO PAL_KMEANS_DATA_TBL VALUES (18, 15.5, 'D', 1.5);
INSERT INTO PAL_KMEANS_DATA_TBL VALUES (19, 15.7, 'A', 1.6);
DROP TABLE #PAL_PARAMETER_TBL;
CREATE LOCAL TEMPORARY COLUMN TABLE #PAL_PARAMETER_TBL(
"PARAM_NAME" NVARCHAR(256),
"INT_VALUE" INTEGER,
"DOUBLE_VALUE" DOUBLE,
"STRING_VALUE" NVARCHAR(1000)
);
INSERT INTO #PAL_PARAMETER_TBL VALUES ('THREAD_RATIO', NULL, 0.2, NULL);
INSERT INTO #PAL_PARAMETER_TBL VALUES ('GROUP_NUMBER', 4, NULL, NULL);
INSERT INTO #PAL_PARAMETER_TBL VALUES ('INIT_TYPE', 1, NULL, NULL);
INSERT INTO #PAL_PARAMETER_TBL VALUES ('DISTANCE_LEVEL',2, NULL, NULL);
INSERT INTO #PAL_PARAMETER_TBL VALUES ('MAX_ITERATION', 100, NULL, NULL);
INSERT INTO #PAL_PARAMETER_TBL VALUES ('EXIT_THRESHOLD', NULL, 1.0E-6, NULL);
INSERT INTO #PAL_PARAMETER_TBL VALUES ('CATEGORY_WEIGHTS', NULL, 0.5, NULL);
CALL _SYS_AFL.PAL_KMEANS(PAL_KMEANS_DATA_TBL, "#PAL_PARAMETER_TBL", ?, ?, ?, ?, ?);
测量
SET SCHEMA DM_PAL;
DROP TABLE PAL_SILHOUETTE_DATA_TBL;
CREATE COLUMN TABLE PAL_SILHOUETTE_DATA_TBL(
"ID" INTEGER,
"V000" DOUBLE,
"A0" INTEGER,
"A1" INTEGER,
"A2" INTEGER,
"A3" INTEGER,
"V002" DOUBLE
);
INSERT INTO PAL_SILHOUETTE_DATA_TBL VALUES (0, 0.5, 1, 0, 0, 0, 0.5);
INSERT INTO PAL_SILHOUETTE_DATA_TBL VALUES (1, 1.5, 1, 0, 0, 0, 0.5);
INSERT INTO PAL_SILHOUETTE_DATA_TBL VALUES (2, 1.5, 1, 0, 0, 0, 1.5);
INSERT INTO PAL_SILHOUETTE_DATA_TBL VALUES (3, 0.5, 1, 0, 0, 0, 1.5);
INSERT INTO PAL_SILHOUETTE_DATA_TBL VALUES (4, 1.1, 0, 1, 0, 0, 1.2);
INSERT INTO PAL_SILHOUETTE_DATA_TBL VALUES (5, 0.5, 0, 1, 0, 0, 15.5);
INSERT INTO PAL_SILHOUETTE_DATA_TBL VALUES (6, 1.5, 0, 1, 0, 0, 15.5);
INSERT INTO PAL_SILHOUETTE_DATA_TBL VALUES (7, 1.5, 0, 1, 0, 0, 16.5);
INSERT INTO PAL_SILHOUETTE_DATA_TBL VALUES (8, 0.5, 0, 1, 0, 0, 16.5);
INSERT INTO PAL_SILHOUETTE_DATA_TBL VALUES (9, 1.2, 0, 0, 1, 0, 16.1);
INSERT INTO PAL_SILHOUETTE_DATA_TBL VALUES (10, 15.5, 0, 0, 1, 0, 15.5);
INSERT INTO PAL_SILHOUETTE_DATA_TBL VALUES (11, 16.5, 0, 0, 1, 0, 15.5);
INSERT INTO PAL_SILHOUETTE_DATA_TBL VALUES (12, 16.5, 0, 0, 1, 0, 16.5);
INSERT INTO PAL_SILHOUETTE_DATA_TBL VALUES (13, 15.5, 0, 0, 1, 0, 16.5);
INSERT INTO PAL_SILHOUETTE_DATA_TBL VALUES (14, 15.6, 0, 0, 0, 1, 16.2);
INSERT INTO PAL_SILHOUETTE_DATA_TBL VALUES (15, 15.5, 0, 0, 0, 1, 0.5);
INSERT INTO PAL_SILHOUETTE_DATA_TBL VALUES (16, 16.5, 0, 0, 0, 1, 0.5);
INSERT INTO PAL_SILHOUETTE_DATA_TBL VALUES (17, 16.5, 0, 0, 0, 1, 1.5);
INSERT INTO PAL_SILHOUETTE_DATA_TBL VALUES (18, 15.5, 0, 0, 0, 1, 1.5);
INSERT INTO PAL_SILHOUETTE_DATA_TBL VALUES (19, 15.7, 1, 0, 0, 0, 1.6);
DROP TABLE PAL_SILHOUETTE_ASSIGN_TBL;
CREATE COLUMN TABLE PAL_SILHOUETTE_ASSIGN_TBL(
"ID" INTEGER,
"CLASS_LABEL" INTEGER
);
INSERT INTO PAL_SILHOUETTE_ASSIGN_TBL VALUES (0, 0);
INSERT INTO PAL_SILHOUETTE_ASSIGN_TBL VALUES (1, 0);
INSERT INTO PAL_SILHOUETTE_ASSIGN_TBL VALUES (2, 0);
INSERT INTO PAL_SILHOUETTE_ASSIGN_TBL VALUES (3, 0);
INSERT INTO PAL_SILHOUETTE_ASSIGN_TBL VALUES (4, 0);
INSERT INTO PAL_SILHOUETTE_ASSIGN_TBL VALUES (5, 1);
INSERT INTO PAL_SILHOUETTE_ASSIGN_TBL VALUES (6, 1);
INSERT INTO PAL_SILHOUETTE_ASSIGN_TBL VALUES (7, 1);
INSERT INTO PAL_SILHOUETTE_ASSIGN_TBL VALUES (8, 1);
INSERT INTO PAL_SILHOUETTE_ASSIGN_TBL VALUES (9, 1);
INSERT INTO PAL_SILHOUETTE_ASSIGN_TBL VALUES (10, 2);
INSERT INTO PAL_SILHOUETTE_ASSIGN_TBL VALUES (11, 2);
INSERT INTO PAL_SILHOUETTE_ASSIGN_TBL VALUES (12, 2);
INSERT INTO PAL_SILHOUETTE_ASSIGN_TBL VALUES (13, 2);
INSERT INTO PAL_SILHOUETTE_ASSIGN_TBL VALUES (14, 2);
INSERT INTO PAL_SILHOUETTE_ASSIGN_TBL VALUES (15, 3);
INSERT INTO PAL_SILHOUETTE_ASSIGN_TBL VALUES (16, 3);
INSERT INTO PAL_SILHOUETTE_ASSIGN_TBL VALUES (17, 3);
INSERT INTO PAL_SILHOUETTE_ASSIGN_TBL VALUES (18, 3);
INSERT INTO PAL_SILHOUETTE_ASSIGN_TBL VALUES (19, 3);
DROP TABLE #PAL_PARAMETER_TBL;
CREATE LOCAL TEMPORARY COLUMN TABLE #PAL_PARAMETER_TBL(
"PARAM_NAME" NVARCHAR(256),
"INT_VALUE" INTEGER,
"DOUBLE_VALUE" DOUBLE,
"STRING_VALUE" NVARCHAR(1000)
);
INSERT INTO #PAL_PARAMETER_TBL VALUES ('VARIABLE_NUM', 6, NULL, NULL);
INSERT INTO #PAL_PARAMETER_TBL VALUES ('THREAD_RATIO', NULL, 0.2, NULL);
CALL _SYS_AFL.PAL_VALIDATE_KMEANS(PAL_SILHOUETTE_DATA_TBL, PAL_SILHOUETTE_ASSIGN_TBL, "#PAL_PARAMETER_TBL", ?);
比较
wrapper procedure
Statement 'CALL "SYS".AFLLANG_WRAPPER_PROCEDURE_CREATE('AFLPAL', 'KMEANS', 'DM_PAL', 'PAL_KMEANS_PROC', ...'
successfully executed in 580 ms 543 µs (server processing time: 579 ms 118 µs) - Rows Affected: 0
Statement 'CALL "DM_PAL".PAL_KMEANS_PROC(PAL_KMEANS_DATA_TBL, #PAL_CONTROL_TBL, PAL_KMEANS_ASSIGNED_TBL, ...'
successfully executed in 250 ms 264 µs (server processing time: 247 ms 596 µs)
direct technique
Statement 'CALL _SYS_AFL.PAL_KMEANS(PAL_KMEANS_DATA_TBL, "#PAL_PARAMETER_TBL", ?, ?, ?, ?, ?)'
successfully prepared in 270 ms 696 µs
Statement 'CALL _SYS_AFL.PAL_KMEANS(PAL_KMEANS_DATA_TBL, "#PAL_PARAMETER_TBL", ?, ?, ?, ?, ?)'
successfully executed in 63 ms 45 µs (server processing time: 58 ms 286 µs)
Accelerated K-Means
示范(wrapper procedure technique)
HANA 2.0 SPS 01以及之前的示范文档
- DM_PAL is a schema belonging to DEVUSER; and
- DEVUSER has been assigned the AFLPM_CREATOR_ERASER_EXECUTE role; and
- DEVUSER has been assigned the AFL__SYS_AFL_AFLPAL_EXECUTE or AFL__SYS_AFL_AFLPAL_EXECUTE_WITH_GRANT_OPTION role.
SET SCHEMA DM_PAL;
DROP TYPE PAL_ACCKMEANS_DATA_T;
CREATE TYPE PAL_ACCKMEANS_DATA_T AS TABLE (
"ID" INTEGER,
"V000" DOUBLE,
"V001" VARCHAR (2),
"V002" DOUBLE
);
DROP TYPE PAL_CONTROL_T;
CREATE TYPE PAL_CONTROL_T AS TABLE (
"NAME" VARCHAR (100),
"INTARGS" INTEGER,
"DOUBLEARGS" DOUBLE,
"STRINGARGS" VARCHAR (100)
);
DROP TYPE PAL_ACCKMEANS_ASSIGNED_T;
CREATE TYPE PAL_ACCKMEANS_ASSIGNED_T AS TABLE (
"ID" INTEGER,
"CLUSTER" INTEGER,
"DISTANCE" DOUBLE,
"SLIGHT_SILHOUETTE" DOUBLE
);
DROP TYPE PAL_ACCKMEANS_CENTERS_T;
CREATE TYPE PAL_ACCKMEANS_CENTERS_T AS TABLE (
"CLUSTER_ID" INTEGER,
"V000" DOUBLE,
"V001" VARCHAR (2),
"V002" DOUBLE
);
DROP TYPE PAL_ACCKMEANS_SIL_CENTERS_T;
CREATE TYPE PAL_ACCKMEANS_SIL_CENTERS_T AS TABLE (
"CLUSTER_ID" INTEGER,
"SLIGHT_SILHOUETTE" DOUBLE
);
DROP TYPE PAL_ACCKMEANS_STATISTIC_T;
CREATE TYPE PAL_ACCKMEANS_STATISTIC_T AS TABLE (
"NAME" VARCHAR (50),
"VALUE" DOUBLE
);
DROP TYPE PAL_ACCKMEANS_MODEL_T;
CREATE TYPE PAL_ACCKMEANS_MODEL_T AS TABLE (
"JID" INTEGER,
"JSMODEL" VARCHAR (5000)
);
DROP TABLE PAL_ACCKMEANS_PDATA_TBL;
CREATE COLUMN TABLE PAL_ACCKMEANS_PDATA_TBL (
"POSITION" INT,
"SCHEMA_NAME" NVARCHAR (256),
"TYPE_NAME" NVARCHAR (256),
"PARAMETER_TYPE" VARCHAR (7)
);
INSERT INTO PAL_ACCKMEANS_PDATA_TBL VALUES (1, 'DM_PAL', 'PAL_ACCKMEANS_DATA_T', 'IN');
INSERT INTO PAL_ACCKMEANS_PDATA_TBL VALUES (2, 'DM_PAL', 'PAL_CONTROL_T', 'IN');
INSERT INTO PAL_ACCKMEANS_PDATA_TBL VALUES (3, 'DM_PAL', 'PAL_ACCKMEANS_ASSIGNED_T', 'OUT');
INSERT INTO PAL_ACCKMEANS_PDATA_TBL VALUES (4, 'DM_PAL', 'PAL_ACCKMEANS_CENTERS_T', 'OUT');
INSERT INTO PAL_ACCKMEANS_PDATA_TBL VALUES (5, 'DM_PAL', 'PAL_ACCKMEANS_SIL_CENTERS_T', 'OUT');
INSERT INTO PAL_ACCKMEANS_PDATA_TBL VALUES (6, 'DM_PAL', 'PAL_ACCKMEANS_STATISTIC_T', 'OUT');
INSERT INTO PAL_ACCKMEANS_PDATA_TBL VALUES (7, 'DM_PAL', 'PAL_ACCKMEANS_MODEL_T', 'OUT');
CALL "SYS".AFLLANG_WRAPPER_PROCEDURE_DROP('DM_PAL', 'PAL_ACCKMEANS_PROC');
CALL "SYS".AFLLANG_WRAPPER_PROCEDURE_CREATE('AFLPAL', 'ACCELERATEDKMEANS', 'DM_PAL', 'PAL_ACCKMEANS_PROC', PAL_ACCKMEANS_PDATA_TBL);
DROP TABLE #PAL_CONTROL_TBL;
CREATE LOCAL TEMPORARY COLUMN TABLE #PAL_CONTROL_TBL (
"NAME" VARCHAR (100),
"INTARGS" INTEGER,
"DOUBLEARGS" DOUBLE,
"STRINGARGS" VARCHAR (100)
);
INSERT INTO #PAL_CONTROL_TBL VALUES ('THREAD_NUMBER', 2, null, null);
INSERT INTO #PAL_CONTROL_TBL VALUES ('GROUP_NUMBER', 4, null, null);
INSERT INTO #PAL_CONTROL_TBL VALUES ('INIT_TYPE', 1, null, null);
INSERT INTO #PAL_CONTROL_TBL VALUES ('DISTANCE_LEVEL',2, null, null);
INSERT INTO #PAL_CONTROL_TBL VALUES ('MAX_ITERATION', 100, null, null);
INSERT INTO #PAL_CONTROL_TBL VALUES ('EXIT_THRESHOLD', null, 1.0E-6, null);
INSERT INTO #PAL_CONTROL_TBL VALUES ('CATEGORY_WEIGHTS', null, 0.5, null);
DROP TABLE PAL_ACCKMEANS_DATA_TBL;
CREATE COLUMN TABLE PAL_ACCKMEANS_DATA_TBL LIKE PAL_ACCKMEANS_DATA_T;
INSERT INTO PAL_ACCKMEANS_DATA_TBL VALUES (0, 0.5, 'A', 0.5);
INSERT INTO PAL_ACCKMEANS_DATA_TBL VALUES (1, 1.5, 'A', 0.5);
INSERT INTO PAL_ACCKMEANS_DATA_TBL VALUES (2, 1.5, 'A', 1.5);
INSERT INTO PAL_ACCKMEANS_DATA_TBL VALUES (3, 0.5, 'A', 1.5);
INSERT INTO PAL_ACCKMEANS_DATA_TBL VALUES (4, 1.1, 'B', 1.2);
INSERT INTO PAL_ACCKMEANS_DATA_TBL VALUES (5, 0.5, 'B', 15.5);
INSERT INTO PAL_ACCKMEANS_DATA_TBL VALUES (6, 1.5, 'B', 15.5);
INSERT INTO PAL_ACCKMEANS_DATA_TBL VALUES (7, 1.5, 'B', 16.5);
INSERT INTO PAL_ACCKMEANS_DATA_TBL VALUES (8, 0.5, 'B', 16.5);
INSERT INTO PAL_ACCKMEANS_DATA_TBL VALUES (9, 1.2, 'C', 16.1);
INSERT INTO PAL_ACCKMEANS_DATA_TBL VALUES (10, 15.5, 'C', 15.5);
INSERT INTO PAL_ACCKMEANS_DATA_TBL VALUES (11, 16.5, 'C', 15.5);
INSERT INTO PAL_ACCKMEANS_DATA_TBL VALUES (12, 16.5, 'C', 16.5);
INSERT INTO PAL_ACCKMEANS_DATA_TBL VALUES (13, 15.5, 'C', 16.5);
INSERT INTO PAL_ACCKMEANS_DATA_TBL VALUES (14, 15.6, 'D', 16.2);
INSERT INTO PAL_ACCKMEANS_DATA_TBL VALUES (15, 15.5, 'D', 0.5);
INSERT INTO PAL_ACCKMEANS_DATA_TBL VALUES (16, 16.5, 'D', 0.5);
INSERT INTO PAL_ACCKMEANS_DATA_TBL VALUES (17, 16.5, 'D', 1.5);
INSERT INTO PAL_ACCKMEANS_DATA_TBL VALUES (18, 15.5, 'D', 1.5);
INSERT INTO PAL_ACCKMEANS_DATA_TBL VALUES (19, 15.7, 'A', 1.6);
DROP TABLE PAL_ACCKMEANS_ASSIGNED_TBL;
CREATE COLUMN TABLE PAL_ACCKMEANS_ASSIGNED_TBL LIKE PAL_ACCKMEANS_ASSIGNED_T;
DROP TABLE PAL_ACCKMEANS_CENTERS_TBL;
CREATE COLUMN TABLE PAL_ACCKMEANS_CENTERS_TBL LIKE PAL_ACCKMEANS_CENTERS_T;
DROP TABLE PAL_ACCKMEANS_SIL_CENTERS_TBL;
CREATE COLUMN TABLE PAL_ACCKMEANS_SIL_CENTERS_TBL LIKE PAL_ACCKMEANS_SIL_CENTERS_T;
DROP TABLE PAL_ACCKMEANS_STATISTIC_TBL;
CREATE COLUMN TABLE PAL_ACCKMEANS_STATISTIC_TBL LIKE PAL_ACCKMEANS_STATISTIC_T;
DROP TABLE PAL_ACCKMEANS_MODEL_TBL;
CREATE COLUMN TABLE PAL_ACCKMEANS_MODEL_TBL LIKE PAL_ACCKMEANS_MODEL_T;
CALL "DM_PAL".PAL_ACCKMEANS_PROC (
PAL_ACCKMEANS_DATA_TBL,
#PAL_CONTROL_TBL,
PAL_ACCKMEANS_ASSIGNED_TBL,
PAL_ACCKMEANS_CENTERS_TBL,
PAL_ACCKMEANS_SIL_CENTERS_TBL,
PAL_ACCKMEANS_STATISTIC_TBL,
PAL_ACCKMEANS_MODEL_TBL) with OVERVIEW;
SELECT * FROM PAL_ACCKMEANS_ASSIGNED_TBL;
SELECT * FROM PAL_ACCKMEANS_CENTERS_TBL;
SELECT * FROM PAL_ACCKMEANS_SIL_CENTERS_TBL;
SELECT * FROM PAL_ACCKMEANS_STATISTIC_TBL;
SELECT * FROM PAL_ACCKMEANS_MODEL_TBL;
示范(direct technique)
HANA 2.0 SPS 02以及之后的示范文档
SET SCHEMA DM_PAL;
DROP TABLE PAL_ACCKMEANS_DATA_TBL;
CREATE COLUMN TABLE PAL_ACCKMEANS_DATA_TBL(
"ID" INTEGER,
"V000" DOUBLE,
"V001" VARCHAR(2),
"V002" INTEGER
);
INSERT INTO PAL_ACCKMEANS_DATA_TBL VALUES (0, 0.5, 'A', 0);
INSERT INTO PAL_ACCKMEANS_DATA_TBL VALUES (1, 1.5, 'A', 0);
INSERT INTO PAL_ACCKMEANS_DATA_TBL VALUES (2, 1.5, 'A', 1);
INSERT INTO PAL_ACCKMEANS_DATA_TBL VALUES (3, 0.5, 'A', 1);
INSERT INTO PAL_ACCKMEANS_DATA_TBL VALUES (4, 1.1, 'B', 1);
INSERT INTO PAL_ACCKMEANS_DATA_TBL VALUES (5, 0.5, 'B', 15);
INSERT INTO PAL_ACCKMEANS_DATA_TBL VALUES (6, 1.5, 'B', 15);
INSERT INTO PAL_ACCKMEANS_DATA_TBL VALUES (7, 1.5, 'B', 16);
INSERT INTO PAL_ACCKMEANS_DATA_TBL VALUES (8, 0.5, 'B', 16);
INSERT INTO PAL_ACCKMEANS_DATA_TBL VALUES (9, 1.2, 'C', 16);
INSERT INTO PAL_ACCKMEANS_DATA_TBL VALUES (10, 15.5, 'C', 15);
INSERT INTO PAL_ACCKMEANS_DATA_TBL VALUES (11, 16.5, 'C', 15);
INSERT INTO PAL_ACCKMEANS_DATA_TBL VALUES (12, 16.5, 'C', 16);
INSERT INTO PAL_ACCKMEANS_DATA_TBL VALUES (13, 15.5, 'C', 16);
INSERT INTO PAL_ACCKMEANS_DATA_TBL VALUES (14, 15.6, 'D', 16);
INSERT INTO PAL_ACCKMEANS_DATA_TBL VALUES (15, 15.5, 'D', 0);
INSERT INTO PAL_ACCKMEANS_DATA_TBL VALUES (16, 16.5, 'D', 0);
INSERT INTO PAL_ACCKMEANS_DATA_TBL VALUES (17, 16.5, 'D', 1);
INSERT INTO PAL_ACCKMEANS_DATA_TBL VALUES (18, 15.5, 'D', 1);
INSERT INTO PAL_ACCKMEANS_DATA_TBL VALUES (19, 15.7, 'A', 1);
DROP TABLE #PAL_PARAMETER_TBL;
CREATE LOCAL TEMPORARY COLUMN TABLE #PAL_PARAMETER_TBL(
"PARAM_NAME" NVARCHAR(256),
"INT_VALUE" INTEGER,
"DOUBLE_VALUE" DOUBLE,
"STRING_VALUE" NVARCHAR(1000)
);
INSERT INTO #PAL_PARAMETER_TBL VALUES ('THREAD_RATIO', NULL, 0.5, NULL);
INSERT INTO #PAL_PARAMETER_TBL VALUES ('GROUP_NUMBER', 4, NULL, NULL);
INSERT INTO #PAL_PARAMETER_TBL VALUES ('INIT_TYPE', 1, NULL, NULL);
INSERT INTO #PAL_PARAMETER_TBL VALUES ('DISTANCE_LEVEL',2, NULL, NULL);
INSERT INTO #PAL_PARAMETER_TBL VALUES ('MAX_ITERATION', 100, NULL, NULL);
INSERT INTO #PAL_PARAMETER_TBL VALUES ('CATEGORY_WEIGHTS', NULL, 0.5, NULL);
INSERT INTO #PAL_PARAMETER_TBL VALUES ('CATEGORICAL_VARIABLE', NULL, NULL, 'V002');
CALL _SYS_AFL.PAL_ACCELERATED_KMEANS(PAL_ACCKMEANS_DATA_TBL, #PAL_PARAMETER_TBL, ?, ?, ?, ?, ?);
比较
wrapper procedure
Statement 'CALL "SYS".AFLLANG_WRAPPER_PROCEDURE_CREATE('AFLPAL', 'ACCELERATEDKMEANS', 'DM_PAL', ...'
successfully executed in 787 ms 810 µs (server processing time: 787 ms 278 µs) - Rows Affected: 0
Statement 'CALL "DM_PAL".PAL_ACCKMEANS_PROC ( PAL_ACCKMEANS_DATA_TBL, #PAL_CONTROL_TBL, ...'
successfully executed in 208 ms 991 µs (server processing time: 205 ms 846 µs)
direct technique
Statement 'CALL _SYS_AFL.PAL_ACCELERATED_KMEANS(PAL_ACCKMEANS_DATA_TBL, #PAL_PARAMETER_TBL, ?, ?, ?, ?, ?)'
successfully prepared in 257 ms 96 µs
Statement 'CALL _SYS_AFL.PAL_ACCELERATED_KMEANS(PAL_ACCKMEANS_DATA_TBL, #PAL_PARAMETER_TBL, ?, ?, ?, ?, ?)'
successfully executed in 246 ms 214 µs (server processing time: 245 ms 256 µs)
R集成
参考资料:
- 【SAP HANA】R Integration for SAP HANA on bilibili
- R Integration for SAP HANA on youtube
- R Integration for SAP HANA on Github
- SAP HANA R Integration Guide
Clustering in R
以用户ML_USER在数据库HXE执行:
-- clean up
DROP TYPE "T_DATA";
DROP TYPE "T_PARAMS";
DROP TYPE "T_RESULTS";
DROP PROCEDURE "R_CLUSTER";
DROP TABLE "DATA";
DROP TABLE "PARAMS";
DROP TABLE "RESULTS";
-- create table types
CREATE TYPE "T_DATA" AS TABLE ("ID" INTEGER, "LIFESPEND" DOUBLE, "NEWSPEND" DOUBLE, "INCOME" DOUBLE, "LOYALTY" DOUBLE);
CREATE TYPE "T_PARAMS" AS TABLE ("NAME" VARCHAR(100), "VALUE" INTEGER);
CREATE TYPE "T_RESULTS" AS TABLE ("ID" INTEGER, "CLUSTER" INTEGER);
-- create stored procedure with R script
CREATE PROCEDURE "R_CLUSTER" (IN data "T_DATA", IN params "T_PARAMS", OUT results "T_RESULTS")
LANGUAGE RLANG AS
BEGIN
library(cluster)
clusters <- kmeans(data[c('LIFESPEND','NEWSPEND','INCOME','LOYALTY')], params[params$NAME=='CLUSTERS',]$VALUE)
results <- cbind(data[c('ID')], CLUSTER=clusters$cluster)
END;
-- create tables
CREATE COLUMN TABLE "DATA" LIKE "T_DATA";
CREATE COLUMN TABLE "PARAMS" LIKE "T_PARAMS";
CREATE COLUMN TABLE "RESULTS" LIKE "T_RESULTS";
-- data
INSERT INTO "DATA" VALUES (1,7.2,3.6,6.1,2.5);
INSERT INTO "DATA" VALUES (2,5.4,3.4,1.5,0.4);
INSERT INTO "DATA" VALUES (3,6.9,3.2,5.7,2.3);
INSERT INTO "DATA" VALUES (4,5.5,2.3,4,1.3);
INSERT INTO "DATA" VALUES (5,6.1,2.9,4.7,1.4);
INSERT INTO "DATA" VALUES (6,5,3.3,1.4,0.2);
INSERT INTO "DATA" VALUES (7,5.8,2.7,5.1,1.9);
INSERT INTO "DATA" VALUES (8,5.1,3.4,1.5,0.2);
INSERT INTO "DATA" VALUES (9,6.4,3.2,5.3,2.3);
INSERT INTO "DATA" VALUES (10,5.7,2.8,4.5,1.3);
INSERT INTO "DATA" VALUES (11,6.8,3,5.5,2.1);
INSERT INTO "DATA" VALUES (12,4.3,3,1.1,0.1);
INSERT INTO "DATA" VALUES (13,7,3.2,4.7,1.4);
INSERT INTO "DATA" VALUES (14,5.4,3.4,1.7,0.2);
INSERT INTO "DATA" VALUES (15,5.4,3,4.5,1.5);
INSERT INTO "DATA" VALUES (16,5.7,2.9,4.2,1.3);
INSERT INTO "DATA" VALUES (17,6.3,2.9,5.6,1.8);
INSERT INTO "DATA" VALUES (18,5.1,3.7,1.5,0.4);
INSERT INTO "DATA" VALUES (19,6.3,2.8,5.1,1.5);
INSERT INTO "DATA" VALUES (20,5.6,2.5,3.9,1.1);
INSERT INTO "DATA" VALUES (21,5.7,2.6,3.5,1);
INSERT INTO "DATA" VALUES (22,5.4,3.9,1.3,0.4);
INSERT INTO "DATA" VALUES (23,6,3,4.8,1.8);
INSERT INTO "DATA" VALUES (24,5.5,2.6,4.4,1.2);
INSERT INTO "DATA" VALUES (25,4.7,3.2,1.3,0.2);
INSERT INTO "DATA" VALUES (26,6.8,2.8,4.8,1.4);
INSERT INTO "DATA" VALUES (27,5.9,3,4.2,1.5);
INSERT INTO "DATA" VALUES (28,5.1,3.8,1.9,0.4);
INSERT INTO "DATA" VALUES (29,5,3.5,1.3,0.3);
INSERT INTO "DATA" VALUES (30,5.8,4,1.2,0.2);
INSERT INTO "DATA" VALUES (31,7.7,2.8,6.7,2);
INSERT INTO "DATA" VALUES (32,5.2,2.7,3.9,1.4);
INSERT INTO "DATA" VALUES (33,4.4,2.9,1.4,0.2);
INSERT INTO "DATA" VALUES (34,5,3,1.6,0.2);
INSERT INTO "DATA" VALUES (35,7.6,3,6.6,2.1);
INSERT INTO "DATA" VALUES (36,5.1,3.5,1.4,0.3);
INSERT INTO "DATA" VALUES (37,6,3.4,4.5,1.6);
INSERT INTO "DATA" VALUES (38,5.7,3.8,1.7,0.3);
INSERT INTO "DATA" VALUES (39,6.2,3.4,5.4,2.3);
INSERT INTO "DATA" VALUES (40,7.3,2.9,6.3,1.8);
INSERT INTO "DATA" VALUES (41,5.7,2.5,5,2);
INSERT INTO "DATA" VALUES (42,6.7,3.1,5.6,2.4);
INSERT INTO "DATA" VALUES (43,6.1,2.8,4.7,1.2);
INSERT INTO "DATA" VALUES (44,5.7,4.4,1.5,0.4);
INSERT INTO "DATA" VALUES (45,5.6,2.7,4.2,1.3);
INSERT INTO "DATA" VALUES (46,5.5,4.2,1.4,0.2);
INSERT INTO "DATA" VALUES (47,6.3,3.4,5.6,2.4);
INSERT INTO "DATA" VALUES (48,6.7,3.1,4.7,1.5);
INSERT INTO "DATA" VALUES (49,6.3,3.3,6,2.5);
INSERT INTO "DATA" VALUES (50,6.3,2.3,4.4,1.3);
-- parameters
INSERT INTO "PARAMS" VALUES ('CLUSTERS',3);
-- call : results inline
--CALL "R_CLUSTER" ("DATA", "PARAMS", ?);
-- call : results in table
TRUNCATE TABLE "RESULTS";
CALL "R_CLUSTER" ("DATA", "PARAMS", "RESULTS") WITH OVERVIEW;
SELECT * FROM "RESULTS";
SQLScript Plan Profiler
基于上面的聚类结果,才能进行下面的实验。
以用户ML_USER在数据库HXE执行,查看每一行R代码具体的执行时间和信息:
-- results inline
CALL "R_CLUSTER" ("DATA", "PARAMS", "RESULTS") WITH OVERVIEW WITH HINT(SQLSCRIPT_PLAN_PROFILER);
以用户SYSTEM在数据库HXE执行。具体执行步骤请参考:【SAP HANA】R Integration for SAP HANA on bilibili中的R 15 SQLScript Plan Profiler
-- results stored
ALTER SYSTEM START SQLSCRIPT PLAN PROFILER FOR PROCEDURE "ML_USER"."R_CLUSTER";
ALTER SYSTEM CLEAR SQLSCRIPT PLAN PROFILER FOR PROCEDURE "ML_USER"."R_CLUSTER";
ALTER SYSTEM STOP SQLSCRIPT PLAN PROFILER FOR PROCEDURE "ML_USER"."R_CLUSTER";
SELECT * FROM "M_SQLSCRIPT_PLAN_PROFILERS";
SELECT * FROM "M_SQLSCRIPT_PLAN_PROFILER_RESULTS";
附录
HANA Content Package权限问题
报错:新建的用户没有HANA Modeler的权限
解决:参考Authorization of User Privilege on SAP HANA Modeler Development和Read/Write Authorization for a single package执行下面的语句:
GRANT MODELING TO DEVUSER;
JupyterLab
(在安装了R之后可以继续配置。由于SUSE系统不支持Python 3,所以暂时不启用)
创建用户
sudo useradd -m -d /home/jupyteradm -c "Jupyter Administrator" jupyteradm
Then, you can execute the following command to configure properly the jupyteradm
user to proceed will the installation:
sudo bash -c 'echo "jupyteradm ALL=(ALL) NOPASSWD: ALL" >>/etc/sudoers'
sudo bash -c 'echo "umask 022" >>/home/jupyteradm/.bashrc'
sudo bash -c 'echo "PATH=/home/jupyteradm/.local/bin:$PATH" >>/home/jupyteradm/.bashrc'
Now, you can switch to the jupyteradm
user if not done yet:
sudo su -l jupyteradm
检查系统状态:
sudo SUSEConnect --status-text
如果没有注册,可以参考上文的步骤
安装python-devel包
sudo zypper install python-devel
安装virtualenv
curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
python get-pip.py --user
pip install --user virtualenv
下载clients_linux_x86_64.tgz
/usr/sap/HXE/home/bin/HXEDownloadManager_linux.bin linuxx86_64 installer \
-d . \
clients_linux_x86_64.tgz
解压clients_linux_x86_64.tgz到/opt/hxe
tar -xvzf ./clients_linux_x86_64.tgz -C .
前往路径/opt/hxe
,解压hdb_client_linux_x86_64.tgz
tar -xvzf ./hdb_client_linux_x86_64.tgz -C .
执行安装
cd ./HDB_CLIENT_LINUX_X86_64
./hdbinst
cd ~/
virtualenv ~/jupyter
source ~/jupyter/bin/activate
安装jupyterlab
pip install jupyterlab
生成默认配置文件
jupyter-lab --generate-config
Let's now create a local dummy certificate using openssl
:
openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout /home/jupyteradm/.jupyter/mykey.key -out /home/jupyteradm/.jupyter/mycert.pem
Once the certificate is created, you can edit the generated configuration file:
vi /home/jupyteradm/.jupyter/jupyter_notebook_config.py
添加以下内容到文档中
# Set options for certfile
c.NotebookApp.certfile = u'/home/jupyteradm/.jupyter/mycert.pem'
c.NotebookApp.keyfile = u'/home/jupyteradm/.jupyter/mykey.key'
# Set ip to '*' to bind on all interfaces (ips) for the public server
c.NotebookApp.ip = '*'
c.NotebookApp.open_browser = False
# It is a good idea to set a known, fixed port for server access
c.NotebookApp.port = 8888
修改密码
jupyter notebook password
启动JupyterLab
jupyter lab --ip=hxehost &
之后使用浏览器打开https://hxehost:8888/tree
启用Python内核作为SQL工具
激活Jupyter Virtual Environment
source ~/jupyter/bin/activate
安装Python模块
pip install \
sqlalchemy \
sqlalchemy-hana \
ipython-sql \
/home/jupyteradm/sap/hdbclient/hdbcli-x.y.Z.tar.gz #前往文件夹找到对应版本的hdbcli
假设你已经完成了这篇文章的配置工作,那你就可以测试连接情况:
import sqlalchemy
%reload_ext sql
%config SqlMagic.displaylimit = 5
hxe_connection = 'hana://ML_USER:@localhost:39015';
%sql $hxe_connection
返回结果u'Connected: ML_USER@None'
%sql select count(1) FROM TABLES;
格式化输出:
result = _
print(result)
print(type(result))
使用Python DB API获取数据
from hdbcli import dbapi
conn = dbapi.connect(
address="localhost",
port=39015,
user="ML_USER",
password=""
)
with conn.cursor() as cursor:
cursor.execute("SELECT CURRENT_USER FROM DUMMY")
result = cursor.fetchone()
print(result[0])
from sqlalchemy import create_engine
engine = create_engine('hana://ML_USER:Welcome18Welcome18@localhost:39015')
print(type(engine))
使用SQLAlchemy引擎
from sqlalchemy import create_engine
engine = create_engine('hana://ML_USER:@localhost:39015')
print(type(engine))
配置R内核(暂时不使用)
用jupyteradm
用户登陆,启动R
sudo R
在R下安装IRkernel
install.packages('devtools')
install.packages('IRkernel')
使Jupyterlab使用IRkernel
IRkernel::installspec()
# IRkernel::installspec(user = FALSE) # 对所有用户可用
报错:jupyter-client has to be installed but “jupyter kernelspec --version” exited with code 127.
解决:查询jupyter安装路径:which jupyter
之后添加路径到
/usr/bin/jupyter
下sudo ln -s /home/jupyteradm/jupyter/bin/jupyter /usr/bin/jupyter
在R下安装RODBC
install.packages("RODBC")
报错:ODBC headers sql.h and sqlext.h not found
解决:sudo zypper install unixODBC*
报错:[unixODBC][Driver Manager]Data source name not found, and no default driver specified
解决:参考mlb-hxe-tools-sql-odbc完成相关配置
报错:[unixODBC][Driver Manager]Can't open lib '/usr/sap/hdbclient/libodbcHDB.so' : file not found
解决:由于前文已经安装了hdbclient,所以利用命令sudo find / -name filename
找到缺失的文件,之后前往/etc/odbcinst.ini
修改对应字段
报错,发现R kernel一直不工作,无法在Jupyterlab中运行R代码,遂不在JupyterLab中使用R kernel
接下来就可以在JupyterLab中使用R kernel了library("RODBC") odbcConnection <- odbcConnect( dsn="DSN_HXE", uid="ML_USER", pwd="ERdfcv34" ) result <- sqlQuery(odbcConnection,"select CURRENT_USER from DUMMY") print(result) odbcClose(odbcConnection)
参考资料
安装配置
【HXE:1】SAP HANA, express edition:HXE安装指南
【HXE:1.1】Power BI + HANA? Yes!:利用Power BI连接HANA获取Calculation View
【HXE:2】Python + HANA? Yes!:使用Python包hdbcli连接HANA数据库
【HXE:2.1】R + HANA? Yes!:
Installation
Install SAP HANA 2.0, express edition on a preconfigured virtual machine
Install SAP HANA, express edition on a native Linux machine
Setup Putty and WinSCP to access your HANA Express Edition on Google Cloud Platform
How to install SAP Hana Express Edition 2.0 on Ubuntu 18.04 (Bionic Beaver)
Configurations
Use the SAP HANA Tools for Eclipse as a SQL query tool with SAP HANA, express edition
Use Jupiter Notebook with SAP HANA, express edition
Configure the TensorFlow integration (SAP HANA EML) with SAP HANA, express edition
After properly configured, you can start your own Machine Learning project leveraging the power of HXE. Here are some jolly good guides for you to have a deep dive:
Introducing “Project: Machine Learning in a Box”
Others
Bring your SAP HANA data to life with Microsoft Power BI
Consuming SAP HANA Express Edition information models in Microsoft Power BI using live connection
学习资料
Learning Journey
A Learning journey is a visual guide to help you think about the path to become fully competent with a new SAP innovation.
This visualization should make it easier to see every offering that is available for a topic and understand how things are connected.
The journey is not a “mandatory” track or sequence of courses. It offers you a high-level view of your choices. Based on your goals and your prior knowledge, you can select what would work for you.
It is recommended that you can choose the field you are interested in and take the courses at your own pace.
Predictive Analytics
SAP Predictive Analytics from SAP Blog:PA相关新闻
SAP Predictive Analytics 3.x Workshop: Overview, Installation, and Modeling:PA 3.x 新功能介绍
Using SAP BusinessObjects Predictive Analytics with SAP HANA, express edition:配置PA连接至HXE
A Great Combination: Python and SAP Predictive Analytics:配置PA与Python的连接
Predictive Analytics (Data Science):HANA Academy官方教学视频
Predictive Analysis Library (GitHub Repo):HANA Academy官方教学视频配套材料
openSAP
Getting Started with Data Science
Data Science in Action - Building a Predictive Churn Model
Enterprise Deep Learning with TensorFlow
杂项
Authorization Dependency Viewer
SAP HANA SYSTEM VIEWS学习–Overview Tab
Machine learning basics with SAP HANA Express edition & Tensorflow Part 1.
SAP HANA External Machine Learning: Take 2
深入理解SAP HANA与R整合的原理(一)
深入理解SAP HANA与R整合的原理(二)
R语言包安装并实现与HANA的整合
SAP HANA中PAL算法使用入门
SAP HANA整合R的性能与PAL比较
Custom time series analytics with HANA, R and UI5
Working with R integration in HANA 2.0 SPS02
Basket Analysis with SAP Predictive Analysis and SAP HANA – Part 1