SciDB使用笔记(根据项目需要,主要针对的聚合操作)

1.   SciDB Basics

1.      进入AQL/AFL界面:

1)使用AQL:到bin目录下,使用./iquery

AQL% 输入AQL语言,并且以分号结尾。

2)使用AFL:到bin目录下,使用./iquery –a

AFL% 输入AFL语言,并且以分号结尾。

按Ctrl+C或者输入quit/exit退出界面。

命令对大小写不敏感,一条命令可以拆分成任意多行。

可以使用help;来查找帮助。

set lang afl; –> 将查询界面语言设为AFL;

set timer; –> 设置timer计时;

set no fetch –> 不输出结果;

2.      新建数组:

AQL%  CREATE ARRAY array_name <attribute_list>[dimension_list];

AFL界面也可以。

其中dimension list里每维的后面两个参数表示chunk sizechunk overlap

单个Chunk最好设置为包含10至20MB的数据。例如,对于双精度浮点数,设一个500,000至1,000,000之间的值作为chunk size最好。

*对于skewed data,可能需要进行repartitioning来重新设置chunksize。

 

Example:

CREATE ARRAY test_array<val:int32> [x=0:3, 4, 0, y=0:2, 3, 0];

生成一个43列的数组,其中X是行数,y是列数,维数从高到低。Chunk也是43列。

3.      查看数组的schema

AFL%  SHOW(array_name);

4.      用随机数初始化数组:

AFL%  STORE(BUILD(array_name,random()%grid_size/1.0), array_name);

5.      重命名数组:

AFL% RENAME(array_name, new_array_name);

6.      给数组载入数据:

AFL%  LOAD(array_name, ‘file_path’);

csv2scidb [options] < input_file >  >  output_file

AQL%  LOAD array_name FROM ' file_path '; /* thefile is in the scidb format */

 

csv2scidb: Utilityfor conversion of CSV file to SciDB input text

format

Usage: csv2scidb[options] [ < input-file ] [ > output-file ]

Default: -f 0 -c1000000 -q

Options:

-v version ofutility

-i PATH input file

-o PATH output file

-a PATH appendedoutput file

-c INT length ofchunk

-f INT startingcoordinate

-n INT number ofinstances

-d char delimiter -default ,

-p STRtype pattern - N number, S string, s nullable-string,

C char, cnullable-char

-q Quote the inputline exactly, simply wrap it in ()

-s Nskip N lines at the beginning of the file

-h prints thishelpful error message

You cannot specify both -qand a set of -p options.

 

./csv2scidb-s 1 -p SNSN < .../examples/olympic_data.csv > .../examples/

olympic_data.scidb    (bin目录下)

AQL% LOAD winnersFlatFROM'../examples/olympic_data.scidb';

7.      打印数组:

AQL% SELECT * FROM array_name;

AFL% SCAN(array_name);

8.      删除数组:

AQL%  DROP ARRAY array_name;

9.      存储输出结果到一个定义好的数组:

AFL%  STORE(output_array_name, query_statement);

2.   SciDB Aggregates

1.      Grid Aggregates

AQL% SELECT aggregate(attribute)[,aggregate(attribute)] ...

INTO dst-array

FROM src-array | array-expression

WHERE where-expression

REGRIDAS

( PARTITION BY dimension1dimension1-size

[, dimension2 dimension2-size]... ) ;

 

Example:

AQL%

SELECT max(attr1), min(attr1)

FROM m4x4

REGRID AS (PARTITION BY x 2, y 2);

 

AFL% regrid(array,grid_1, grid_2[,...,grid_N],

aggregate_call_1 [, aggregate_call_2,...,aggregate_call_N]);

 

Example:

AFL% regrid(m4x4, 2,2,avg(val));

 

AFL%  REGRID(array_name,grid_size1, grid_size2, …, aggregate(val));

2.      Window Aggregates

AQL% SELECT aggregate(attribute)[, aggregate (attribute)]...

INTO dst-array

FROM src-array | array-expression

WHERE where-expression

FIXED | VARIABLE WINDOW AS

(PARTITION BY dimension1dim1-low PRECEDING ANDdim1-

high FOLLOWING

[, dimension2dim2-low PRECEDING AND dim2-

high FOLLOWING ]... );

 

Example:

AQL%

SELECT sum(attr1)

FROM m4x4

FIXEDWINDOW AS

(PARTITION BY x1 PRECEDING AND 1 FOLLOWING,

y 1 PRECEDING AND 1 FOLLOWING);

 

AFL% window(array,dim_1_low,dim_1_high,[dim_2_low,dim_2_high,]...

aggregate_1[,aggrgegate_2, ...]

Example:

AFL% window(m4x4,1,1,1,1,sum(attr1));

 

AFL% WINDOW(array_name, dim1_low, dim1_high, …, aggregate(val));

3.   SciDB Sampling

1.      一次抽取一个元素:

AFL%  BERNOULLI (array_name, probability:double [,seed:int64]);

2.      一次抽取一个chunkchunk大小在数组声明时设定:

AFL%  SAMPLE(array_name, probability:double [,seed:int64]);

 

Sampling的结果为保持原来数组形状sparse array

4.   SciDB Script

1)      AQL script:

./iquery –q “……”;

2)      AFL script:

./iquery –aq “……”;

 

Use the command “su” to maintainthe sudo privilege, so that the user doesn’t have to repeatedly type theannoying sudo.

 

Use the built-in “time”utility in front of the SciDB command. Example: time ./iquery –aq …

5.   SciDB Other Commands

1.      Redimension

AFL% create arraytwo_dim<val:double>[d(string)=5,5,0,

p(string)=5,5,0];

 

When you examine the data,notice that it could be expressed in a 2-dimensional format like this:

probe-0 probe-1 probe-2 probe-3probe-4

device-0 0.01 30.36 111.21242.56 424.41

device-1 2.04 42.49 133.44274.89 466.84

device-2 6.09 56.64 157.69309.24 511.29

device-3 12.16 72.81 183.96345.61 557.76

device-4 20.25 91 212.25 384606.25

SciDB allows you to redimensionthe data so that you can store it in this 2-dimensional format. First, create anarray with 2 dimensions:

 

AFL% create arraytwo_dim<val:double>[d(string)=5,5,0,

p(string)=5,5,0];

 

Each of the dimensions is ofsize 5, corresponding to a dimension in the 5-by-5 table. Now, you can use theredimension_store operator to redimension the array device_probe into the arraytwo_dim:

 

AFL%redimension_store(device_probe, two_dim);

[

[(0.01),(30.36),(111.21),(242.56),(424.41)],

[(2.04),(42.49),(133.44),(274.89),(466.84)],

[(6.09),(56.64),(157.69),(309.24),(511.29)],

[(12.16),(72.81),(183.96),(345.61),(557.76)],

[(20.25),(91),(212.25),(384),(606.25)]

]

Now the data is stored so that device and probe numbers are the dimensions of the array. This means that you can use the dimension indices to select data from the array. For example, to select the second device from the third probe, use the dimension indices:


下面说说自己的用户体验:

1)单线程的SciDB(single Instance)还是比较健壮t的,基本上不会随便垮掉。

2)分布式的SciDB做得有点烂,很多时候不知道为什么会直接垮掉,而且最要命的是有时候重启数据库还启动不了,即使重启所有机器后都不行。最后只能够重新初始化数据库并重新导入数据——致命缺陷啊!

3)不怎么区分基于数组坐标的subsetting和基于数组元素值的subsetting,如果把这些predicate都写在AQL的where语句里,SciDB会全部把数组全都扫一遍再作过滤,这样的话对于基于数组坐标的效率会非常的低。不过还好SciDB提供了一个自定义的between函数,算是可以勉强解决这个问题。所以啊,基于数组坐标的subsetting千万不能用直接写到where语句里!

你可能感兴趣的:(Data,scientific,SciDB,Mana)