4、ES Rally 资源下载

参考：Offline Usage

1、数据下载说明

Rally 运行时，需要从外网下载数据：

从 github 压测场景的配置文件，即使有网络，但是下载功能基本也是不可用，改为手动下载
从 aws s3 下载版本数据的压缩包，因 aws s3 不再支持 curl下载，需改为手动下载

github 下载数据失败

标准输出提示

[WARNING] No Internet connection detected. Automatic download of track data sets etc. is disabled.

改为手动下载即可：

git clone git@...

或

git clone https://...

或

下载 zip 包再解压到指定目录

推荐优先使用 git 协议下载，https 需要认证

aws s3 下载样本数据失败

以 geopoint 这个样本数据为例

标准输出提示

[ERROR] Cannot race. Error in track preparator (('Cannot find /home/apps/.rally/benchmarks/data/geopoint/documents.json.bz2\. Please disable offline mode and retry again.', None))

打开 DEBUG 日志，发现如下错误日志：

2019-07-09 10:29:02,78 -not-actor-/PID:1800 esrally.racecontrol ERROR A benchmark failure has occurred
2019-07-09 10:29:02,79 -not-actor-/PID:1800 esrally.racecontrol INFO Telling benchmark actor to exit.
2019-07-09 10:29:00,659 ActorAddr-(T|:38646)/PID:2134 esrally.track.loader INFO Downloading data from [http://benchmarks.elasticsearch.org.s3.amazonaws.com/corpora/geopoint/documents.json.bz2] (482 MB) to [/home/apps/.rally/benchmarks/data/geopoint/documents.json.bz2].
2019-07-09 10:29:02,64 ActorAddr-(T|:38646)/PID:2134 esrally.actor ERROR Error in track preparator
Traceback (most recent call last):

  File "/apps/svr/python-3.5.2/lib/python3.5/site-packages/esrally/actor.py", line 84, in guard
    return f(self, msg, sender)

  File "/apps/svr/python-3.5.2/lib/python3.5/site-packages/esrally/driver/driver.py", line 307, in receiveMsg_PrepareTrack
    track.prepare_track(msg.track, cfg)

  File "/apps/svr/python-3.5.2/lib/python3.5/site-packages/esrally/track/loader.py", line 286, in prepare_track
    prep.prepare_document_set(document_set, data_root[0])

  File "/apps/svr/python-3.5.2/lib/python3.5/site-packages/esrally/track/loader.py", line 424, in prepare_document_set
    self.download(document_set.base_url, target_path, expected_size, msg)

  File "/apps/svr/python-3.5.2/lib/python3.5/site-packages/esrally/track/loader.py", line 345, in download
    net.download(data_url, target_path, size_in_bytes, progress_indicator=progress)

  File "/apps/svr/python-3.5.2/lib/python3.5/site-packages/esrally/utils/net.py", line 156, in download
    (local_path, download_size, expected_size_in_bytes))

esrally.exceptions.DataError: ('Download of [/home/apps/.rally/benchmarks/data/geopoint/documents.json.bz2] is corrupt. Downloaded [2548] bytes but [505295401] bytes are expected. Please retry.', None)

意思是使用 curl 下载样本数据文件 documents.json.bz2 失败:

# 482M
http://benchmarks.elasticsearch.org.s3.amazonaws.com/corpora/geopoint/documents.json.bz2

因为这个链接做了跳转，curl 得到的结果是一个 2548 字节 html 页面。这时，需要改为使用浏览器下载，然后再用 rz 命令上传到数据目录：

~/.rally/benchmarks/data/geopoint/

另外一个文件 documents-2.json.bz2，也会遇到同样问题，使用同样方法解决即可：

# 252M
http://benchmarks.elasticsearch.org.s3.amazonaws.com/corpora/geopoint/documents-2.json.bz2

2、手动下载资源

2.1、下载压测场景配置（tracks）

mkdir -p ~/.rally/benchmarks
cd ~/.rally/benchmarks
sudo update-ca-trust
git clone https://github.com/elastic/rally-tracks.git
or
git clone [email protected]:elastic/rally-tracks.git （需要设置 publickey）

查看 tracks 项目提供的现成压测场景：

esrally list tracks

    ____        ____
   / __ \____ _/ / /_  __
  / /_/ / __ `/ / / / / /
 / _, _/ /_/ / / / /_/ /
/_/ |_|\__,_/_/_/\__, /
                /____/

[WARNING] No Internet connection detected. Automatic download of track data sets etc. is disabled.
Available tracks:

Name           Description                                                                                                                                                                        Documents    Compressed Size    Uncompressed Size    Default Challenge        All Challenges
-------------  ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  -----------  -----------------  -------------------  -----------------------  ---------------------------------------------------------------------------------------------------------------------------
eventdata      This benchmark indexes HTTP access logs generated based sample logs from the elastic.co website using the generator available in https://github.com/elastic/rally-eventdata-track  20,000,000   755.1 MB           15.3 GB              append-no-conflicts      append-no-conflicts
geonames       POIs from Geonames                                                                                                                                                                 11,396,505   252.4 MB           3.3 GB               append-no-conflicts      append-no-conflicts,append-no-conflicts-index-only,append-sorted-no-conflicts,append-fast-with-conflicts
geopoint       Point coordinates from PlanetOSM                                                                                                                                                   60,844,404   481.9 MB           2.3 GB               append-no-conflicts      append-no-conflicts,append-no-conflicts-index-only,append-fast-with-conflicts
geopointshape  Point coordinates from PlanetOSM indexed as geoshapes                                                                                                                              60,844,404   470.5 MB           2.6 GB               append-no-conflicts      append-no-conflicts,append-no-conflicts-index-only,append-fast-with-conflicts
geoshape       Shapes from PlanetOSM                                                                                                                                                              60,523,283   13.4 GB            45.4 GB              append-no-conflicts      append-no-conflicts
http_logs      HTTP server log data                                                                                                                                                               247,249,096  1.2 GB             31.1 GB              append-no-conflicts      append-no-conflicts,append-no-conflicts-index-only,append-sorted-no-conflicts,append-index-only-with-ingest-pipeline,update
metricbeat     Metricbeat data                                                                                                                                                                    1,079,600    87.6 MB            1.2 GB               append-no-conflicts      append-no-conflicts
nested         StackOverflow Q&A stored as nested docs                                                                                                                                            11,203,029   663.1 MB           3.4 GB               nested-search-challenge  nested-search-challenge,index-only
noaa           Global daily weather measurements from NOAA                                                                                                                                        33,659,481   947.3 MB           9.0 GB               append-no-conflicts      append-no-conflicts,append-no-conflicts-index-only
nyc_taxis      Taxi rides in New York in 2015                                                                                                                                                     165,346,692  4.5 GB             74.3 GB              append-no-conflicts      append-no-conflicts,append-no-conflicts-index-only,append-sorted-no-conflicts-index-only,update,append-ml
percolator     Percolator benchmark based on AOL queries                                                                                                                                          2,000,000    102.7 kB           104.9 MB             append-no-conflicts      append-no-conflicts
pmc            Full text benchmark with academic papers from PMC                                                                                                                                  574,199      5.5 GB             21.7 GB              append-no-conflicts      append-no-conflicts,append-no-conflicts-index-only,append-sorted-no-conflicts,append-fast-with-conflicts
so             Indexing benchmark using up to questions and answers from StackOverflow                                                                                                            36,062,278   8.9 GB             33.1 GB              append-no-conflicts      append-no-conflicts

-------------------------------
[INFO] SUCCESS (took 0 seconds)
-------------------------------

跑 list 命令时，rally 自动做了一个 copy 动作

cd ~/.rally/benchmarks; mkdir tracks; cp -r rally-tracks tracks/default

所以 rally-tracks 目录可以删掉了

rm -rf rally-tracks

务必从目录 rally-tracks 挑选一个 track 来理解所有文件的作用，这样就能弄清整个压测流程了

插入说明一下压测的工作，无非就是以下几个步骤：

指定/创建目标 ES 集群
创建索引、mapping
导入样本数据
进行读写操作
汇报压测结果

按照这个逻辑，就可以很好地理解一个 track 的配置了。

2.2、手动下载样本数据（data）

从 list 命令可以知道，不同的压测场景，样本数据的体积不一样。可以根据需求，下载需要的数据。这里以 geopoint 为例。

# 把 python3 和 git1.9 加入 PATH
export PATH=/apps/svr/python-3.5.2/bin:$PATH
export PATH=/apps/svr/git/bin:/apps/svr/git/libexec/git-core:$PATH

# 列出所有默认的 tracks
esrally list tracks

# 获取某个 track 的 base-url，这里以 geopoint 为例
grep base-url ~/.rally/benchmarks/tracks/default/geopoint/track.json 

    "base-url": "http://benchmarks.elasticsearch.org.s3.amazonaws.com/corpora/geopoint",

# 获取文件名称
cat ~/.rally/benchmarks/tracks/default/geopoint/files.txt

documents.json.bz2
documents-1k.json.bz2

# 组合下载地址：baseurl + filename
http://benchmarks.elasticsearch.org.s3.amazonaws.com/corpora/geopoint/documents.json.bz2
http://benchmarks.elasticsearch.org.s3.amazonaws.com/corpora/geopoint/documents-1k.json.bz2

# 使用浏览器下载后，通过 rz 上传到 data 目录
cd ~/.rally/benchmarks/data/geopoint/
rz
du -sh *.bz2

253M    documents-2.json.bz2
482M    documents.json.bz2

# 验收，对比 tracks.json 里面的信息是否一致

vim ~/.rally/benchmarks/tracks/default/geopoint/tracks.json

      "documents": [
        {
          "source-file": "documents.json.bz2",
          "document-count": 60844404,
          "compressed-bytes": 505295401,
          "uncompressed-bytes": 2448564579
        }

cd ~/.rally/benchmarks/data/geopoint
bzip2 -dk documents.json.bz2

wc -l documents.json
60844404 documents.json

du -b documents.json.bz2 documents.json
505295401   documents.json.bz2
2448564579  documents.json

写个脚本列出所有需要哦下载的样本数据地址

listfiles.sh

track_files=$(ls */track.json)
for track_file in $track_files; do 
    track_name=$(echo $track_file | awk -F '/' '{print $1}')
    echo $track_name

    baseurl=$(grep base-url $track_file | awk '{print $2}' | sed -e 's/,//g' -e 's/"//g' | head -n 1)
    #echo $baseurl

    for data_file in $(cat $track_name/files.txt); do
        url="$baseurl/$data_file"
        echo $url
    done | sort | uniq
    echo
    #break
done

3、下载 ES 配置（teams）【可选】

默认压测的是 Rally 建立的 ES 本地实例的性能，需要下载 cars 配置（即不一样的 ES 配置，一个 car 表示一种 ES 配置）

cd ~/.rally/benchmarks/
mkdir teams
git clone https://github.com/elastic/rally-teams.git
or
git clone [email protected]:elastic/rally-teams.git (需要设置 publickey)

esrally list cars

    ____        ____
   / __ \____ _/ / /_  __
  / /_/ / __ `/ / / / / /
 / _, _/ /_/ / / / /_/ /
/_/ |_|\__,_/_/_/\__, /
                /____/

Available cars:

Name                     Type    Description
-----------------------  ------  ----------------------------------
16gheap                  car     Sets the Java heap to 16GB
1gheap                   car     Sets the Java heap to 1GB
24gheap                  car     Sets the Java heap to 24GB
2gheap                   car     Sets the Java heap to 2GB
4gheap                   car     Sets the Java heap to 4GB
8gheap                   car     Sets the Java heap to 8GB
defaults                 car     Sets the Java heap to 1GB
basic-license            mixin   Basic License
debug-non-safepoints     mixin   More accurate CPU profiles
ea                       mixin   Enables Java assertions
fp                       mixin   Preserves frame pointers
g1gc                     mixin   Enables the G1 garbage collector
trial-license            mixin   Trial License
unpooled                 mixin   Enables Netty's unpooled allocator
x-pack-ml                mixin   X-Pack Machine Learning
x-pack-monitoring-http   mixin   X-Pack Monitoring (HTTP exporter)
x-pack-monitoring-local  mixin   X-Pack Monitoring (local exporter)
x-pack-security          mixin   X-Pack Security

-------------------------------
[INFO] SUCCESS (took 3 seconds)
-------------------------------

类似 tracks，运行 list car 命令后，做了如下 copy 动作

cp -r rally-teams teams/default

所以 rally-teams 目录可以删掉了

rm -rf rally-teams

接着来看下默认的 ES 配置是什么

cd ~/.rally/benchmarks/teams/default/cars/v1; ll; cat defaults.ini

[meta]
description=Sets the Java heap to 1GB
type=car

[config]
base=vanilla

[variables]
heap_size=1g

heap 大小为 1GB
使用 vanilla 目录里面的配置，tree vanilla：

vanilla
├── config.ini
├── README.md
└── templates
    └── config
        ├── elasticsearch.yml
        ├── jvm.options
        └── log4j2.properties

可以看下 elasticsearch.yml 和 jvm.options 配置，这里就不细说了

3、ES 源码下载

Rally 运行时，通过参数 -distribution-version=5.5.2 指定 ES 版本，然后自动从 github 下载 ES 源码到 distributions 目录，例如：

~/.rally/benchmarks/distributions/elasticsearch-5.5.2.tar.gz

然后编译安装到 races 目录

~/.rally/benchmarks/races/2019-07-10-11-42-55/rally-node-0/install/elasticsearch-5.5.2

ES 日志目录

~/.rally/benchmarks/races/2019-07-10-11-42-55/rally-node-0/logs

heap dump 目录

~/.rally/benchmarks/races/2019-07-10-11-42-55/rally-node-0/heapdump