Google BigQuery 数据集下载示例

Google BigQuery 公开数据集网站:

  • https://cloud.google.com/bigquery/public-data/

java 客户端下载

https://cloud.google.com/bigquery/docs/quickstarts/quickstart-client-libraries

  • 开启 bigquery API,默认是开启的

  • IAM 中创建角色,添加权限,直接搜 BigQuery 管理员,把所有能加的权限都加进来

https://console.cloud.google.com/iam-admin/iam

Google BigQuery 数据集下载示例_第1张图片
  • 创建身份验证json

https://cloud.google.com/docs/authentication/getting-started

https://console.cloud.google.com/apis/credentials/serviceaccountkey?_ga=2.123224587.-124782506.1551416872&project=download-dataset&folder&organizationId

在服务账号中填一个拥有 bigquery 权限的服务账号,下载json,就是“credentialFileName.json”

  • 添加依赖
	
      com.google.cloud
      google-cloud-bigquery
      1.38.0
    
    
  • 查询表数据
package mypackage;

import com.google.auth.oauth2.GoogleCredentials;
import com.google.cloud.bigquery.BigQuery;
import com.google.cloud.bigquery.BigQueryOptions;
import com.google.cloud.bigquery.FieldValueList;
import com.google.cloud.bigquery.Job;
import com.google.cloud.bigquery.JobId;
import com.google.cloud.bigquery.JobInfo;
import com.google.cloud.bigquery.QueryJobConfiguration;
import com.google.cloud.bigquery.QueryResponse;
import com.google.cloud.bigquery.TableResult;

import java.io.FileInputStream;
import java.util.UUID;

public class App {
    public static void main(String... args) throws Exception {
        String jsonPath = "/credentialFileName.json";
        GoogleCredentials credentials = GoogleCredentials.fromStream(new FileInputStream(jsonPath));

        BigQuery bigquery = BigQueryOptions.newBuilder().setCredentials(credentials).build().getService();

        QueryJobConfiguration queryConfig =
                QueryJobConfiguration.newBuilder(
                        "SELECT * FROM `bigquery-public-data.noaa_gsod.gsod2019` limit 1000")
                        // Use standard SQL syntax for queries.
                        // See: https://cloud.google.com/bigquery/sql-reference/
                        .setUseLegacySql(false)
                        .build();

        // Create a job ID so that we can safely retry.
        JobId jobId = JobId.of(UUID.randomUUID().toString());
        Job queryJob = bigquery.create(JobInfo.newBuilder(queryConfig).setJobId(jobId).build());

        // Wait for the query to complete.
        queryJob = queryJob.waitFor();

        // Check for errors
        if (queryJob == null) {
            throw new RuntimeException("Job no longer exists");
        } else if (queryJob.getStatus().getError() != null) {
            // You can also look at queryJob.getStatus().getExecutionErrors() for all
            // errors, not just the latest one.
            throw new RuntimeException(queryJob.getStatus().getError().toString());
        }

        // Get the results.
        QueryResponse response = bigquery.getQueryResults(jobId);

        TableResult result = queryJob.getQueryResults();

        // Print all pages of the results.
        for (FieldValueList row : result.iterateAll()) {
            System.out.println(row.toString());
//            String url = row.get("url").getStringValue();
//            long viewCount = row.get("view_count").getLongValue();
//            System.out.printf("url: %s views: %d%n", url, viewCount);
        }
    }
}

网页版下载

  • 描述

    A public dataset collected by National Climatic Data Center, which contains the daily climatic data collected at their climatic stations around the globe since 1929. We define the schema of the raw table as {metric, station-name, time, value}. The data we use has 350 million rows in total, with three metrics: temperature, wind speed, and dew point.

  • 网页界面

Google BigQuery 数据集下载示例_第2张图片
  • Try the new UI
Google BigQuery 数据集下载示例_第3张图片
  • 浏览公共数据集

可能需要先创建一个项目

Google BigQuery 数据集下载示例_第4张图片
  • 天气与气候 -> GHCN Daily
Google BigQuery 数据集下载示例_第5张图片
  • ghcn_d.ghcnd_1763
  • 查询表,生成示例查询语句
Google BigQuery 数据集下载示例_第6张图片
  • 运行,保存查询结果
Google BigQuery 数据集下载示例_第7张图片

FTP 下载

  • https://cloud.google.com/bigquery/public-data/noaa-gsod

  • 描述

    The NOAA dataset (∼800GB) contains global sur- face weather data from the USAF Climatology Center col- lected daily from over 9000 stations between 1929 and 2016.

可以直接从 FTP 下载:ftp://ftp.ncdc.noaa.gov/pub/data/gsod/

每一年里的 gz 文件是一个一个观测站的,最后的tar是每一年的合集,只下载 tar 就可以。

#! /bin/bash
for i in $(seq 1929 2016)
do
  wget --execute robots=off ?accept=tar -r -np -nH --cut-dirs=4 -R index.html* ftp://ftp.ncdc.noaa.gov/pub/data/gsod/$i/gsod_$i.tar
done

你可能感兴趣的:(系统部署教程)