bigquery数据类型
Previously in BigQuery Explained, we reviewed BigQuery architecture, storage management, and ingesting data into BigQuery. In this post, we will cover querying datasets in BigQuery using SQL, saving and sharing queries , creating views and materialized views. Let’s get started!
之前在BigQuery Explained中,我们介绍了BigQuery体系结构,存储管理以及将数据提取到BigQuery中。 在本文中,我们将介绍使用SQL在BigQuery中查询数据集,保存和共享查询,创建视图和实例化视图。 让我们开始吧!
标准SQL (Standard SQL)
BigQuery supports two SQL dialects: standard SQL and legacy SQL. Standard SQL is preferred for querying data stored in BigQuery because it’s compliant with the ANSI SQL 2011 standard. It has other advantages over legacy SQL, such as automatic predicate push down for JOIN operations and support for correlated subqueries. Refer to Standard SQL highlights for more information.
BigQuery支持两种SQL方言:标准SQL和传统SQL 。 首选标准SQL查询BigQuery中存储的数据,因为它符合ANSI SQL 2011标准。 与旧版SQL相比,它还有其他优点,例如JOIN操作的自动谓词下推以及对相关子查询的支持。 有关更多信息,请参见标准SQL高亮。
When you run a SQL query in BigQuery, it automatically creates, schedules and runs a query job. BigQuery runs query jobs in two modes: interactive (default) and batch.
在BigQuery中运行SQL查询时,它会自动创建,安排和运行查询作业。 BigQuery以两种模式运行查询作业:交互(默认)和batch 。
Interactive (on-demand) queries are executed as soon as possible, and these queries count towards concurrent rate limit and daily limit.
交互式(按需)查询将尽快执行,并且这些查询将计入并发速率限制和每日限制。
Batch queries are queued and started as soon as idle resources are available in the BigQuery shared resource pool, which usually occurs within a few minutes. If BigQuery hasn’t started the query within 24 hours, job priority is changed to interactive. Batch queries don’t count towards your concurrent rate limit. They use the same resources as interactive queries.
一旦BigQuery共享资源池中有可用的空闲资源,批处理查询就会排队并开始运行,这通常在几分钟之内发生。 如果BigQuery在24小时之内没有启动查询,则作业优先级将更改为交互式。 批量查询不计入您的并发速率限制。 它们使用与交互式查询相同的资源。
The queries in this post follow Standard SQL dialect and run in interactive mode, unless otherwise mentioned.
除非另有说明,否则本文中的查询均遵循标准SQL方言并以交互模式运行。
BigQuery表类型 (BigQuery Table Types)
Every table in BigQuery is defined by a schema describing the column names, data types, and other metadata. BigQuery supports the following table types:
BigQuery中的每个表格均由描述列名称,数据类型和其他元数据的架构定义。 BigQuery支持以下表格类型:
BigQuery架构 (BigQuery Schemas)
In BigQuery, schemas are defined at the table level and provide structure to the data. Schema describes column definitions with their name, data type, description and mode.
在BigQuery中,架构是在表格级别定义的,并为数据提供结构。 模式描述列定义及其名称,数据类型,描述和模式。
Data types can be simple data types, such as integers, or more complex, such as
ARRAY
andSTRUCT
for nested and repeated values.数据类型可以是简单的数据类型,例如整数,也可以是更复杂的数据类型,例如用于嵌套和重复值的
ARRAY
和STRUCT
。Column modes can be
NULLABLE
,REQUIRED
, orREPEATED
.列模式可以为
NULLABLE
,REQUIRED
或REPEATED
。
Table schema is specified when loading data into the table or when creating an empty table. Alternatively, when loading data, you can use schema auto-detection for self-describing source data formats such as Avro, Parquet, ORC, Cloud Firestore or Cloud Datastore export files. Schema can be defined manually or in a JSON file as shown.
在将数据加载到表中或创建空表时,将指定表架构。 或者,在加载数据时,您可以使用模式自动检测来自描述源数据格式,例如Avro,Parquet,ORC,Cloud Firestore或Cloud Datastore导出文件。 可以手动定义架构,也可以在JSON文件中定义架构,如图所示。
[
{
"description": "[DESCRIPTION]",
"name": "[NAME]",
"type": "[TYPE]",
"mode": "[MODE]"
},
{
"description": "[DESCRIPTION]",
"name": "[NAME]",
"type": "[TYPE]",
"mode": "[MODE]"
}
]
使用SQL进行分析 (Using SQL for Analysis)
Now let’s analyze one of the public BigQuery datasets related to NCAA Basketball games and players using SQL. The game data covers play-by-play and box scores dated back to 2009. We will look at a specific game from the 2014 season between Kentucky’s Wildcats and Notre Dame’s Fighting Irish. This game had an exciting finish. Let’s find out what made it exciting!
现在,让我们分析与使用SQL的NCAA篮球比赛和运动员有关的公共BigQuery数据集之一。 游戏数据涵盖了追溯到2009年的逐项比赛和盒式得分。我们将研究2014赛季肯塔基肯塔基州的野猫队和巴黎圣母院的格斗爱尔兰人之间的特定比赛。 这场比赛令人兴奋。 让我们找出使它令人兴奋的原因!
On your BigQuery Sandbox, open the NCAA Basketball dataset from public datasets. Click on the “VIEW DATASET” button to open the dataset in BigQuery web UI.
在您的BigQuery沙盒上,从公共数据集中打开NCAA Basketball数据集。 点击“查看数据集”按钮,以在BigQuery Web UI中打开数据集。
Navigate to table mbb_pbp_sr
under ncaa_basketball
dataset to look at the schema. This table has play-by-play information of all men’s basketball games in the 2013–2014 season, and each row in the table represents a single event in a game.
导航到ncaa_basketball
数据集下的表mbb_pbp_sr
,以查看模式。 该表具有2013-2014赛季所有男篮比赛的逐项比赛信息,并且表中的每一行代表一场比赛中的单个事件。
Check the Details section of table mbb_pbp_sr
. There are ~4 million game events with a total volume of ~3GB.
检查表mbb_pbp_sr
的“详细信息”部分。 有大约400万个游戏事件,总容量约为3GB。
Let’s run a query to filter the events for the game we are interested in. The query selects the following columns from the mbb_pbp_sr
table:
让我们运行一个查询来过滤我们感兴趣的游戏的事件。该查询从mbb_pbp_sr
表中选择以下列:
game_clock
: Time left in the game before the finishgame_clock
:游戏结束前剩余的时间points_scored
: Points were scored in an eventpoints_scored
:事件中得分team_name
: Name of the team who scored the pointsteam_name
:获得积分的球队名称event_description
: Description about the eventevent_description
:有关事件的描述timestamp
: Time when the event occurredtimestamp
:事件发生的时间
SELECT
game_clock,
points_scored,
team_name,
event_description
FROM
`bigquery-public-data.ncaa_basketball.mbb_pbp_sr`
WHERE
season = 2014
AND home_name = 'Wildcats'
AND away_name = 'Fighting Irish'
AND points_scored IS NOT NULL
ORDER BY
timestamp DESC
LIMIT 10;
A breakdown of what this query is doing:
此查询的工作细目:
The
SELECT
statement retrieves the rows and the specified columnsFROM
the tableSELECT
语句FROM
表中检索行和指定的列The
WHERE
clause filters the rows returned bySELECT
. This query filters to return rows for the specific game we are interested in.WHERE
子句过滤SELECT
返回的行。 此查询过滤器返回我们感兴趣的特定游戏的行。The
ORDER BY
statement controls the order of rows in the result set. This query sorts the rows resulting fromSELECT
by timestamp in descending order.ORDER BY
语句控制结果集中的行顺序。 此查询按时间戳按降序对SELECT
产生的行进行排序。Finally, the
LIMIT
constraints the amount of data returned from the query. This query returns 10 events from the results set after the rows are sorted. Note that addingLIMIT
does not reduce the amount of data processed by the query engine.最后,
LIMIT
限制了从查询返回的数据量。 在对行进行排序之后,此查询从结果集中返回10个事件。 请注意,添加LIMIT
不会减少查询引擎处理的数据量。
Now let’s look at the results.
现在让我们看一下结果。
Query results from the analysis 查询分析结果From the results, it appears the player Andrew Harrison made two free throws scoring 2 points with only 6 seconds remaining in the game. This doesn’t tell us much except there were points scored towards the very end of the game.
从结果看,球员安德鲁·哈里森(Andrew Harrison)罚球两次,得到2分,比赛还剩6秒。 除了在比赛即将结束时得分以外,这并不能告诉我们太多。
Tip: Avoid using SELECT *
in the query. Instead query only the columns needed. To exclude only certain columns use SELECT * EXCEPT
.
提示:避免在查询中使用SELECT *
。 而是仅查询所需的列。 要仅排除某些列,请使用SELECT * EXCEPT
。
Let’s modify the query to include cumulative sum of scores for each team rolling up to the event time, using analytic (window) functions. Analytic functions computes aggregates for each row over a group of rows defined by a window whereas aggregate functions compute a single aggregate value over a group of rows.
让我们使用分析(窗口)功能修改查询,以包括累积到活动时间的每个团队的累积分数总和。 解析函数计算窗口定义的一组行中每一行的聚合,而聚合函数计算一组行中的单个聚合值。
Run the below query with two new columns added — wildcats_score
and fighting_irish_score
, calculated on-the-fly using points_scored
column.
运行下面的查询添加了两个新列- wildcats_score
和fighting_irish_score
,计算使用的即时points_scored
列。
SELECT
game_clock,
SUM(
CASE
WHEN team_name = 'Wildcats' THEN points_scored
END
) OVER(ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS wildcats_score,
SUM(
CASE
WHEN team_name = 'Fighting Irish' THEN points_scored
END
) OVER(ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS fighting_irish_score,
team_name,
event_description
FROM
`bigquery-public-data.ncaa_basketball.mbb_pbp_sr`
WHERE
season = 2014
AND home_name = 'Wildcats'
AND away_name = 'Fighting Irish'
AND points_scored IS NOT NULL
ORDER BY
timestamp DESC
LIMIT 10;
A breakdown of what this query is doing:
此查询的工作细目:
Calculate cumulative
SUM
of scores by each team in the game — specified byCASE
statement计算游戏中每个团队的累积得分
SUM
-由CASE
语句指定SUM
is calculated on scores in the window defined withinOVER
clauseSUM
是根据OVER
定义的窗口中的分数计算得出的 条款OVER
clause references a window (group of rows) to useSUM
OVER
子句引用一个窗口(一组行)以使用SUM
ORDER BY
is part of window specification that defines sort order within a partition. This query orders rows bytimestamp
ORDER BY
是窗口规范的一部分,该窗口规范定义了分区内的排序顺序。 该查询按timestamp
行进行timestamp
Define the window frame from the start of the game specified by
UNBOUNDED PRECEDING
to theCURRENT ROW
over which the analytic functionSUM()
is evaluated.从指定的游戏开始定义窗口框架
UNBOUNDED PRECEDING
到CURRENT ROW
在其上分析函数SUM()
进行评价。
From the results, we can see how the game ended. The Fighting Irish held the lead by four points with 04:28
minutes remaining. Karl-Anthony Towns of Wildcats was able to tie the game on a layup with 01:12
minutes remaining, and Andrew Harrison made two free throws with 00:06
seconds remaining, setting the stage for the Wildcats’ win. That was a nail biting finish indeed!
从结果中,我们可以看到游戏如何结束。 爱尔兰格斗队以4分的优势领先,还剩下04:28
分钟。 野猫队的Karl-Anthony Towns能够在比赛还剩01:12
分钟的情况下扳平比分,而安德鲁·哈里森在还剩00:06
秒的时间内罚了两罚,为野猫队的获胜奠定了基础。 那确实是一个令人a舌的结局!
In addition to aggregate and analytic functions, BigQuery also supports functions and operators such as string manipulation, date/time, mathematical functions, JSON extract and more, as shown.
如图所示,除了聚合和分析功能外,BigQuery还支持函数和运算符,例如字符串操作,日期/时间,数学函数,JSON提取等。
BigQuery SQL Functions BigQuery SQL函数Refer BigQuery SQL function reference for the complete list. In the upcoming posts, we will cover other advanced query features in BigQuery such as User Defined Functions, Spatial functions and more.
有关完整列表,请参阅BigQuery SQL函数参考。 在接下来的帖子中,我们将介绍BigQuery中的其他高级查询功能,例如用户定义函数,空间函数等。
BigQuery SQL查询的寿命 (The Life of a BigQuery SQL Query)
Under the hood when you run the SQL query it does the following:
在后台运行SQL查询时,它会执行以下操作:
A QueryJob is submitted to the BigQuery service. As reviewed in the BigQuery architecture, the BigQuery compute is decoupled from the BigQuery storage, and they are designed to work together to organize the data to make queries efficient over huge datasets.
QueryJob提交到BigQuery服务。 如BigQuery体系结构中所述,BigQuery计算与BigQuery存储是分离的,它们被设计为一起工作以组织数据以提高对庞大数据集的查询效率。
- Each query executed is broken up into stages which are then processed by workers (slots) and written back out to Shuffle. Shuffle provides resilience to failures within workers themselves, say a worker were to have an issue during query processing. 每个执行的查询都分为几个阶段,然后由工作程序(插槽)进行处理,然后写回Shuffle。 Shuffle提供了对工作人员自身故障的恢复能力,例如,工作人员在查询处理期间会遇到问题。
BigQuery engine utilizes BigQuery’s columnar storage format to scan only the required columns to run the query. One of the best practices to control costs is to query only the columns that you need.
BigQuery引擎利用BigQuery的列存储格式来仅扫描运行查询所需的列。 控制成本的最佳实践之一是仅查询所需的列。
- After the query execution is completed, the query service persists the results into a temporary table, and the web UI displays that data. You can also request to write results into a permanent table. 查询执行完成后,查询服务会将结果持久保存到临时表中,然后Web UI会显示该数据。 您还可以请求将结果写入永久表。
保存和共享查询 (Saving and Sharing Queries)
保存查询(Saving Queries)
Now that you have run SQL queries to perform analysis, how would you save those results? BigQuery writes all query results to a table. The table is either explicitly identified by the user as a destination table or a temporary cached results table. This temporary table is stored for 24 hours, so if you run the exact same query again (exact string match), and if the results would not be different, then BigQuery will simply return a pointer to the cached results. Queries that can be served from the cache do not incur any charges. Refer to the documentation to understand limitations and exceptions to query caching.
既然您已经运行了SQL查询来执行分析,那么如何保存这些结果? BigQuery将所有查询结果写入表格。 用户将该表明确标识为目标表或临时缓存的结果表。 该临时表将存储24小时,因此,如果您再次运行完全相同的查询(完全匹配字符串),并且如果结果没有不同,则BigQuery只会返回指向缓存结果的指针。 可以从缓存中提供的查询不会产生任何费用。 请参阅文档以了解查询缓存的限制和例外。
You can view cached query results from the Query History tab on BigQuery UI. This history includes all queries submitted by you to the service, not just those submitted via the web UI.
您可以从BigQuery UI的“查询历史记录”选项卡中查看缓存的查询结果。 此历史记录包括您向服务提交的所有查询,而不仅仅是通过Web UI提交的查询。
Query History 查询历史You can disable retrieval of cached results from the query settings when executing the query. This requires BigQuery to compute the query result, which will result in charges for the query to execute. This is typically used in benchmarking scenarios, such as in the previous post comparing performance of partitioned and clustered tables against non-partitioned tables.
执行查询时,可以禁用从查询设置中检索缓存的结果。 这需要BigQuery计算查询结果,这将导致查询执行产生费用。 这通常用于基准测试方案中,例如在先前的文章中比较了分区表和集群表与未分区表的性能。
You can also request the query write to a destination table. You will have control on when the table is deleted. Because the destination table is permanent, you will be charged for the storage of the results.
您还可以请求将查询写入目标表。 您可以控制何时删除表。 由于目标表是永久性的,因此需要为存储结果付费。
共享查询 (Sharing Queries)
BigQuery allows you to share queries with others. When you save a query, it can be private (visible only to you), shared at the project level (visible to project members), or public (anyone can view it).
BigQuery允许您与他人共享查询。 保存查询时,查询可以是私有的(仅对您可见),在项目级别共享(对项目成员可见)或公共(任何人都可以查看)。
Check this video to know how to save and share your queries in BigQuery.
观看此视频,了解如何在BigQuery中保存和共享您的查询。
演示地址
标准视图(Standard Views)
A view is a virtual table defined by a SQL query. A view has properties similar to a table and can be queried as a table. The schema of the view is the schema that results from running the query. The query results from the view contain data only from the tables and fields specified in the query that defines the view.
视图是由SQL查询定义的虚拟表。 视图的属性类似于表,可以作为表查询。 视图的架构是运行查询所产生的架构。 视图的查询结果仅包含定义视图的查询中指定的表和字段的数据。
You can create a view by saving the query from the BigQuery UI using the “Save View” button or using BigQuery DDL — CREATE VIEW
statement. Saving a query as a view does not persist results aside from the caching to a temporary table, which expires within a 24 hour window. The behavior is similar to the query executed on the tables.
你可以创建一个视图-通过使用“保存视图”按钮保存来自BigQuery的UI查询或使用BigQuery的DDL CREATE VIEW
语句。 除了将查询缓存为临时表外,将查询另存为视图不会保留结果,该临时表会在24小时内过期。 该行为类似于在表上执行的查询。
何时使用标准视图?(When to use standard Views?)
- Let’s say you want to expose queries with complex logic to your users and you want to avoid the users needing to remember the logic, then you can roll those queries into a view. 假设您要向用户公开具有复杂逻辑的查询,并且希望避免用户需要记住逻辑,然后可以将这些查询滚动到视图中。
Another use case is — views can be placed into datasets and offer fine-grained access controls to share dataset with specific users and groups without giving them access to underlying tables. These views are called authorized views. We will look into securing and accessing datasets in detail in a future post.
另一个用例是-可以将视图放入数据集中,并提供细粒度的访问控制,以与特定用户和组共享数据集,而无需授予他们访问基础表的权限。 这些视图称为授权视图。 在以后的文章中,我们将详细研究如何保护和访问数据集。
Refer to BigQuery documentation for creating and managing standard views.
请参阅BigQuery文档,以创建和管理标准视图。
物化视图 (Materialized Views)
BigQuery supports Materialized Views (MV) — a beta feature. MV are precomputed views that periodically cache results of a query for increased performance and efficiency. Queries using MV are generally faster and consume less resources than queries retrieving the same data only from the base table. They can significantly boost performance of workloads with common and repeated queries.
BigQuery支持实例化视图(MV) -Beta功能。 MV是预先计算的视图,可定期缓存查询结果以提高性能和效率。 与仅从基表中检索相同数据的查询相比,使用MV的查询通常更快,消耗的资源更少。 它们可以通过常见和重复查询大大提高工作负载的性能。
Following are the key features of materialized views:
以下是实例化视图的主要功能:
Zero Maintenance
零维护
BigQuery leverages precomputed results from MV and whenever possible reads only delta changes from the base table to compute up-to-date results. It automatically synchronizes data refreshes with data changes in base tables. No user inputs required. You also have the option to trigger manual refresh the views to control the costs of refresh jobs.
BigQuery利用MV预先计算的结果,并尽可能从基表中读取增量变化以计算最新结果。 它自动将数据刷新与基表中的数据更改同步。 无需用户输入。 您还可以选择触发手动刷新视图,以控制刷新作业的成本。
Always fresh
永远新鲜
- MV is always consistent with the source table. They can be queried directly or can be used by the BigQuery optimizer to process queries to the base tables. MV始终与源表一致。 它们可以直接查询,也可以由BigQuery优化器用来处理对基表的查询。
Smart Tuning
智能调优
- MV supports query rewrites. If a query against the source table can instead be resolved by querying the MV, BigQuery will rewrite (reroute) the query to the MV for better performance and/or efficiency. MV支持查询重写。 如果可以通过查询MV来解决针对源表的查询,则BigQuery会将查询重写(重新路由)到MV,以提高性能和/或效率。
You create MV using BigQuery DDL — CREATE MATERIALIZED VIEW
statement.
您可以使用BigQuery DDL- CREATE MATERIALIZED VIEW
语句创建MV。
何时使用物化视图? (When to use Materialized Views?)
MV are suited for cases when you need to query the latest data while cutting down latency and cost by reusing the previously computed results. MV act as pseudo-indexes, accelerating queries to the base table without updating any existing workflows.
MV适用于需要查询最新数据,同时通过重用先前计算的结果来减少延迟和成本的情况。 MV充当伪索引,可在不更新任何现有工作流程的情况下加速对基表的查询。
物化视图的局限性 (Limitations of Materialized Views)
- As of this writing, joins are not currently supported in MV. However, you can leverage MV in a query that does aggregation on top of joins. This reduces cost and latency of the query. 在撰写本文时,MV当前不支持联接。 但是,您可以在查询中利用MV进行查询,该查询在联接之上进行汇总。 这样可以减少成本和查询延迟。
MV supports a limited set of aggregation functions and restricted SQL. Refer the query patterns supported by materialized views..
MV支持一组有限的聚合功能和受限制SQL。 请参考实例化视图支持的查询模式。
Refer to BigQuery documentation for working with materialized views, and best practices.
有关使用实例化视图和最佳实践的信息,请参阅BigQuery文档。
接下来是什么? (What Next?)
In this post, we reviewed the lifecycle of a SQL query in BigQuery, working with window functions, creating standard and materialized views, saving and sharing queries.
在本文中,我们回顾了BigQuery中SQL查询的生命周期,使用窗口函数,创建标准视图和实例化视图,保存和共享查询。
BigQuery Functions reference
BigQuery函数参考
Analytic (window) functions in BigQuery
BigQuery中的分析(窗口)功能
Standard views in BigQuery [Docs]
BigQuery中的标准视图[文档]
Materialized views in BigQuery [Docs]
BigQuery中的物化视图[文档]
Saving and sharing queries [Video] [Docs]
保存和共享查询[视频] [文档]
BigQuery best practices for query performance
BigQuery查询性能最佳做法
Codelab
编解码器
Try this codelab to query a large public dataset based on Github archives.
尝试使用此代码实验室来查询基于Github档案的大型公共数据集。
In the next post, we will dive into joins, optimizing join patterns and denormalizing data with nested and repeated data structures.
在下一篇文章中,我们将深入研究联接,优化联接模式以及使用嵌套和重复的数据结构对数据进行规范化。
Stay tuned. Thank you for reading! Have a question or want to chat? Find me on Twitter or LinkedIn.
敬请关注。 感谢您的阅读! 有问题或想聊天? 在Twitter或LinkedIn上找到我。
Thanks to Yuri Grinshsteyn and Alicia Williams for helping with the post.
感谢Yuri Grinshsteyn和Alicia Williams为这个职位提供的帮助。
翻译自: https://medium.com/google-cloud/bigquery-explained-querying-your-data-9e017f2714a3
bigquery数据类型