diggerTT

数据挖掘工具---Spark SQL使用

Spark SQL 你需要知道的十件事

来源：

Spark SQL 使用场景

Ad-hoc querying of data in files
ETL capabilities alongside familiar SQL
Interaction with external Databases
Scalable query performance with larger clusters
Live SQL analytics over streaming data

数据加载：云和本地, RDDs 和 DataFrames

You can load data directly into a DataFrame, and begin querying it relatively quickly. Otherwise you’ll need to load data into an RDD and transform it first.

# loading data into an RDD in Spark 2.0
sc = spark.sparkContext
oneSysLog = sc.textFile("file:/var/log/system.log")
allSysLogs = sc.textFile("file:/var/log/system.log*")
allLogs = sc.textFile("file:/var/log/*.log"）

# lets count the lines in each RDD
>>> oneSysLog.count()
8339
>>> allSysLogs.count()
47916
>>> allLogs.count()
546254

That’s great, but you can’t query this. You’ll need to convert the data to Rows, add a schema, and convert it to a dataframe.

# import Row, map the rdd, and create dataframe
from pyspark.sql import Row
sc = spark.sparkContext
allSysLogs = sc.textFile("file:/var/log/system.log*")
logsRDD = allSysLogs.map(lambda logRow: Row(log=logRow))
logsDF = spark.createDataFrame(logsRDD)

Once the data is converted to at least a DataFrame with a schema, now you can talk SQL to the data.

# write some SQL
logsDF = spark.createDataFrame(logsRDD)
logsDF.createOrReplaceTempView("logs")
>>> spark.sql("SELECT * FROM logs LIMIT 1").show()
+--------------------+
|                 log|
+--------------------+
|Jan  6 16:37:01 (...|
+--------------------+

But, you can also load certain types of data and store it directly as a DataFrame. This allows you to get to SQL quickly.
Both JSON and Parquet formats can be loaded as a DataFrame straightaway because they
contain enough schema information to do so.

# load parquet straight into DF, and write some SQL
logsDF = spark.read.parquet("file:/logs.parquet")
logsDF.createOrReplaceTempView("logs")
>>> spark.sql("SELECT * FROM logs LIMIT 1").show()
+--------------------+
|                 log|
+--------------------+
|Jan  6 16:37:01 (...|
+--------------------+

In fact, now they even have support for querying parquet files directly! Easy peasy!

# load parquet straight into DF, and write some SQL
>>> spark.sql("""  SELECT * FROM parquet.`path/to/logs.parquet` LIMIT 1""").show()
+--------------------+
|                 log|
+--------------------+
|Jan  6 16:37:01 (...|
+--------------------+

That’s one aspect of loading data. The other aspect is using the protocols for cloud storage (i.e. s3://). In some cloud ecosystems, support for their storage protocol comes installed already.

# i.e. on AWS EMR, s3:// is installed already.
sc = spark.sparkContext
decemberLogs = sc.textFile("s3://acme-co/logs/2016/12/")
# count the lines in all of the december 2016 logs in S3
>>> decemberLogs.count()
910125081250
# wow, such logs. Ur poplar.

Sometimes you actually need to provide support for those protocols if your VM’s OS doesn’t have it already.

my-linux-shell$
pyspark --packages 
com.amazonaws:aws-java-sdk-pom:1.10.34,com.amazonaws:aws-jav
a-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.1 demo2.py
>>> rdd = sc.readText("s3a://acme-co/path/to/files")
rdd.count()
# note: "s3a" and not "s3" -- "s3" is specific to AWS EMR.

Now you should have several ways to load data to quickly start writing SQL with Apache Spark.

SQL 和 DataFrame API 比较，它们之间的区别

What is a Datalframe?
You can think of dataframes like RDDS with a schema
Note:“Data Frame is just a type alias for Dataset of Row—Databricks”
Why Dataframe over RDD?
Catalyst optimization & schemas
What kind of data can Datalframes handle?
Text. JSON XML, Parquet and more
What can I do with a Dataframe?
Use sql-like and actual SQL. Also, you can apply schemas to your data and benefit from the performance enhancements of the Catalyst optimizer
Still Catalyst Optimized Both SQL and API Functions in df’s sit atop Catalyst
Dataframe Functions Provides a bridge between to features of
Spark APIs
SQL With Dataframes Allows you a familiar way to interact with the data
sql-like Functions in Dataframe API For many of the expected features of SQL. there are similar functions in the DF Apithat do practically the same thing,allowing for .functional().chaining()

模式: 隐式和显示模式解释，数据类型

Schemas can be inferred, i.e. guessed, by spark. With inferred schemas, you usually end up with a bunch of strings and ints. If you have more specific needs, supply your own schema.

# sample data - "people.txt"
1|Kristian|Algebraix Data|San Diego|CA
2|Pat|Algebraix Data|San Diego|CA
3|Lebron|Cleveland Cavaliers|Cleveland|OH
4|Brad|Self Employed|Hollywood|CA

# load as RDD and map it to a row with multiple fields
rdd = sc.textFile("file:/people.txt")
def mapper(line):
  s = line.split("|")
  return Row(id=s[0],name=s[1],company=s[2],state=s[4])
peopleRDD = rdd.map(mapper)
peopleDF = spark.createDataFrame(peopleRDD)
# full syntax: .createDataFrame(peopleRDD, schema)

# we didn't actually pass anything into that 2nd param.
# yet, behind the scenes, there's still a schema.
>>> peopleDF.printSchema()
Root
 |-- company: string (nullable = true)
 |-- id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- state: string (nullable = true)

Spark SQL can certainly handle queries where id is a string, but what if we don’t want it to be? it should be an int.

# load as RDD and map it to a row with multiple fields
rdd = sc.textFile("file:/people.txt")
def mapper(line):
  s = line.split("|")
return Row(id=int(s[0]),name=s[1],company=s[2],state=s[4])
peopleRDD = rdd.map(mapper)
peopleDF = spark.createDataFrame(peopleRDD)
>>> peopleDF.printSchema()
Root
 |-- company: string (nullable = true)
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- state: string (nullable = true)

You can actually provide a schema, too, which will be more authoritative.

# load as RDD and map it to a row with multiple fields
import pyspark.sql.types as types
rdd = sc.textFile("file:/people.txt")
def mapper(line):
  s = line.split("|")
return Row(id=s[0],name=s[1],company=s[2],state=s[4])
schema = types.StructType([
   types.StructField('id',types.IntegerType(), False)
  ,types.StructField('name',types.StringType())
  ,types.StructField('company',types.StringType())
  ,types.StructField('state',types.StringType())
])
peopleRDD = rdd.map(mapper)
peopleDF = spark.createDataFrame(peopleRDD, schema)

>>> peopleDF.printSchema()
Root
 |-- id: integer (nullable = false)
 |-- name: string (nullable = true)
 |-- company: string (nullable = true)
 |-- state: string (nullable = true)

And what are the available types?

# http://spark.apache.org/docs/2.0.0/api/python/_modules/pyspark/sql/types.html
__all__ = ["DataType", "NullType", "StringType", 
"BinaryType", "BooleanType", "DateType", "TimestampType", 
"DecimalType", "DoubleType", "FloatType", "ByteType", 
"IntegerType", "LongType", "ShortType", "ArrayType", 
"MapType", "StructField", "StructType"]

Gotcha alert
Spark doesn’t seem to care when you leave dates as strings.

# Spark SQL handles this just fine as if they were
# legit date objects.
spark.sql("""  SELECT * FROM NewHires n WHERE n.start_date > "2016-01-01" """).show()

Now you know about inferred and explicit schemas, and the available types you can use.

数据加载以及结果保存等

Loading and Saving is fairly straight forward.
Save your dataframes in your desired format.

# picking up where we left off
peopleDF = spark.createDataFrame(peopleRDD, schema)
peopleDF.write.save("s3://acme-co/people.parquet",  format="parquet") 
# format= defaults to parquet if omitted
# formats: json, parquet, jdbc, orc, libsvm, csv, text

When you read, some types preserve schema. Parquet keeps the full schema, JSON has inferrable schema, and JDBC pulls in schema.

# read from stored parquet
peopleDF = spark.read.parquet(“s3://acme-co/people.parquet”)
# read from stored JSON
peopleDF = spark.read.json(“s3://acme-co/people.json”)

Spark SQL数据源

来源

SparkSession
parquet
csv
json
jdbc
table
- 准备table
- 读取
- 写入
- 连接一个已存在的Hive
text
- 格式提前确定
- 格式在运行时确定

pyspark读写dataframe

来源
创建dataframe

从变量创建
从变量创建
读取json
df = spark.read.json(file)
读取csv
monthlySales = spark.read.csv(file, header=True, inferSchema=True)
读取MySQL

df = spark.read.format('jdbc').options(
    url='jdbc:mysql://127.0.0.1',
    dbtable='mysql.db',
    user='root',
    password='123456' 
    ).load()

sql="(select * from mysql.db where db='wp230') t"
df = spark.read.format('jdbc').options(
    url='jdbc:mysql://127.0.0.1',
    dbtable=sql,
    user='root',
    password='123456' 
    ).load()

从pandas.dataframe创建
从列式存储的parquet读取
df=spark.read.parquet(file)
从hive读取

spark = SparkSession \
        .builder \
        .enableHiveSupport() \      
        .master("172.31.100.170:7077") \
        .appName("my_first_app_name") \
        .getOrCreate()
 
df=spark.sql("select * from hive_tb_name")

保存数据

写到csv
spark_df.write.csv(path=file, header=True, sep=",", mode='overwrite')
保存到parquet
spark_df.write.parquet(path=file,mode='overwrite')
写到hive

# 打开动态分区
spark.sql("set hive.exec.dynamic.partition.mode = nonstrict")
spark.sql("set hive.exec.dynamic.partition=true")
 
# 使用普通的hive-sql写入分区表
spark.sql("""
    insert overwrite table ai.da_aipurchase_dailysale_hive 
    partition (saledate) 
    select productid, propertyid, processcenterid, saleplatform, sku, poa, salecount, saledate 
    from szy_aipurchase_tmp_szy_dailysale distribute by saledate
    """)
 
# 或者使用每次重建分区表的方式
jdbcDF.write.mode("overwrite").partitionBy("saledate").insertInto("ai.da_aipurchase_dailysale_hive")
jdbcDF.write.saveAsTable("ai.da_aipurchase_dailysale_hive", None, "append", partitionBy='saledate')
 
# 不写分区表，只是简单的导入到hive表
jdbcDF.write.saveAsTable("ai.da_aipurchase_dailysale_for_ema_predict", None, "overwrite", None)

写到hdfs
jdbcDF.write.mode("overwrite").options(header="true").csv("/home/ai/da/da_aipurchase_dailysale_for_ema_predict.csv")
写到mysql

# 会自动对齐字段，也就是说，spark_df 的列不一定要全部包含MySQL的表的全部列才行
 
# overwrite 清空表再导入
spark_df.write.mode("overwrite").format("jdbc").options(
    url='jdbc:mysql://127.0.0.1',
    user='root',
    password='123456',
    dbtable="test.test",
    batchsize="1000",
).save()
 
# append 追加方式
spark_df.write.mode("append").format("jdbc").options(
    url='jdbc:mysql://127.0.0.1',
    user='root',
    password='123456',
    dbtable="test.test",
    batchsize="1000",
).save()

SQL 使用场景，什么时候不适合使用 SQL

Spark 1.6

Limited support for subqueries and various other noticeable SQL functionalities
Runs roughly half of the 99 TPC-DS benchmark queries
More SQL support in HiveContext

Spark 2.0

In DataBricks’ Words

SQL2003 support
Runs all 99 of TPC-DS benchmark queries
A native SQL parser that supports both ANSI-SQL as well as Hive QL
Native DDL command implementations
Subquery support, including
- Uncorrelated Scalar Subqueries
- Correlated Scalar Subqueries
- NOT IN predicate Subqueries (in WHERE/HAVING clauses)
- IN predicate subqueries (in WHERE/HAVING clauses)
- (NOT) EXISTS predicate subqueries (in WHERE/HAVING clauses)
View canonicalization support
In addition, when building without Hive support, Spark SQL should have almost all the functionality as when building with Hive support, with the exception of Hive connectivity, Hive UDFs, and script transforms.

使用 SQL 进行 ETL

These things are some of the things we learned to start doing after working with Spark a while.

Tip 1: In production, break your applications into smaller apps as steps. I.e. “Pipeline pattern”
Tip 2: When tinkering locally, save a small version of the dataset via spark and test against
that.
Tip 3: If using EMR, create a cluster with your desired steps, prove it works, then export a CLI
command to reproduce it, and run it in Data Pipeline to start recurring pipelines / jobs.

操作 JSON 数据

JSON data is most easily read-in as line delimied json objects*

{
     "n":"sarah","age":29}
{
     "n”:"steve","age":45}

Schema is inferred upon load. Unlike other lazy operations, this will cause some work to be done.
Access arrays with inline array syntax

SELECT  col[1], col[3] FROM json

If you want to flatten your JSON data, use the explode method(works in both DF API
and SQL)

# json explode example
>>> spark.read.json("file:/json.explode.json").createOrReplaceTempView("json")
>>> spark.sql("SELECT * FROM json").show()
+----+----------------+
|   x|               y|
+----+----------------+
|row1| [1, 2, 3, 4, 5]|
|row2|[6, 7, 8, 9, 10]|
+----+----------------+
>>> spark.sql("SELECT x, explode(y) FROM json").show()
+----+---+
|   x|col|
+----+---+
|row1|  1|
|row1|  2|
|row1|  3|
|row1|  4|
|row1|  5|
|row2|  6|
|row2|  7|
|row2|  8|
|row2|  9|
|row2| 10|
+----+---+

Access nested-objects with dot syntax For multi-line JSON files,you’ve got to do much more:

SELECT    field.subfield FROM json

# a list of data from files.
files = sc.wholeTextFiles("data.json")
# each tuple is (path, jsonData)
rawJSON = files.map(lambda x: x[1])
# sanitize the data
cleanJSON = rawJSON.map(\
  lambda x: re.sub(r"\s+", "",x,flags=re.UNICODE)\
)
# finally, you can then read that in as “JSON”
spark.read.json( scrubbedJSON )
# PS -- the same goes for XML.

从外部数据库读取和写入

To read from an external database, you’ve got to have your JDBC connectors (jar) handy. In order to pass a jar package into spark, you’d use the --jarsflag when starting pyspark or spark-submit.

# loading data into an RDD in Spark 2.0
my-linux-shell$ pyspark \
  --jars /path/to/mysql-jdbc.jar\
  --packages
# note: you can also add the path to your jar in the  spark.defaults config file to these settings:
  spark.driver.extraClassPath
  spark.executor.extraClassPath

Once you’ve got your connector jars successfully imported, now you can read an existing database into your spark application or spark shell as a dataframe.

# line broken for readibility
sqlURL = "jdbc:mysql://<db-host>:<port>
  ?user=<user>
  &password=<pass>
  &rewriteBatchedStatements=true
  &continueBatchOnError=true"
df = spark.read.jdbc(url=sqlURL, table=".")
df.createOrReplaceTempView("myDB")
spark.sql("SELECT * FROM myDB").show()If you’ve done some work and built created or manipulated a dataframe, you can write it to a database by using the spark.read.jdbc method. Be prepared, it can a while.
Also, be warned, save modes in spark can be a bit destructive. “Overwrite” doesn’t just overwrite your data, it overwrites your schemas too.Say goodbye to those precious indices.
在真实环境下测试你的 SQL
If testing locally, do not load data from S3 or other similar types of cloud storage.
 Construct your applications as much as you can in advance. Cloudy clusters are expensive.
 In the cloud,you can test a lot of your code reliably with a 1-node cluster.
 Get really comfortable using .parallelize() to create dummy data.
 If you’re using big data, and many nodes, don’t use .collect() unless you intend to
pyspark.sql 各接口的使用方法
官网地址
class pyspark.sql.SparkSession(sparkContext, jsparkSession=None)[source]
SparkSession是spark操作数据集和数据框的入口。就跟RDD的入口是SparkContext一样。直接使用即可，不需要经过SparkContext。样式如下，根据自己应用的实际情况修改
spark = SparkSession.builder \
...     .master("local") \
...     .appName("Word Count") \
...     .config("spark.some.config.option", "some-value") \
...     .getOrCreate()
class pyspark.sql.DataFrameReader(spark)
http://spark.apache.org/docs/2.2.1/api/python/pyspark.sql.html
 
   csv(path, schema=None, sep=None, encoding=None, quote=None, escape=None, comment=None, header=None, inferSchema=None, ignoreLeadingWhiteSpace=None, ignoreTrailingWhiteSpace=None, nullValue=None, nanValue=None, positiveInf=None, negativeInf=None, dateFormat=None, timestampFormat=None, maxColumns=None, maxCharsPerColumn=None, maxMalformedLogPerPartition=None, mode=None, columnNameOfCorruptRecord=None, multiLine=None) 
  
Loads a CSV file and returns the result as a DataFrame.
原始数据，.csv格式
instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
1,2011-01-01,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0,3,13,16
2,2011-01-01,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0,8,32,40
3,2011-01-01,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0,5,27,32
4,2011-01-01,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0,3,10,13
5,2011-01-01,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0,0,1,1
#-*-coding:utf-8-*-

from pyspark import SparkContext
from pyspark.sql import SparkSession

if __name__=="__main__":
    sc=SparkContext(appName='myApp')
    spark=SparkSession.builder.enableHiveSupport().getOrCreate()
    """
    df = spark.read.csv('/tmp/test/hour2.csv')
    df.show()
    
    +-------+----------+------+---+----+---+-------+-------+----------+----------+----+------+----+---------+------+----------+----+
    |    _c0|       _c1|   _c2|_c3| _c4|_c5|    _c6|    _c7|       _c8|       _c9|_c10|  _c11|_c12|     _c13|  _c14|      _c15|_c16|
    +-------+----------+------+---+----+---+-------+-------+----------+----------+----+------+----+---------+------+----------+----+
    |instant|    dteday|season| yr|mnth| hr|holiday|weekday|workingday|weathersit|temp| atemp| hum|windspeed|casual|registered| cnt|
    |      1|2011-01-01|     1|  0|   1|  0|      0|      6|         0|         1|0.24|0.2879|0.81|        0|     3|        13|  16|
    |      2|2011-01-01|     1|  0|   1|  1|      0|      6|         0|         1|0.22|0.2727| 0.8|        0|     8|        32|  40|
    |      3|2011-01-01|     1|  0|   1|  2|      0|      6|         0|         1|0.22|0.2727| 0.8|        0|     5|        27|  32|
    |      4|2011-01-01|     1|  0|   1|  3|      0|      6|         0|         1|0.24|0.2879|0.75|        0|     3|        10|  13|
    |      5|2011-01-01|     1|  0|   1|  4|      0|      6|         0|         1|0.24|0.2879|0.75|        0|     0|         1|   1|
    |      6|2011-01-01|     1|  0|   1|  5|      0|      6|         0|         2|0.24|0.2576|0.75|   0.0896|     0|         1|   1|
    |      7|2011-01-01|     1|  0|   1|  6|      0|      6|         0|         1|0.22|0.2727| 0.8|        0|     2|         0|   2|
    |      8|2011-01-01|     1|  0|   1|  7|      0|      6|         0|         1| 0.2|0.2576|0.86|        0|     1|         2|   3|
    |      9|2011-01-01|     1|  0|   1|  8|      0|      6|         0|         1|0.24|0.2879|0.75|        0|     1|         7|   8|
    |     10|2011-01-01|     1|  0|   1|  9|      0|      6|         0|         1|0.32|0.3485|0.76|        0|     8|         6|  14|
    |     11|2011-01-01|     1|  0|   1| 10|      0|      6|         0|         1|0.38|0.3939|0.76|   0.2537|    12|        24|  36|
    |     12|2011-01-01|     1|  0|   1| 11|      0|      6|         0|         1|0.36|0.3333|0.81|   0.2836|    26|        30|  56|
    |     13|2011-01-01|     1|  0|   1| 12|      0|      6|         0|         1|0.42|0.4242|0.77|   0.2836|    29|        55|  84|
    |     14|2011-01-01|     1|  0|   1| 13|      0|      6|         0|         2|0.46|0.4545|0.72|   0.2985|    47|        47|  94|
    |     15|2011-01-01|     1|  0|   1| 14|      0|      6|         0|         2|0.46|0.4545|0.72|   0.2836|    35|        71| 106|
    |     16|2011-01-01|     1|  0|   1| 15|      0|      6|         0|         2|0.44|0.4394|0.77|   0.2985|    40|        70| 110|
    |     17|2011-01-01|     1|  0|   1| 16|      0|      6|         0|         2|0.42|0.4242|0.82|   0.2985|    41|        52|  93|
    |     18|2011-01-01|     1|  0|   1| 17|      0|      6|         0|         2|0.44|0.4394|0.82|   0.2836|    15|        52|  67|
    |     19|2011-01-01|     1|  0|   1| 18|      0|      6|         0|         3|0.42|0.4242|0.88|   0.2537|     9|        26|  35|
    +-------+----------+------+---+----+---+-------+-------+----------+----------+----+------+----+---------+------+----------+----+
    可以看到，如果read.csv不加其他参数，那么，会把.csv文件的头部也当成一条数据读进来，另外会自动生成新的列名。
    """
    """
    df = spark.read.csv(path='/tmp/test/hour2.csv',header=True,sep=',')
    df.show()
    +-------+----------+------+---+----+---+-------+-------+----------+----------+----+------+----+---------+------+----------+---+
    |instant|    dteday|season| yr|mnth| hr|holiday|weekday|workingday|weathersit|temp| atemp| hum|windspeed|casual|registered|cnt|
    +-------+----------+------+---+----+---+-------+-------+----------+----------+----+------+----+---------+------+----------+---+
    |      1|2011-01-01|     1|  0|   1|  0|      0|      6|         0|         1|0.24|0.2879|0.81|        0|     3|        13| 16|
    |      2|2011-01-01|     1|  0|   1|  1|      0|      6|         0|         1|0.22|0.2727| 0.8|        0|     8|        32| 40|
    |      3|2011-01-01|     1|  0|   1|  2|      0|      6|         0|         1|0.22|0.2727| 0.8|        0|     5|        27| 32|
    |      4|2011-01-01|     1|  0|   1|  3|      0|      6|         0|         1|0.24|0.2879|0.75|        0|     3|        10| 13|
    |      5|2011-01-01|     1|  0|   1|  4|      0|      6|         0|         1|0.24|0.2879|0.75|        0|     0|         1|  1|
    |      6|2011-01-01|     1|  0|   1|  5|      0|      6|         0|         2|0.24|0.2576|0.75|   0.0896|     0|         1|  1|
    |      7|2011-01-01|     1|  0|   1|  6|      0|      6|         0|         1|0.22|0.2727| 0.8|        0|     2|         0|  2|
    |      8|2011-01-01|     1|  0|   1|  7|      0|      6|         0|         1| 0.2|0.2576|0.86|        0|     1|         2|  3|
    |      9|2011-01-01|     1|  0|   1|  8|      0|      6|         0|         1|0.24|0.2879|0.75|        0|     1|         7|  8|
    |     10|2011-01-01|     1|  0|   1|  9|      0|      6|         0|         1|0.32|0.3485|0.76|        0|     8|         6| 14|
    |     11|2011-01-01|     1|  0|   1| 10|      0|      6|         0|         1|0.38|0.3939|0.76|   0.2537|    12|        24| 36|
    |     12|2011-01-01|     1|  0|   1| 11|      0|      6|         0|         1|0.36|0.3333|0.81|   0.2836|    26|        30| 56|
    |     13|2011-01-01|     1|  0|   1| 12|      0|      6|         0|         1|0.42|0.4242|0.77|   0.2836|    29|        55| 84|
    |     14|2011-01-01|     1|  0|   1| 13|      0|      6|         0|         2|0.46|0.4545|0.72|   0.2985|    47|        47| 94|
    |     15|2011-01-01|     1|  0|   1| 14|      0|      6|         0|         2|0.46|0.4545|0.72|   0.2836|    35|        71|106|
    |     16|2011-01-01|     1|  0|   1| 15|      0|      6|         0|         2|0.44|0.4394|0.77|   0.2985|    40|        70|110|
    |     17|2011-01-01|     1|  0|   1| 16|      0|      6|         0|         2|0.42|0.4242|0.82|   0.2985|    41|        52| 93|
    |     18|2011-01-01|     1|  0|   1| 17|      0|      6|         0|         2|0.44|0.4394|0.82|   0.2836|    15|        52| 67|
    |     19|2011-01-01|     1|  0|   1| 18|      0|      6|         0|         3|0.42|0.4242|0.88|   0.2537|     9|        26| 35|
    |     20|2011-01-01|     1|  0|   1| 19|      0|      6|         0|         3|0.42|0.4242|0.88|   0.2537|     6|        31| 37|
    +-------+----------+------+---+----+---+-------+-------+----------+----------+----+------+----+---------+------+----------+---+

    """
spark.createDataFrame(data, schema=None, samplingRatio=None, verifySchema=True)[source]
从RDD，列表，或pandas.dataframe数据源创建RDD形式的数据框
 参数data就是输入的数据，如果是列表，那么列表中的每个元素对应一行
 schema可以是 a pyspark.sql.types.DataType or a datatype string or a list of column names, default is None.
from pyspark.sql import SparkSession
import pandas
spark=SparkSession.builder.getOrCreate()
#由列表创建，列表的元素为元组
a=[('Alice',1)]
a_dataframe=spark.createDataFrame(a)
print(a_dataframe.collect())
#指定数据框的列名
a_dataframe=spark.createDataFrame(a,['name','age'])
print(a_dataframe.collect())
#由列表创建，列表的元素为字典，因而直接有列名
d = [{
     'name': 'Alice', 'age': 1}]
d_dataframe=spark.createDataFrame(d)
print(d_dataframe.collect)
#由rdd创建
rdd=sc.parallelize(a)
df=spark.createDataFrame(rdd)
print(df.collect())
#指定数据框各列的数据类型，这时的类型是pyspark.sql里的数据类型
from pyspark.sql.types import *
schema=StructType([StructField('name',StringType,True),StructField('age',IntegerType,True)])
df3=spark.createDataFrame(rdd,schema)
#直接指定列名和列的数据类型，这时是通过python中的数据类型来指定
print(spark.createDataFrame(rdd,'name:string,age:int').collect())
print(df3.collect())
#由pandas.dataframe创建，其中.toPandas()将数据转换为rdd下的pnadasdataframe类型
print(spark.createDataFrame(df.toPandas()).collect())
print(spark.createDataFrame(pandas.DataFrame([['age',1]])).collect())
结果如下：
[Row(_1='Alice', _2=1)]
[Row(name='Alice', age=1)]
从json文件创建dataframe
# spark is an existing SparkSession
df = spark.read.json("/home/qjzh/miniconda/envs/water_meter2/projects/people.json")
# Displays the content of the DataFrame to stdout
df.show()
# +----+-------+
# | age|   name|
# +----+-------+
# |null|Michael|
# |  30|   Andy|
# |  19| Justin|
# +----+-------+
可以通过属性df.age或索引df[‘age’]获取dataframe的列，前者用起来比较方便，但更鼓励使用后者。
pyspark.sql.DataFrame的使用练习
df.toDF(*cols)
返回一新的数据框, 指定新的列名。有的时候通过读取或转换等操作产生的dataframe没有列名，这个时候通过toDF加上列名可以使得后续对df的操作更方便；（好像利用toDF不可以用来将RDD转换为dataframe）
df.toDF('f1', 'f2').collect()
[Row(f1=2, f2='Alice'), Row(f1=5, f2='Bob')]
df.toJSON(use_unicode=True)
toJSON() 将数据框转换为字符串 RDD。每一行都被转换为 JSON 格式的字符串, 作为所返回的 RDD 中的一个元素。
df.toJSON().first()
'{"age":2,"name":"Alice"}'
df.toLocalIterator()
返回一个包含该数据框所有行的迭代器。这个迭代器会在该数据框的最大分区中消耗尽可能多的内存。
list(df.toLocalIterator())
[Row(age=2, name='Alice'), Row(age=5, name='Bob')]
df.groupBy 之后可以使用的方法
(以下的方法可以跟在groupby后达到分组计算的效果，但也可以不跟groupby后，这些方法本身就是dataframe的方法。)
 groupBy会产生一个专门的数据类型pyspark.sql.GroupedData，因而后面可以接什么操作，直接看pyspark.sql.GroupedData类的方法和属性就可以了。
 
   agg(*exprs) 
  
聚合计算并将结果返回为 DataFrame。
 可用的聚合函数有avg, max, min, sum, 'count'
 如果 expr 是从字符串到字符串的单个 dict 映射, 那么其键就是要执行聚合的列, 其值就是该聚合函数。
 可选地, expr 还可以是一组聚合列 表达式。
 参数: exprs - 一个从列名(字符串)到聚合函数(字符串)的字典映射, 或者是一个 Column 列表。
 未指定groupby的列，则所有的数据行为一组；如果未指定agg的列，则对所有数字类的列进行聚合，如下
>>> df.groupBy().avg().collect()
[Row(avg(age)=3.5)]
>>> sorted(df.groupBy(df.name).avg().collect())
[Row(name='Alice', avg(age)=2.0), Row(name='Bob', avg(age)=5.0)]
还可以按groupBy的列名进排序:
>>> sorted(df.groupBy('name').agg({
     'age': 'mean'}).collect())
[Row(name='Alice', avg(age)=2.0), Row(name='Bob', avg(age)=5.0)]
可以按多个列进行分组：
>>> sorted(df.groupBy(['name', df.age]).count().collect())
[Row(name='Alice', age=2, count=1), Row(name='Bob', age=5, count=1)]
详细的函数列表：
'lit': 'Creates a :class:`Column` of literal value.',
'col': 'Returns a :class:`Column` based on the given column name.',
'column': 'Returns a :class:`Column` based on the given column name.',
'asc': 'Returns a sort expression based on the ascending order of the given column name.',
'desc': 'Returns a sort expression based on the descending order of the given column name.',

'upper': 'Converts a string expression to upper case.',
'lower': 'Converts a string expression to upper case.',
'sqrt': 'Computes the square root of the specified float value.',
'abs': 'Computes the absolute value.',

'max': 'Aggregate function: returns the maximum value of the expression in a group.',
'min': 'Aggregate function: returns the minimum value of the expression in a group.',
'first': 'Aggregate function: returns the first value in a group.',
'last': 'Aggregate function: returns the last value in a group.',
'count': 'Aggregate function: returns the number of items in a group.',
'sum': 'Aggregate function: returns the sum of all values in the expression.',
'avg': 'Aggregate function: returns the average of the values in a group.',
'mean': 'Aggregate function: returns the average of the values in a group.',
'sumDistinct': 'Aggregate function: returns the sum of distinct values in the expression.'

'acos': 'Computes the cosine inverse of the given value; the returned angle is in the range' +
'0.0 through pi.',
'asin': 'Computes the sine inverse of the given value; the returned angle is in the range' +
'-pi/2 through pi/2.',
'atan': 'Computes the tangent inverse of the given value.',
'cbrt': 'Computes the cube-root of the given value.',
'ceil': 'Computes the ceiling of the given value.',
'cos': 'Computes the cosine of the given value.',
'cosh': 'Computes the hyperbolic cosine of the given value.',
'exp': 'Computes the exponential of the given value.',
'expm1': 'Computes the exponential of the given value minus one.',
'floor': 'Computes the floor of the given value.',
'log': 'Computes the natural logarithm of the given value.',
'log10': 'Computes the logarithm of the given value in Base 10.',
'log1p': 'Computes the natural logarithm of the given value plus one.',
'rint': 'Returns the double value that is closest in value to the argument and' +
' is equal to a mathematical integer.',
'signum': 'Computes the signum of the given value.',
'sin': 'Computes the sine of the given value.',
'sinh': 'Computes the hyperbolic sine of the given value.',
'tan': 'Computes the tangent of the given value.',
'tanh': 'Computes the hyperbolic tangent of the given value.',
'toDegrees': 'Converts an angle measured in radians to an approximately equivalent angle ' +
'measured in degrees.',
'toRadians': 'Converts an angle measured in degrees to an approximately equivalent angle ' +
'measured in radians.',

'bitwiseNOT': 'Computes bitwise not.'


'stddev': 'Aggregate function: returns the unbiased sample standard deviation of' +
' the expression in a group.',
'stddev_samp': 'Aggregate function: returns the unbiased sample standard deviation of' +
' the expression in a group.',
'stddev_pop': 'Aggregate function: returns population standard deviation of' +
' the expression in a group.',
'variance': 'Aggregate function: returns the population variance of the values in a group.',
'var_samp': 'Aggregate function: returns the unbiased variance of the values in a group.',
'var_pop':  'Aggregate function: returns the population variance of the values in a group.',
'skewness': 'Aggregate function: returns the skewness of the values in a group.',
'kurtosis': 'Aggregate function: returns the kurtosis of the values in a group.',
'collect_list': 'Aggregate function: returns a list of objects with duplicates.',
'collect_set': 'Aggregate function: returns a set of objects with duplicate elements' +
' eliminated.'

# math functions that take two arguments as input
_binary_mathfunctions = {
     
    'atan2': 'Returns the angle theta from the conversion of rectangular coordinates (x, y) to' +
             'polar coordinates (r, theta).',
    'hypot': 'Computes `sqrt(a^2^ + b^2^)` without intermediate overflow or underflow.',
    'pow': 'Returns the value of the first argument raised to the power of the second argument.',
}

_window_functions = {
     
    'rowNumber':
        """.. note:: Deprecated in 1.6, use row_number instead.""",
    'row_number':
        """returns a sequential number starting at 1 within a window partition.""",
    'denseRank':
        """.. note:: Deprecated in 1.6, use dense_rank instead.""",
    'dense_rank':
        """returns the rank of rows within a window partition, without any gaps.

        The difference between rank and denseRank is that denseRank leaves no gaps in ranking
        sequence when there are ties. That is, if you were ranking a competition using denseRank
        and had three people tie for second place, you would say that all three were in second
        place and that the next person came in third.""",
    'rank':
        """returns the rank of rows within a window partition.

        The difference between rank and denseRank is that denseRank leaves no gaps in ranking
        sequence when there are ties. That is, if you were ranking a competition using denseRank
        and had three people tie for second place, you would say that all three were in second
        place and that the next person came in third.

        This is equivalent to the RANK function in SQL.""",
    'cumeDist':
        """.. note:: Deprecated in 1.6, use cume_dist instead.""",
    'cume_dist':
        """returns the cumulative distribution of values within a window partition,
        i.e. the fraction of rows that are below the current row.""",
    'percentRank':
        """.. note:: Deprecated in 1.6, use percent_rank instead.""",
    'percent_rank':
        """returns the relative rank (i.e. percentile) of rows within a window partition.""",
}

------
 
   avg(*cols)
 为每一组的每个数值列计算平均值。
 mean 是 avg 的别名。
 参数: cols - 一组列的名字(字符串)。非数值列被忽略。 
  
>>> df.groupBy().avg('age').collect()
[Row(avg(age)=3.5)]
>>> df3.groupBy().avg('age', 'height').collect()
[Row(avg(age)=3.5, avg(height)=82.5)]
 
   count()
 计算每组的记录数。 
  
>>> sorted(df.groupBy(df.age).count().collect())
[Row(age=2, count=1), Row(age=5, count=1)]
 
   max(*cols)
 计算每组每个数值列的最大值。 
  
>>> df.groupBy().max('age').collect()
[Row(max(age)=5)]
>>> df3.groupBy().max('age', 'height').collect()
[Row(max(age)=5, max(height)=85)]
 
    mean() 求平均数, 同 avg
  
    min() 参考 max
  
    pivot(pivot_col, values=None)
 透视当前[[DataFrame]]的列并执行指定的聚合。有两个版本的pivot函数：一个需要调用者指定不同值的列表以便进行透视，而另一个不指定。后者更简洁但效率较低，因为Spark需要首先在内部计算不同值的列表。
 参数: pivot_col- 要透视的列名; ```values`` - 将转换为输出 DataFrame 中的列的值列表。
  
    *sum(cols)
 为每组计算每个数值列的总和。
 参数: cols - 一组列的名字（字符串）。非数值列被忽略。
  
  
>>> df.groupBy().sum('age').collect()
[Row(sum(age)=7)]
>>> df3.groupBy().sum('age', 'height').collect()
[Row(sum(age)=7, sum(height)=165)]
还可以参考后面的pyspark.sql的自定义函数部分的内容
 关于groupby后的数据类型可参考：Spark DataFrame 的 groupBy vs groupByKey
df.describe(*cols)
count, mean, stddev, min, and max. If no columns are given, this function computes statistics for all numerical columns.
df.describe().show()
+-------+------------------+
|summary|               age|
+-------+------------------+
|  count|                 2|
|   mean|               3.5|
| stddev|2.1213203435596424|
|    min|                 2|
|    max|                 5|
+-------+------------------+
df.describe(['age', 'name']).show()
+-------+------------------+-----+
|summary|               age| name|
+-------+------------------+-----+
|  count|                 2|    2|
|   mean|               3.5| null|
| stddev|2.1213203435596424| null|
|    min|                 2|Alice|
|    max|                 5|  Bob|
+-------+------------------+-----+
df.fillna(value, subset=None)
Replace null values, alias for na.fill(). DataFrame.fillna() and DataFrameNaFunctions.fill() are aliases of each other.
 Parameters:
 
   value – int, long, float, string, or dict. Value to replace null values with. If the value is a dict, then subset is ignored and value must be a mapping from column name (string) to replacement value. The replacement value must be an int, long, float, boolean, or string. 
   subset – optional list of column names to consider. Columns specified in subset that do not have matching data type are ignored. For example, if value is a string, and subset contains a non-string column, then the non-string column is simply ignored. 
  
>>> df4.na.fill(50).show()
+---+------+-----+
|age|height| name|
+---+------+-----+
| 10|    80|Alice|
|  5|    50|  Bob|
| 50|    50|  Tom|
| 50|    50| null|
+---+------+-----+
>>> df4.na.fill({
     'age': 50, 'name': 'unknown'}).show()
+---+------+-------+
|age|height|   name|
+---+------+-------+
| 10|    80|  Alice|
|  5|  null|    Bob|
| 50|  null|    Tom|
| 50|  null|unknown|
+---+------+-------+
pyspark.sql.DataFrame的连接操作
crossJoin(other)
返回两个df的笛卡尔积
	df=spark.createDataFrame([Row(age=2, name='Alice'), Row(age=5, name='Bob')])
    df.show()
    """
    +---+-----+
    |age| name|
    +---+-----+
    |  2|Alice|
    |  5|  Bob|
    +---+-----+
    """
    df2=spark.createDataFrame([Row(name='Tom', height=80), Row(name='Bob', height=85)])
    df2.show()
    """
    +------+----+
    |height|name|
    +------+----+
    |    80| Tom|
    |    85| Bob|
    +------+----+
    """
    df3 = spark.createDataFrame([Row(age=2, name='Alice',grade='A'), Row(age=4, name='Bob',grade='B'), Row(age=5, name='Tom',grade='C')])
    df3.show()
    """
    +---+-----+-----+
    |age|grade| name|
    +---+-----+-----+
    |  2|    A|Alice|
    |  4|    B|  Bob|
    |  5|    C|  Tom|
    +---+-----+-----+
    """
    df.crossJoin(df2).show()
    """
    +---+-----+------+----+
    |age| name|height|name|
    +---+-----+------+----+
    |  2|Alice|    80| Tom|
    |  2|Alice|    85| Bob|
    |  5|  Bob|    80| Tom|
    |  5|  Bob|    85| Bob|
    +---+-----+------+----+
    """
join(other, on=None, how=None)
Parameters:
 
   other – Right side of the join 
   on – a string for the join column name, a list of column names, a join expression (Column), or a list of Columns. If on is a string or a list of strings indicating the name of the join column(s), the column(s) must exist on both sides, and this performs an equi-join. 
   how – str, default inner. Must be one of: inner, cross, outer, full, full_outer, left, left_outer, right, right_outer, left_semi, and left_anti. 
  
#df.join(df2,on=df.name==df2.name,how='inner').show()
    """
    +---+----+------+----+
    |age|name|height|name|
    +---+----+------+----+
    |  5| Bob|    85| Bob|
    +---+----+------+----+
    """
    #df.join(df2, on=df.name == df2.name, how='cross').show()  # 这样看起来和inner join就没什么区别
    """
    +---+----+------+----+
    |age|name|height|name|
    +---+----+------+----+
    |  5| Bob|    85| Bob|
    +---+----+------+----+
    """
    #df.join(df2,on=df.name==df2.name,how='left').show()
    """
    +---+-----+------+----+
    |age| name|height|name|
    +---+-----+------+----+
    |  5|  Bob|    85| Bob|
    |  2|Alice|  null|null|
    +---+-----+------+----+
    """
    # df.join(df2, on=df.name == df2.name, how='left_outer').show()#与left join似乎没什么区别
    """
    +---+-----+------+----+
    |age| name|height|name|
    +---+-----+------+----+
    |  5|  Bob|    85| Bob|
    |  2|Alice|  null|null|
    +---+-----+------+----+
    """
    # df.join(df2, on=df.name == df2.name, how='left_semi').show()#与left john相比，它只取了左表这一部分，而且只取有连接值的一部分
    """
    +---+----+
    |age|name|
    +---+----+
    |  5| Bob|
    +---+----+
    """
    # df.join(df2, on=df.name == df2.name, how='left_anti').show()#与left john相比，它只取了左表这一部分，而且只取没连接值的一部分
    """
    +---+-----+
    |age| name|
    +---+-----+
    |  2|Alice|
    +---+-----+
    """
    #cond = [df.name == df3.name, df.age == df3.age]#多个字段连接时，要所有条件同时成立，否则就会null
    #df.join(df3, on=cond, how='left').show()
    """
    +---+-----+----+-----+-----+
    |age| name| age|grade| name|
    +---+-----+----+-----+-----+
    |  2|Alice|   2|    A|Alice|
    |  5|  Bob|null| null| null|
    +---+-----+----+-----+-----+
    """
    #df.join(df2,on=df.name==df2.name,how='right').show()
    """
    +----+----+------+----+
    | age|name|height|name|
    +----+----+------+----+
    |null|null|    80| Tom|
    |   5| Bob|    85| Bob|
    +----+----+------+----+
    """
    # df.join(df2, on=df.name == df2.name, how='right_outer').show()#似乎与right join没什么区别
    """
    +----+----+------+----+
    | age|name|height|name|
    +----+----+------+----+
    |null|null|    80| Tom|
    |   5| Bob|    85| Bob|
    +----+----+------+----+
    """
    #df.join(df2,on=df.name==df2.name,how='outer').show()
    """
    +----+-----+------+----+
    | age| name|height|name|
    +----+-----+------+----+
    |null| null|    80| Tom|
    |   5|  Bob|    85| Bob|
    |   2|Alice|  null|null|
    +----+-----+------+----+
    """

    #df.join(df2, on=df.name == df2.name, how='full').show()#似乎与outer join没什么区别
    """
    +----+-----+------+----+
    | age| name|height|name|
    +----+-----+------+----+
    |null| null|    80| Tom|
    |   5|  Bob|    85| Bob|
    |   2|Alice|  null|null|
    +----+-----+------+----+
    """
    #df.join(df2, on=df.name == df2.name, how='full_outer').show()#似乎与outer join没什么区别
    """
    +----+-----+------+----+
    | age| name|height|name|
    +----+-----+------+----+
    |null| null|    80| Tom|
    |   5|  Bob|    85| Bob|
    |   2|Alice|  null|null|
    +----+-----+------+----+
    """
    #df.join(df2, 'name', how='left').show()#如果不用on，而是指明连接字段，结果是去掉了冗余字段，有点像mysql里left join时不用on而用using
    """
    +-----+---+------+
    | name|age|height|
    +-----+---+------+
    |  Bob|  5|    85|
    |Alice|  2|  null|
    +-----+---+------+
    """
    #df.join(df3, ['name','age'], how='left').show()#与on相比，去掉了冗余项
    """
    +-----+---+-----+
    | name|age|grade|
    +-----+---+-----+
    |Alice|  2|    A|
    |  Bob|  5| null|
    +-----+---+-----+
    """
union(other)或者unionAll(other)
Return a new DataFrame containing union of rows in this and another frame.
This is equivalent to UNION ALL in SQL. To do a SQL-style set union (that does deduplication of elements), use this function followed by a distinct.
Also as standard in SQL, this function resolves columns by position (not by name).
增删改查
PySpark的DataFrame处理方法：增删改差
df.drop(how=‘any’, thresh=None, subset=None)
Functionality for working with missing data in DataFrame
 Parameters:
 
   how – ‘any’ or ‘all’. If ‘any’, drop a row if it contains any nulls. If ‘all’, drop a row only if all its values are null. 
   thresh – int, default None If specified, drop rows that have less than thresh non-null values. This overwrites the how parameter. 
   subset – optional list of column names to consider. 
  
df.fill(value, subset=None)
df.replace(to_replace, value, subset=None)
df.withColumn(colName, col)
通过为原数据框添加一个新列或替换已存在的同名列而返回一个新数据框。colName 是一个字符串, 为新列的名字。
 col 为这个新列的 Column 表达式。表达式的操作对象必须是原数据框，如果试图把其他数据框的信息加到新列中可能会出错，报错 AssertionError: col should be Column；但是col 表达式可以组合使用多个列:。
>>> df.withColumn('age2', df.age + 2).collect()
[Row(age=2, name='Alice', age2=4), Row(age=5, name='Bob', age2=7)]
python dataframe添加一列 - 如何在Spark DataFrame中添加常量列
 如：
from pyspark.sql.functions import lit

df.withColumn('new_column', lit(10))
df.withColumn('new_column', lit(''))
df.withColumnRenamed(existing, new)
重命名已存在的列并返回一个新数据框。existing 为已存在的要重命名的列, col 为新列的名字。这个用toDF也能实现。
>>> df.withColumnRenamed('age', 'age2').collect()
[Row(age2=2, name='Alice'), Row(age2=5, name='Bob')]
class pyspark.sql.Column(jc)
数据框中的一列。 Column 实例可以通过如下的代码创建：
#1. 选择数据框中的一列
df.colName
df["colName"]

#2. 从表达式创建
df.colName + 1
1 / df.colName 
pyspark.sql的自定义函数
pyspark.sql.functions.udf(f=None, returnType=StringType)[source]
pyspark.sql.functions.pandas_udf(f=None, returnType=None, functionType=None)
pyspark中的自定义函数
 PySpark Pandas UDF
 如果自定义函数是用在groupby后，传入函数的df,就是包含了key本身的所有数据，是一个pandas的dataframe，而返回值也必须是pandas的dataframe。但dataframe的结构可以自定义。
 groupby的响应函数里，最终返回的dataframe值、类型和顺序必须和schema中定义的一模一样；由于在响应函数里会进行各种各样的处理，最简单的控制方法就是在返回前按顺序在取一边dataframe,如 df=df[["A","B","C"]] retrun df
 注意：
 hive中如果数据类型是int型，但实际为null时，用pyspark处理可能会有问题，在groupby 后，此时schema的类型定义不要用interType(),可以临时改为stringType()
 返回的dataframe包含原来hive表中没有的字段，按说有两种方式，一种是先对pyspark.sql的dataframe利用withColumns()增加列，也就是groupby前增加列，然后再在groupby的响应函数中进行各种处理；一种是先groupby,再在响应函数里对pandas.DataFrame添加列进行其他处理，但不知道为什么第二种方式出错了，有时间再试一试
 下面是一个例子
schema = StructType([
    StructField("AAA", StringType()),
	StructField("result", StringType())
])

# 算法
@pandas_udf(schema, functionType=PandasUDFType.GROUPED_MAP)
def virtual_height(gdf):
	AAA=gdf.iloc[0,0]
	gdf=gdf.sort_values(by="DATATIME")
	cur_input=list(gdf["BBB"])
	vol_max=list(gdf["CCC"])
	vol_min=list(gdf["DDD"])
	soc_v_time=list(gdf["DATATIME"])
	
	----
				
	return pd.DataFrame({
     "AAA":[AAA, AAA, AAA, AAA, AAA],
				"result":['warning', str(EEE), str(FFF), str(HHH), GGG.strftime("%Y-%m-%d %H:%M:%S")]})


DF.groupby("AAA").apply(virtual_height).show()
pyspark.sql.functions.window(timeColumn, windowDuration, slideDuration=None, startTime=None)
Bucketize rows into one or more time windows given a timestamp specifying column. Window starts are inclusive but the window ends are exclusive, e.g. 12:05 will be in the window [12:05,12:10) but not in [12:00,12:05). Windows can support microsecond precision. Windows in the order of months are not supported.
The time column must be of pyspark.sql.types.TimestampType.
Durations are provided as strings, e.g. ‘1 second’, ‘1 day 12 hours’, ‘2 minutes’. Valid interval strings are ‘week’, ‘day’, ‘hour’, ‘minute’, ‘second’, ‘millisecond’, ‘microsecond’. If the slideDuration is not provided, the windows will be tumbling windows.
The startTime is the offset with respect to 1970-01-01 00:00:00 UTC with which to start window intervals. For example, in order to have hourly tumbling windows that start 15 minutes past the hour, e.g. 12:15-13:15, 13:15-14:15… provide startTime as 15 minutes.
The output column will be a struct called ‘window’ by default with the nested columns ‘start’ and ‘end’, where ‘start’ and ‘end’ will be of pyspark.sql.types.TimestampType.
>>> df = spark.createDataFrame([("2016-03-11 09:00:07", 1)]).toDF("date", "val")
>>> w = df.groupBy(window("date", "5 seconds")).agg(sum("val").alias("sum"))
>>> w.select(w.window.start.cast("string").alias("start"),
...          w.window.end.cast("string").alias("end"), "sum").collect()
[Row(start='2016-03-11 09:00:05', end='2016-03-11 09:00:10', sum=1)]
spark sql优化
浪尖以案例聊聊spark3的动态分区裁剪
使用 Spark SQL 对结构化数据进行统计分析
来源：王 龙 2015 年 9 月 09 日发布
引言
在很多领域，如电信，金融等，每天都会产生大量的结构化数据，当数据量不断变大，传统的数据存储 (DBMS) 和计算方式 (单机程序) 已经不能满足企业对数据存储，统计分析以及知识挖掘的需要。在过去的数年里，传统的软件开发和维护人员已经积累了大量的基于 DBMS 的操作数据知识和经验，他们已经习惯了通过编写 SQL 语句来对数据记录进行统计分析。于是大数据工程师们开始探索如何使用类 SQL 的方式来操作和分析大数据，通过大量的努力，目前业界已经出现很多 SQL on Hadoop 的方案，如 Hive, Impala 等。Spark SQL 就是其中的一个，实际上 Spark SQL 并不是一开始就存在于 Spark 生态系统里的，它的前身是 Shark。随着 Spark 自身的发展，Spark 团队开始试图放弃 Shark 这个对于 Hive 有太多依赖 (查询优化，语法解析) 的东西，于是才有了 Spark SQL 这个全新的模块，通过几个版本的发展，目前 Spark SQL 已经趋于稳定，功能也逐渐丰富。本文将以 Spark1.4.1 版本为基础，由浅入深地向读者介绍 Spark SQL/DataFrame 的基本概念和原理，并且通过实例向读者展示如何使用 Spark SQL/DataFrame API 开发应用程序。接下来，就让我们开始 Spark SQL 的体验之旅吧。
关于 Spark SQL/DataFrame
Spark SQL 是 Spark 生态系统里用于处理结构化大数据的模块，该模块里最重要的概念就是 DataFrame, 相信熟悉 R 语言的工程师对此并不陌生。Spark 的 DataFrame 是基于早期版本中的 SchemaRDD，所以很自然的使用分布式大数据处理的场景。Spark DataFrame 以 RDD 为基础，但是带有 Schema 信息，它类似于传统数据库中的二维表格。
 Spark SQL 模块目前支持将多种外部数据源的数据转化为 DataFrame，并像操作 RDD 或者将其注册为临时表的方式处理和分析这些数据。当前支持的数据源有：
 
    Json
  
   文本文件 
   RDD 
   关系数据库 
   Hive 
   Parquet 
  
一旦将 DataFrame 注册成临时表，我们就可以使用类 SQL 的方式操作这些数据，我们将在下文的案例中详细向读者展示如何使用 Spark SQL/DataFrame 提供的 API 完成数据读取，临时表注册，统计分析等步骤。
案例介绍与编程实现
案例一
a.案例描述与分析
 本案例中，我们将使用 Spark SQL 分析包含 5 亿条人口信息的结构化数据，数据存储在文本文件上，总大小为 7.81G。文件总共包含三列，第一列是 ID，第二列是性别信息 (F -> 女，M -> 男)，第三列是人口的身高信息，单位是 cm。实际上这个文件与我们在本系列文章第一篇中的案例三使用的格式是一致的，读者可以参考相关章节，并使用提供的测试数据生成程序，生成 5 亿条数据，用于本案例中。为了便于读者理解，本案例依然把用于分析的文本文件的内容片段贴出来，具体格式如下。
 图 1. 案例一测试数据格式预览
1 F 151
2 E 203
3 M 132
4 E 126
5 M 215
6 M 173
7 F 156
8 M 120
9 M 201
10 F 102
1	Male	163
2	Male	164
3	Male	165
4	Male	168
5	Male	169
6	Male	170
7	Male	170
8	Male	170
9	Male	171
10	Male	172
11	Male	172
12	Male	172
13	Male	173
14	Male	173
15	Male	174
16	Male	175
17	Male	175
18	Male	175
19	Male	175
20	Male	175
21	Male	176
22	Male	178
23	Male	178
24	Male	180
25	Male	180
26	Male	183
27	Female	153
28	Female	156
29	Female	156
30	Female	157
31	Female	158
32	Female	158
33	Female	159
34	Female	160
35	Female	160
36	Female	160
37	Female	160
38	Female	161
39	Female	161
40	Female	162
41	Female	162
42	Female	163
43	Female	163
44	Female	163
45	Female	164
46	Female	164
47	Female	165
48	Female	165
49	Female	165
50	Female	168
51	Female	168
52	Female	170
生成该测试文件后，读者需要把文件上传到 HDFS 上，您可以选择使用 HDFS shell 命令或者 HDSF 的 eclipse 插件。上传到 HDFS 后，我们可以通过访问 HDFS web console(http://namenode:50070)，查看文件具体信息。
 图 2. 案例一测试数据文件基本信息
本例中，我们的统计任务如下：
 
   用 SQL 语句的方式统计男性中身高超过 180cm 的人数。 
   用 SQL 语句的方式统计女性中身高超过 170cm 的人数。 
   对人群按照性别分组并统计男女人数。 
   用类 RDD 转换的方式对 DataFrame 操作来统计并打印身高大于 210cm 的前 50 名男性。 
   对所有人按身高进行排序并打印前 50 名的信息。 
   统计男性的平均身高。 
   统计女性身高的最大值。 
  
读者可以看到，上述统计任务中有些是相似的，但是我们要用不同的方式实现它，以向读者展示不同的语法。
 b.编码实现
 清单 1. 案例一示例程序源代码
#-*- coding:utf-8 -*-

from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import Row

if __name__=="__main__":
    sc=SparkContext(appName="myApp")
    spark=SparkSession.builder.enableHiveSupport().getOrCreate()

    rddData=sc.textFile("/tmp/test/heightData.txt")
    rddDataFmt=rddData.map(lambda x:x.split(",")).map(lambda line:Row(id=int(line[0]),gender=line[1],heights=float(line[2])))
    df=spark.createDataFrame(rddDataFmt)
    """
    df.show()
    +------+-------+---+
    |gender|heights| id|
    +------+-------+---+
    |  Male|  163.0|  1|
    |  Male|  164.0|  2|
    |  Male|  165.0|  3|
    |  Male|  168.0|  4|
    |  Male|  169.0|  5|
    |  Male|  170.0|  6|
    |  Male|  170.0|  7|
    |  Male|  170.0|  8|
    |  Male|  171.0|  9|
    |  Male|  172.0| 10|
    |  Male|  172.0| 11|
    |  Male|  172.0| 12|
    |  Male|  173.0| 13|
    |  Male|  173.0| 14|
    |  Male|  174.0| 15|
    |  Male|  175.0| 16|
    |  Male|  175.0| 17|
    |  Male|  175.0| 18|
    |  Male|  175.0| 19|
    |  Male|  175.0| 20|
    +------+-------+---+
    only showing top 20 rows
    """
    df.registerTempTable("people")
    #查看身高大于178的男性人数
    heighterMale178=spark.sql("select * from people where heights>=178 and gender='Male'")
    # print("男性身高大于178cm的人数为：%d"%(heighterMale178.count()))
    # 男性身高大于178cm的人数为：5
    #查看身高大于168的女性人数
    heighterFemal168=spark.sql("select * from people where heights>=168 and gender='Female'")
    # print("女性身高大于168cm的人数为：%d"%(heighterFemal168.count()))
    # 女性身高大于168cm的人数为：3
    #查看不同性别的人数
    """
    df.groupBy("gender").count().show()
    +------+-----+
    |gender|count|
    +------+-----+
    |Female|   26|
    |  Male|   26|
    +------+-----+
    """
    #查看身高大于170的10个男性情况（未排序）
    """
    df.filter(df.gender=="Male").filter(df.heights>=170).show(10)
    +------+-------+---+
    |gender|heights| id|
    +------+-------+---+
    |  Male|  170.0|  6|
    |  Male|  170.0|  7|
    |  Male|  170.0|  8|
    |  Male|  171.0|  9|
    |  Male|  172.0| 10|
    |  Male|  172.0| 11|
    |  Male|  172.0| 12|
    |  Male|  173.0| 13|
    |  Male|  173.0| 14|
    |  Male|  174.0| 15|
    +------+-------+---+
    only showing top 10 rows
    """
    #将所有人按身高从高到低排列，取出前15名的信息
    """
    df.sort(df.heights.desc()).show(15)
    +------+-------+---+
    |gender|heights| id|
    +------+-------+---+
    |  Male|  183.0| 26|
    |  Male|  180.0| 24|
    |  Male|  180.0| 25|
    |  Male|  178.0| 23|
    |  Male|  178.0| 22|
    |  Male|  176.0| 21|
    |  Male|  175.0| 17|
    |  Male|  175.0| 19|
    |  Male|  175.0| 16|
    |  Male|  175.0| 18|
    |  Male|  175.0| 20|
    |  Male|  174.0| 15|
    |  Male|  173.0| 13|
    |  Male|  173.0| 14|
    |  Male|  172.0| 10|
    +------+-------+---+
    only showing top 15 rows
    print(df.sort(df.heights.desc()).take(20))
    [Row(gender='Male', heights=183.0, id=26), Row(gender='Male', heights=180.0, id=25), 
    Row(gender='Male', heights=180.0, id=24), Row(gender='Male', heights=178.0, id=22), 
    Row(gender='Male', heights=178.0, id=23), Row(gender='Male', heights=176.0, id=21), 
    Row(gender='Male', heights=175.0, id=16), Row(gender='Male', heights=175.0, id=17), 
    Row(gender='Male', heights=175.0, id=19), Row(gender='Male', heights=175.0, id=20), 
    Row(gender='Male', heights=175.0, id=18), Row(gender='Male', heights=174.0, id=15), 
    Row(gender='Male', heights=173.0, id=13), Row(gender='Male', heights=173.0, id=14), 
    Row(gender='Male', heights=172.0, id=10), Row(gender='Male', heights=172.0, id=11), 
    Row(gender='Male', heights=172.0, id=12), Row(gender='Male', heights=171.0, id=9), 
    Row(gender='Male', heights=170.0, id=6), Row(gender='Male', heights=170.0, id=7)]
    """
    #统计男性的平均身高
    """
    df.filter(df.gender=="Male").agg({"heights":"avg"}).show()
    +------------------+
    |      avg(heights)|
    +------------------+
    |172.92307692307693|
    +------------------+
    """
    #统计女性的最大身高
    """
    df.filter(df.gender=="Female").agg({"heights":"max"}).show()
    +------------+
    |max(heights)|
    +------------+
    |       170.0|
    +------------+
    """
c.提交并运行
 spark-submit --master yarn --executor-memory 1g --driver-memory 1g --conf spark.yarn.am.memory=1g ./sparkSQLTest.py
 d.监控执行过程
 在提交后，我们可以在 Spark web console(http://:8080)中监控程序执行过程。下面我们将分别向读者展示如何监控程序产生的 Jobs，Stages，以及 D 可视化的查看 DAG 信息。
 e.运行结果
 如上
案例二
a.案例描述与分析
 在案例一中，我们将存储于 HDFS 中的文件转换成 DataFrame 并进行了一系列的统计，细心的读者会发现，都是一个关联于一个 DataFrame 的简单查询和统计，那么在案例二中，我们将向读者展示如何连接多个 DataFrame 做更复杂的统计分析。
在本案例中，我们将统计分析 1 千万用户和 1 亿条交易数据。对于用户数据，它是一个包含 6 个列 (ID, 性别, 年龄, 注册日期, 角色 (从事行业), 所在区域) 的文本文件，具有以下格式。
 图 8. 案例二测试用户数据格式预览
1,F,56,2008-10-4,ROLE003,REG005
2,F,13,2008-7-16,ROLE001,REG004
3,M,28,2009-2-2,ROLE003,REG002
4,M,51,2008-8-24,ROLE001,REG003
5,M,18,2010-1-15,ROLE004,REG004
6,M,52,2014-7-27,ROLE001,REG005
7,M,59,2000-6-14,ROLE001,REG004
8,M,13,2006-3-17,ROLE002,REG001
9,M,46,2000-10-5,ROLE002,REG005
10,M,19,2004-8-24,ROLE002,REG001
11,M,15,2004-4-26,ROLE001,REG003
12,F,30,2004-12-20,ROLE002,REG003
13,M,11,2003-3-22,ROLE004,REG001
14,M,56,2013-11-14,ROLE002,REG001
15,F,31,2005-2-8,ROLE003,REG001
16,F,13,2001-11-7,ROLE001,REG003
17,M,38,2000-1-11,ROLE005,REG004
18,M,14,2004-9-12,ROLE003,REG003
19,F,53,2013-3-13,ROLE002,REG004
20,M,34,2009-2-6,ROLE002,REG005
21,F,50,2011-8-16,ROLE003,REG004
22,F,10,2007-5-27,ROLE004,REG002
23,F,47,2014-4-18,ROLE003,REG005
24,M,43,2012-2-21,ROLE001,REG002
25,F,19,2001-10-4,ROLE005,REG001
26,F,14,2006-11-8,ROLE005,REG003
对于交易数据，它是一个包含 5 个列 (交易单号, 交易日期, 产品种类, 价格, 用户 ID) 的文本文件，具有以下格式。
 图 9. 案例二测试交易数据格式预览
1,2007-11-12,7,462,2
2,2008-9-24,10,279,6
3,2013-12-13,9,438,2
4,2001-3-26,4,1871,8
5,2005-7-6,10,1634,1
6,2003-2-8,2,1906,6
7,2007-10-27,3,1321,1
8,2001-10-4,2,1623,7
9,2015-3-25,7,754,3
10,2009-4-19,2,1308,1
11,2011-10-15,7,1266,3
12,2015-4-3,6,1480,9
13,2007-4-26,1,1221,5
14,2013-1-19,10,921,5
15,2005-2-25,10,1347,4
16,2005-2-5,2,1150,5
17,2004-12-9,6,1850,8
18,2005-2-26,6,122,8
19,2009-1-10,7,427,7
20,2004-5-1,4,414,9
21,2014-10-5,6,651,4
22,2003-11-7,5,1082,2
23,2010-11-7,9,624,6
24,2013-10-11,3,1848,4
25,2015-6-11,1,141,1
26,2000-10-2,3,653,3
b.编码实现
#-*-coding:utf-8-*-

from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.storagelevel import StorageLevel

if __name__=="__main__":
    sc=SparkContext(appName="myApp")
    spark=SparkSession.builder.enableHiveSupport().getOrCreate()
    #读取users和orders数据
    usersRddSrc = sc.textFile("/tmp/test/users.txt")
    usersRdd=usersRddSrc.map(lambda x:x.split(",")).map(lambda line:Row(userID=line[0],gender=line[1],
                age=line[2],registerDate=line[3],occupation=line[4],region=line[5]))
    usersDF=spark.createDataFrame(usersRdd)
    """
    usersDF.show()
    +---+------+----------+------+------------+------+
    |age|gender|occupation|region|registerDate|userID|
    +---+------+----------+------+------------+------+
    | 56|     F|   ROLE003|REG005|   2008-10-4|     1|
    | 13|     F|   ROLE001|REG004|   2008-7-16|     2|
    | 28|     M|   ROLE003|REG002|    2009-2-2|     3|
    | 51|     M|   ROLE001|REG003|   2008-8-24|     4|
    | 18|     M|   ROLE004|REG004|   2010-1-15|     5|
    | 52|     M|   ROLE001|REG005|   2014-7-27|     6|
    | 59|     M|   ROLE001|REG004|   2000-6-14|     7|
    | 13|     M|   ROLE002|REG001|   2006-3-17|     8|
    | 46|     M|   ROLE002|REG005|   2000-10-5|     9|
    | 19|     M|   ROLE002|REG001|   2004-8-24|    10|
    | 15|     M|   ROLE001|REG003|   2004-4-26|    11|
    | 30|     F|   ROLE002|REG003|  2004-12-20|    12|
    | 11|     M|   ROLE004|REG001|   2003-3-22|    13|
    | 56|     M|   ROLE002|REG001|  2013-11-14|    14|
    | 31|     F|   ROLE003|REG001|    2005-2-8|    15|
    | 13|     F|   ROLE001|REG003|   2001-11-7|    16|
    | 38|     M|   ROLE005|REG004|   2000-1-11|    17|
    | 14|     M|   ROLE003|REG003|   2004-9-12|    18|
    | 53|     F|   ROLE002|REG004|   2013-3-13|    19|
    | 34|     M|   ROLE002|REG005|    2009-2-6|    20|
    +---+------+----------+------+------------+------+
    only showing top 20 rows
    """
    usersDF.registerTempTable("users")
    ordersRddSrc=sc.textFile("/tmp/test/orders.txt")
    ordersRdd=ordersRddSrc.map(lambda x:x.split(",")).map(lambda line:Row(orderID=line[0],orderDate=line[1],
                productID=line[2],price=line[3],userID=line[4]))
    ordersDF=spark.createDataFrame(ordersRdd)
    """
    ordersDF.show()
    +----------+-------+-----+---------+------+
    | orderDate|orderID|price|productID|userID|
    +----------+-------+-----+---------+------+
    |2007-11-12|      1|  462|        7|     2|
    | 2008-9-24|      2|  279|       10|     6|
    |2013-12-13|      3|  438|        9|     2|
    | 2001-3-26|      4| 1871|        4|     8|
    |  2015-7-6|      5| 1634|       10|     1|
    |  2003-2-8|      6| 1906|        2|     6|
    |2007-10-27|      7| 1321|        3|     1|
    | 2001-10-4|      8| 1623|        2|     7|
    | 2015-3-25|      9|  754|        7|     3|
    | 2009-4-19|     10| 1308|        2|     1|
    |2011-10-15|     11| 1266|        7|     3|
    |  2015-4-3|     12| 1480|        6|     9|
    | 2007-4-26|     13| 1221|        1|     5|
    | 2013-1-19|     14|  921|       10|     5|
    | 2005-2-25|     15| 1347|       10|     4|
    |  2005-2-5|     16| 1150|        2|     5|
    | 2004-12-9|     17| 1850|        6|     8|
    | 2005-2-26|     18|  122|        6|     8|
    | 2009-1-10|     19|  427|        7|     7|
    |  2004-5-1|     20|  414|        4|     9|
    +----------+-------+-----+---------+------+
    only showing top 20 rows
    """
    ordersDF.registerTempTable("orders")
    usersDF.persist(storageLevel=StorageLevel.MEMORY_ONLY_SER)
    ordersDF.persist(storageLevel=StorageLevel.MEMORY_ONLY_SER)
    #查看2015年订单及用户个人信息
    df2015=ordersDF.filter(ordersDF.orderDate.contains("2015")).join(usersDF,ordersDF.userID==usersDF.userID,"inner")
    """
    df2015.show()
    +---------+-------+-----+---------+------+---+------+----------+------+------------+------+
    |orderDate|orderID|price|productID|userID|age|gender|occupation|region|registerDate|userID|
    +---------+-------+-----+---------+------+---+------+----------+------+------------+------+
    |2015-3-25|      9|  754|        7|     3| 28|     M|   ROLE003|REG002|    2009-2-2|     3|
    | 2015-4-3|     12| 1480|        6|     9| 46|     M|   ROLE002|REG005|   2000-10-5|     9|
    | 2015-7-6|      5| 1634|       10|     1| 56|     F|   ROLE003|REG005|   2008-10-4|     1|
    |2015-6-11|     25|  141|        1|     1| 56|     F|   ROLE003|REG005|   2008-10-4|     1|
    +---------+-------+-----+---------+------+---+------+----------+------+------------+------+
    从上面结果可以看到，虽然2015年的订单有4条，但这里只显示了3条，这是因为采用inner join 取了两者的交集
    """
    count2015=df2015.count()
    print("2015年的订单数为：%d"%(count2015))
    # 2015年有订单且有个人信息的用户数为：3
    #统计2013年总的订单数
    df2013=spark.sql("select * from orders where orderDate like '2013%'")
    count2013=df2013.count()
    # print("2013年总的订单数为：%d"%(count2013))
    # 2013年总的订单数为：3
    #上面展示了两种查看订单的方法：一个是对dataframe操作，一个是对利用spark.sql对临时表查询，跟操作Hive是一样的
    #统计用户ID=1的用户的整体情况
    df1=spark.sql("select o.orderID,o.productID,o.price,u.userID,u.age from orders o,users u where u.userID=1 and u.userID=o.userID")
    """
    df1.show()
    +-------+---------+-----+------+---+
    |orderID|productID|price|userID|age|
    +-------+---------+-----+------+---+
    |      5|       10| 1634|     1| 56|
    |      7|        3| 1321|     1| 56|
    |     10|        2| 1308|     1| 56|
    |     25|        1|  141|     1| 56|
    +-------+---------+-----+------+---+
    """
    #统计用户ID=4的用户订单的最大价格、最小价格、平均价格
    df4=spark.sql("select max(o.price) as maxPrice,min(o.price),avg(o.price) as avgPrice,u.userID from orders o,users u where u.userID=4 and o.userID=u.userID group by u.userID")
    """
    df4.show()
    +--------+----------+--------+------+                                           
    |maxPrice|min(price)|avgPrice|userID|
    +--------+----------+--------+------+
    |     651|      1347|  1282.0|     4|
    +--------+----------+--------+------+
    """
    #df1和df4两个例子更详细的展示了利用spark.sql像操作数据库一样操作临时表；当然这些功能也可以用spark.sql.dataframe接口来实现。
c.提交并运行
spark-submit --master yarn --executor-memory 2g --driver-memory 2g --conf spark.yarn.am.memory=2g ./sparkSQLTest.py
d.监控执行过程
 程序提交后，读者可以用案例一描述的方式在 Spark web console 监控执行过程，这样也能帮助您深入的理解 Spark SQL 程序的执行过程。
 e.运行结果
总结
关于 Spark SQL 程序开发，我们通常有以下需要注意的地方。
 
   Spark SQL 程序开发过程中，我们有两种方式确定 schema，第一种是反射推断 schema，如本文的案例二，这种方式下，我们需要定义样本类 (case class) 来对应数据的列;第二种方式是通过编程方式来确定 schema，这种方式主要是通过 Spark SQL 提供的 StructType 和 StructField 等 API 来编程实现，这种方式下我们不需要定义样本类，如本文中的案例一。
 import sqlCtx.implicits._
 在程序实现中，我们需要使用以便隐式的把 RDD 转化成 DataFrame 来操作。 
   本文展示的 DataFrame API 使用的方法只是其中很小的一部分，但是一旦读者掌握了开发的基本流程，就能通过参考 DataFrame API 文档 写出更为复杂的程序。 
   通常来说，我们有两种方式了解 Spark 程序的执行流程。第一种是通过在控制台观察输出日志，另一种则更直观，就是通过 Spark Web Console 来观察 Driver 程序里各个部分产生的 job 信息以及 job 里包含的 stages 信息。 
   需要指出的是，熟练的掌握 Spark SQL/DataFrame 的知识对学习最新的 Spark 机器学习库 ML Pipeline 至关重要，因为 ML Pipeline 使用 DataFrame 作为数据集来支持多种的数据类型。 
   笔者在测试的过程中发现，处理相同的数据集和类似的操作，Spark SQL/DataFrame 比传统的 RDD 转换操作具有更好的性能。这是由于 SQL 模块的 Catalyst 对用户 SQL 做了很好的查询优化。在以后的文章中会向读者详细的介绍该组件。 
  
结束语
本文通过两个案例向读者详细的介绍了使用 Spark SQL/DataFrame 处理结构化数据的过程，限于篇幅，我们并没有在文中向读者详细介绍 Spark SQL/DataFrame 程序的执行流程，以及 Catalyst SQL 解析和查询优化引擎。这个将会在本系列后面的文章中介绍。其实文中提供的测试数据还可以用来做更为复杂的 Spark SQL 测试，读者可以基于本文，进行更多的工作。需要指出的是，由于我们用到的数据是靠程序随机生成的，所以部分数据难免有不符合实际的情况，读者应该关注在使用 Spark SQL/DataFrame 处理这些数据的过程。最后，感谢您耐心的阅读本文，如果您有任何问题或者想法，请在文末留言，我们可以进行深入的讨论。让我们互相学习，共同进步。
 相关主题
 
   参考Spark SQL/DataFrame 官网文档，了解 Spark SQL/DataFrame 的基本原理和编程模型。 
   参考Spark Scala API 文档，了解 Spark SQL/DataFrame 相关 API 的使用。 
   developerWorks 开源技术主题：查找丰富的操作信息、工具和项目更新，帮助您掌握开源技术并将其用于 IBM 产品。 
  
Spark 2.0介绍：Spark SQL中的Time Window使用
来源：2016-07-12 21:07:10
Spark SQL中Window API
Spark SQL中的window API是从1.4版本开始引入的，以便支持更智能的分组功能。这个功能对于那些有SQL背景的人来说非常有用；但是在Spark 1.x中，window API一大缺点就是无法使用时间来创建窗口。时间在诸如金融、电信等领域有着非常重要的角色，基于时间来理解数据变得至关重要。
 不过值得高兴的是，在Spark 2.0中，window API内置也支持time windows！Spark SQL中的time windows和Spark Streaming中的time windows非常类似。在这篇文章中，我将介绍如何在Spark SQL中使用time windows。
时间序列数据
在我们介绍如何使用time window之前，我们先来准备一份时间序列数据。本文将使用Apple公司从1980年到2016年期间的股票交易信息。如下（完整的数据点击这里获取）：
Date,Open,High,Low,Close,Volume,Adj Close
2016-7-11,96.75,97.650002,96.730003,96.980003,23298900,96.980003
2016-7-8,96.489998,96.889999,96.050003,96.68,28855800,96.68
2016-7-7,95.699997,96.5,95.620003,95.940002,24280900,95.940002
2016-7-6,94.599998,95.660004,94.370003,95.529999,30770700,95.529999
2016-7-5,95.389999,95.400002,94.459999,95.040001,27257000,95.040001
2016-7-1,95.489998,96.470001,95.330002,95.889999,25872300,95.889999
2016-6-30,94.440002,95.769997,94.300003,95.599998,35836400,95.599998
2016-6-29,93.970001,94.550003,93.629997,94.400002,36531000,94.400002
2016-6-28,92.900002,93.660004,92.139999,93.589996,40444900,93.589996
2016-6-27,93,93.050003,91.5,92.040001,45489600,92.040001
2016-6-24,92.910004,94.660004,92.650002,93.400002,75311400,93.400002
2016-6-23,95.940002,96.290001,95.25,96.099998,32240200,96.099998
2016-6-22,96.25,96.889999,95.349998,95.550003,28971100,95.550003
2016-6-21,94.940002,96.349998,94.68,95.910004,35229500,95.910004
2016-6-20,96,96.57,95.029999,95.099998,33942300,95.099998
2016-6-17,96.620003,96.650002,95.300003,95.330002,60595000,95.330002
2016-6-16,96.449997,97.75,96.07,97.550003,31236300,97.550003
2016-6-15,97.82,98.410004,97.029999,97.139999,29445200,97.139999
2016-6-14,97.32,98.480003,96.75,97.459999,31931900,97.459999
2016-6-13,98.690002,99.120003,97.099998,97.339996,38020500,97.339996
2016-6-10,98.529999,99.349998,98.480003,98.830002,31712900,98.830002
2016-6-9,98.5,99.989998,98.459999,99.650002,26601400,99.650002
2016-6-8,99.019997,99.559998,98.68,98.940002,20848100,98.940002
2016-6-7,99.25,99.870003,98.959999,99.029999,22409500,99.029999
2016-6-6,97.989998,101.889999,97.550003,98.629997,23292500,98.629997
2016-6-3,97.790001,98.269997,97.449997,97.919998,28062900,97.919998
2016-6-2,97.599998,97.839996,96.629997,97.720001,40004100,97.720001
2016-6-1,99.019997,99.540001,98.330002,98.459999,29113400,98.459999
2016-5-31,99.599998,100.400002,98.82,99.860001,42084800,99.860001
股票数据一共有六列，但是这里我们仅关心Date和Close两列，它们分别代表股票交易时间和当天收盘的价格。
将时间序列数据导入到DataFrame中
我们有了样本数据之后，需要将它导入到DataFrame中以便下面的计算。所有的time window API需要一个类型为timestamp的列。我们可以使用spark-csv工具包来解析上面的Apple股票数据（csv格式），这个工具可以自动推断时间类型的数据并自动创建好模式。代码如下：
计算2016年Apple股票周平均收盘价格
现在我们已经有了初始化好的数据，所以我们可以进行一些基于时间的窗口分析。在本例中我们将计算2016年Apple公司每周股票的收盘价格平均值。下面将一步一步进行介绍。
 步骤一：找出2016年的股票交易数据
 因为我们仅仅需要2016年的交易数据，所以我们可以对原始数据进行过滤，代码片段我们使用了内置的year函数来提取出日期中的年。
 步骤二：计算平均值
 现在我们需要对每个星期创建一个窗口，这种类型的窗口通常被称为tumbling window
 步骤三：打印window的值
带有开始时间的Time window
在前面的示例中，我们使用的是tumbling window。为了能够指定开始时间，我们需要使用sliding window（滑动窗口）。到目前为止，没有相关API来创建带有开始时间的tumbling window，但是我们可以通过将窗口时间(window duration)和滑动时间(slide duration)设置成一样来创建带有开始时间的tumbling window。代码如下：
#-*-coding:utf-8-*-

from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import year
from pyspark.sql.functions import window

if __name__=="__main__":
    sc=SparkContext(appName="myApp")
    spark=SparkSession.builder.enableHiveSupport().getOrCreate()
    df=spark.read.csv(path="/tmp/test/iteblog_apple.csv",header=True,dateFormat="yyyy/MM/dd")
    """
    df.show()
    +---------+---------+---------+---------+---------+--------+---------+
    |     Date|     Open|     High|      Low|    Close|  Volume|Adj Close|
    +---------+---------+---------+---------+---------+--------+---------+
    |2016-7-11|    96.75|97.650002|96.730003|96.980003|23298900|96.980003|
    | 2016-7-8|96.489998|96.889999|96.050003|    96.68|28855800|    96.68|
    | 2016-7-7|95.699997|     96.5|95.620003|95.940002|24280900|95.940002|
    | 2016-7-6|94.599998|95.660004|94.370003|95.529999|30770700|95.529999|
    | 2016-7-5|95.389999|95.400002|94.459999|95.040001|27257000|95.040001|
    | 2016-7-1|95.489998|96.470001|95.330002|95.889999|25872300|95.889999|
    |2016-6-30|94.440002|95.769997|94.300003|95.599998|35836400|95.599998|
    |2016-6-29|93.970001|94.550003|93.629997|94.400002|36531000|94.400002|
    |2016-6-28|92.900002|93.660004|92.139999|93.589996|40444900|93.589996|
    |2016-6-27|       93|93.050003|     91.5|92.040001|45489600|92.040001|
    |2016-6-24|92.910004|94.660004|92.650002|93.400002|75311400|93.400002|
    |2016-6-23|95.940002|96.290001|    95.25|96.099998|32240200|96.099998|
    |2016-6-22|    96.25|96.889999|95.349998|95.550003|28971100|95.550003|
    |2016-6-21|94.940002|96.349998|    94.68|95.910004|35229500|95.910004|
    |2016-6-20|       96|    96.57|95.029999|95.099998|33942300|95.099998|
    |2016-6-17|96.620003|96.650002|95.300003|95.330002|60595000|95.330002|
    |2016-6-16|96.449997|    97.75|    96.07|97.550003|31236300|97.550003|
    |2016-6-15|    97.82|98.410004|97.029999|97.139999|29445200|97.139999|
    |2016-6-14|    97.32|98.480003|    96.75|97.459999|31931900|97.459999|
    |2016-6-13|98.690002|99.120003|97.099998|97.339996|38020500|97.339996|
    +---------+---------+---------+---------+---------+--------+---------+
    only showing top 20 rows
    print(df.dtypes)
    [('Date', 'string'), ('Open', 'string'), ('High', 'string'), ('Low', 'string'), ('Close', 'string'),
     ('Volume', 'string'), ('Adj Close', 'string')]
    
    print(df.count())
    8971
    """
    #找出2016年的股票交易数据
    stock2016=df.filter(year(df.Date)=='2016')
    """
    stock2016.show()
    +---------+---------+---------+---------+---------+--------+---------+
    |     Date|     Open|     High|      Low|    Close|  Volume|Adj Close|
    +---------+---------+---------+---------+---------+--------+---------+
    |2016-7-11|    96.75|97.650002|96.730003|96.980003|23298900|96.980003|
    | 2016-7-8|96.489998|96.889999|96.050003|    96.68|28855800|    96.68|
    | 2016-7-7|95.699997|     96.5|95.620003|95.940002|24280900|95.940002|
    | 2016-7-6|94.599998|95.660004|94.370003|95.529999|30770700|95.529999|
    | 2016-7-5|95.389999|95.400002|94.459999|95.040001|27257000|95.040001|
    | 2016-7-1|95.489998|96.470001|95.330002|95.889999|25872300|95.889999|
    |2016-6-30|94.440002|95.769997|94.300003|95.599998|35836400|95.599998|
    |2016-6-29|93.970001|94.550003|93.629997|94.400002|36531000|94.400002|
    |2016-6-28|92.900002|93.660004|92.139999|93.589996|40444900|93.589996|
    |2016-6-27|       93|93.050003|     91.5|92.040001|45489600|92.040001|
    |2016-6-24|92.910004|94.660004|92.650002|93.400002|75311400|93.400002|
    |2016-6-23|95.940002|96.290001|    95.25|96.099998|32240200|96.099998|
    |2016-6-22|    96.25|96.889999|95.349998|95.550003|28971100|95.550003|
    |2016-6-21|94.940002|96.349998|    94.68|95.910004|35229500|95.910004|
    |2016-6-20|       96|    96.57|95.029999|95.099998|33942300|95.099998|
    |2016-6-17|96.620003|96.650002|95.300003|95.330002|60595000|95.330002|
    |2016-6-16|96.449997|    97.75|    96.07|97.550003|31236300|97.550003|
    |2016-6-15|    97.82|98.410004|97.029999|97.139999|29445200|97.139999|
    |2016-6-14|    97.32|98.480003|    96.75|97.459999|31931900|97.459999|
    |2016-6-13|98.690002|99.120003|97.099998|97.339996|38020500|97.339996|
    +---------+---------+---------+---------+---------+--------+---------+
    only showing top 20 rows
    print(stock2016.count())
    131
    """
    #计算平均值
    tumblingWindowDS=stock2016.groupby(window('Date','1 week')).agg({
     "Close":"avg"})
    """
    tumblingWindowDS.sort(tumblingWindowDS.window.desc()).show()
    +--------------------+------------------+                                       
    |              window|        avg(Close)|
    +--------------------+------------------+
    |[2016-07-07 08:00...| 96.83000150000001|
    |[2016-06-30 08:00...| 95.60000025000001|
    |[2016-06-23 08:00...|        93.8059998|
    |[2016-06-16 08:00...| 95.59800099999998|
    |[2016-06-09 08:00...| 97.66399979999998|
    |[2016-06-02 08:00...|        98.8339996|
    |[2016-05-26 08:00...|       99.09749975|
    |[2016-05-19 08:00...|         97.916002|
    |[2016-05-12 08:00...|        93.3299974|
    |[2016-05-05 08:00...| 92.35599959999999|
    |[2016-04-28 08:00...|        93.9979994|
    |[2016-04-21 08:00...|       101.5520004|
    |[2016-04-14 08:00...|107.46800060000001|
    |[2016-04-07 08:00...|       110.4520004|
    |[2016-03-31 08:00...|110.08399979999999|
    |[2016-03-24 08:00...|       107.8549995|
    |[2016-03-17 08:00...|       106.0699996|
    |[2016-03-10 08:00...|        104.226001|
    |[2016-03-03 08:00...|101.64000100000001|
    |[2016-02-25 08:00...|         99.276001|
    +--------------------+------------------+
    only showing top 20 rows
    
    tumblingWindowDS.select(tumblingWindowDS.window.start.cast("string").alias("start"),
        tumblingWindowDS.window.end.cast("string").alias("end"), "avg(Close)").show()
    
    +-------------------+-------------------+------------------+
    |              start|                end|        avg(Close)|
    +-------------------+-------------------+------------------+
    |2016-01-07 08:00:00|2016-01-14 08:00:00| 98.47199859999999|
    |2016-01-21 08:00:00|2016-01-28 08:00:00|        97.6719984|
    |2016-06-23 08:00:00|2016-06-30 08:00:00|        93.8059998|
    |2016-03-31 08:00:00|2016-04-07 08:00:00|110.08399979999999|
    |2016-01-14 08:00:00|2016-01-21 08:00:00| 96.72000125000001|
    |2016-04-21 08:00:00|2016-04-28 08:00:00|       101.5520004|
    |2016-04-07 08:00:00|2016-04-14 08:00:00|       110.4520004|
    |2016-02-04 08:00:00|2016-02-11 08:00:00| 94.39799819999999|
    |2016-06-02 08:00:00|2016-06-09 08:00:00|        98.8339996|
    |2016-04-14 08:00:00|2016-04-21 08:00:00|107.46800060000001|
    |2016-03-03 08:00:00|2016-03-10 08:00:00|101.64000100000001|
    |2016-07-07 08:00:00|2016-07-14 08:00:00| 96.83000150000001|
    |2016-05-05 08:00:00|2016-05-12 08:00:00| 92.35599959999999|
    |2016-04-28 08:00:00|2016-05-05 08:00:00|        93.9979994|
    |2016-06-30 08:00:00|2016-07-07 08:00:00| 95.60000025000001|
    |2016-05-26 08:00:00|2016-06-02 08:00:00|       99.09749975|
    |2016-05-12 08:00:00|2016-05-19 08:00:00|        93.3299974|
    |2016-05-19 08:00:00|2016-05-26 08:00:00|         97.916002|
    |2015-12-31 08:00:00|2016-01-07 08:00:00|101.30249774999999|
    |2016-02-11 08:00:00|2016-02-18 08:00:00|        96.2525005|
    +-------------------+-------------------+------------------+
    only showing top 20 rows
    """
    #打印window的值
    tumblingWindowDSTR=tumblingWindowDS.select(tumblingWindowDS.window.start.cast("string").alias("start"),
        tumblingWindowDS.window.end.cast("string").alias("end"), "avg(Close)")
    """
    tumblingWindowDSTR.sort(tumblingWindowDSTR.start).show()
    +-------------------+-------------------+------------------+                    
    |              start|                end|        avg(Close)|
    +-------------------+-------------------+------------------+
    |2015-12-31 08:00:00|2016-01-07 08:00:00|101.30249774999999|
    |2016-01-07 08:00:00|2016-01-14 08:00:00| 98.47199859999999|
    |2016-01-14 08:00:00|2016-01-21 08:00:00| 96.72000125000001|
    |2016-01-21 08:00:00|2016-01-28 08:00:00|        97.6719984|
    |2016-01-28 08:00:00|2016-02-04 08:00:00|         96.239999|
    |2016-02-04 08:00:00|2016-02-11 08:00:00| 94.39799819999999|
    |2016-02-11 08:00:00|2016-02-18 08:00:00|        96.2525005|
    |2016-02-18 08:00:00|2016-02-25 08:00:00| 96.09400000000001|
    |2016-02-25 08:00:00|2016-03-03 08:00:00|         99.276001|
    |2016-03-03 08:00:00|2016-03-10 08:00:00|101.64000100000001|
    |2016-03-10 08:00:00|2016-03-17 08:00:00|        104.226001|
    |2016-03-17 08:00:00|2016-03-24 08:00:00|       106.0699996|
    |2016-03-24 08:00:00|2016-03-31 08:00:00|       107.8549995|
    |2016-03-31 08:00:00|2016-04-07 08:00:00|110.08399979999999|
    |2016-04-07 08:00:00|2016-04-14 08:00:00|       110.4520004|
    |2016-04-14 08:00:00|2016-04-21 08:00:00|107.46800060000001|
    |2016-04-21 08:00:00|2016-04-28 08:00:00|       101.5520004|
    |2016-04-28 08:00:00|2016-05-05 08:00:00|        93.9979994|
    |2016-05-05 08:00:00|2016-05-12 08:00:00| 92.35599959999999|
    |2016-05-12 08:00:00|2016-05-19 08:00:00|        93.3299974|
    +-------------------+-------------------+------------------+
    only showing top 20 rows
    # 上面的输出按照window.start进行了排序，这个字段标记了窗口的开始时间。上面的输出你可能
    #已经看到了第一行的开始时间是2015 - 12 - 31，结束时间是2016 - 01 - 07。
    #但是你从原始数据可以得到：2016年Apple公司的股票交易信息是从2016 - 01 - 04开始的；
    #原因是2016 - 01 - 01是元旦，而2016 - 01 - 02和2016 - 01 - 03正好是周末，期间没有股票交易。
    #
    # 我们可以手动指定窗口的开始时间来解决这个问题。
    """

    #带有开始时间的Time window
    #这里的startTime指的是相对1970-01-01 00:00:00 UTC的偏移量，但实际数据的开始时间是不确定的，
    # 因而这种方式对起始时间的控制并不好，不知道为什么这么设计
    tumblingWindowDS2=stock2016.groupby(window(timeColumn='Date',windowDuration='1 week',
        slideDuration='1 week',startTime='4 days')).agg({
     'Close':'avg'})
    tumblingWindowDSTR2 = tumblingWindowDS2.select(tumblingWindowDS2.window.start.cast("string").alias("start"),
        tumblingWindowDS2.window.end.cast("string").alias("end"), "avg(Close)").sort('start')
    """
    tumblingWindowDSTR2.show()
    +-------------------+-------------------+------------------+                    
    |              start|                end|        avg(Close)|
    +-------------------+-------------------+------------------+
    |2015-12-28 08:00:00|2016-01-04 08:00:00|        105.349998|
    |2016-01-04 08:00:00|2016-01-11 08:00:00|        99.0699982|
    |2016-01-11 08:00:00|2016-01-18 08:00:00| 98.49999799999999|
    |2016-01-18 08:00:00|2016-01-25 08:00:00|        98.1220016|
    |2016-01-25 08:00:00|2016-02-01 08:00:00|        96.2539976|
    |2016-02-01 08:00:00|2016-02-08 08:00:00| 95.29199960000001|
    |2016-02-08 08:00:00|2016-02-15 08:00:00|        94.2374975|
    |2016-02-15 08:00:00|2016-02-22 08:00:00|        96.7880004|
    |2016-02-22 08:00:00|2016-02-29 08:00:00| 96.23000160000001|
    |2016-02-29 08:00:00|2016-03-07 08:00:00|101.53200079999999|
    |2016-03-07 08:00:00|2016-03-14 08:00:00|       101.6199998|
    |2016-03-14 08:00:00|2016-03-21 08:00:00|105.63600160000001|
    |2016-03-21 08:00:00|2016-03-28 08:00:00|105.92749950000001|
    |2016-03-28 08:00:00|2016-04-04 08:00:00|109.46799940000001|
    |2016-04-04 08:00:00|2016-04-11 08:00:00|109.39799980000001|
    |2016-04-11 08:00:00|2016-04-18 08:00:00|       110.3820004|
    |2016-04-18 08:00:00|2016-04-25 08:00:00|106.15400079999999|
    |2016-04-25 08:00:00|2016-05-02 08:00:00|        96.8759994|
    |2016-05-02 08:00:00|2016-05-09 08:00:00|        93.6240004|
    |2016-05-09 08:00:00|2016-05-16 08:00:00| 92.13399799999999|
    +-------------------+-------------------+------------------+
    only showing top 20 rows
    """
    # 从上面的结果可以看出，我们已经有了一个从2016 - 01 - 04
    # 的结果；不过结果中还有2015年的数据。原因是我们的开始时间是4
    # days，2016 - 01 - 04
    # 之前的一周数据也会被显示出，我们可以使用filter来过滤掉那行数据：
    #进一步对时间进行控制
    tumblingWindowDSTRfilter=tumblingWindowDSTR2.filter(year(tumblingWindowDSTR2.start)=='2016')
    """
    tumblingWindowDSTRfilter.show()
    +-------------------+-------------------+------------------+                    
    |              start|                end|        avg(Close)|
    +-------------------+-------------------+------------------+
    |2016-01-04 08:00:00|2016-01-11 08:00:00|        99.0699982|
    |2016-01-11 08:00:00|2016-01-18 08:00:00| 98.49999799999999|
    |2016-01-18 08:00:00|2016-01-25 08:00:00|        98.1220016|
    |2016-01-25 08:00:00|2016-02-01 08:00:00|        96.2539976|
    |2016-02-01 08:00:00|2016-02-08 08:00:00| 95.29199960000001|
    |2016-02-08 08:00:00|2016-02-15 08:00:00|        94.2374975|
    |2016-02-15 08:00:00|2016-02-22 08:00:00|        96.7880004|
    |2016-02-22 08:00:00|2016-02-29 08:00:00| 96.23000160000001|
    |2016-02-29 08:00:00|2016-03-07 08:00:00|101.53200079999999|
    |2016-03-07 08:00:00|2016-03-14 08:00:00|       101.6199998|
    |2016-03-14 08:00:00|2016-03-21 08:00:00|105.63600160000001|
    |2016-03-21 08:00:00|2016-03-28 08:00:00|105.92749950000001|
    |2016-03-28 08:00:00|2016-04-04 08:00:00|109.46799940000001|
    |2016-04-04 08:00:00|2016-04-11 08:00:00|109.39799980000001|
    |2016-04-11 08:00:00|2016-04-18 08:00:00|       110.3820004|
    |2016-04-18 08:00:00|2016-04-25 08:00:00|106.15400079999999|
    |2016-04-25 08:00:00|2016-05-02 08:00:00|        96.8759994|
    |2016-05-02 08:00:00|2016-05-09 08:00:00|        93.6240004|
    |2016-05-09 08:00:00|2016-05-16 08:00:00| 92.13399799999999|
    |2016-05-16 08:00:00|2016-05-23 08:00:00| 94.77999880000002|
    +-------------------+-------------------+------------------+
    only showing top 20 rows
    """

                    
                        
                        
                             
                        
                        
                        
                            
                        
                        
                        
                            
                        
                    
                

        你可能感兴趣的:(数据挖掘工具,spark,数据)
        
            
                
                    Java 中如何使用 SSL 连接 IoTDB
                        铁头乔
javassliotdb数据库时序数据库开源
                        问题Java中如何使用SSL连接IoTDB？方案SSL（SecureSocketsLayer）是一种安全协议，用于在网络通信中提供加密、认证和数据完整性保护。它主要用于在互联网上保护数据传输，确保数据在传输过程中不被窃听或篡改。IoTDB支持SSL协议，但是在配置文件中相关参数是默认关闭的状态，如下：#Doesdn_rpc_portenableSSL#effectiveMode:restart#D
                    
                    Java stream流的避坑指南
                        冰糖心158
2025Java面试系列java
                        在使用JavaStreamAPI时，虽然它提供了强大的功能来简化集合操作，但也存在一些常见的“坑”需要注意。以下是详细的避坑指南：1.Stream的不可重用性问题：Stream一旦被消费（如调用forEach、collect等终端操作），就不能再次使用。解决方案：如果需要多次操作同一个数据源，可以重新创建Stream，或者将Stream的结果保存到集合中。Listnumbers=Arrays.as
                    
                    注解Annontation 详解
                        宸之元亨利贞
JavaSE基础javalombokjunitspringintegration
                        什么是注解Annontation？Annontation是Java5开始引入的新特征，中文名称叫注解。它提供了一种安全的类似注释的机制，用来将任何的信息或元数据（metadata）与程序元素（类、方法、成员变量等）进行关联。为程序的元素（类、方法、成员变量）加上更直观更明了的说明，这些说明信息是与程序的业务逻辑无关，并且供指定的工具或框架使用。Annontation像一种修饰符一样，应用于包、类型
                    
                    Arcgis遥感影像建立镶嵌数据集，加载不显示，采用以下步骤，注意事项
                        木易GIS
arcpy遥感GISarcgisgis图像处理
                        Arcgis遥感影像建立镶嵌数据集，加载不显示，采用以下步骤，注意事项：（1）Footprint属性表maxPs改为5000；（2）镶嵌数据集右键属性，maxinumnumberofrasterspermasaic改为1000；（3）右键modify更改生成影像金字塔。然后重新加载图层就能显示了。
                    
                    策略模式-简单工具包
                        冥王 • 雷利
技术经验设计模式策略模式策略
                        策略是大家开发中用的很多模式，特别在解决相同流程多场景的模式下显得尤为的重要，其标准的结构就是一个加载多钟场景的上下文context，一个标准的处理接口handler及若干个根据不同场景的实现。举一个实际中碰到的场景，我要获取用户登录态中的登录信息，因为种种原因，需要根据不同的登录端，从不同环境或是请求域中获取登录态信息，例如APP端，H5，PC，那么根据不同端获取登录态信息就可以通过策略模式实现
                    
                    Java 入门指南：集合概述
                        ZachOn1y
Javajava开发语言后端eclipsejava-ee
                        Java集合概述Java集合（Collections）是Java中提供的一种容器，用于存储和管理多个对象。与数组不同，集合的长度是可变的，且只能存储对象（包括对象的引用），不能存储基本数据类型。集合是Java编程中非常重要的一部分，特别是在处理大量数据时，集合提供了丰富的操作方法和灵活的数据结构。Java集合的体系结构Java集合，也叫作容器，主要是由两大接口派生而来：一个是Collection接
                    
                    物联网网关Web服务器--CGI开发接口
                        国产化创客
物联网Web服务器嵌入式项目服务器物联网web网关
                        1、CGI(公用网关接口)CGI(公用网关接口)规定了Web服务器调用其他可执行程序(CGI程序)的接口协议标准。Web服务器通过调用CGI程序实现和Web浏览器的交互,也就是CGI程序接受Web浏览器发送给Web服务器的信息,进行处理,将响应结果再回送给Web服务器及Web浏览器。CGI程序一般完成Web网页中表单(Form)数据的处理、数据库查询和实现与传统应用系统的集成等工作。CGI程序可以
                    
                    JS在HTML页面内动态创建SVG元素
                        一粒马豆
html5JavaScript数据可视化SVGJSD3WEB
                        最近在学习数据可视化，深入了解了如何在网页上实现数据的动态可视化。比如D3.JS主要应用JS在HTML页面内动态生成SVG元素并绑定数据。以下是我的例程：//通过createElementNS创建svg元素并设置属性varsvg=document.createElementNS('http://www.w3.org/2000/svg','svg');svg.setAttribute("style"
                    
                    HSM能为区块链、IoT等新兴技术提供怎样的保护？
                        Anna_Tong
区块链物联网iothsm数据加密
                        随着区块链和物联网（IoT）技术的快速发展，数据安全已成为最为关键的挑战之一。在这些技术的应用中，涉及到大量的敏感数据和交易信息，因此如何确保数据的机密性、完整性和真实性，成为了亟待解决的问题。硬件安全模块（HSM）作为一种高度安全的加密服务技术，正日益成为保障区块链和IoT技术安全的核心工具。HSM具体能为区块链和IoT做些什么？它又是如何保护这些技术免受安全威胁的呢？HSM在区块链中的应用：密
                    
                    Commander 一款命令行自定义命令依赖
                        yqcoder
arcgisjavascript前端node.js
                        一、安装`commander`插件npminstallcommander二、基本用法1.创建一个简单的命令行程序创建一个JavaScript文件，例如`mycli.js`，并添加以下代码：//引入`commander`模块并获取`program`对象。const{program}=require("commander");program .version("1.0.0")//设置命令行工具的版本 
                    
                    31、Java集合概述
                        周某某～
JAVA基础知识java开发语言
                        目录一.Collection二.Map三.Collection和Map的区别四.应用场景集合是一组对象的集合，它封装了对象的存储和操作方式。集合框架提供了一组接口和类，用于存储、访问和操作这些对象集合。这些接口和类定义了不同的数据结构，如列表、集合、映射等，以支持各种类型的数据操作。简单来说，集合是对象的容器，它允许你将多个对象存储在一个单一的数据结构中，并对这些对象进行各种操作，如添加、删除、搜
                    
                    Python使用socket传输对数据AES和MD5加密
                        夜语醉星辰
Pythonpython
                        一、使用socket通信defclient_communication(data):#通信host="127.0.0.1"#服务器IP地址port=12345#服务器端口号#处理发送数据data=json.dumps(data)#将字典转换为json字符串data=encryption_AES(data)#加密数据try:client_socket=socket.socket(socket.AF_
                    
                    element ui Table组件内容自适应的情况下实现表头相对页面固定
                        木有是我
jshtmlvuecssjs
                        一、elementuiTable只要在el-table元素中定义了height属性，即可实现固定表头的表格，而不需要额外的代码。此时有一个弊端，如果页面布局内容较多，会出现两个滚动条，一个table的一个页面的，我们的需求是页面高度随着数据数量而自适应撑开，而不是固定table视窗的高度二、效果图对比如下：三、实现步骤1、首先监听页面滚动事件、因为滚动事件触发太过频繁我就带了500毫秒的节流mou
                    
                    【代码复现】ResUNet++进行语义分割（含图像切片预处理）
                        Cpdr
模型代码解读深度学习人工智能
                        文章目录参考资料1.preprocess.py1.1.参数声明1.1.1.执行命令的形参1.1.2.代码中的参数声明2.train.py2.1.参数声明2.2.main函数（不包括训练阶段）2.2.1参数说明2.2.2.读取数据部分2.2.3.创建loaders2.3.训练阶段2.4.validation阶段3.其他相关代码3.1.model.py3.1.1.res_unet_plus.py3.1
                    
                    盘点Python网页开发轻量级框架Flask知识
                        傻啦嘿哟
关于python那些事儿pythonflask开发语言
                        目录一、Flask框架概述二、核心组件1、WSGI服务器2、Jinja2模板引擎3、URL路由4、数据库集成三、应用场景博客平台内容管理系统（CMS）API开发四、优缺点优点：缺点：五、总结随着Web开发的日益普及，各种开发框架也层出不穷。其中，Python的Flask框架作为一种轻量级的Web开发工具，受到了广泛的欢迎。本文将对Flask框架进行深入的剖析，让您全面了解它的基本概念、核心组件、应
                    
                    Android外接USB扫码枪
                        云启软件
Android原生android
                        前言公司的设备以前接入的都是串口的扫码头，优点是直接通过串口读取流里面的数据就OK了，缺点是你需要知道每一款扫码器的型号以获取波特率及Android设备的串口地址。因为现在usb扫码器越来越方便且即插即用，不需要额外供电以及价格便宜等特点，公司以后开发的设备都打算采用usb扫码器。所以我开始尝试接入usb扫码器，下面就是我在接入时的方法以及遇到的一些问题。1.USB扫码器接入前面我有说过，usb扫
                    
                    Python Flask中集成SQLAlchemy和Flask-Login
                        ivwdcwso
开发flaskpython后端web开发
                        在现代Web应用开发中,数据库和用户认证是两个非常重要的功能。Flask作为一个轻量级的PythonWeb框架,本身只提供了最基本的Web功能。但是,它可以通过集成各种优秀的扩展库来增强功能。本文将介绍如何在Flask应用中集成SQLAlchemy(数据库)和Flask-Login(用户认证),并提供一个完整的示例供参考。©ivwdcwso(ID:u012172506)准备工作安装Python确保
                    
                    Vite + Vue3 + TS项目配置前置路由守卫
                        洛*璃
Vue.jsvue.js前端javascriptVue-RouterPiniatypescript
                        在现代前端开发中，使用Vue3和TypeScript的组合是一种流行且高效的开发方式。Vite是一个极速的构建工具，可以显著提升开发体验。本文博主将指导你如何在Vite+Vue3+TypeScript项目中配置前置路由守卫（NavigationGuards）。前置条件在开始配置项目前置路由守卫前，博主希望你能够先达成以下前置条件：1.完成Vue3前端项目搭建:Vite创建Vue3+TS项目2.引入
                    
                    Python支持向量机（SVM）算法：面向对象的实现与案例详解
                        闲人编程
进阶算法案例支持向量机算法python深度学习数据分析
                        目录Python支持向量机（SVM）算法：面向对象的实现与案例详解引言一、支持向量机算法概述1.1支持向量机的基本思想1.2SVM的分类问题1.3SVM的优化目标二、面向对象的SVM实现2.1类的设计2.2Python代码实现2.3代码详解三、案例分析3.1案例一：鸢尾花分类问题描述数据准备模型训练与预测输出结果3.2案例二：手写数字识别问题描述数据准备模型训练与预测输出结果四、SVM的优化与核方
                    
                    el-table表格单行表头
                        Is无糖
vue.js前端javascript
                        最近开发项目遇到一个订单列表展示的需要在每一行表头上进行订单的某些操作和数据展示如图：表格一般我都是使用elementui的el-table正常使用肯定是不能满足这个效果想了想也是有点思绪便做了一个demo记录一下效果图：父组件代码:importchilTabelfrom'./components/chilTable.vue';exportdefault{data(){return{tableDa
                    
                    .NET 7迁移后OutOfMemoryException的解决之旅
                        t0_54coder
编程问题解决手册个人开发
                        引言最近，我们将应用从.NET5升级到了.NET7，并将UI框架从标准MVC升级到了Vue3。升级后，一切看起来都运行良好，但仅仅一周后，我们开始遇到了令人困惑的System.OutOfMemoryException。这些异常出现在代码库的不同、看似无关的部分，而这些部分并不总是处理大量数据。这篇博客将详细记录我们如何解决这些内存异常的问题。问题描述在迁移到.NET7后，我们开始频繁地看到Syst
                    
                    JHipster入门 - 生成单体架构的应用
                        yorkwu1977
软件工程架构
                        JHipster入门-生成单体架构的应用目标准备工作生成基础功能输入指令开始问答环节问答环节结束，开始自动生成基础功能代码生成业务功能输入指令开始问答环节问答环节结束，开始自动生成业务功能代码调试启动启动后端服务启动前端服务访问前端页面打包启动构建启动访问前端页面关于数据库交给JHipster自己启动目标30分钟内生成一个开箱即用的单体架构应用。生成SpringBoot后端代码和Vue前端代码。基
                    
                    PEX: Python Executable魔力工具箱
                        史艾岭

                        PEX:PythonExecutable魔力工具箱pexAtoolforgenerating.pex(PythonEXecutable)files,lockfilesandvenvs.项目地址:https://gitcode.com/gh_mirrors/pe/pex项目基础介绍及主要编程语言PEX（PythonEXecutable）是Pantsbuild团队维护的一个强大开源项目，致力于简化Py
                    
                    深度ResUnet与ResUnet++：新一代的语义分割神器
                        倪澄莹George

                        深度ResUnet与ResUnet++：新一代的语义分割神器去发现同类优质开源项目:https://gitcode.com/在这个数据驱动的时代，深度学习模型在图像处理领域展现出了强大的潜力，尤其是在语义分割任务中。今天，我们向您推荐一个基于PyTorch实现的开源项目——DeepResUnet和ResUnet++。这两个模型源自于学术界的最新研究，旨在提高图像分割的准确性和效率。项目介绍这个开源
                    
                    学习ASP.NET Core的身份认证（基于JwtBearer的身份认证9）
                        gc_2299
网页编程JwtBear身份认证
                          测试数据库中只有之前记录温湿度及烟雾值的表中数据较多，在该数据库中增加AppUser表，用于登录用户身份查询，数据库表如下所示：  项目中安装SqlSugarCore包，然后修改控制器类的登录函数及分页查询数据函数，将之前函数中的固定数据修改为从数据库中查询数据，并将分页查询数据函数中返回数据集合修改为返回环境检测数据的集合，主要调整的代码如下所示。客户端页面中的JavaScript代码主要修
                    
                    openbmc
                        csu_fky
c++
                        openbmc这个开源项目编译出来的是固件，也可以说是镜像，它是一个可以运行在BMC芯片上的小型操作系统。我们可以在不同架构的CPU，不同的linux操作系统上面进行编译，最后得到的镜像适用于各个架构的CPU。在对固件进行测试时，可以通过网线与BMC开发板连接，进行相应的测试。在开发板资源不足时，可以通过qemu这个工具来代替。qemu可以虚拟出相应的硬件，例如它本身有x86或者arm架构的版本，
                    
                    Rancher初探：深入剖析产品架构并探索编程
                        YOLO_CODE
rancher架构
                        Rancher初探：深入剖析产品架构并探索编程Rancher是一个开源的容器管理平台，它提供了一套丰富的工具和功能，帮助用户轻松管理和部署容器化应用。本文将深入剖析Rancher的产品架构，并介绍如何使用编程来扩展和定制Rancher。以下是相关源代码和实例，以帮助读者更好地理解和应用所学内容。1.Rancher的产品架构概述Rancher的产品架构主要由以下几个核心组件组成：1.1Rancher
                    
                    从零到一：低代码平台的核心技术解析
                        
低代码
                        在数字化转型的浪潮中，低代码平台正逐渐成为企业加速应用开发、提升效率的重要工具。它打破了传统开发模式的束缚，让更多非专业开发者也能参与到应用构建中来。今天，我们就来深入剖析低代码平台背后的核心技术，看看它是如何实现高效开发的。可视化设计引擎低代码平台的显著特征之一就是可视化设计。可视化设计引擎就像是一个图形化的工作区，开发者通过简单的拖拽、配置操作，就能搭建出应用的界面和流程。它提供了丰富的组件库
                    
                    Rancher - 产品架构详解与编程实践
                        风华绚烂
rancher架构编程
                        Rancher-产品架构详解与编程实践Rancher是一个开源的容器管理平台，它提供了丰富的功能和工具，用于简化容器部署、管理和编排。本文将详细介绍Rancher的产品架构，并提供一些编程实践示例。Rancher的产品架构主要由三个核心组件组成：RancherServer、RancherAgent和RancherUI。下面将对每个组件进行详细解释。RancherServer:RancherServ
                    
                    为AI聊天工具添加一个知识系统 之70 详细设计 之11 维度运动控制的应用：上下文受控的自然语言
                        一水鉴天
软件智能人工语言智能制造数据库
                        本文要点要点前面我们讨论了“维度”及其运动控制原理以及维度控制如何在中台微服务架构中撑起了“架构师”角色的一片天。下面我们从“维度”运动控制的一个典型应用场景：受控的自然语言”开始讨论。拼块文字型风格:维度运动控制下的受控自然语言演示了支持/支撑/支援的三因式分解（三化：化仪/化解/化法）效果。C单独支撑（独立支撑）的分组交换(激活：前/后。维度=0--静止“方”)，A三顶支持（共同支持）的分段替
                    
                                jQuery 键盘事件keydown ,keypress ,keyup介绍
                                    107x
jsjquerykeydownkeypresskeyup
                                    本文章总结了下些关于jQuery 键盘事件keydown ,keypress ,keyup介绍，有需要了解的朋友可参考。 
一、首先需要知道的是：  1、keydown()  keydown事件会在键盘按下时触发.  2、keyup()     代码如下 复制代码    
$('input').keyup(funciton(){      
                                
                                AngularJS中的Promise
                                    bijian1013
JavaScriptAngularJSPromise
                                    一.Promise 
        Promise是一个接口，它用来处理的对象具有这样的特点：在未来某一时刻（主要是异步调用）会从服务端返回或者被填充属性。其核心是，promise是一个带有then()函数的对象。 
        为了展示它的优点，下面来看一个例子，其中需要获取用户当前的配置文件： 
var cu
                                
                                c++ 用数组实现栈类
                                    CrazyMizzz
数据结构C++
                                    #include<iostream>
#include<cassert>
using namespace std;

template<class T, int SIZE = 50>
class Stack{
private:
	T list[SIZE];//数组存放栈的元素
	int top;//栈顶位置

public:
	Stack(
                                
                                java和c语言的雷同
                                    麦田的设计者
java递归scaner
                                    软件启动时的初始化代码，加载用户信息2015年5月27号 
从头学java二 
1、语言的三种基本结构：顺序、选择、循环。废话不多说，需要指出一下几点： 
     a、return语句的功能除了作为函数返回值以外，还起到结束本函数的功能，return后的语句 
不会再继续执行。 
     b、for循环相比于whi
                                
                                LINUX环境并发服务器的三种实现模型
                                    被触发
linux
                                    服务器设计技术有很多，按使用的协议来分有TCP服务器和UDP服务器。按处理方式来分有循环服务器和并发服务器。 
1  循环服务器与并发服务器模型 
在网络程序里面，一般来说都是许多客户对应一个服务器，为了处理客户的请求，对服务端的程序就提出了特殊的要求。 
目前最常用的服务器模型有： 
·循环服务器：服务器在同一时刻只能响应一个客户端的请求 
·并发服务器：服
                                
                                Oracle数据库查询指令
                                    肆无忌惮_
oracle数据库
                                    20140920 
  
单表查询 
-- 查询************************************************************************************************************ 
-- 使用scott用户登录 
  
-- 查看emp表 
  
desc emp 
  

                                
                                ext右下角浮动窗口
                                    知了ing
JavaScriptext
                                    第一种 
 
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/
                                
                                浅谈REDIS数据库的键值设计
                                    矮蛋蛋
redis
                                    http://www.cnblogs.com/aidandan/ 
原文地址：http://www.hoterran.info/redis_kv_design 
 
丰富的数据结构使得redis的设计非常的有趣。不像关系型数据库那样，DEV和DBA需要深度沟通，review每行sql语句，也不像memcached那样，不需要DBA的参与。redis的DBA需要熟悉数据结构，并能了解使用场景。 
 
                                
                                maven编译可执行jar包
                                    alleni123
maven
                                    http://stackoverflow.com/questions/574594/how-can-i-create-an-executable-jar-with-dependencies-using-maven 
 
 
<build>
  <plugins>
    <plugin>
      <artifactId>maven-asse
                                
                                人力资源在现代企业中的作用
                                    百合不是茶
HR 企业管理
                                    //人力资源在在企业中的作用人力资源为什么会存在，人力资源究竟是干什么的 人力资源管理是对管理模式一次大的创新，人力资源兴起的原因有以下点： 工业时代的国际化竞争，现代市场的风险管控等等。所以人力资源 在现代经济竞争中的优势明显的存在，人力资源在集团类公司中存在着 明显的优势(鸿海集团)，有一次笔者亲自去体验过红海集团的招聘，只 知道人力资源是管理企业招聘的 当时我被招聘上了，当时给我们培训 的人
                                
                                Linux自启动设置详解
                                    bijian1013
linux
                                    linux有自己一套完整的启动体系，抓住了linux启动的脉络，linux的启动过程将不再神秘。 
阅读之前建议先看一下附图。 
本文中假设inittab中设置的init tree为： 
/etc/rc.d/rc0.d
/etc/rc.d/rc1.d
/etc/rc.d/rc2.d
/etc/rc.d/rc3.d
/etc/rc.d/rc4.d
/etc/rc.d/rc5.d
/etc
                                
                                Spring Aop Schema实现
                                    bijian1013
javaspringAOP
                                    本例使用的是Spring2.5 
1.Aop配置文件spring-aop.xml 
<?xml version="1.0" encoding="UTF-8"?>  
<beans  
    xmlns="http://www.springframework.org/schema/beans"  
    xmln
                                
                                【Gson七】Gson预定义类型适配器
                                    bit1129
gson
                                    Gson提供了丰富的预定义类型适配器，在对象和JSON串之间进行序列化和反序列化时，指定对象和字符串之间的转换方式， 
  DateTypeAdapter 
  
public final class DateTypeAdapter extends TypeAdapter<Date> {
  public static final TypeAdapterFacto
                                
                                【Spark八十八】Spark Streaming累加器操作（updateStateByKey)
                                    bit1129
update
                                    在实时计算的实际应用中，有时除了需要关心一个时间间隔内的数据，有时还可能会对整个实时计算的所有时间间隔内产生的相关数据进行统计。 
比如： 对Nginx的access.log实时监控请求404时，有时除了需要统计某个时间间隔内出现的次数，有时还需要统计一整天出现了多少次404，也就是说404监控横跨多个时间间隔。 
  
Spark Streaming的解决方案是累加器，工作原理是，定义
                                
                                linux系统下通过shell脚本快速找到哪个进程在写文件
                                    ronin47

                                    一个文件正在被进程写 我想查看这个进程 文件一直在增大 找不到谁在写 使用lsof也没找到 
这个问题挺有普遍性的，解决方法应该很多，这里我给大家提个比较直观的方法。 
linux下每个文件都会在某个块设备上存放，当然也都有相应的inode, 那么透过vfs.write我们就可以知道谁在不停的写入特定的设备上的inode。 
幸运的是systemtap的安装包里带了inodewatch.stp，位
                                
                                java-两种方法求第一个最长的可重复子串
                                    bylijinnan
java算法
                                    
import java.util.Arrays;
import java.util.Collections;
import java.util.List;


public class MaxPrefix {

	
	public static void main(String[] args) {
		String str="abbdabcdabcx";

                                
                                Netty源码学习-ServerBootstrap启动及事件处理过程
                                    bylijinnan
javanetty
                                    Netty是采用了Reactor模式的多线程版本，建议先看下面这篇文章了解一下Reactor模式： 
 
http://bylijinnan.iteye.com/blog/1992325 
 
Netty的启动及事件处理的流程，基本上是按照上面这篇文章来走的 
文章里面提到的操作，每一步都能在Netty里面找到对应的代码 
其中Reactor里面的Acceptor就对应Netty的ServerBo
                                
                                servelt filter listener 的生命周期
                                    cngolon
filterlistenerservelt生命周期
                                    1. servlet    当第一次请求一个servlet资源时，servlet容器创建这个servlet实例，并调用他的 init(ServletConfig config)做一些初始化的工作，然后调用它的service方法处理请求。当第二次请求这个servlet资源时，servlet容器就不在创建实例，而是直接调用它的service方法处理请求，也就是说
                                
                                jmpopups获取input元素值
                                    ctrain
JavaScript
                                    jmpopups 获取弹出层form表单 
首先，我有一个div，里面包含了一个表单，默认是隐藏的，使用jmpopups时，会弹出这个隐藏的div，其实jmpopups是将我们的代码生成一份拷贝。 
当我直接获取这个form表单中的文本框时，使用方法：$('#form input[name=test1]').val()；这样是获取不到的。 
我们必须到jmpopups生成的代码中去查找这个值，$(
                                
                                vi查找替换命令详解
                                    daizj
linux正则表达式替换查找vim
                                    一、查找 
 
查找命令 
 
/pattern<Enter> ：向下查找pattern匹配字符串 
?pattern<Enter>：向上查找pattern匹配字符串 
使用了查找命令之后，使用如下两个键快速查找： 
n：按照同一方向继续查找 
N：按照反方向查找 
 
字符串匹配 
 
pattern是需要匹配的字符串，例如： 
 
1:  /abc<En
                                
                                对网站中的js,css文件进行打包
                                    dcj3sjt126com
PHP打包
                                    一，为什么要用smarty进行打包 
apache中也有给js,css这样的静态文件进行打包压缩的模块，但是本文所说的不是以这种方式进行的打包，而是和smarty结合的方式来把网站中的js,css文件进行打包。 
为什么要进行打包呢，主要目的是为了合理的管理自己的代码 。现在有好多网站，你查看一下网站的源码的话，你会发现网站的头部有大量的JS文件和CSS文件，网站的尾部也有可能有大量的J
                                
                                php Yii: 出现undefined offset 或者 undefined index解决方案
                                    dcj3sjt126com
undefined
                                    在开发Yii 时，在程序中定义了如下方式： 
       if($this->menuoption[2] === 'test')，那么在运行程序时会报：undefined offset:2，这样的错误主要是由于php.ini 里的错误等级太高了，在windows下错误等级
                                
                                linux 文件格式（1） sed工具
                                    eksliang
linuxlinux sed工具sed工具linux sed详解
                                    转载请出自出处：
http://eksliang.iteye.com/blog/2106082  
简介 
      sed 是一种在线编辑器，它一次处理一行内容。处理时，把当前处理的行存储在临时缓冲区中，称为“模式空间”（pattern space），接着用sed命令处理缓冲区中的内容，处理完成后，把缓冲区的内容送往屏幕。接着处理下一行，这样不断重复，直到文件末尾
                                
                                Android应用程序获取系统权限
                                    gqdy365
android
                                    引用   
如何使Android应用程序获取系统权限 
 
 
        第一个方法简单点，不过需要在Android系统源码的环境下用make来编译： 
 
        1. 在应用程序的AndroidManifest.xml中的manifest节点
                                
                                HoverTree开发日志之验证码
                                    hvt
.netC#asp.nethovertreewebform
                                    HoverTree是一个ASP.NET的开源CMS，目前包含文章系统，图库和留言板功能。代码完全开放，文章内容页生成了静态的HTM页面，留言板提供留言审核功能，文章可以发布HTML源代码，图片上传同时生成高品质缩略图。推出之后得到许多网友的支持，再此表示感谢！留言板不断收到许多有益留言，但同时也有不少广告，因此决定在提交留言页面增加验证码功能。ASP.NET验证码在网上找，如果不是很多，就是特别多
                                
                                JSON API：用 JSON 构建 API 的标准指南中文版
                                    justjavac
json
                                    译文地址：https://github.com/justjavac/json-api-zh_CN 
如果你和你的团队曾经争论过使用什么方式构建合理 JSON 响应格式， 那么 JSON API 就是你的 anti-bikeshedding 武器。 
通过遵循共同的约定，可以提高开发效率，利用更普遍的工具，可以是你更加专注于开发重点：你的程序。 
基于 JSON API 的客户端还能够充分利用缓存，
                                
                                数据结构随记_2
                                    lx.asymmetric
数据结构笔记
                                    第三章 栈与队列 
一．简答题 
1. 在一个循环队列中，队首指针指向队首元素的  前一个    位置。  
2.在具有n个单元的循环队列中，队满时共有  n-1  个元素。  
3. 向栈中压入元素的操作是先  移动栈顶指针&n
                                
                                Linux下的监控工具dstat
                                    网络接口
linux
                                    1) 工具说明dstat是一个用来替换 vmstat,iostat netstat,nfsstat和ifstat这些命令的工具, 是一个全能系统信息统计工具. 与sysstat相比, dstat拥有一个彩色的界面, 在手动观察性能状况时, 数据比较显眼容易观察; 而且dstat支持即时刷新, 譬如输入dstat 3, 即每三秒收集一次, 但最新的数据都会每秒刷新显示. 和sysstat相同的是, 
                                
                                C 语言初级入门--二维数组和指针
                                    1140566087
二维数组c/c++指针
                                    /* 
 二维数组的定义和二维数组元素的引用 
 
 二维数组的定义： 
 当数组中的每个元素带有两个下标时，称这样的数组为二维数组； 
 (逻辑上把数组看成一个具有行和列的表格或一个矩阵); 
 语法： 
 类型名 数组名[常量表达式1][常量表达式2] 
 
 二维数组的引用： 
 引用二维数组元素时必须带有两个下标，引用形式如下： 
 例如： 
 int a[3][4];  引用：
                                
                                10点睛Spring4.1-Application Event
                                    wiselyman
application
                                    10.1 Application Event 
 
 Spring使用Application Event给bean之间的消息通讯提供了手段 
 应按照如下部分实现bean之间的消息通讯 
   
   继承ApplicationEvent类实现自己的事件 
   实现继承ApplicationListener接口实现监听事件 
   使用ApplicationContext发布消息 
    
 
                                
                
            
        
    

    
        
            按字母分类：
            ABCDEFGHIJKLMNOPQRSTUVWXYZ其他
        
    

    
        
            首页 -
            关于我们 -
            站内搜索 -
            Sitemap -
            侵权投诉
        
        版权所有 IT知识库 CopyRight © 2000-2050 E-COM-NET.COM , All Rights Reserved.