一个散步者的梦

pyspark中文api

内容基于官网pyspark-SparkSQL官方文档翻译及拓展

官方文档：https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/index.html

具体使用可在程序中使用help，dir函数，查看帮助文档，对象包含详细信息；也可以使用object?查看帮助文档，在对象后面带上一个?，这是python的一个特性；比如：SparkSession.builder?

Spark Session

SparkSQL程序主入口

from pyspark.sql import SparkSession

spark = SparkSession.Builder()\
    .config("spark.sql.shuffle.partitions",24)\
    .config("spark.executor.cores",8)\
    .config("spark.executor.memory","16g")\
    .config("spark.executor.instances",1)\
    .config("spark.driver.memory","4g")\
    .appName('pyspark')\
    .getOrCreate()

`SparkSession.builder.appName`(name)	程序名称
`SparkSession.builder.config`([key, value, …])	配置选项
`SparkSession.builder.enableHiveSupport`()	添加hive支持，可读hive表
`SparkSession.builder.getOrCreate`()	获取一个现有的SparkSession，如果没有，则根据此构建器中设置的选项创建一个新的SparkSession。
`SparkSession.builder.master`(master)	设置要连接的Spark主URL，例如“local”在本地运行，“local[4]”设置要连接的Spark主URL，例如“local”在本地运行，“local[4]”在4核的本地运行，或者“Spark://master:7077”在Spark独立集群上运行。
`SparkSession.builder.remote`(url)	Sets the Spark remote URL to connect to, such as “sc://host:port” to run it via Spark Connect server.
`SparkSession.catalog`	Interface through which the user may create, drop, alter or query underlying databases, tables, functions, etc.
`SparkSession.conf`	Spark运行时配置接口
`SparkSession.createDataFrame`(data[, schema, …])	创建DataFrame
`SparkSession.getActiveSession`()	返回当前线程的活动SparkSession，由构建器返回
`SparkSession.newSession`()	返回一个新的SparkSession作为新会话，它有单独的SQLConf，注册的临时视图和udf，但是共享SparkContext和表缓存。
`SparkSession.range`(start[, end, step, …])	Create a `DataFrame` with single `pyspark.sql.types.LongType` column named `id`, containing elements in a range from `start` to `end` (exclusive) with step value `step`.
`SparkSession.read`	返回一个DataFrameReader，它可以作为一个DataFrame来读取数据。
`SparkSession.readStream`	Returns a `DataStreamReader` that can be used to read data streams as a streaming `DataFrame`.
`SparkSession.sparkContext`	Returns the underlying `SparkContext`.
`SparkSession.sql`(sqlQuery[, args])	执行sql，返回DataFrame
`SparkSession.stop`()	停止`SparkContext`.
`SparkSession.streams`	Returns a `StreamingQueryManager` that allows managing all the `StreamingQuery` instances active on this context.
`SparkSession.table`(tableName)	Returns the specified table as a `DataFrame`.
`SparkSession.udf`	返回用于UDF注册的UDFRegistration。
`SparkSession.version`	版本号

Configuration

SparkSQL配置，比如dirver，executor资源配置，shuffle配置等

`RuntimeConfig`(jconf)	用户的配置API，可通过SparkSession.conf访问

Input/Output

SparkSession.read返回一个DataFrameReader对象；DataFrame.write返回一个DataFrameWriter对象；

from pyspark.sql import SparkSession

spark = SparkSession.Builder()\
    .master("local[8]")\
    .appName('pyspark')\
    .enableHiveSupport()\
    .getOrCreate()

data = [[1,'alice'],[2,'mary']]
spark_df = spark.createDataFrame(data,['id','name'])
spark_df.write.saveAsTable('dw_pub_safe.dw_pub_xxx_xxx')   # 另存为hive表

文件的输入与输出

`DataFrameReader.csv`(path[, schema, sep, …])	读取csv文件返回 `DataFrame`.
`DataFrameReader.format`(source)	指定数据源格式
`DataFrameReader.jdbc`(url, table[, column, …])	构造一个DataFrame，表示通过JDBC URL、URL和连接属性访问的名为table的数据库表
`DataFrameReader.json`(path[, schema, …])	加载json文件返回 `DataFrame`.
`DataFrameReader.load`([path, format, schema])	加载数据返回 `DataFrame`.可通过format参数指定数据格式
`DataFrameReader.option`(key, value)	为基础数据源添加输入选项。
`DataFrameReader.options`(**options)	为基础数据源添加输入选项。可输入多选项
`DataFrameReader.orc`(path[, mergeSchema, …])	读取orc文件并返回`DataFrame`.
`DataFrameReader.parquet`(paths, *options)	读取Parquet 文件返回 `DataFrame`.
`DataFrameReader.schema`(schema)	指定输入字段类型
`DataFrameReader.table`(tableName)	将指定的表作为 `DataFrame`返回
`DataFrameReader.text`(paths[, wholetext, …])	加载文本文件返回 `DataFrame`
`DataFrameWriter.bucketBy`(numBuckets, col, *cols)	按指定列分桶.
`DataFrameWriter.csv`(path[, mode, …])	将 `DataFrame` 以csv格式保存到指定路径
`DataFrameWriter.format`(source)	指定输出格式
`DataFrameWriter.insertInto`(tableName[, …])	将 `DataFrame` 插入到指定表里
`DataFrameWriter.jdbc`(url, table[, mode, …])	通过JDBC将DataFrame的内容保存到外部数据库表中
`DataFrameWriter.json`(path[, mode, …])	保存为json文件
`DataFrameWriter.mode`(saveMode)	指定数据或表已经存在时的行为
`DataFrameWriter.options`(**options)	为基础数据源添加输出选项。
`DataFrameWriter.orc`(path[, mode, …])	Saves the content of the `DataFrame` in ORC format at the specified path.
`DataFrameWriter.parquet`(path[, mode, …])	保存为Parquet文件
`DataFrameWriter.partitionBy`(*cols)	按指定列分区
`DataFrameWriter.save`([path, format, mode, …])	保存输出为指定格式数据

DataFrame

DataFrame的一些属性及方法

from pyspark.sql import SparkSession

spark = SparkSession.Builder()\
    .appName('pyspark')\
    .enableHiveSupport()\
    .getOrCreate()
    
sc = spark.sparkContext
rdd = sc.parallelize([[1,'alice'],[2,'mary']])   # 构建rdd
spark_df = rdd.toDF(schema=['id','name'])    # 转化为dataframe

# 查看agg的帮助文档
spark_df.agg?    
# 统计id字段的个数，返回结果进行字段重命名
spark_df.agg({'id':'count'}).withColumnRenamed('count(id)','cnt').show()

sql = "select * from tablename1 where field_name = 'xxx'"
spark_df1 = spark.sql(sql)
spark_df1.printSchema()

`DataFrame.__getattr__`(name)	Returns the `Column` denoted by `name`.1
`DataFrame.__getitem__`(item)	返回以name表示的列
`DataFrame.agg`(*exprs)	对不包含组的整个DataFrame进行聚合，没有groupby分组，直接按整个df字段聚合；
`DataFrame.alias`(alias)	返回一个设置了别名的新DataFrame
`DataFrame.approxQuantile`(col, probabilities, …)	Calculates the approximate quantiles of numerical columns of a `DataFrame`.
`DataFrame.cache`()	使用默认存储级别(MEMORY_AND_DISK)持久化DataFrame。
`DataFrame.checkpoint`([eager])	Returns a checkpointed version of this `DataFrame`.
`DataFrame.coalesce`(numPartitions)	返回一个新的DataFrame，它恰好有numPartitions分区。数据量filter少了，可以用该算子减少分区
`DataFrame.colRegex`(colName)	Selects column based on the column name specified as a regex and returns it as `Column`.
`DataFrame.collect`()	将数据以list对方返回到driver
`DataFrame.corr`(col1, col2[, method])	将DataFrame的两列的相关性计算为双精度值。
`DataFrame.count`()	返回记录行
`DataFrame.createGlobalTempView`(name)	用这个数据框创建一个全局临时视图
`DataFrame.createOrReplaceGlobalTempView`(name)	使用给定名称创建或替换全局临时视图。
`DataFrame.createOrReplaceTempView`(name)	用此数据框创建或替换本地临时视图。
`DataFrame.createTempView`(name)	Creates a local temporary view with this `DataFrame`.
`DataFrame.crossJoin`(other)	返回与另一个DataFrame的笛卡尔积
`DataFrame.crosstab`(col1, col2)	Computes a pair-wise frequency table of the given columns.
`DataFrame.cube`(*cols)	使用指定的列为当前DataFrame创建一个多维多维数据集，这样我们就可以在这些列上运行聚合。
`DataFrame.describe`(*cols)	计算数值和字符串列的基本统计信息。
`DataFrame.distinct`()	返回一个包含此数据框中不同行的新数据框。
`DataFrame.drop`(*cols)	返回一个没有指定列的新DataFrame。
`DataFrame.dropDuplicates`([subset])	返回一个新的DataFrame，删除重复的行，可选地只考虑某些列。
`DataFrame.drop_duplicates`([subset])	`dropDuplicates()`的别名
`DataFrame.dropna`([how, thresh, subset])	返回一个新的DataFrame，省略带有空值的行。
`DataFrame.dtypes`	以列表形式返回所有列名及其数据类型。
`DataFrame.exceptAll`(other)	Return a new `DataFrame` containing rows in this `DataFrame` but not in another `DataFrame` while preserving duplicates.
`DataFrame.explain`([extended, mode])	将(逻辑和物理)计划打印到控制台以进行调试。
`DataFrame.fillna`(value[, subset])	替换空值，别名为na.fill()
`DataFrame.filter`(condition)	按给定条件筛选
`DataFrame.first`()	返回前n行
`DataFrame.foreachPartition`(f)	Applies the `f` function to each partition of this `DataFrame`.
`DataFrame.freqItems`(cols[, support])	Finding frequent items for columns, possibly with false positives.
`DataFrame.groupBy`(*cols)	使用指定的列对DataFrame进行分组，以便我们可以对它们运行聚合。
`DataFrame.head`([n])	返回前n行。
`DataFrame.hint`(name, *parameters)	Specifies some hint on the current `DataFrame`.
`DataFrame.inputFiles`()	Returns a best-effort snapshot of the files that compose this `DataFrame`.
`DataFrame.intersect`(other)	返回一个新的DataFrame，其中只包含这个DataFrame和另一个DataFrame中的行：交集
`DataFrame.intersectAll`(other)	Return a new `DataFrame` containing rows in both this `DataFrame` and another `DataFrame` while preserving duplicates.
`DataFrame.isEmpty`()	如果此DataFrame为空，则返回True。
`DataFrame.isLocal`()	Returns `True` if the `collect()` and `take()` methods can be run locally (without any Spark executors).
`DataFrame.isStreaming`	Returns `True` if this `DataFrame` contains one or more sources that continuously return data as it arrives.
`DataFrame.join`(other[, on, how])	使用给定的连接表达式与另一个DataFrame连接。
`DataFrame.limit`(num)	将结果计数限制为指定的数目。
`DataFrame.localCheckpoint`([eager])	Returns a locally checkpointed version of this `DataFrame`.
`DataFrame.mapInPandas`(func, schema)	Maps an iterator of batches in the current `DataFrame` using a Python native function that takes and outputs a pandas DataFrame, and returns the result as a `DataFrame`.
`DataFrame.mapInArrow`(func, schema)	Maps an iterator of batches in the current `DataFrame` using a Python native function that takes and outputs a PyArrow’s RecordBatch, and returns the result as a `DataFrame`.
`DataFrame.melt`(ids, values, …)	Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set.
`DataFrame.na`	Returns a `DataFrameNaFunctions` for handling missing values.
`DataFrame.observe`(observation, *exprs)	Define (named) metrics to observe on the DataFrame.
`DataFrame.orderBy`(cols, *kwargs)	返回按指定列排序的新DataFrame。
`DataFrame.persist`([storageLevel])	设置存储级别，以便在第一次计算DataFrame的内容后，跨操作持久保存它
`DataFrame.printSchema`()	打印结构信息
`DataFrame.randomSplit`(weights[, seed])	Randomly splits this `DataFrame` with the provided weights.
`DataFrame.rdd`	作为pyspark返回内容。Row的RDD。
`DataFrame.registerTempTable`(name)	使用给定的名称将此DataFrame注册为临时表。
`DataFrame.repartition`(numPartitions, *cols)	返回一个由给定分区表达式划分的新DataFrame。
`DataFrame.repartitionByRange`(numPartitions, …)	Returns a new `DataFrame` partitioned by the given partitioning expressions.
`DataFrame.replace`(to_replace[, value, subset])	返回一个新的DataFrame，用另一个值替换一个值。
`DataFrame.rollup`(*cols)	使用指定列为当前DataFrame创建多维汇总，这样我们就可以对它们运行聚合。
`DataFrame.sameSemantics`(other)	Returns True when the logical query plans inside both `DataFrame`s are equal and therefore return the same results.
`DataFrame.sample`([withReplacement, …])	返回此数据框的抽样子集。
`DataFrame.sampleBy`(col, fractions[, seed])	根据每个层的给定分数返回一个分层样本，而不进行替换。
`DataFrame.schema`	以pyspark.sql.types.StructType的形式返回该DataFrame的模式。
`DataFrame.select`(*cols)	投射一组表达式并返回一个新的DataFrame；类似sql中的select
`DataFrame.selectExpr`(*expr)	投射一组SQL表达式并返回一个新的DataFrame。
`DataFrame.semanticHash`()	Returns a hash code of the logical query plan against this `DataFrame`.
`DataFrame.show`([n, truncate, vertical])	将前n行打印到控制台。
`DataFrame.sort`(cols, *kwargs)	返回按指定列排序的新DataFrame。
`DataFrame.sortWithinPartitions`(cols, *kwargs)	Returns a new `DataFrame` with each partition sorted by the specified column(s).
`DataFrame.sparkSession`	返回创建该数据框架的Spark会话。
`DataFrame.stat`	Returns a `DataFrameStatFunctions` for statistic functions.
`DataFrame.storageLevel`	获取DataFrame的当前存储级别。
`DataFrame.subtract`(other)	Return a new `DataFrame` containing rows in this `DataFrame` but not in another `DataFrame`.
`DataFrame.summary`(*statistics)	Computes specified statistics for numeric and string columns.
`DataFrame.tail`(num)	返回最后num行作为Row的列表
`DataFrame.take`(num)	返回前num行作为Row的列表。
`DataFrame.to`(schema)	返回一个新的DataFrame，其中每一行都与指定的模式相匹配
`DataFrame.toDF`(*cols)	返回具有新指定列名的新DataFrame
`DataFrame.toJSON`([use_unicode])	将一个DataFrame转换为字符串的RDD。
`DataFrame.toLocalIterator`([prefetchPartitions])	Returns an iterator that contains all of the rows in this `DataFrame`.
`DataFrame.toPandas`()	返回该数据框的内容为 Pandas .DataFrame。
`DataFrame.to_pandas_on_spark`([index_col])
`DataFrame.transform`(func, args, *kwargs)	Returns a new `DataFrame`.
`DataFrame.union`(other)	返回一个包含此数据框和另一个数据框中的行并集的新数据框。去重
`DataFrame.unionAll`(other)	返回一个包含此数据框和另一个数据框中的行并集的新数据框。不去重
`DataFrame.unionByName`(other[, …])	Returns a new `DataFrame` containing union of rows in this and another `DataFrame`.
`DataFrame.unpersist`([blocking])	将DataFrame标记为非持久性，并从内存和磁盘中删除它的所有块
`DataFrame.unpivot`(ids, values, …)	Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set.
`DataFrame.where`(condition)	filter()的别名
`DataFrame.withColumn`(colName, col)	通过添加列或替换具有相同名称的现有列，返回一个新的DataFrame
`DataFrame.withColumns`(*colsMap)	通过添加多个列或替换具有相同名称的现有列，返回一个新的DataFrame。
`DataFrame.withColumnRenamed`(existing, new)	通过重命名现有列返回一个新的DataFrame。
`DataFrame.withColumnsRenamed`(colsMap)	通过重命名多个列返回一个新的DataFrame。
`DataFrame.withMetadata`(columnName, metadata)	Returns a new `DataFrame` by updating an existing column with metadata.
`DataFrame.withWatermark`(eventTime, …)	Defines an event time watermark for this `DataFrame`.
`DataFrame.write`	将非流数据帧的内容保存到外部存储器的接口。
`DataFrame.writeStream`	Interface for saving the content of the streaming `DataFrame` out into external storage.
`DataFrame.writeTo`(table)	Create a write configuration builder for v2 sources.
`DataFrame.pandas_api`([index_col])	将现有的DataFrame转换为pandas-on-Spark DataFrame。
`DataFrameNaFunctions.drop`([how, thresh, subset])	Returns a new `DataFrame` omitting rows with null values.
`DataFrameNaFunctions.fill`(value[, subset])	Replace null values, alias for `na.fill()`.
`DataFrameNaFunctions.replace`(to_replace[, …])	Returns a new `DataFrame` replacing a value with another value.
`DataFrameStatFunctions.approxQuantile`(col, …)	Calculates the approximate quantiles of numerical columns of a `DataFrame`.
`DataFrameStatFunctions.corr`(col1, col2[, method])	Calculates the correlation of two columns of a `DataFrame` as a double value.
`DataFrameStatFunctions.cov`(col1, col2)	Calculate the sample covariance for the given columns, specified by their names, as a double value.
`DataFrameStatFunctions.crosstab`(col1, col2)	Computes a pair-wise frequency table of the given columns.
`DataFrameStatFunctions.freqItems`(cols[, support])	Finding frequent items for columns, possibly with false positives.
`DataFrameStatFunctions.sampleBy`(col, fractions)	Returns a stratified sample without replacement based on the fraction given on each stratum.

Column

DataFrame的列
spark_df.filter(spark_df.id.between(2,3)).show() 筛选id字段在2和3之间的数据行

`Column.__getattr__`(item)	An expression that gets an item at position `ordinal` out of a list, or gets an item by key out of a dict.
`Column.__getitem__`(k)	An expression that gets an item at position `ordinal` out of a list, or gets an item by key out of a dict.
`Column.alias`(alias, *kwargs)	以一个或多个新名称返回该列的别名(对于返回多个列的表达式，例如explosion)。
`Column.asc`()	返回基于列升序的排序表达式。
`Column.asc_nulls_first`()	返回基于列升序的排序表达式，空值在非空值之前返回。
`Column.asc_nulls_last`()	返回基于列升序的排序表达式，空值出现在非空值之后。
`Column.astype`(dataType)	Astype()是cast()的别名。数据类型转化
`Column.between`(lowerBound, upperBound)	是否包含在区间内
`Column.bitwiseAND`(other)	Compute bitwise AND of this expression with another expression.
`Column.bitwiseOR`(other)	Compute bitwise OR of this expression with another expression.
`Column.bitwiseXOR`(other)	Compute bitwise XOR of this expression with another expression.
`Column.cast`(dataType)	将列强制转换为dataType类型。
`Column.contains`(other)	是否包含子串，类似于pandas的column.str.contains
`Column.desc`()	根据列的降序返回排序表达式。
`Column.desc_nulls_first`()	根据列的降序返回排序表达式，空值出现在非空值之前。
`Column.desc_nulls_last`()	根据列的降序返回排序表达式，空值出现在非空值之后。
`Column.dropFields`(*fieldNames)	按名称删除StructType中的字段的表达式
`Column.endswith`(other)	字符串以结尾。
`Column.eqNullSafe`(other)	Equality test that is safe for null values.
`Column.getField`(name)	An expression that gets a field by name in a `StructType`.
`Column.getItem`(key)	An expression that gets an item at position `ordinal` out of a list, or gets an item by key out of a dict.
`Column.ilike`(other)	SQL表达式(不区分大小写的LIKE)。
`Column.isNotNull`()	如果当前表达式不为空，则为True。
`Column.isNull`()	如果当前表达式为空，则为True。
`Column.isin`(*cols)	一个布尔表达式，如果该表达式的值包含在参数的求值中，则计算为true。
`Column.like`(other)	SQL表达式。
`Column.name`(alias, *kwargs)	Name()是alias()的别名。
`Column.otherwise`(value)	计算条件列表并返回多个可能结果表达式中的一个。
`Column.over`(window)	定义一个窗口列。
`Column.rlike`(other)	SQL RLIKE表达式(如正则表达式)。
`Column.startswith`(other)	以xx开头
`Column.substr`(startPos, length)	返回列的substring字符串
`Column.when`(condition, value)	Evaluates a list of conditions and returns one of multiple possible result expressions.
`Column.withField`(fieldName, col)	An expression that adds/replaces a field in `StructType` by name.

Data Types

数据类型在from pyspark.sql.types import *下面，有时候构建DataFrame制定schema字段数据类型时可能会用到

SparkSQL的数据类型

`ArrayType`(elementType[, containsNull])	Array data type.
`BinaryType`	Binary (byte array) data type.
`BooleanType`	Boolean data type.
`ByteType`	Byte data type, i.e.
`DataType`	Base class for data types.
`DateType`	Date (datetime.date) data type.
`DecimalType`([precision, scale])	Decimal (decimal.Decimal) data type.
`DoubleType`	Double data type, representing double precision floats.
`FloatType`	Float data type, representing single precision floats.
`IntegerType`	Int data type, i.e.
`LongType`	Long data type, i.e.
`MapType`(keyType, valueType[, valueContainsNull])	Map data type.
`NullType`	Null type.
`ShortType`	Short data type, i.e.
`StringType`	String data type.
`CharType`(length)	Char data type
`VarcharType`(length)	Varchar data type
`StructField`(name, dataType[, nullable, metadata])	A field in `StructType`.
`StructType`([fields])	Struct type, consisting of a list of `StructField`.
`TimestampType`	Timestamp (datetime.datetime) data type.
`TimestampNTZType`	Timestamp (datetime.datetime) data type without timezone information.
`DayTimeIntervalType`([startField, endField])	DayTimeIntervalType (datetime.timedelta).

Row

DataFrame没每一行都是一个Row对象

Row.asDict([recursive]) 返回一个字典

Functions

SparkSQL中函数库

from pyspark.sql.functions import *

spark_df.select(upper(spark_df.name)).show()    # 把字段转化为大写，使用upper函数

String Functions

字符串函数

`ascii`(col)	Computes the numeric value of the first character of the string column.
`base64`(col)	Computes the BASE64 encoding of a binary column and returns it as a string column.
`bit_length`(col)	Calculates the bit length for the specified string column.
`concat_ws`(sep, *cols)	Concatenates multiple input string columns together into a single string column, using the given separator.
`decode`(col, charset)	Computes the first argument into a string from a binary using the provided character set (one of ‘US-ASCII’, ‘ISO-8859-1’, ‘UTF-8’, ‘UTF-16BE’, ‘UTF-16LE’, ‘UTF-16’).
`encode`(col, charset)	Computes the first argument into a binary from a string using the provided character set (one of ‘US-ASCII’, ‘ISO-8859-1’, ‘UTF-8’, ‘UTF-16BE’, ‘UTF-16LE’, ‘UTF-16’).
`format_number`(col, d)	Formats the number X to a format like ‘#,–#,–#.–’, rounded to d decimal places with HALF_EVEN round mode, and returns the result as a string.
`format_string`(format, *cols)	Formats the arguments in printf-style and returns the result as a string column.
`initcap`(col)	Translate the first letter of each word to upper case in the sentence.
`instr`(str, substr)	Locate the position of the first occurrence of substr column in the given string.
`length`(col)	Computes the character length of string data or number of bytes of binary data.
`lower`(col)	Converts a string expression to lower case.
`levenshtein`(left, right)	Computes the Levenshtein distance of the two given strings.
`locate`(substr, str[, pos])	Locate the position of the first occurrence of substr in a string column, after position pos.
`lpad`(col, len, pad)	Left-pad the string column to width len with pad.
`ltrim`(col)	Trim the spaces from left end for the specified string value.
`octet_length`(col)	Calculates the byte length for the specified string column.
`regexp_extract`(str, pattern, idx)	Extract a specific group matched by a Java regex, from the specified string column.
`regexp_replace`(string, pattern, replacement)	Replace all substrings of the specified string value that match regexp with replacement.
`unbase64`(col)	Decodes a BASE64 encoded string column and returns it as a binary column.
`rpad`(col, len, pad)	Right-pad the string column to width len with pad.
`repeat`(col, n)	Repeats a string column n times, and returns it as a new string column.
`rtrim`(col)	Trim the spaces from right end for the specified string value.
`soundex`(col)	Returns the SoundEx encoding for a string
`split`(str, pattern[, limit])	Splits str around matches of the given pattern.
`substring`(str, pos, len)	Substring starts at pos and is of length len when str is String type or returns the slice of byte array that starts at pos in byte and is of length len when str is Binary type.
`substring_index`(str, delim, count)	Returns the substring from string str before count occurrences of the delimiter delim.
`overlay`(src, replace, pos[, len])	Overlay the specified portion of src with replace, starting from byte position pos of src and proceeding for len bytes.
`sentences`(string[, language, country])	Splits a string into arrays of sentences, where each sentence is an array of words.
`translate`(srcCol, matching, replace)	A function translate any character in the srcCol by a character in matching.
`trim`(col)	Trim the spaces from both ends for the specified string column.
`upper`(col)	Converts a string expression to upper case.

Sort Functions

排序函数

`asc`(col)	根据给定列名的升序返回排序表达式。
`asc_nulls_first`(col)	根据给定列名的升序返回排序表达式，空值在非空值之前返回
`asc_nulls_last`(col)	Returns a sort expression based on the ascending order of the given column name, and null values appear after non-null values.
`desc`(col)	根据给定列名的降序返回排序表达式。
`desc_nulls_first`(col)	根据给定列名的降序返回排序表达式，空值出现在非空值之前
`desc_nulls_last`(col)	Returns a sort expression based on the descending order of the given column name, and null values appear after non-null values.

Window Functions

窗口函数，

`cume_dist`()	Window function: returns the cumulative distribution of values within a window partition, i.e.
`dense_rank`()	窗口函数:返回窗口分区内的行秩，没有任何间隙。1,2,2,3
`lag`(col[, offset, default])	窗口函数:返回当前行之前偏移行的值，如果当前行之前的偏移行少于偏移行，则返回默认值。
`lead`(col[, offset, default])	窗口函数:返回当前行之后偏移行的值，如果当前行之后的偏移行数少于，则返回默认值。
`nth_value`(col, offset[, ignoreNulls])	Window function: returns the value that is the offsetth row of the window frame (counting from 1), and null if the size of window frame is less than offset rows.
`ntile`(n)	窗口函数:返回一个有序窗口分区中的ntile组id(从1到n包括在内)。
`percent_rank`()	返回百分位
`rank`()	跳跃排序，比如1,2,2,4
`row_number`()	窗口函数:返回一个窗口分区内从1开始的序号。相同值排序不同

Aggregate Functions

`approx_count_distinct`(col[, rsd])	返回列col的近似不同计数的新列。
`avg`(col)	返回一组中所有值的平均值。
`collect_list`(col)	返回具有重复项的对象列表
`collect_set`(col)	聚合函数:返回一组消除重复元素的对象。分布式计算，每一次计算元素顺序可能会不一样
`corr`(col1, col2)	返回col1和col2的Pearson相关系数的新列。
`count`(col)	计数
`count_distinct`(col, *cols)	字段去重计数

`covar_samp`(col1, col2)	为col1和col2的样本协方差返回一个新列。
`first`(col[, ignorenulls])	分组value中第一个值
`grouping`(col)	表示在GROUP BY列表中指定的列是否聚合，结果集中聚合返回1，未聚合返回0。同hive中该函数；
`grouping_id`(*cols)	Aggregate function: returns the level of grouping, equals to
`kurtosis`(col)	峰度
`last`(col[, ignorenulls])	分组value中最后一个值
`max`(col)	最大值
`max_by`(col, ord)	返回与ord的最大值相关联的值。
`mean`(col)	均值
`median`(col)	中位数
`min`(col)	最小值
`min_by`(col, ord)	返回与ord的最小值相关联的值。
`mode`(col)	返回组中出现频率最高的值。众数
`percentile_approx`(col, percentage[, accuracy])	返回数值列col的近似百分位数，它是有序的col值(从最小到最大排序)中的最小值，使得不超过百分比的col值小于或等于该值。
`product`(col)	返回一组中所有值的乘积。
`skewness`(col)	偏度
`stddev`(col)	stddev_samp的别名。标准差
`stddev_pop`(col)	返回表达式在一组中的总体标准差。
`stddev_samp`(col)	返回一组表达式的无偏样本标准差。除以n-1
`sum`(col)	求和
`sum_distinct`(col)	聚合函数:返回表达式中不同值的和
`sumDistinct`(col)	返回表达式中不同值的和。
`var_pop`(col)	返回组中值的总体方差。
`var_samp`(col)	返回一组中值的无偏样本方差。
`variance`(col)	var_samp的别名

Datetime Functions

日期函数

`add_months`(start, months)	返回开始后几个月的日期。
`current_date`()	将查询求值开始时的当前日期作为DateType列返回。
`current_timestamp`()	将查询求值开始时的当前时间戳作为TimestampType列返回。
`date_add`(start, days)	返回开始后几天的日期
`date_format`(date, format)	将日期/时间戳/字符串转换为由第二个参数给出的日期格式指定的字符串值。
`date_sub`(start, days)	返回开始前几天的日期。
`date_trunc`(format, timestamp)	返回时间戳截断为格式指定的单位。
`datediff`(end, start)	返回从开始到结束的天数
`dayofmonth`(col)	提取给定日期/时间戳的月份的第几天作为整数。
`dayofweek`(col)	提取给定日期/时间戳的星期几作为整数。
`dayofyear`(col)	提取给定日期/时间戳的年份中的第几天作为整数。
`second`(col)	将给定日期的秒数提取为整数
`weekofyear`(col)	所在年的第几周
`year`(col)	提取年份
`quarter`(col)	提取给定日期/时间戳的四分之一作为整数。
`month`(col)	提取月份
`last_day`(date)	返回月所在的最后一天，类似于exce中的emonth函数
`localtimestamp`()	Returns the current timestamp without time zone at the start of query evaluation as a timestamp without time zone column.
`minute`(col)	提取分钟数
`months_between`(date1, date2[, roundOff])	返回日期date1和date2之间的月数。
`next_day`(date, dayOfWeek)	返回第一个日期，该日期晚于基于第二个星期日期参数的日期列的值。
`hour`(col)	提取小时
`make_date`(year, month, day)	生成日期，输入年月日
`from_unixtime`(timestamp[, format])	将unix epoch (1970-01-01 00:00:00 UTC)中的秒数转换为以给定格式表示当前系统时区中该时刻的时间戳的字符串。
`unix_timestamp`([timestamp, format])	将给定模式(’ yyyy-MM-dd HH:mm:ss '，默认情况下)的时间字符串转换为Unix时间戳(以秒为单位)，使用默认时区和默认区域设置，如果失败则返回null。
`to_timestamp`(col[, format])	使用可选的指定格式将列转换为pyspark.sql.types.TimestampType。
`to_date`(col[, format])	使用可选的指定格式将列转换为pyspark.sql.types.DateType。
`trunc`(date, format)	返回日期截断为格式指定的单位。
`from_utc_timestamp`(timestamp, tz)	This is a common function for databases supporting TIMESTAMP WITHOUT TIMEZONE.
`to_utc_timestamp`(timestamp, tz)	This is a common function for databases supporting TIMESTAMP WITHOUT TIMEZONE.
`window`(timeColumn, windowDuration[, …])	给定一个指定列的时间戳，将行分成一个或多个时间窗口。
`session_window`(timeColumn, gapDuration)	Generates session window given a timestamp specifying column.
`timestamp_seconds`(col)	Converts the number of seconds from the Unix epoch (1970-01-01T00:00:00Z) to a timestamp.
`window_time`(windowColumn)	Computes the event time from a window column.

Collection Functions

集合函数

`array`(*cols)	Creates a new array column.
`array_contains`(col, value)	Collection function: returns null if the array is null, true if the array contains the given value, and false otherwise.
`arrays_overlap`(a1, a2)	Collection function: returns true if the arrays contain any common non-null element; if not, returns null if both the arrays are non-empty and any of them contains a null element; returns false otherwise.
`array_join`(col, delimiter[, null_replacement])	Concatenates the elements of column using the delimiter.
`create_map`(*cols)	Creates a new map column.
`slice`(x, start, length)	Collection function: returns an array containing all the elements in x from index start (array indices start at 1, or from the end if start is negative) with the specified length.
`concat`(*cols)	将多个输入列合并到一个列中。
`array_position`(col, value)	Collection function: Locates the position of the first occurrence of the given value in the given array.
`element_at`(col, extraction)	Collection function: Returns element of array at given index in extraction if col is array.
`array_append`(col, value)	集合函数:返回一个包含col1中的元素的数组，并在数组的最后一个位置添加col2中的元素。
`array_sort`(col[, comparator])	Collection function: sorts the input array in ascending order.
`array_insert`(arr, pos, value)	Collection function: adds an item into a given array at a specified array index.
`array_remove`(col, element)	集合函数:从给定数组中移除所有等于element的元素。
`array_distinct`(col)	收集功能:从数组中移除重复的值。
`array_intersect`(col1, col2)	集合函数:返回一个包含col1和col2交集元素的数组，没有重复元素。
`array_union`(col1, col2)	集合函数:返回一个包含col1和col2并集元素的数组，不包含重复元素。
`array_except`(col1, col2)	集合函数:返回一个包含col1中但不包含col2中的元素的数组，不包含重复元素。
`array_compact`(col)	集合函数:从数组中移除空值。
`transform`(col, f)	在对输入数组中的每个元素应用转换后返回一个元素数组。
`exists`(col, f)	返回谓词是否适用于数组中的一个或多个元素
`forall`(col, f)	返回谓词是否适用于数组中的每个元素。
`filter`(col, f)	返回给定数组中谓词所对应的元素数组。
`aggregate`(col, initialValue, merge[, finish])	对初始状态和数组中的所有元素应用二进制运算符，并将其简化为单个状态。
`zip_with`(left, right, f)	Merge two given arrays, element-wise, into a single array using a function.
`transform_keys`(col, f)	Applies a function to every key-value pair in a map and returns a map with the results of those applications as the new keys for the pairs.
`transform_values`(col, f)	Applies a function to every key-value pair in a map and returns a map with the results of those applications as the new values for the pairs.
`map_filter`(col, f)	返回其键值对满足谓词的映射。
`map_from_arrays`(col1, col2)	从两个数组创建一个新的映射。
`map_zip_with`(col1, col2, f)	使用函数将两个给定的映射按键合并为一个映射。
`explode`(col)	为给定数组或映射中的每个元素返回一个新行。
`explode_outer`(col)	Returns a new row for each element in the given array or map.
`posexplode`(col)	Returns a new row for each element with position in the given array or map.
`posexplode_outer`(col)	Returns a new row for each element with position in the given array or map.
`inline`(col)	Explodes an array of structs into a table.
`inline_outer`(col)	Explodes an array of structs into a table.
`get`(col, index)	Collection function: Returns element of array at given (0-based) index.
`get_json_object`(col, path)	Extracts json object from a json string based on json path specified, and returns json string of the extracted json object.
`json_tuple`(col, *fields)	Creates a new row for a json column according to the given field names.
`from_json`(col, schema[, options])	Parses a column containing a JSON string into a `MapType` with `StringType` as keys type, `StructType` or `ArrayType` with the specified schema.
`schema_of_json`(json[, options])	Parses a JSON string and infers its schema in DDL format.
`to_json`(col[, options])	Converts a column containing a `StructType`, `ArrayType` or a `MapType` into a JSON string.
`size`(col)	Collection函数:返回存储在列中的数组或映射的长度。
`struct`(*cols)	Creates a new struct column.
`sort_array`(col[, asc])	Collection函数:根据数组元素的自然排列顺序，对输入数组进行升序或降序排序。
`array_max`(col)	Collection函数:返回数组的最大值。
`array_min`(col)	集合函数:返回数组的最小值
`shuffle`(col)	生成给定数组的随机排列。
`reverse`(col)	返回一个颠倒的字符串或元素顺序颠倒的数组。
`flatten`(col)	从数组的数组中创建一个数组。
`sequence`(start, stop[, step])	生成一个从开始到结束，逐步递增的整数序列。
`array_repeat`(col, count)	Collection function: creates an array containing a column repeated count times.
`map_contains_key`(col, value)	如果映射包含键则返回true。
`map_keys`(col)	返回包含映射键的无序数组。
`map_values`(col)	返回包含映射值的无序数组。
`map_entries`(col)	Collection function: Returns an unordered array of all entries in the given map.
`map_from_entries`(col)	Collection function: Converts an array of entries (key value struct types) to a map of values.
`arrays_zip`(*cols)	Collection function: Returns a merged array of structs in which the N-th struct contains all N-th values of input arrays.
`map_concat`(*cols)	Returns the union of all the given maps.
`from_csv`(col, schema[, options])	Parses a column containing a CSV string to a row with the specified schema.
`schema_of_csv`(csv[, options])	Parses a CSV string and infers its schema in DDL format.
`to_csv`(col[, options])	将包含StructType的列转换为CSV字符串。

Math Functions

`sqrt`(col)	计算指定浮点值的平方根
`abs`(col)	绝对值
`acos`(col)	计算输入列的逆余弦。
`acosh`(col)	计算输入列的逆双曲余弦。
`asin`(col)	计算输入列的逆正弦。
`asinh`(col)	计算输入列的逆双曲正弦。
`atan`(col)	计算输入列的tan逆。
`atanh`(col)	计算输入列的逆双曲正切。
`atan2`(col1, col2)	New in version 1.4.0.
`bin`(col)	Returns the string representation of the binary value of the given column.
`cbrt`(col)	计算给定值的立方根。
`ceil`(col)	计算给定值的上限。
`conv`(col, fromBase, toBase)	将字符串列中的数字从一种基数转换为另一种基数。
`cos`(col)	计算输入列的余弦值。
`cosh`(col)	计算输入列的双曲余弦
`cot`(col)	Computes cotangent of the input column.
`csc`(col)	Computes cosecant of the input column.
`exp`(col)	计算给定值的指数。
`expm1`(col)	计算给定值减去1的指数
`factorial`(col)	计算给定值的阶乘。
`floor`(col)	计算给定值的下限。
`hex`(col)	Computes hex value of the given column, which could be `pyspark.sql.types.StringType`, `pyspark.sql.types.BinaryType`, `pyspark.sql.types.IntegerType` or `pyspark.sql.types.LongType`.
`unhex`(col)	Inverse of hex.
`hypot`(col1, col2)	计算根号(a^2 + b^2)，没有中间溢出或下溢。
`log`(arg1[, arg2])	返回第二个参数的基于第一个参数的对数。
`log10`(col)	以10为基数计算给定值的对数
`log1p`(col)	计算“给定值加一”的自然对数。
`log2`(col)	返回参数的以2为底的对数。
`pmod`(dividend, divisor)	返回被除数的正数
`pow`(col1, col2)	返回第一个参数的值乘以第二个参数的幂。
`rint`(col)	Returns the double value that is closest in value to the argument and is equal to a mathematical integer.
`round`(col[, scale])	如果scale >= 0，使用HALF_UP四舍五入模式将给定值四舍五入到小数位数，或者当scale < 0时使用整数部分。
`bround`(col[, scale])	Round the given value to scale decimal places using HALF_EVEN rounding mode if scale >= 0 or at integral part when scale < 0.
`sec`(col)	Computes secant of the input column.
`shiftleft`(col, numBits)	Shift the given value numBits left.
`shiftright`(col, numBits)	(Signed) shift the given value numBits right.
`shiftrightunsigned`(col, numBits)	Unsigned shift the given value numBits right.
`signum`(col)	Computes the signum of the given value.
`sin`(col)	计算输入列的正弦值
`sinh`(col)	计算输入列的双曲正弦。
`tan`(col)	计算输入列的正切值。
`tanh`(col)	计算输入列的双曲正切。
`toDegrees`(col)	New in version 1.4.0.
`degrees`(col)	将以弧度测量的角转换为以度测量的近似等效角。
`toRadians`(col)	New in version 1.4.0.
`radians`(col)	将以度为单位的角转换为以弧度为单位的近似等效角。

Normal Functions

一般函数

`col`(col)	返回基于给定列名的列。
`column`(col)	返回基于给定列名的列。
`lit`(col)	创建具有文字值的列。
`broadcast`(df)	Marks a DataFrame as small enough for use in broadcast joins.
`coalesce`(*cols)	返回以一个非空列
`input_file_name`()	Creates a string column for the file name of the current Spark task.
`isnan`(col)	如果是null，返回true
`isnull`(col)	如果列为空，则返回true的表达式。
`monotonically_increasing_id`()	A column that generates monotonically increasing 64-bit integers.
`nanvl`(col1, col2)	Returns col1 if it is not NaN, or col2 if col1 is NaN.
`rand`([seed])	生成一个随机列，样本独立且同分布(i.i.d)，均匀分布在[0.0,1.0]中。
`randn`([seed])	从标准正态分布中生成具有独立且同分布(i.i.d)样本的列。
`spark_partition_id`()	A column for partition ID.
`when`(condition, value)	计算条件列表并返回多个可能结果表达式中的一个。
`bitwise_not`(col)	Computes bitwise not.
`bitwiseNOT`(col)	Computes bitwise not.
`expr`(str)	将表达式字符串解析为它所表示的列
`greatest`(*cols)	返回列名列表中的最大值，跳过空值。
`least`(*cols)	返回列名列表中最小的值，跳过空值。

Grouping

DataFrame.groupBy方法返回GroupedData对象，可以使用如下方法

`GroupedData.agg`(*exprs)	计算聚合并将结果作为DataFrame返回。
`GroupedData.apply`(udf)	它是pyspark.sql.GroupedData.applyInPandas()的别名;然而，它接受一个pyspark.sql.functions.pandas_udf()，而pyspark.sql.GroupedData.applyInPandas()接受一个Python本地函数。
`GroupedData.applyInPandas`(func, schema)	使用pandas udf映射当前DataFrame的每一组，并将结果作为DataFrame返回
`GroupedData.applyInPandasWithState`(func, …)	将给定函数应用于每组数据，同时保持用户定义的每组状态。
`GroupedData.avg`(*cols)	计算每个组的每个数字列的平均值
`GroupedData.cogroup`(other)	将这个组与另一个组共同分组，这样我们就可以进行共同分组的操作
`GroupedData.count`()	计数
`GroupedData.max`(*cols)	最大值
`GroupedData.mean`(*cols)	均值
`GroupedData.min`(*cols)	最小值
`GroupedData.pivot`(pivot_col[, values])	对当前DataFrame的列进行透视，并执行指定的聚合。
`GroupedData.sum`(*cols)	切合
`PandasCogroupedOps.applyInPandas`(func, schema)	使用pandas对每个协同组应用一个函数，并将结果作为DataFrame返回

UDF

用户自定义函数

strlen = spark.udf.register("len", lambda x: len(x))   # 自定义函数
spark.sql("SELECT len('id')").collect()    # 使用sql方法，调用函数，使用函数注册名len

# 基于DataFrame调用函数，使用strlen
spark.sql("SELECT '1' AS text union all select '31' as text").select(strlen("text")).show()

`UDFRegistration.register`(name, f[, returnType])	将Python函数(包括lambda函数)或用户定义函数注册为SQL函数。
`UDFRegistration.registerJavaFunction`(name, …)	将Java用户定义函数注册为SQL函数
`UDFRegistration.registerJavaUDAF`(name, …)	将Java用户定义的聚合函数注册为SQL函数。

你可能感兴趣的:(Spark,Python,大数据,python,pyspark,spark)

理解Gunicorn：Python WSGI服务器的基石范范0825 ipython linux 运维
理解Gunicorn：PythonWSGI服务器的基石介绍Gunicorn，全称GreenUnicorn，是一个为PythonWSGI（WebServerGatewayInterface）应用设计的高效、轻量级HTTP服务器。作为PythonWeb应用部署的常用工具，Gunicorn以其高性能和易用性著称。本文将介绍Gunicorn的基本概念、安装和配置，帮助初学者快速上手。1.什么是Gunico
Python数据分析与可视化实战指南 William数据分析 python python 数据
在数据驱动的时代，Python因其简洁的语法、强大的库生态系统以及活跃的社区，成为了数据分析与可视化的首选语言。本文将通过一个详细的案例，带领大家学习如何使用Python进行数据分析，并通过可视化来直观呈现分析结果。一、环境准备1.1安装必要库在开始数据分析和可视化之前，我们需要安装一些常用的库。主要包括pandas、numpy、matplotlib和seaborn等。这些库分别用于数据处理、数学
python os.environ 江湖偌大 python 深度学习
os.environ['TF_CPP_MIN_LOG_LEVEL']='0'#默认值，输出所有信息os.environ['TF_CPP_MIN_LOG_LEVEL']='1'#屏蔽通知信息（INFO）os.environ['TF_CPP_MIN_LOG_LEVEL']='2'#屏蔽通知信息和警告信息（INFO\WARNING）os.environ['TF_CPP_MIN_LOG_LEVEL']='
Python中os.environ基本介绍及使用方法鹤冲天Pro #Python python 服务器开发语言
文章目录python中os.environos.environ简介os.environ进行环境变量的增删改查python中os.environ的使用详解1.简介2.key字段详解2.1常见key字段3.os.environ.get()用法4.环境变量的增删改查和判断是否存在4.1新增环境变量4.2更新环境变量4.3获取环境变量4.4删除环境变量4.5判断环境变量是否存在python中os.envi
Pyecharts数据可视化大屏：打造沉浸式数据分析体验我的运维人生信息可视化数据分析数据挖掘运维开发技术共享
Pyecharts数据可视化大屏：打造沉浸式数据分析体验在当今这个数据驱动的时代，如何将海量数据以直观、生动的方式展现出来，成为了数据分析师和企业决策者关注的焦点。Pyecharts，作为一款基于Python的开源数据可视化库，凭借其丰富的图表类型、灵活的配置选项以及高度的定制化能力，成为了构建数据可视化大屏的理想选择。本文将深入探讨如何利用Pyecharts打造数据可视化大屏，并通过实际代码案例
Python教程：一文了解使用Python处理XPath 旦莫 Python进阶 python 开发语言
目录1.环境准备1.1安装lxml1.2验证安装2.XPath基础2.1什么是XPath？2.2XPath语法2.3示例XML文档3.使用lxml解析XML3.1解析XML文档3.2查看解析结果4.XPath查询4.1基本路径查询4.2使用属性查询4.3查询多个节点5.XPath的高级用法5.1使用逻辑运算符5.2使用函数6.实战案例6.1从网页抓取数据6.1.1安装Requests库6.1.2代
python os.environ_python os.environ 读取和设置环境变量 weixin_39605414 python os.environ
>>>importos>>>os.environ.keys()['LC_NUMERIC','GOPATH','GOROOT','GOBIN','LESSOPEN','SSH_CLIENT','LOGNAME','USER','HOME','LC_PAPER','PATH','DISPLAY','LANG','TERM','SHELL','J2REDIR','LC_MONETARY','QT_QPA
使用Faiss进行高效相似度搜索 llzwxh888 faiss python
在现代AI应用中，快速和高效的相似度搜索是至关重要的。Faiss（FacebookAISimilaritySearch）是一个专门用于快速相似度搜索和聚类的库，特别适用于高维向量。本文将介绍如何使用Faiss来进行相似度搜索，并结合Python代码演示其基本用法。什么是Faiss？Faiss是一个由FacebookAIResearch团队开发的开源库，主要用于高维向量的相似性搜索和聚类。Faiss
python是什么意思中文-在python中%是什么意思编程大乐趣
Python中%有两种：1、数值运算：%代表取模，返回除法的余数。如：>>>7%212、%操作符（字符串格式化，stringformatting），说明如下：%[(name)][flags][width].[precision]typecode(name)为命名flags可以有+，-，''或0。+表示右对齐。-表示左对齐。''为一个空格，表示在正数的左侧填充一个空格，从而与负数对齐。0表示使用0填
Day1笔记-Python简介&标识符和关键字&输入输出 ~在杰难逃~ Python python 开发语言大数据数据分析数据挖掘
大家好，从今天开始呢，杰哥开展一个新的专栏，当然，数据分析部分也会不定时更新的，这个新的专栏主要是讲解一些Python的基础语法和知识，帮助0基础的小伙伴入门和学习Python，感兴趣的小伙伴可以开始认真学习啦！一、Python简介【了解】1.计算机工作原理编程语言就是用来定义计算机程序的形式语言。我们通过编程语言来编写程序代码，再通过语言处理程序执行向计算机发送指令，让计算机完成对应的工作，编程
python八股文面试题分享及解析(1) Shawn________ python
#1.'''a=1b=2不用中间变量交换a和b'''#1.a=1b=2a,b=b,aprint(a)print(b)结果：21#2.ll=[]foriinrange(3):ll.append({'num':i})print(11)结果:#[{'num':0},{'num':1},{'num':2}]#3.kk=[]a={'num':0}foriinrange(3):#0,12#可变类型，不仅仅改变
每日算法&面试题，大厂特训二十八天——第二十天（树）肥学 ⚡算法题⚡面试题每日精进 java 算法数据结构
目录标题导读算法特训二十八天面试题点击直接资料领取导读肥友们为了更好的去帮助新同学适应算法和面试题，最近我们开始进行专项突击一步一步来。上一期我们完成了动态规划二十一天现在我们进行下一项对各类算法进行二十八天的一个小总结。还在等什么快来一起肥学进行二十八天挑战吧！！特别介绍小白练手专栏，适合刚入手的新人欢迎订阅编程小白进阶python有趣练手项目里面包括了像《机器人尬聊》《恶搞程序》这样的有趣文章
Python快速入门 —— 第三节：类与对象孤华暗香 Python快速入门 python 开发语言
第三节：类与对象目标：了解面向对象编程的基础概念，并学会如何定义类和创建对象。内容：类与对象：定义类：class关键字。类的构造函数：__init__()。类的属性和方法。对象的创建与使用。示例：classStudent:def__init__(self,name,age,major):self.name&#
pyecharts——绘制柱形图折线图 2224070247 信息可视化 python java 数据可视化
一、pyecharts概述自2013年6月百度EFE(ExcellentFrontEnd）数据可视化团队研发的ECharts1.0发布到GitHub网站以来，ECharts一直备受业界权威的关注并获得广泛好评，成为目前成熟且流行的数据可视化图表工具，被应用到诸多数据可视化的开发领域。Python作为数据分析领域最受欢迎的语言，也加入ECharts的使用行列，并研发出方便Python开发者使用的数据
Python 实现图片裁剪（附代码） | Python工具剑客阿良_ALiang
前言本文提供将图片按照自定义尺寸进行裁剪的工具方法，一如既往的实用主义。环境依赖ffmpeg环境安装，可以参考我的另一篇文章：windowsffmpeg安装部署_阿良的博客-CSDN博客本文主要使用到的不是ffmpeg，而是ffprobe也在上面这篇文章中的zip包中。ffmpy安装：pipinstallffmpy-ihttps://pypi.douban.com/simple代码不废话了，上代码
【华为OD技术面试真题 - 技术面】- python八股文真题题库（4) 算法大师华为od 面试 python
华为OD面试真题精选专栏：华为OD面试真题精选目录:2024华为OD面试手撕代码真题目录以及八股文真题目录文章目录华为OD面试真题精选**1.Python中的`with`**用途和功能自动资源管理示例：文件操作上下文管理协议示例代码工作流程解析优点2.\_\_new\_\_和**\_\_init\_\_**区别__new____init__区别总结3.**切片（Slicing）操作**基本切片语法
python os 环境变量 CV矿工 python 开发语言 numpy
环境变量：环境变量是程序和操作系统之间的通信方式。有些字符不宜明文写进代码里，比如数据库密码，个人账户密码，如果写进自己本机的环境变量里，程序用的时候通过os.environ.get（）取出来就行了。os.environ是一个环境变量的字典。环境变量的相关操作importos"""设置/修改环境变量：os.environ[‘环境变量名称’]=‘环境变量值’#其中key和value均为string类
Python爬虫解析工具之xpath使用详解 eqa11 python 爬虫开发语言
文章目录Python爬虫解析工具之xpath使用详解一、引言二、环境准备1、插件安装2、依赖库安装三、xpath语法详解1、路径表达式2、通配符3、谓语4、常用函数四、xpath在Python代码中的使用1、文档树的创建2、使用xpath表达式3、获取元素内容和属性五、总结Python爬虫解析工具之xpath使用详解一、引言在Python爬虫开发中，数据提取是一个至关重要的环节。xpath作为一门
【华为OD技术面试真题 - 技术面】- python八股文真题题库（1）算法大师华为od 面试 python
华为OD面试真题精选专栏：华为OD面试真题精选目录:2024华为OD面试手撕代码真题目录以及八股文真题目录文章目录华为OD面试真题精选1.数据预处理流程数据预处理的主要步骤工具和库2.介绍线性回归、逻辑回归模型线性回归（LinearRegression）模型形式：关键点：逻辑回归（LogisticRegression）模型形式：关键点：参数估计与评估：3.python浅拷贝及深拷贝浅拷贝（Shal
nosql数据库技术与应用知识点皆过客，揽星河 NoSQL nosql 数据库大数据数据分析数据结构非关系型数据库
Nosql知识回顾大数据处理流程数据采集(flume、爬虫、传感器)数据存储(本门课程NoSQL所处的阶段)Hdfs、MongoDB、HBase等数据清洗(入仓)Hive等数据处理、分析(Spark、Flink等)数据可视化数据挖掘、机器学习应用(Python、SparkMLlib等)大数据时代存储的挑战(三高)高并发(同一时间很多人访问)高扩展(要求随时根据需求扩展存储)高效率(要求读写速度快)
《Python数据分析实战终极指南》 xjt921122 python 数据分析开发语言
对于分析师来说，大家在学习Python数据分析的路上，多多少少都遇到过很多大坑**，有关于技能和思维的**：Excel已经没办法处理现有的数据量了，应该学Python吗？找了一大堆Python和Pandas的资料来学习，为什么自己动手就懵了？跟着比赛类公开数据分析案例练了很久，为什么当自己面对数据需求还是只会数据处理而没有分析思路？学了对比、细分、聚类分析，也会用PEST、波特五力这类分析法，为啥
Python中深拷贝与浅拷贝的区别 yuxiaoyu.
转自：http://blog.csdn.net/u014745194/article/details/70271868定义：在Python中对象的赋值其实就是对象的引用。当创建一个对象，把它赋值给另一个变量的时候，python并没有拷贝这个对象，只是拷贝了这个对象的引用而已。浅拷贝：拷贝了最外围的对象本身，内部的元素都只是拷贝了一个引用而已。也就是，把对象复制一遍，但是该对象中引用的其他对象我不复
Python开发常用的三方模块如下：换个网名有点难 python 开发语言
Python是一门功能强大的编程语言，拥有丰富的第三方库，这些库为开发者提供了极大的便利。以下是100个常用的Python库，涵盖了多个领域：1、NumPy，用于科学计算的基础库。2、Pandas，提供数据结构和数据分析工具。3、Matplotlib，一个绘图库。4、Scikit-learn，机器学习库。5、SciPy，用于数学、科学和工程的库。6、TensorFlow，由Google开发的开源机
ES聚合分析原理与代码实例讲解光剑书架上的书大厂Offer收割机面试题简历程序员读书硅基计算碳基计算认知计算生物计算深度学习神经网络大数据 AIGC AGI LLM Java Python 架构设计 Agent 程序员实现财富自由
ES聚合分析原理与代码实例讲解1.背景介绍1.1问题的由来在大规模数据分析场景中，特别是在使用Elasticsearch（ES）进行数据存储和检索时，聚合分析成为了一个至关重要的功能。聚合分析允许用户对数据集进行细分和分组，以便深入探索数据的结构和模式。这在诸如实时监控、日志分析、业务洞察等领域具有广泛的应用。1.2研究现状目前，ES聚合分析已经成为现代大数据平台的核心组件之一。它支持多种类型的聚
Python编译器鹿鹿~ Python编译器 Python python 开发语言后端
嘿嘿嘿我又来了啊有些小盆友可能不知道Python其实是有编译器的，也就是PyCharm。你们可能会问到这个是干嘛的又不可以吃也不可以穿好像没有什么用，其实你还说对了这个还真的不可以吃也不可以穿，但是它用来干嘛的呢。用来编译你所打出的代码进行运行（可能这里说的有点不对但是只是个人认为）现在我们来说说PyCharm是用来干嘛的。PyCharm是一种PythonIDE，带有一整套可以帮助用户在使用Pyt
一文掌握python面向对象魔术方法（二）程序员neil python python 开发语言
接上篇：一文掌握python面向对象魔术方法（一）-CSDN博客目录六、迭代和序列化：1、__iter__(self):定义迭代器，使得类可以被for循环迭代。2、__getitem__(self,key):定义索引操作，如obj[key]。3、__setitem__(self,key,value):定义赋值操作，如obj[key]=value。4、__delitem__(self,key):定义
一文掌握python常用的list（列表）操作程序员neil python python 开发语言
目录一、创建列表1.直接创建列表：2.使用list()构造器3.使用列表推导式4.创建空列表二、访问列表元素1.列表支持通过索引访问元素，索引从0开始：2.还可以使用切片操作访问列表的一部分：三、修改列表元素四、添加元素1.append()：在末尾添加元素2.insert()：在指定位置插入元素五、删除元素1.del：删除指定位置的元素2.remove()：删除指定值的第一个匹配项3.pop()：
Python实现简单的机器学习算法 master_chenchengg python python 办公效率 python开发 IT
Python实现简单的机器学习算法开篇：初探机器学习的奇妙之旅搭建环境：一切从安装开始必备工具箱第一步：安装Anaconda和JupyterNotebook小贴士：如何配置Python环境变量算法初体验：从零开始的Python机器学习线性回归：让数据说话数据准备：从哪里找数据编码实战：Python实现线性回归模型评估：如何判断模型好坏逻辑回归：从分类开始理论入门：什么是逻辑回归代码实现：使用skl
python中的深拷贝与浅拷贝 anshejd70787 python
深拷贝和浅拷贝浅拷贝的时候，修改原来的对象，浅拷贝的对象不会发生改变。1、对象的赋值对象的赋值实际上是对象之间的引用：当创建一个对象，然后将这个对象赋值给另外一个变量的时候，python并没有拷贝这个对象，而只是拷贝了这个对象的引用。当对对象做赋值或者是参数传递或者作为返回值的时候，总是传递原始对象的引用，而不是一个副本。如下所示：>>>aList=["kel","abc",123]>>>bLis
用Python实现简单的猜数字游戏程序媛了了 python 游戏 java
猜数字游戏代码：importrandomdefpythonit():a=random.randint(1,100)n=int(input("输入你猜想的数字："))whilen!=a:ifn>a:print("很遗憾，猜大了")n=int(input("请再次输入你猜想的数字："))elifna::如果玩家猜的数字n大于随机数字a，则输出"很遗憾，猜大了"，并提示玩家再次输入。elifn
ViewController添加button按钮解析。（翻译）张亚雄 c
<div class="it610-blog-content-contain" style="font-size: 14px"></div>// ViewController.m // Reservation software // // Created by 张亚雄 on 15/6/2.
mongoDB 简单的增删改查开窍的石头 mongodb
在上一篇文章中我们已经讲了mongodb怎么安装和数据库/表的创建。在这里我们讲mongoDB的数据库操作在mongo中对于不存在的表当你用db.表名他会自动统计下边用到的user是表明，db代表的是数据库添加(insert):
log4j配置 0624chenhong log4j
1) 新建java项目 2) 导入jar包，项目右击，properties—java build path—libraries—Add External jar，加入log4j.jar包。 3) 新建一个类com.hand.Log4jTest package com.hand; import org.apache.log4j.Logger; public class
多点触摸(图片缩放为例) 不懂事的小屁孩多点触摸
多点触摸的事件跟单点是大同小异的，上个图片缩放的代码，供大家参考一下 import android.app.Activity; import android.os.Bundle; import android.view.MotionEvent; import android.view.View; import android.view.View.OnTouchListener
有关浏览器窗口宽度高度几个值的解析换个号韩国红果果 JavaScript html
1 元素的 offsetWidth 包括border padding content 整体的宽度。 clientWidth 只包括内容区 padding 不包括border。 clientLeft = offsetWidth -clientWidth 即这个元素border的值 offsetLeft 若无已定位的包裹元素
数据库产品巡礼：IBM DB2概览蓝儿唯美 db2
IBM DB2是一个支持了NoSQL功能的关系数据库管理系统，其包含了对XML，图像存储和Java脚本对象表示（JSON）的支持。DB2可被各种类型的企业使用，它提供了一个数据平台，同时支持事务和分析操作，通过提供持续的数据流来保持事务工作流和分析操作的高效性。 DB2支持的操作系统 DB2可应用于以下三个主要的平台: 工作站，DB2可在Linus、Unix、Windo
java笔记5 a-john java
控制执行流程： 1，true和false 利用条件表达式的真或假来决定执行路径。例：（a==b）。它利用条件操作符“==”来判断a值是否等于b值，返回true或false。java不允许我们将一个数字作为布尔值使用，虽然这在C和C++里是允许的。如果想在布尔测试中使用一个非布尔值，那么首先必须用一个条件表达式将其转化成布尔值，例如if(a!=0)。 2，if-els
Web开发常用手册汇总 aijuans PHP
一门技术，如果没有好的参考手册指导,很难普及大众。这其实就是为什么很多技术，非常好，却得不到普遍运用的原因。正如我们学习一门技术，过程大概是这个样子： ①我们日常工作中，遇到了问题，困难。寻找解决方案，即寻找新的技术； ②为什么要学习这门技术？这门技术是不是很好的解决了我们遇到的难题，困惑。这个问题，非常重要，我们不是为了学习技术而学习技术，而是为了更好的处理我们遇到的问题，才需要学习新的
今天帮助人解决的一个sql问题 asialee sql
今天有个人问了一个问题，如下： type AD value A
意图对象传递数据百合不是茶 android 意图Intent Bundle对象数据的传递
学习意图将数据传递给目标活动; 初学者需要好好研究的 1,将下面的代码添加到main.xml中 <?xml version="1.0" encoding="utf-8"?> <LinearLayout xmlns:android="http:/
oracle查询锁表解锁语句 bijian1013 oracle object session kill
一.查询锁定的表如下语句，都可以查询锁定的表语句一： select a.sid, a.serial#, p.spid, c.object_name, b.session_id, b.oracle_username, b.os_user_name from v$process p, v$s
mac osx 10.10 下安装 mysql 5.6 二进制文件［tar.gz］征客丶 mysql osx
场景：在 mac osx 10.10 下安装 mysql 5.6 的二进制文件。环境：mac osx 10.10、mysql 5.6 的二进制文件步骤：[所有目录请从根“/”目录开始取，以免层级弄错导致找不到目录] 1、下载 mysql 5.6 的二进制文件，下载目录下面称之为 mysql5.6SourceDir；下载地址：http://dev.mysql.com/downl
分布式系统与框架 bit1129 分布式
RPC框架 Dubbo 什么是Dubbo Dubbo是一个分布式服务框架，致力于提供高性能和透明化的RPC远程服务调用方案，以及SOA服务治理方案。其核心部分包含: 远程通讯: 提供对多种基于长连接的NIO框架抽象封装，包括多种线程模型，序列化，以及“请求-响应”模式的信息交换方式。集群容错: 提供基于接
那些令人蛋痛的专业术语白糖_ spring Web SSO IOC
spring 【控制反转(IOC)/依赖注入(DI)】：由容器控制程序之间的关系，而非传统实现中，由程序代码直接操控。这也就是所谓“控制反转”的概念所在：控制权由应用代码中转到了外部容器，控制权的转移，是所谓反转。简单的说：对象的创建又容器(比如spring容器)来执行，程序里不直接new对象。 Web 【单点登录(SSO)】：SSO的定义是在多个应用系统中，用户
《给大忙人看的java8》摘抄 braveCS java8
函数式接口：只包含一个抽象方法的接口 lambda表达式：是一段可以传递的代码你最好将一个lambda表达式想象成一个函数，而不是一个对象，并记住它可以被转换为一个函数式接口。事实上，函数式接口的转换是你在Java中使用lambda表达式能做的唯一一件事。方法引用：又是要传递给其他代码的操作已经有实现的方法了，这时可以使
编程之美-计算字符串的相似度 bylijinnan java 算法编程之美
public class StringDistance { /** * 编程之美计算字符串的相似度 * 我们定义一套操作方法来把两个不相同的字符串变得相同，具体的操作方法为： * 1.修改一个字符（如把“a”替换为“b”）; * 2.增加一个字符（如把“abdd”变为“aebdd”）; * 3.删除一个字符（如把“travelling”变为“trav
上传、下载压缩图片 chengxuyuancsdn 下载
/** * * @param uploadImage --本地路径(tomacat路径) * @param serverDir --服务器路径 * @param imageType --文件或图片类型 * 此方法可以上传文件或图片.txt,.jpg,.gif等 */ public void upload(String uploadImage,Str
bellman-ford(贝尔曼-福特)算法 comsci 算法 F#
Bellman-Ford算法(根据发明者 Richard Bellman 和 Lester Ford 命名)是求解单源最短路径问题的一种算法。单源点的最短路径问题是指：给定一个加权有向图G和源点s，对于图G中的任意一点v，求从s到v的最短路径。有时候这种算法也被称为 Moore-Bellman-Ford 算法，因为 Edward F. Moore zu 也为这个算法的发展做出了贡献。与迪科
oracle ASM中ASM_POWER_LIMIT参数 daizj ASM oracle ASM_POWER_LIMIT 磁盘平衡
ASM_POWER_LIMIT 该初始化参数用于指定ASM例程平衡磁盘所用的最大权值，其数值范围为0~11，默认值为1。该初始化参数是动态参数，可以使用ALTER SESSION或ALTER SYSTEM命令进行修改。示例如下： SQL>ALTER SESSION SET Asm_power_limit=2;
高级排序:快速排序 dieslrae 快速排序
public void quickSort(int[] array){ this.quickSort(array, 0, array.length - 1); } public void quickSort(int[] array,int left,int right){ if(right - left <= 0
C语言学习六指针_何谓变量的地址一个指针变量到底占几个字节 dcj3sjt126com C语言
# include <stdio.h> int main(void) { /* 1、一个变量的地址只用第一个字节表示 2、虽然他只使用了第一个字节表示，但是他本身指针变量类型就可以确定出他指向的指针变量占几个字节了 3、他都只存了第一个字节地址，为什么只需要存一个字节的地址，却占了4个字节，虽然只有一个字节，但是这些字节比较多，所以编号就比较大，
phpize使用方法 dcj3sjt126com PHP
phpize是用来扩展php扩展模块的，通过phpize可以建立php的外挂模块,下面介绍一个它的使用方法,需要的朋友可以参考下安装（fastcgi模式）的时候，常常有这样一句命令：代码如下: /usr/local/webserver/php/bin/phpize 一、phpize是干嘛的？ phpize是什么？ phpize是用来扩展php扩展模块的，通过phpi
Java虚拟机学习 - 对象引用强度 shuizhaosi888 JAVA虚拟机
本文原文链接：http://blog.csdn.net/java2000_wl/article/details/8090276 转载请注明出处！无论是通过计数算法判断对象的引用数量，还是通过根搜索算法判断对象引用链是否可达，判定对象是否存活都与“引用”相关。引用主要分为：强引用(Strong Reference)、软引用(Soft Reference)、弱引用(Wea
.NET Framework 3.5 Service Pack 1（完整软件包）下载地址 happyqing .net 下载 framework
Microsoft .NET Framework 3.5 Service Pack 1（完整软件包） http://www.microsoft.com/zh-cn/download/details.aspx?id=25150 Microsoft .NET Framework 3.5 Service Pack 1 是一个累积更新，包含很多基于 .NET Framewo
JAVA定时器的使用 jingjing0907 java timer 线程定时器
1、在应用开发中，经常需要一些周期性的操作，比如每5分钟执行某一操作等。对于这样的操作最方便、高效的实现方式就是使用java.util.Timer工具类。 privatejava.util.Timer timer; timer = newTimer(true); timer.schedule( newjava.util.TimerTask() { public void run()
Webbench 流浪鱼 webbench
首页下载地址 http://home.tiscali.cz/~cz210552/webbench.html Webbench是知名的网站压力测试工具，它是由Lionbridge公司（http://www.lionbridge.com）开发。 Webbench能测试处在相同硬件上，不同服务的性能以及不同硬件上同一个服务的运行状况。webbench的标准测试可以向我们展示服务器的两项内容：每秒钟相
第11章动画效果（中） onestopweb 动画
index.html <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/
windows下制作bat启动脚本. sanyecao2314 java cmd 脚本 bat
java -classpath C:\dwjj\commons-dbcp.jar;C:\dwjj\commons-pool.jar;C:\dwjj\log4j-1.2.16.jar;C:\dwjj\poi-3.9-20121203.jar;C:\dwjj\sqljdbc4.jar;C:\dwjj\voucherimp.jar com.citsamex.core.startup.MainStart
Java进行RSA加解密的例子 tomcat_oracle java
加密是保证数据安全的手段之一。加密是将纯文本数据转换为难以理解的密文；解密是将密文转换回纯文本。　　数据的加解密属于密码学的范畴。通常，加密和解密都需要使用一些秘密信息，这些秘密信息叫做密钥，将纯文本转为密文或者转回的时候都要用到这些密钥。　　对称加密指的是发送者和接收者共用同一个密钥的加解密方法。　　非对称加密(又称公钥加密)指的是需要一个私有密钥一个公开密钥，两个不同的密钥的
Android_ViewStub 阿尔萨斯 ViewStub
public final class ViewStub extends View java.lang.Object android.view.View android.view.ViewStub 类摘要： ViewStub 是一个隐藏的，不占用内存空间的视图对象，它可以在运行时延迟加载布局资源文件。当 ViewSt