如何在PySpark的UDF中返回“元组类型”？

小编典典

如何在PySpark的UDF中返回“元组类型”？

python

输入的所有数据类型pyspark.sql.types为：

__all__ = [
    "DataType", "NullType", "StringType", "BinaryType", "BooleanType", "DateType",
    "TimestampType", "DecimalType", "DoubleType", "FloatType", "ByteType", "IntegerType",
    "LongType", "ShortType", "ArrayType", "MapType", "StructField", "StructType"]

我必须编写一个UDF（在pyspark中），它返回一个元组数组。我应该给它第二个参数是udf方法的返回类型吗？这将是ArrayType(TupleType())…

阅读 572

2021-01-20

共1个答案

小编典典

TupleTypeSpark中没有这样的东西。产品类型structs用特定类型的字段表示。例如，如果您想返回一个成对的数组（整数，字符串），则可以使用如下模式：

from pyspark.sql.types import *

schema = ArrayType(StructType([
    StructField("char", StringType(), False),
    StructField("count", IntegerType(), False)
]))

用法示例：

from pyspark.sql.functions import udf
from collections import Counter

char_count_udf = udf(
    lambda s: Counter(s).most_common(),
    schema
)

df = sc.parallelize([(1, "foo"), (2, "bar")]).toDF(["id", "value"])

df.select("*", char_count_udf(df["value"])).show(2, False)

## +---+-----+-------------------------+
## |id |value|PythonUDF#<lambda>(value)|
## +---+-----+-------------------------+
## |1  |foo  |[[o,2], [f,1]]           |
## |2  |bar  |[[r,1], [a,1], [b,1]]    |
## +---+-----+-------------------------+

2021-01-20