我有pyspark数据框,其中包含名为 Filters 的列:“ array>”
我想将数据帧保存在csv文件中,为此,我需要将数组转换为字符串类型。
我尝试将其强制转换为:DF.Filters.tostring()和DF.Filters.cast(StringType()),但是这两种解决方案都会为“过滤器”列中的每一行生成错误消息:
DF.Filters.tostring()
DF.Filters.cast(StringType())
org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@56234c19
代码如下
from pyspark.sql.types import StringType DF.printSchema() |-- ClientNum: string (nullable = true) |-- Filters: array (nullable = true) |-- element: struct (containsNull = true) |-- Op: string (nullable = true) |-- Type: string (nullable = true) |-- Val: string (nullable = true) DF_cast = DF.select ('ClientNum',DF.Filters.cast(StringType())) DF_cast.printSchema() |-- ClientNum: string (nullable = true) |-- Filters: string (nullable = true) DF_cast.show() | ClientNum | Filters | 32103 | org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@d9e517ce | 218056 | org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@3c744494
样本JSON数据:
{"ClientNum":"abc123","Filters":[{"Op":"foo","Type":"bar","Val":"baz"}]}
谢谢 !!
我创建了一个样本JSON数据集来匹配该模式:
{"ClientNum":"abc123","Filters":[{"Op":"foo","Type":"bar","Val":"baz"}]} select(s.col("ClientNum"),s.col("Filters").cast(StringType)).show(false) +---------+------------------------------------------------------------------+ |ClientNum|Filters | +---------+------------------------------------------------------------------+ |abc123 |org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@60fca57e| +---------+------------------------------------------------------------------+
使用explode()函数可以最佳化解决您的问题,该函数可以展平数组,然后使用星号扩展表示法:
s.selectExpr("explode(Filters) AS structCol").selectExpr("structCol.*").show() +---+----+---+ | Op|Type|Val| +---+----+---+ |foo| bar|baz| +---+----+---+
使其成为由逗号分隔的单列字符串:
s.selectExpr("explode(Filters) AS structCol").select(F.expr("concat_ws(',', structCol.*)").alias("single_col")).show() +-----------+ | single_col| +-----------+ |foo,bar,baz| +-----------+