数据看起来像这样-
+-----------+-----------+-----------------------------+ | id| point| data| +-----------------------------------------------------+ | abc| 6|{"key1":"124", "key2": "345"}| | dfl| 7|{"key1":"777", "key2": "888"}| | 4bd| 6|{"key1":"111", "key2": "788"}|
我正在尝试将其分解为以下格式。
+-----------+-----------+-----------+-----------+ | id| point| key1| key2| +------------------------------------------------ | abc| 6| 124| 345| | dfl| 7| 777| 888| | 4bd| 6| 111| 788|
该explode函数将数据框分解为多行。但这不是理想的解决方案。
explode
只要您使用的是Spark 2.1或更高版本,就pyspark.sql.functions.from_json应该获得所需的结果,但是您首先需要定义必需的结果。schema
pyspark.sql.functions.from_json
schema
from pyspark.sql.functions import from_json, col from pyspark.sql.types import StructType, StructField, StringType schema = StructType( [ StructField('key1', StringType(), True), StructField('key2', StringType(), True) ] ) df.withColumn("data", from_json("data", schema))\ .select(col('id'), col('point'), col('data.*'))\ .show()
这应该给你
+---+-----+----+----+ | id|point|key1|key2| +---+-----+----+----+ |abc| 6| 124| 345| |df1| 7| 777| 888| |4bd| 6| 111| 788| +---+-----+----+----+