我有这样的数据:
df = sqlContext.createDataFrame([ ('1986/10/15', 'z', 'null'), ('1986/10/15', 'z', 'null'), ('1986/10/15', 'c', 'null'), ('1986/10/15', 'null', 'null'), ('1986/10/16', 'null', '4.0')], ('low', 'high', 'normal'))
我想计算low列之间的日期差异,2017-05-02并用low差异替换列。我已经尝试过关于stackoverflow的相关解决方案,但是它们都不起作用。
low
2017-05-02
您需要将该列转换low为日期,然后才能datediff()与结合使用lit()。使用 Spark 2.2 :
datediff()
lit()
from pyspark.sql.functions import datediff, to_date, lit df.withColumn("test", datediff(to_date(lit("2017-05-02")), to_date("low","yyyy/MM/dd"))).show() +----------+----+------+-----+ | low|high|normal| test| +----------+----+------+-----+ |1986/10/15| z| null|11157| |1986/10/15| z| null|11157| |1986/10/15| c| null|11157| |1986/10/15|null| null|11157| |1986/10/16|null| 4.0|11156| +----------+----+------+-----+
使用 < Spark 2.2,我们需要首先将该low列转换为class timestamp:
timestamp
from pyspark.sql.functions import datediff, to_date, lit, unix_timestamp df.withColumn("test", datediff(to_date(lit("2017-05-02")), to_date(unix_timestamp('low', "yyyy/MM/dd").cast("timestamp")))).show()