嗨,我有以下问题:
numeric.registerTempTable("numeric").
我要过滤的所有值都是文字空字符串,而不是N / A或Null值。
我尝试了以下三个选项:
numeric_filtered = numeric.filter(numeric['LOW'] != 'null').filter(numeric['HIGH'] != 'null').filter(numeric['NORMAL'] != 'null')
numeric_filtered = numeric.filter(numeric['LOW'] != 'null' AND numeric['HIGH'] != 'null' AND numeric['NORMAL'] != 'null')
sqlContext.sql("SELECT * from numeric WHERE LOW != 'null' AND HIGH != 'null' AND NORMAL != 'null'")
不幸的是,numeric_filtered始终为空。我检查了数值是否有应根据这些条件进行过滤的数据。
以下是一些示例值:
低高正常
3.5 5.0空
2.0 14.0空
空38.0空
null null null
1.0空4.0
您正在使用逻辑与(AND)。这意味着所有列必须'null'与要包含的行不同。让我们以filter版本为例进行说明:
'null'
filter
numeric = sqlContext.createDataFrame([ ('3.5,', '5.0', 'null'), ('2.0', '14.0', 'null'), ('null', '38.0', 'null'), ('null', 'null', 'null'), ('1.0', 'null', '4.0')], ('low', 'high', 'normal')) numeric_filtered_1 = numeric.where(numeric['LOW'] != 'null') numeric_filtered_1.show() ## +----+----+------+ ## | low|high|normal| ## +----+----+------+ ## |3.5,| 5.0| null| ## | 2.0|14.0| null| ## | 1.0|null| 4.0| ## +----+----+------+ numeric_filtered_2 = numeric_filtered_1.where( numeric_filtered_1['NORMAL'] != 'null') numeric_filtered_2.show() ## +---+----+------+ ## |low|high|normal| ## +---+----+------+ ## |1.0|null| 4.0| ## +---+----+------+ numeric_filtered_3 = numeric_filtered_2.where( numeric_filtered_2['HIGH'] != 'null') numeric_filtered_3.show() ## +---+----+------+ ## |low|high|normal| ## +---+----+------+ ## +---+----+------+
您尝试过的所有其他方法都遵循完全相同的架构。您在这里需要的是逻辑分离(OR)。
from pyspark.sql.functions import col numeric_filtered = df.where( (col('LOW') != 'null') | (col('NORMAL') != 'null') | (col('HIGH') != 'null')) numeric_filtered.show() ## +----+----+------+ ## | low|high|normal| ## +----+----+------+ ## |3.5,| 5.0| null| ## | 2.0|14.0| null| ## |null|38.0| null| ## | 1.0|null| 4.0| ## +----+----+------+
或使用原始SQL:
numeric.registerTempTable("numeric") sqlContext.sql("""SELECT * FROM numeric WHERE low != 'null' OR normal != 'null' OR high != 'null'""" ).show() ## +----+----+------+ ## | low|high|normal| ## +----+----+------+ ## |3.5,| 5.0| null| ## | 2.0|14.0| null| ## |null|38.0| null| ## | 1.0|null| 4.0| ## +----+----+------+
另请参阅:Pyspark:when子句中的多个条件