我有一个DataFrame看起来像这样:
DataFrame
+--------------------+------------------+ | features| labels | +--------------------+------------------+ |[-0.38475, 0.568...]| label1 | |[0.645734, 0.699...]| label2 | | ..... | ... | +--------------------+------------------+
两列都是String类型(StringType()),我想将其放入spark ml randomForest中。为此,我需要将要素列转换为包含浮点数的向量。有谁知道怎么做吗?
如果您使用的是 Spark 2.x ,我相信这就是您所需要的:
from pyspark.sql.functions import udf from pyspark.mllib.linalg import Vectors from pyspark.ml.linalg import VectorUDT from pyspark.ml.feature import StringIndexer df = spark.createDataFrame([("[-0.38475, 0.568]", "label1"), ("[0.645734, 0.699]", "label2")], ("features", "label")) def parse(s): try: return Vectors.parse(s).asML() except: return None parse_ = udf(parse, VectorUDT()) parsed = df.withColumn("features", parse_("features")) indexer = StringIndexer(inputCol="label", outputCol="label_indexed") indexer.fit(parsed).transform(parsed).show() ## +----------------+------+-------------+ ## | features| label|label_indexed| ## +----------------+------+-------------+ ## |[-0.38475,0.568]|label1| 0.0| ## |[0.645734,0.699]|label2| 1.0| ## +----------------+------+-------------+
使用 Spark 1.6 并没有太大不同:
from pyspark.sql.functions import udf from pyspark.ml.feature import StringIndexer from pyspark.mllib.linalg import Vectors, VectorUDT df = sqlContext.createDataFrame([("[-0.38475, 0.568]", "label1"), ("[0.645734, 0.699]", "label2")], ("features", "label")) parse_ = udf(Vectors.parse, VectorUDT()) parsed = df.withColumn("features", parse_("features")) indexer = StringIndexer(inputCol="label", outputCol="label_indexed") indexer.fit(parsed).transform(parsed).show() ## +----------------+------+-------------+ ## | features| label|label_indexed| ## +----------------+------+-------------+ ## |[-0.38475,0.568]|label1| 0.0| ## |[0.645734,0.699]|label2| 1.0| ## +----------------+------+-------------+
Vectors具有parse可以帮助您实现所要完成的功能的功能。
Vectors
parse