0

I am very new to pyspark and need to perform prediction. I've already done everything but in python, because the data I have to apply the logic is huge - I need to transform everything to pyspark.

The problem is: I have 2 dataframes, first dataframe is for training purposes with Y column and the second one is for testing (from my s3 bucket).

  1. I already cleaned and prepared training dataframe in python and converted it to pyspark df:

    from pyspark.sql import SparkSession

    import pandas as pd

    Create a SparkSession:

    spark = SparkSession.builder.appName('myapp').getOrCreate()

    Load the Pandas DataFrame:

    pdf = dataframe #dataframe is the name of my pandas df

    Convert the Pandas DataFrame to a PySpark DataFrame:

    df_spark = spark.createDataFrame(pdf)

  2. I selected training columns - for training purposes:

    y = df_spark.select("Y")

    X = df_spark.select([c for c in dataframe.columns if c != "Y"])

    training_columns = X.columns

  3. I followed the tutorial where I used VectorAssembler:

    from pyspark.ml.feature import VectorAssembler

    df_spark.columns

    assembler = VectorAssembler(inputCols=training_columns, outputCol='features')

    output = assembler.transform(df_spark)

    final_data = output.select('features', 'Y')

  4. I trained the model - Random Forest

    from pyspark.ml.classification import RandomForestClassifier

    from pyspark.ml import Pipeline

    Set the hyperparameters:

    n_estimators = 100

    max_depth = 5

    Create the model:

    rfc = RandomForestClassifier( featuresCol='features', labelCol='Y', numTrees=n_estimators, maxDepth=max_depth )

    Fit the model on the training data:

    model = rfc.fit(final_data)

  5. I checked model evaluation on training data:

    predictions = model.transform(final_data)

    from pyspark.ml.evaluation import BinaryClassificationEvaluator

    my_binary_eval = BinaryClassificationEvaluator(labelCol='Y')

    print(my_binary_eval.evaluate(predictions))

  6. Now is the moment where I want to apply the model on a different pyspark dataframe - test data, with a lot of records = df_to_predict

df_to_predict might have slightly different set of columns comparing to training data - as I was dropping columns with no variance. Hence first I need to apply the columns from training set:

df_to_predict = df_to_predict.select(training_columns) 

next I would like to apply the model and do predict_proba <- this is in sklearn package - I do not know how to apply/ convert this code to pyspark:

df_to_predict["Predicted_Y_Probability"] = model.predict_proba(df_to_predict)[::, 1] 

1 Answer 1

-1

You can use UDF to make your custom functions. Check this answer:

Unable to make prediction with Sklearn model on pyspark dataframe

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.