how to create an udf using two columns and if expression

Asked

Viewed 22 times

0

I’m doing a job using Pyspark and SQL

have that function:

selection = F.udf(lambda x: cumulative_sum._1 if 100*random.random()<= x, FloatType())
cumulative_sum.withColumn('Selection2', selection(cumulative_sum.cumSum)

My table calls cumulative_sum and my goal is to check if each column value cumSUm is less than a randomly generated number, if it is I would add the respective value of column _1 in the new column Selection2. However, when I run this code, I have this error:

SyntaxError: invalid syntax 

How can I fix this?

  • I’m sure you can do that without an UDF, that would be more efficient. You can add an example dataset an input and one with the result you want?

2 answers

0

You need the Else in the if-else in the same line. It follows as it should be:

selection = F.udf(lambda x: cumulative_sum._1 if 100*random.random()<= x else float(0.0), FloatType())
cumulative_sum.withColumn('Selection2', selection(cumulative_sum.cumSum)

0

I don’t know if I understand the question correctly, I haven’t spoken Portuguese for a long time. That’s what you wanna do?

df.withColumn('selection2', 
               when(F.lag('cumSum') < F.col('cumSum'), F.col('cumSum'))
               .otherwise(0.0) #ou .otherwise(None)
              )

Browser other questions tagged

You are not signed in. Login or sign up in order to post.