Saturday 3 March 2018

Conditional Join in Spark using Dataframe

Lets see how can we add conditions along with dataframe join in spark
Say we have 2 dataframes: dataFrame1,dataFrame2
val dataFrame1 = hc.sql("select * from tbl1") //id,name,code
val dataFrame2 = hc.sql("select * from tbl2") //id,name,code

We need to join these 2 df's with different columns based on condition.

We have a decision Flag coming in with true/false value.
If the decision flag is true we need to set join condition with id and code columns else only with id column.

So how can we achieve this in scala.
val decision: Boolean = false

Lets set an expression
val exprs = (if (decision != true) 
                   dataFrame1.col("id").equalTo(dataFrame2.col("id"))
             else dataFrame1.col("id").equalTo(dataFrame2.col("id")) 
                   and dataFrame1.col("code").equalTo(dataFrame2.col("code")))
            )

and then join
dataFrame1.join(dataFrame2, exprs).show

This is how you join 2 dataframes with conditions.