1

I am trying to perform selfjoin using DataFrame Scala API. Here are my code snippets; Can you please tell me what's wrong with the first solution?

val df= sqlc.read.json("empMgr.json");

empMgr.json

{"ID":101,"ename":"Peter","sal":24.24,"dept":"11","country":"US","doj":"1/12/2017","mgr":201} {"ID":201,"ename":"John","sal":1300,"dept":"232","country":"IN","doj":"4/22/2016","mgr":111} {"ID":301,"ename":"Sam","dept":"22","country":"KR","doj":"5/22/2015","mgr":201}

// 1. following is not working var df_right=df; df.join(df_right, df("mgr") === df_right("ID")).show() df.join(df, df("mgr") === df("ID")).show() /* * Output: * +---+-------+----+---+-----+---+---+---+-------+----+---+-----+---+---+ | ID|country|dept|doj|ename|mgr|sal| ID|country|dept|doj|ename|mgr|sal| +---+-------+----+---+-----+---+---+---+-------+----+---+-----+---+---+ +---+-------+----+---+-----+---+---+---+-------+----+---+-----+---+---+ * */ //2. following works fine df_right= sqlc.read.json("file:///opt/data/empMgr.json"); df.join(df_right, df("mgr") === df_right("ID")).show() /* *Output: * +---+-------+----+---------+-----+---+-----+---+-------+----+---------+-----+---+------+ | ID|country|dept| doj|ename|mgr| sal| ID|country|dept| doj|ename|mgr| sal| +---+-------+----+---------+-----+---+-----+---+-------+----+---------+-----+---+------+ |101| US| 11|1/12/2017|Peter|201|24.24|201| IN| 232|4/22/2016| John|111|1300.0| |301| KR| 22|5/22/2015| Sam|201| null|201| IN| 232|4/22/2016| John|111|1300.0| +---+-------+----+---------+-----+---+-----+---+-------+----+---------+-----+---+------+ * */ //3. following works fine df.registerTempTable("empMgr") sqlc.sql("select b.ename, a.ename as mgr,b.mgr from empMgr a join empMgr b on a.ID=b.mgr").show(); /* * output * +-----+----+---+ |ename| mgr|mgr| +-----+----+---+ |Peter|John|201| | Sam|John|201| +-----+----+---+ * */ 
1
  • what is your question? Am I mistaken or there is a extra line in point 1 that is not supposed to be there? Please clarify. Commented Oct 12, 2016 at 7:03

1 Answer 1

2

Use Dataframe's as() method to remove ambiguity when referencing similar names.

df.as("a").join(df.as("b"), $"a.mgr" === $"b.ID").show +---+-------+----+---------+-----+---+-----+---+-------+----+---------+-----+---+------+ | ID|country|dept| doj|ename|mgr| sal| ID|country|dept| doj|ename|mgr| sal| +---+-------+----+---------+-----+---+-----+---+-------+----+---------+-----+---+------+ |101| US| 11|1/12/2017|Peter|201|24.24|201| IN| 232|4/22/2016| John|111|1300.0| |301| KR| 22|5/22/2015| Sam|201| null|201| IN| 232|4/22/2016| John|111|1300.0| +---+-------+----+---------+-----+---+-----+---+-------+----+---------+-----+---+------+ 
Sign up to request clarification or add additional context in comments.

3 Comments

I could not test this as I am getting below error. **'value $ is not a member of StringContext' ** Here are my maven dependencies ` <dependencies> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.10</artifactId> <version>1.6.0</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.10</artifactId> <version>1.6.0</version> </dependency> </dependencies>`
Try importing the implicits and see if that fixes it: import sqlContext.implicits._
Thanks, It worked. Up voted. 1. Can you also tell me issue with solution#1? is it in-correct syntax? 2.Can someone also provide me the Java fix?

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.