6

My requirement is to retrieve the order number from the comment column which is in a column comment and always starts with R. The order number should be added as a new column to the table.

Input data:

code,id,mode,location,status,comment AS-SD,101,Airways,hyderabad,D,order got delayed R1657 FY-YT,102,Airways,Delhi,ND,R7856 package damaged TY-OP,103,Airways,Pune,D,Order number R5463 not received 

Expected output:

AS-SD,101,Airways,hyderabad,D,order got delayed R1657,R1657 FY-YT,102,Airways,Delhi,ND,R7856 package damaged,R7856 TY-OP,103,Airways,Pune,D,Order number R5463 not received,R5463 

I have tried it in spark-sql, the query I am using is given below:

val r = sqlContext.sql("select substring(comment, PatIndex('%[0-9]%',comment, length(comment))) as number from A") 

However, I'm getting the following error:

org.apache.spark.sql.AnalysisException: undefined function PatIndex; line 0 pos 0 
0

2 Answers 2

12

You can use regexp_extract which has the definition :

def regexp_extract(e: Column, exp: String, groupIdx: Int): Column 

(R\\d{4}) means R followed by 4 digits. You can easily accommodate any other case by using a valid regex

df.withColumn("orderId", regexp_extract($"comment", "(R\\d{4})" , 1 )).show +-----+---+-------+---------+------+--------------------+-------+ | code| id| mode| location|status| comment|orderId| +-----+---+-------+---------+------+--------------------+-------+ |AS-SD|101|Airways|hyderabad| D|order got delayed...| R1657| |FY-YT|102|Airways| Delhi| ND|R7856 package dam...| R7856| |TY-OP|103|Airways| Pune| D|Order number R546...| R5463| +-----+---+-------+---------+------+--------------------+-------+ 
Sign up to request clarification or add additional context in comments.

Comments

3

You can use a udf function as following

import org.apache.spark.sql.functions._ def extractString = udf((comment: String) => comment.split(" ").filter(_.startsWith("R")).head) df.withColumn("newColumn", extractString($"comment")).show(false) 

where the comment column is splitted with space and filtering the words that starts with R. head will take the first word that was filtered starting with R.

Updated

To ensure that the returned string is order number starting with R and rest of the strings are digits, you can add additional filter

import scala.util.Try def extractString = udf((comment: String) => comment.split(" ").filter(x => x.startsWith("R") && Try(x.substring(1).toDouble).isSuccess).head) 

You can edit the filter according to your need.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.