How to extract number from string column?

Question

My requirement is to retrieve the order number from the comment column which is in a column comment and always starts with R. The order number should be added as a new column to the table.

Input data:

code,id,mode,location,status,comment AS-SD,101,Airways,hyderabad,D,order got delayed R1657 FY-YT,102,Airways,Delhi,ND,R7856 package damaged TY-OP,103,Airways,Pune,D,Order number R5463 not received

Expected output:

AS-SD,101,Airways,hyderabad,D,order got delayed R1657,R1657 FY-YT,102,Airways,Delhi,ND,R7856 package damaged,R7856 TY-OP,103,Airways,Pune,D,Order number R5463 not received,R5463

I have tried it in spark-sql, the query I am using is given below:

val r = sqlContext.sql("select substring(comment, PatIndex('%[0-9]%',comment, length(comment))) as number from A")

However, I'm getting the following error:

org.apache.spark.sql.AnalysisException: undefined function PatIndex; line 0 pos 0

philantrovert · Accepted Answer · 2017-12-13 08:44:14Z

You can use regexp_extract which has the definition :

def regexp_extract(e: Column, exp: String, groupIdx: Int): Column

(R\\d{4}) means R followed by 4 digits. You can easily accommodate any other case by using a valid regex

df.withColumn("orderId", regexp_extract($"comment", "(R\\d{4})" , 1 )).show +-----+---+-------+---------+------+--------------------+-------+ | code| id| mode| location|status| comment|orderId| +-----+---+-------+---------+------+--------------------+-------+ |AS-SD|101|Airways|hyderabad| D|order got delayed...| R1657| |FY-YT|102|Airways| Delhi| ND|R7856 package dam...| R7856| |TY-OP|103|Airways| Pune| D|Order number R546...| R5463| +-----+---+-------+---------+------+--------------------+-------+

Anahcolus · Accepted Answer · 2017-12-13 09:43:15Z

You can use a udf function as following

import org.apache.spark.sql.functions._ def extractString = udf((comment: String) => comment.split(" ").filter(_.startsWith("R")).head) df.withColumn("newColumn", extractString($"comment")).show(false)

where the comment column is splitted with space and filtering the words that starts with R. head will take the first word that was filtered starting with R.

Updated

To ensure that the returned string is order number starting with R and rest of the strings are digits, you can add additional filter

import scala.util.Try def extractString = udf((comment: String) => comment.split(" ").filter(x => x.startsWith("R") && Try(x.substring(1).toDouble).isSuccess).head)

You can edit the filter according to your need.

Collectives™ on Stack Overflow

How to extract number from string column?

2 Answers 2

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Linked

Related