0

I have approched in the following way

$parser = new Parser(); $pdf = $parser->parseFile($_FILES['pdf']['tmp_name']); $text = $pdf->getText(); // Normalize NBSP → plain space $text = preg_replace('/\x{00A0}/u', ' ', $text); $text = preg_replace('/\R+(?=\s*Weekly\s+Totals)/iu', ' ', $text); // Split into lines $lines = preg_split('/\R/u', $text); $results = []; $currentType = null; foreach ($lines as $line) { $line = trim($line); 

and it detects names when

// Only lines with "Weekly Totals" matter now if (stripos($line, 'Weekly Totals') === false) { continue; } // Match: Name Weekly Totals <left> $<cost> <actual> if (preg_match('/^(.+?)\s+Weekly\s+Totals\s+\d+\.\d{2}\s+\$[\d,]+\.\d{2}\s+(\d+\.\d{2})$/i', $line, $m)) { $name = ucwords(strtolower($m[1])); 

But in my pdf some names are long and they get split into two lines in the same column "Josh Brook Silvester Damien Junior Weekly Totals" in the pdf is as

Josh Brook Silvester Damien 34.90 $259. 32.00

Junior Weekly Totals

Hence, it is not getting detected in the results. The whole person details are ignored and the next person in the pdf is retrieved (left 34.90, cost 259, actual 32)

How is it possible to approach this situation? I tried fixing the regex expression in different ways, but did not work.

The issue is only with names that split into two lines. Detects all other names falling in a single line along with 'Weekly'

19
  • 2
    Just for my own clarification, you are trying to read a PDF and parse it out, right? Commented Apr 22 at 18:53
  • 3
    it would be clearer if you used code blocks for the sample "Josh" data. perhaps provide some more sample data that gives examples of a few other lines that are affected and a few that are not. some common features may emerge. Also, consider not trimming as that may hide some useful details Commented Apr 22 at 19:19
  • 1
    @chly when I use that one it does not retrieve any data Commented Apr 24 at 6:56
  • 1
    @chly From what I can see you mainly answer regex question, and you're probably using a tool for that. Answers like this one or this one seem to contain large sections that are likely copied from this tool? There's a fine line in how much can be copied without attribution. Commented Apr 24 at 10:08
  • 1
    @chly That being said, I see no real harm in this. Regex are nasty things that need a lot of detailed explaining, and using a tool for that makes perfect sense. You could probably reference it? Something like: "I used RegEx101 to created this explanation.". That can also be useful for readers. Commented Apr 24 at 10:11

1 Answer 1

0

This pattern will capture the name(s) followed by a int|float followed by a $int|float followed by a int|float into named capture groups.

ASSUMPTIONS:

  • names consist of 1 or more names.
  • Each name is followed by 0-2 spaces followed by an optional a newline.
  • A name may contain alphanum, or literal _, @, # or ., characters.
  • name cannot be Weekly Totals.
  • name cannot be an a float, e.g. 22.23.

REGEX PATTERN (PCRE2 Flavor. Flags: gm)

^(?!Weekly Totals)(?!\d+\.\d+?\s)(?<names>[\w@#.-]+(?:[ ]{0,2}[ \n](?!Weekly Totals)(?!\d+\.\d+?\s)[\w@#.-]+?)*)[ ]{0,2}\n?(Weekly\s{0,2}Totals)\s+(?<number1>\d+\.?(?:\d{2})?)\s+(?<number2>\$[\d,]+\.?(?:\d{2})?)\s+(?<number3>\d+\.?(?:\d{2})?)[ ]*(?:\n) 

Regex demo: https://regex101.com/r/UzQDnj/9

TEST STRING:

Josh Brook Silvester Damien 34.90 $259.99 55 Josh Brook Silvester Damien 34.90 Junior Weekly Totals Josh Brook Silvester Damien Junior Weekly Totals 34.90 $259. 32.00 22.11 22 Josh Brook Silvester Damien Junior Weekly Totals 34.90 $259. 32.00 Josh Brook Weekly Totals 34.90 $259. 32 Josh Brook Silvester Lucky7 Damien Junior Weekly Totals 34.90 $259. 32.00 Lucky7 Weekly Totals 34.90 $259. 32.00 555Wildcard-Willy Weekly Totals 34.90 $259. 32.00 #Harry Weekly Totals 34.90 $259. 32.00 @Max Weekly Totals 34. $59 0 Junior Weekly Totals Josh Weekly Totals Brook Silvester Weekly Totals Damien 34.90 $259.99 55 Josh Weekly Brook Totals Week Total Silvester Weekly Totals 34.90 $259.99 55 Josh Brook Silvester Damien 34.90 

MATCHES / GROUPS:
names
number1
number2
number3

Josh Brook Silvester Damien Junior 34.90 $259. 32.00 22 Josh Brook Silvester Damien Junior 34.90 $259. 32.00 Josh Brook 34.90 $259. 32 Josh Brook Silvester Lucky7 Damien Junior 34.90 $259. 32.00 Lucky7 34.90 $259. 32.00 555Wildcard-Willy 34.90 $259. 32.00 #Harry 34.90 $259. 32.00 @Max 34. $59 0 Josh Weekly Brook Totals Week Total Silvester 34.90 $259.99 55 
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.