How to format a string based on a regular expression?

Question

I'm writing a PHP application that gets data from an API (let's call it A) and writes to another one (I'll call it B). I'm struggling with a specific field: postal code.

API A returns all postal codes as a 7 digit string, without any separator. If a specific postal code has less than 7 digits, it pads the value with 0 (zeros) to the left. This way, 50-224 – a postal code from Poland – becomes 0050224. I have no control over this output and probably it's stored this way. I know that's a Polish postal code because the response also gives me the country code, PL.

The issue is that API B validates the postal code and requires the right format.

I found a PHP library on GitHub that has a regular expression with the postal code format for each country. Like this: resources/address_format/PL.json.

What I want to do is use the expression provided by that lib to format the value returned by A.

My current code looks like this:

use CommerceGuys\Addressing\Repository\AddressFormatRepository; $country = 'US'; $postalcode = '0031401'; $repo = new AddressFormatRepository(); $pattern = $repo ->get($country) ->getPostalCodePattern() ; $postalcode = preg_replace( '/^.*(' . $pattern . ')$/', '$1', $potalcode );

For the case above, a U.S. ZIP code, it works fine because the second part of the code is optional in the expression: (\d{5})(?:[ \-](\d{4}))?. I started to have problems when other countries showed up, specifically where the postal code has other characters than letters and numbers.

BTW, I've looked through several questions here on S.O., however, none of them seems to ask for what I'm trying to achieve.

UPDATE

Despite the Polish example above, my code should work for any country. I just wanted to provide some background on what I'm trying to do. As I've stated in the question title, I'm hoping to take advantage of the regular expression from the addressing lib.

A couple more examples, from other countries:

Country | Postal code --------+------------ PH | 0002010 LB | 0001201 JO | 0000962

in brevi - you cannot use matching patterns to format postal code, api A is wrong, don't use it (even if you have to - don't use it) — Iłya Bursov
– Iłya Bursov, Commented Apr 29, 2016 at 20:42
Could you confirm that this polish zip code (if it exists) 50-1 will be stored by api A this way: 0050001? — Casimir et Hippolyte
– Casimir et Hippolyte, Commented Apr 29, 2016 at 20:51
@CasimiretHippolyte Unfortunately I can't add new data to API A. But I think it'll store as 0000501. Anyway, your example doesn't seem to be a valid postal code. Just to make it clearer, my code should work for any country. — Gustavo Straube
– Gustavo Straube, Commented Apr 29, 2016 at 21:04
You can't use these patterns. You need to write your own patterns for each possible format. — Casimir et Hippolyte
– Casimir et Hippolyte, Commented Apr 30, 2016 at 21:20
Regular expressions are used to validate a string, not to format it. One is dealing with input, the other with output. Your library is supplying the first argument to preg_match but you will have to provide the second. Your first API seems very lacking. What will it do with a Brazilian postal code, or a US ZIP+4? — miken32
– miken32, Commented May 2, 2016 at 20:51

Khalid Habib · Accepted Answer · 2016-05-04 13:11:36Z

2

/*Try this out to format your postal code*/ /* preg_replace(pattern, Replacement,values) */ $result = preg_replace('/(\d{3})(\d{3})$/', '$1-$2', '0050224'); echo substr($result, 2); // Out put : 050-224

Click the given link for more info about preg_replace

edited May 4, 2016 at 13:11

answered May 4, 2016 at 6:35

Khalid Habib

1,1561 gold badge16 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Gustavo Straube Over a year ago

Hey, thanks for your answer. Unfortunately, I didn't want to write new expressions. Using the expressions from the lib I mentioned would be great. Or any other source that I can use without writing a new regex for each country. For the Poland case, I have the expression \d{2}-\d{3}, for example.

Gustavo Straube Over a year ago

BTW, your regex should be /(\d{3})(\d{3})$/ to give 050-224 as result.

bishop · Accepted Answer · 2016-05-09 13:07:47Z

You can generate all possible combinations from a regular expression. Faker does it, for example, with its regexify formatter.

The problem is that the valid postal codes are a subset of the possible matches. For example, the US 5 digit ZIP code regex (\d{5}) produces 100,000 candidates, but there are only (approximately) 43,000 5 digit ZIP codes.

This, to me, sounds like a classic case of GIGO - Garbage In, Garbage Out. You are given a denormalized data point, and asked to normalize it from first principles. This is hard. And sometimes impossible.

If I were you, I'd start from a simple list of formats, like this one (or this one if the original is offline) based on the United Nations list. Then pull one character at a time from your input, in reverse, and match it. Let's take an example.

API A tells you that 0001201 is Liberia. From the list, you see that Liberia's format is 9999. Reverse both of those strings: 1021000 and 9999 respectively. Now walk the format one character at a time, matching. First character from format is a 9, which is a digit placeholder. Is the first character from the reversed input a digit? Yes: 1, remember that. Ok, second character. 9 and 0, the zero matches so remember it. Repeat until we run out of format or input, or we hit a non-match on format.

In this example, we'll run out of format digits before input digits and we wont' hit an error, finding that the reversed input 1021 matches the reversed format 9999. So we're done, now do a final reverse on the match: 1021 becomes 1201, which is a valid Liberian postal code.

I tried to access that link with the list of formats but it's not working right now. I liked this way to handle the problem, however, I don't want to create a list of formats by hand, unless it's the last and only option. So I gave the bounty to Solarflare's answer because it uses a source that is already compiled.
@GustavoStraube Totally, don't do by hand what can be done by machine. Here's the link from web.archive.org, with all the postal codes ready to use.

Solarflare · Accepted Answer · 2016-05-07 13:57:07Z

As others pointed out, there is no general way to get the original text from a regular expression, since there are usually a lot of possibilies.

However, since you have the digits of the "original text", you are able to recreate the text in case these specific digits are the only information that is missing in the pattern; e.g., in your polish example \d{2}-\d{3} you are able to replace \d{2} and {3} in the pattern by 2 and 3 digits of your postalcode from api A, and the pattern will give you the additional "-".

Examples for cases you can't reconstruct:

SO: "[A-Z]{2}[ ]?\d{5}" because you don't get letters from api A, so you can't reconstruct them.
BR: "\d{5}[\-]?\d{3}" because you don't get 8 digits from api A.
anything with optional stuff, cause, well, it is not defined which of these options is the right one. There might be several valid solutions that might depend on special conditions (e.g. you have to use the additional 3 digits in \d{4}(-\d{3})? for cities with more than 10000 houses or something like that, or you have to use the - in \d{2}[-]?\d{2} only for the capital of the state or maybe you can just use it as you like.) That includes terms like \d{1-4}, since the length might depend on other values. And you might get problems if leading 0's are allowed in your code: for an input 0000001, 1, 01, 001 and 0001 might be the correct solution for \d{1-4} (though I would assume that leading 0's in practice will only happen with a fixed length); and for \d{4}(-\d{3})?, 0001002 might mean 0001-001 (large city) or 1001 (small city).

The usual way to get the correct postal code in these (and tbh in all) cases would be to look it up in a database by city and street name. (You can buy access to such databases from your local postal service, or create a database from e.g. openstreetmap-data).

Having said that, here is some example code that will reconstruct codes that only are missing a fixed number of digits, e.g. PL (\d{2}-\d{3}). It will work too for patterns like FK ("FIQQ 1ZZ"), provided the code from A will be "0000001". I assume it will work for about 50%-60% of countries.

use CommerceGuys\Addressing\Repository\AddressFormatRepository; $country = 'PL'; $postalcodeA = '0031401'; $repo = new AddressFormatRepository(); $pattern = $repo ->get($country) ->getPostalCodePattern() ; $ok = 1; $pospattern = 0; $posA = 0; $postalcodeB = ''; while ( ($pospattern < strlen($pattern)) and ($ok==1) ) { $pospattern += 1; $charact = substr($pattern, -$pospattern,1); if (strcmp($charact,'}') == 0) { if (strcmp(substr($pattern, -$pospattern - 4, 3),'\d{') == 0) { $cnt = substr($pattern, -$pospattern - 1,1); $postalcodeB = substr($postalcodeA, -$posA - $cnt, $cnt) . $postalcodeB; $posA += $cnt; $pospattern += 4; } else { $ok = 0; } } elseif ( ctype_digit($charact) ) { if ( strcmp($charact,substr($postalcodeA,-$posA-1,1)) !== 0) { $ok = 0; } $postalcodeB = $charact . $postalcodeB; $posA += 1; } elseif ( preg_match('/[\(\)\[\]\{\}\$\?\\\]/', $charact) ) { $ok = 0; } else { $postalcodeB = $charact . $postalcodeB; } } # USE WITH CARE! READ INFO! # if ($ok == 0) { # $postalcodeB = preg_replace( # '/^.*(' . $pattern . ')$/', # '$1', # $postalcodeA # ); # if (strcmp($postalcodeA,$postalcodeB) !== 0) { # $ok = 1; # } #} if (!preg_match('/^' . $pattern . '$/', $postalcodeB)) { $ok = 0; } if (!$ok) { echo "Pattern ",$pattern," not supported or no match to ",$postalcodeA,"\r\n"; } else { echo "Pattern ",$pattern," ok: ",$postalcodeA," -> ",$postalcodeB,"\r\n"; }

It will replace every occurance of \d{n} in the pattern by n digits, starting at the end of the string. In case it doesn't understand the pattern (e.g. as it has optional stuff), you might want to try preg_replace. I wouldn't use it (and commented it out) cause it can give you unpredictable and wrong random results (see below the example for Boston City Hall), but I added it in case you want to use it because you e.g. can make sure the client for api A will never allow a zip+4 code to be entered. As a last step it will verify if the result fits the pattern.

You can easily add support for \d (a single digit).

You can try to add support for terms like \d{1-4} by e.g. checking how many digits api A has and doesn't use in other terms, and use the remaining digits (e.g. \d{2}-\d{1-4} with an input 0001245 has 4 digits, uses 2 for the first term \d{2} so it has 2 digits for \d{1-4} left, but keep in mind the things i wrote above: you might get wrong results if zero is an allowed digit at the beginning, e.g. 00-1245, 01-245 or 12-34 might be valid results (in this case, you cannot recover the code without looking up the city name in a database). And you will get in trouble for \d{1-2}-\d{2-3}.

You should add a final check to see if the numbers of digits fits the digits in A (e.g. you might want to concat all the digits in the result and check if this string is the code given by A padded with zeros). That will prevent you from some misinterpretation caused by e.g. preg_replace or \d{1-2} or other optional stuff. For example, someone entered US zip+4 code for Boston City Hall, which is 02201-1020. Your api A will give you 0220110, or, worse, 2011020, and preg_replace will give you 20110 or 11020, both of which are completely wrong (02201 might be an acceptable compromise, but you will have problems generating this result).

You can then let it run once for every country with a random code and then check for patterns that don't work. Some of these will just not work because the code is not right (e.g. FK will only work if the input is 0000001 what is not usually the case for random input).

If you are lucky, you don't need these countries.

Or, as a last ressort, you might be able to rewrite some of the remaining errors, but it will require some manual work:

Some of the patterns will contain optional stuff, e.g. \d{2}[-]?\d{2}. For these cases you can check if the - depends e.g. on some of the digits or the city name, or if it is really optional. If it is really optional, you have to decide if you want the - or not and then save that as the new pattern, e.g. \d{2}-\d{2}. But in most cases you can't do a general replacement, e.g. for US you might decide to leave out the +4, but you still wouldn't be able to get the correct result if the customer entered the (correct) zip+4 code for boston city hall, see example above.

For other patterns there might be some allowed possibilities, e.g. \d{4}|A-\d{3}. For this cases you might be able to create 2 patterns, e.g. \d{4} and A-\d{3}. You can do the same for e.g. \d{2}(-\d{2})? and manually generate the two patterns \d{2} and \d{2}-\d{2}. You then have to test all these patterns for a country (put the whole thing in a while-loop and execute it for every sub-pattern) and take the first that fits. A pattern will fit, if it uses all given digits from A and fulfills the final patterntest. Though this will, again, usually fail if leading zeros are allowed: input 0000123 might mean 0123or A-123, so you might have to check other resources if zeros are allowed (and a similar problem as with boston city hall might still occur). But this way you might be able to reconstruct some more countries.

But in most of these cases it won't be possible to rewrite them or even generate a specific postal code manually without looking them up in a database.

Nice approach! I tried some postal codes with the regex_replace part enabled and I got a match for an empty postal code, then I've added an extra check: if (!empty($postalcodeB) && strcmp($postalcodeA,$postalcodeB) !== 0) { $ok =1; }. Also, for the code 0002010 from PH, it wasn't possible to format it correctly. The regex is too complex. Then I simply trimmed the zeros from left: if ($ok == 0) { $postalcodeB = ltrim($postalcodeA, '0'); $ok = 1; }. Since it runs a preg_match in the end to check the postal code, I don't think it's a concern. Thanks BTW!

user557597 · Accepted Answer · 2016-05-05 19:58:10Z

You could do it the old fashioned way, by hand.

Dump all the patterns from that library into a text file.
Trim the punctuation out. Put capture groups around the
parts separated by punctuation. Create a replacement.

Country Regex Validation Regex Conversion Find Replace --------------------------------------------------------------------------------- NL Netherlands \d{4}[ ][A-Z]{2} (\d{4})([A-Z]{2})$ $1 $2 9999 AA NI Nicaragua \d{3}-\d{3}-\d (\d{3})(\d{3})(\d)$ $1-$2-$3 999-999-9 US United States \d{5} (\d{5})$ $1 99999 SH Saint Helena [A-Z]{4}[ ]\d[A-Z]{2} ([A-Z]{4})(\d[A-Z]{2})$ $1 $2 TDCU 1ZZ JM Jamaica [A-Z]{5}\d{2} ([A-Z]{5}\d{2})$ $1 JMAAA99

This is definitely an option, but I wanted to avoid this kind of work. Thanks for trying to help, anyway.
@GustavoStraube - You don't have any other options. Hate to be the bearer of bad news. It's a one time thing to do, you won't be able to avoid it, sorry man ..

Collectives™ on Stack Overflow

How to format a string based on a regular expression?

4 Answers 4

2 Comments

2 Comments

1 Comment

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

2 Comments

1 Comment

2 Comments

Related