Will php mb_strlen($str,‘utf8’) ever return a greater result than JavaScript .length?

Question

I'm working on an Angular 17 reactive form where I send the form data to a PHP API on the server and store it in a database. I would like the user to be able to input emojis to the form so I have set my database to utf8mb4_unicode_ci collation so that the emojis can be stored.

Security is very important to me so I do several checks on both the client side and the server side for various things.

One of the checks I do is to check the length of the input. I was wondering If you can help because the length results are inconsistent on client side and server side (since the string contains emojis).

On using the JavaScript .length property and also the built-in Angular Form Validators called minLength and maxLength I see that they all calculate the length in the same way (for example most of the basic smilie emojis are calculated as having a length of 2).

However when I send this data (which includes emojis) to the server side I use the PHP method called mb_strlen($subject, 'utf8') and the values are different (most of the basic smilie emojis are calculated as having a length of 1 and also they take up 1 varchar character in the database).

I've tested about 160 emojis to see what values they return on both client side and server side in order to try and work out a pattern (so that I can do checks for length in the right way).

As you can see from my screenshots below in most cases mb_strlen($subject,‘utf8’) returns a lower value for the length than JavaScript .length. Sometimes it returns the same value as JavaScript .length property but in all these cases mb_strlen($subject,‘utf8’) has never returned a length greater than what JavaScript .length returns.

Is it safe to assume that mb_strlen($subject,‘utf8’) will never return a value greater than JavaScript .length. for the rest of existing emojis that I have not tested?

If not could you explain a bit more about this and could you give some examples of characters where mb_strlen($subject,‘utf8’) would return a greater value than JavaScript .length?

Thank you

Don't forget grapheme_strlen(). I once read a wonderful blog article that explains this perfectly, but of course Google is useless nowadays. — Álvaro González
– Álvaro González, Commented Oct 23, 2024 at 13:46
You can really go down a rabbit hole with this. One of my favorite emojis for testing is 👨‍👩‍👧‍👦. This is often represented as a single glyph but is technically composed of four base character sequences (man, woman, girl, boy) plus three joiners to get a specific "family" representation. Depending on how you ask PHP or JavaScript you can get different answers. Personally, I think "characters" is probably the best thing to count, just know that it doesn't always map to our human perceptions. — Chris Haas
– Chris Haas, Commented Oct 23, 2024 at 14:17
Well I thought emojis were 4 byte unicode strings, basically a key into an emoji store that has to exist in the browser or whatever tool you use them on. I am fairly sure using .length in js will not give you the right answer as I dont think it understand unicode — RiggsFolly
– RiggsFolly, Commented Oct 23, 2024 at 14:18
See this for example playcode.io/2056385 Seems I was wrong about the 4 byte bit — RiggsFolly
– RiggsFolly, Commented Oct 23, 2024 at 14:21
"best thing to count" - I actually meant that from the security perspective. However I also see you have min/max lengths, so from a user perspective it is probably better to use graphemes since that is what most people expect. (If anyone remembers Twitter's early character limit which was byte-based since it went over SMS, and people were confused about how long a message really was.) — Chris Haas
– Chris Haas, Commented Oct 23, 2024 at 14:28

shingo · Accepted Answer · 2024-10-25 14:38:40Z

The encoding of Javascript strings is UTF-16, and the length property is the number of UTF-16 characters in a string. Each UTF-16 character is 2 bytes long. In PHP, you can count the length like this:

# Assume the input string is encoded with UTF-8 $str2 = mb_convert_encoding($str, 'UTF-16LE', 'UTF-8'); $length = strlen($str2) / 2;

However mb_strlen counts the number of Unicode characters in a string. The length of a Unicode character is variable, in the current version, the length can be 2 or 3 bytes.

A 2-byte Unicode character can correspond to a UTF-16 character, but to represent a 3-byte Unicode character in UTF-16 encoding, you need to use surrogate pair (using 2 UTF-16 characters). Therefore, when you use mb_strlen to count the number of Unicode characters, the result will never be greater than the string.length property in Javascript.

Thanks for this answer. Maybe you could edit the last line of your answer to say more specifically mb_strlen with utf8 i.e. mb_strlen($str, 'utf8') as mb_strlen with 8bit will often return a greater length as it calculates the number of bytes e.g mb_strlen($str, '8bit') and mb_strlen($str, 'utf16') sometimes returns a greater value than the length property in JavaScript... Thanks :)
Done, but I saw some issues. AFAIK neither mb_strlen($str, 'utf8') nor mb_strlen($str, 'utf16') will return a value greater than the length property in JavaScript. If you encounter this situation, you need to check the value or encoding of the passed $str variable.
Ok I appreciate you understand this better than me. I was passing in this particular emoji 😶‍🌫️ into mb_strlen($str, 'utf16') and it returns 7 ... and then with JavaScript .length it returns 6. So I guess that happened because the encoding of the emoji 😶‍🌫️ is probably utf8? So that's my error then...
Yes, if you copy emojis into a PHP file, they will be UTF-8 encoded.

Collectives™ on Stack Overflow

Will php mb_strlen($str,‘utf8’) ever return a greater result than JavaScript .length?

1 Answer 1

5 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Related