What is a Unicode safe replica of String.IndexOf(string input) that can handle Surrogate Pairs?

Question

I am trying to figure out an equivalent to C# string.IndexOf(string) that can handle surrogate pairs in Unicode characters.

I am able to get the index when only comparing single characters, like in the code below:

 public static int UnicodeIndexOf(this string input, string find) { return input.ToTextElements().ToList().IndexOf(find); } public static IEnumerable<string> ToTextElements(this string input) { var e = StringInfo.GetTextElementEnumerator(input); while (e.MoveNext()) { yield return e.GetTextElement(); } }

But if I try to actually use a string as the find variable then it won't work because each text element only contains a single character to compare against.

Are there any suggestions as to how to go about writing this?

Thanks for any and all help.

EDIT:

Below is an example of why this is necessary:

CODE

 Console.WriteLine("HolyCow𪘁BUBBYY𪘁YY𪘁Y".IndexOf("BUBB")); Console.WriteLine("HolyCow@BUBBYY@YY@Y".IndexOf("BUBB"));

OUTPUT

9 8

Notice where I replace the 𪘁 character with @ the values change.

@Steve I added some information to my question. Are those strings the same encoding or is there a difference? — Ibrennan208
– Ibrennan208, Commented May 4, 2018 at 20:39
@Ibrennan208, from your initial implementation it looks like you are trying to find a single grapheme, because you are using an IndexOf on an array of strings that are in effect TextElements, but from your sample data it looks like you actually want to find an index of a substring with length > 1 grapheme. Can you specify which solution you are seeking? (Just run your code on your test data - it won't work - indexOf will return -1) — ironstone13
– ironstone13, Commented May 4, 2018 at 20:48
@ironstone13 I want to find an index of a substring with length > 1. In the question I explained that I can get it to work if I am only comparing a string with a single character, but I want to extend it to allow for the user to input a multicharacter string to find the index of. — Ibrennan208
– Ibrennan208, Commented May 4, 2018 at 20:58

Evk · Accepted Answer · 2018-05-04 21:01:35Z

You basically want to find index of one string array in another string array. We can adapt code from this question for that:

public static class Extensions { public static int UnicodeIndexOf(this string input, string find, StringComparison comparison = StringComparison.CurrentCulture) { return IndexOf( // split input by code points input.ToTextElements().ToArray(), // split searched value by code points find.ToTextElements().ToArray(), comparison); } // code from another answer private static int IndexOf(string[] haystack, string[] needle, StringComparison comparision) { var len = needle.Length; var limit = haystack.Length - len; for (var i = 0; i <= limit; i++) { var k = 0; for (; k < len; k++) { if (!String.Equals(needle[k], haystack[i + k], comparision)) break; } if (k == len) return i; } return -1; } public static IEnumerable<string> ToTextElements(this string input) { var e = StringInfo.GetTextElementEnumerator(input); while (e.MoveNext()) { yield return e.GetTextElement(); } } }

Collectives™ on Stack Overflow

What is a Unicode safe replica of String.IndexOf(string input) that can handle Surrogate Pairs?

1 Answer 1

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Linked

Related