Char-folding: how can we implement matching multiple characters as a single "thing"?
2015-11-30 15:54:50 GMT
I pushed this feature a couple of days ago, but I had to disable it now (the code is in `character-fold-to-regexp'). For those who don't know, the way char-folding works is by replacing each character in the search string with a regexp of the characters it can match. For instance, a character-folding search for "fi" is actually a regexp search for "\\(?:ḟ\\|[fᶠḟⓕｆ𝐟𝑓𝒇𝒻𝓯𝔣𝕗𝖋𝖿𝗳𝘧𝙛𝚏]\\)\\(?:i[̀-̨̣̰̄̆̈̉̌̏̑]\\|[iì-ïĩīĭįǐȉȋᵢḭỉịⁱℹⅈⅰⓘｉ𝐢𝑖𝒊𝒾𝓲𝔦𝕚𝖎𝗂𝗶𝘪𝙞𝚒]\\)". As you can see, this multiplies the length of the string by a factor of about 45. But that's still acceptable. However, this only includes foldings for the 'f' character and the 'i' character independently. As it happens, the string "fi" can also match the ligature 'ﬁ'. Now, for the sake of conciseness, let me denote the f and i regexps above as [f] and [i]. The code I pushed a couple of days ago added support for multi-char matches (such as "fi" matching that ligature), but returning a regexp like this: "\\([f][i]\\|ﬁ\\)" This might not look like much. It only adds a few chars to a regexp that already has about 90. But the problem is that it introduces a new branch. Let's say we want to search for the word "fix". The letters 'i' and 'x' could also represent the character 'ⅸ' (roman numeral nine). So the only way to write that regexp is: "\\([f]\\([i][x]\\|ⅸ\\)\\|ﬁ[x]\\)". Now the pattern [x] (which I remind you has ~ 40 chars) appears twice(Continue reading)