-
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Description
Describe the feature
When working with Unicode, we usually don't care about the bytes, but we usually also don't care about the code points (runes). What we mostly care is characters displayed on screen (grapheme clusters). Unicode provides an algorithm to split strings into grapheme clusters (units of display width one). This feature is about including grapheme cluster splitting into builtin.
Use Case
Anyone working with a UI, who wants to know:
- how long is a string (display characters on the screen)
- where is the pointer on screen
Neither bytes nor runes provide this information - use format strings with unicode strings
Example:
This text should be right aligned:
examples := [
'\u006E\u0303',
'\U0001F3F3\uFE0F\u200D\U0001F308',
'ห์',
'ปีเตอร์'
]
println("0123456789abcdefgh")
for text in examples
{
println("${text:10}")
}
0123456789abcdefgh
ñ
🏳️🌈
ห์
ปีเตอร์
But it isn't.
Proposed Solution
Add a feature to split a string into graphemes
hello := 'Hello World 🏳️🌈'
hello_graphemes := hello.graphemes () // [`H`, `e`, `l`, `l`, `o`, ` `, `W`, `o`, `r`, `l`, `d`, ` `, `🏳️🌈`]
Current Behavior
examples := [
'\u006E\u0303',
'\U0001F3F3\uFE0F\u200D\U0001F308',
'ห์',
'ปีเตอร์'
]
for text in examples
{
println("0123456789abcdefgh")
println(text)
println(text.runes())
}
0123456789abcdefgh
ñ
[`n`, `̃`]
0123456789abcdefgh
🏳️🌈
[`🏳`, `️`, ``, `🌈`]
0123456789abcdefgh
ห์
[`ห`, `์`]
0123456789abcdefgh
ปีเตอร์
[`ป`, `ี`, `เ`, `ต`, `อ`, `ร`, `์`]
Proposed behavior:
examples := [
'\u006E\u0303',
'\U0001F3F3\uFE0F\u200D\U0001F308',
'ห์',
'ปีเตอร์'
]
for text in examples
{
println("0123456789abcdefgh")
println(text)
println(text.graphemes())
}
0123456789abcdefgh
ñ
[`ñ`]
0123456789abcdefgh
🏳️🌈
[`🏳️🌈`]
0123456789abcdefgh
ห์
[`ห์`]
0123456789abcdefgh
ปีเตอร์
[`ปี`, `เ`, `ต`, `อ`, `ร์`]
Further suggestions
- consider removing runes (or consider replacing the implementation to use grapheme clusters instead of codepoints):
- What is the rational to have them?
- In which situation do you want to work with codepoints but not grapheme clusters?
- consider using grapheme clusters for width calculation of format strings
- consider making grapheme clusters a first class citizen and hide bytes behind a call
e.g.
string[n] ... access n-th grapheme
string.len ... number of graphemes
string.bytes()[n] ... access n-th byte
string.bytes().len ... number of bytes
Other Information
Unicode Reference and some more info on the background
- Unicode® Standard Annex #29 Unicode Text Segmentation
- [Medium] Dealing with Unicode strings, done right and better.
- Sample C-implentation (examplery) https://libs.suckless.org/libgrapheme/
This feature would also fix this bug:
Acknowledgements
- I may be able to implement this feature request
- This feature might incur a breaking change
Version used
0.4.7
Environment details (OS name and version, etc.)
V full version: V 0.4.7 7baff15
OS: linux, "Manjaro Linux"
Processor: 16 cpus, 64bit, little endian, AMD Ryzen 7 7840U w/ Radeon 780M Graphics
getwd: /home/pepper
vexe: /usr/lib/vlang/v
vexe mtime: 2024-08-26 17:34:57
vroot: NOT writable, value: /usr/lib/vlang
VMODULES: OK, value: /home/pepper/.vmodules
VTMP: OK, value: /tmp/v_1000
Git version: git version 2.46.0
Git vroot status: Error: fatal: not a git repository (or any of the parent directories): .git
.git/config present: false
CC version: cc (GCC) 14.2.1 20240805
thirdparty/tcc status: thirdparty-linux-amd64 0134e9b9-dirty
Note
You can use the 👍 reaction to increase the issue's priority for developers.
Please note that only the 👍 reaction to the issue itself counts as a vote.
Other reactions and those to comments will not be taken into account.