Skip to content

Support splitting Strings into Unicode Grapheme Cluster #22117

@peppergrayxyz

Description

@peppergrayxyz

Describe the feature

When working with Unicode, we usually don't care about the bytes, but we usually also don't care about the code points (runes). What we mostly care is characters displayed on screen (grapheme clusters). Unicode provides an algorithm to split strings into grapheme clusters (units of display width one). This feature is about including grapheme cluster splitting into builtin.

Use Case

Anyone working with a UI, who wants to know:

  • how long is a string (display characters on the screen)
  • where is the pointer on screen
    Neither bytes nor runes provide this information
  • use format strings with unicode strings

Example:

This text should be right aligned:

examples := [
	'\u006E\u0303',
	'\U0001F3F3\uFE0F\u200D\U0001F308',
	'ห์', 
	'ปีเตอร์'
]

println("0123456789abcdefgh")
for text in examples 
{
	println("${text:10}")
}
0123456789abcdefgh
         ñ
    🏳️‍🌈
        ห์
   ปีเตอร์

But it isn't.

Proposed Solution

Add a feature to split a string into graphemes

hello := 'Hello World 🏳️‍🌈'
hello_graphemes := hello.graphemes () // [`H`, `e`, `l`, `l`, `o`, ` `, `W`, `o`, `r`, `l`, `d`, ` `, `🏳️‍🌈`]

Current Behavior

examples := [
	'\u006E\u0303',
	'\U0001F3F3\uFE0F\u200D\U0001F308',
	'ห์', 
	'ปีเตอร์'
]

for text in examples 
{
	println("0123456789abcdefgh")
	println(text)
	println(text.runes())
}
0123456789abcdefgh
ñ
[`n`, `̃`]
0123456789abcdefgh
🏳️‍🌈
[`🏳`, `️`, `‍`, `🌈`]
0123456789abcdefgh
ห์
[`ห`, `์`]
0123456789abcdefgh
ปีเตอร์
[`ป`, `ี`, `เ`, `ต`, `อ`, `ร`, `์`]

Proposed behavior:

examples := [
	'\u006E\u0303',
	'\U0001F3F3\uFE0F\u200D\U0001F308',
	'ห์', 
	'ปีเตอร์'
]

for text in examples 
{
	println("0123456789abcdefgh")
	println(text)
	println(text.graphemes())
}
0123456789abcdefgh
ñ
[`ñ`]
0123456789abcdefgh
🏳️‍🌈
[`🏳️‍🌈`]
0123456789abcdefgh
ห์
[`ห์`]
0123456789abcdefgh
ปีเตอร์
[`ปี`, `เ`, `ต`, `อ`, `ร์`]

Further suggestions

  • consider removing runes (or consider replacing the implementation to use grapheme clusters instead of codepoints):
    • What is the rational to have them?
    • In which situation do you want to work with codepoints but not grapheme clusters?
  • consider using grapheme clusters for width calculation of format strings
  • consider making grapheme clusters a first class citizen and hide bytes behind a call

e.g.

string[n] ... access n-th grapheme
string.len ... number of graphemes
string.bytes()[n] ... access n-th byte
string.bytes().len ... number of bytes

Other Information

Unicode Reference and some more info on the background

This feature would also fix this bug:

Acknowledgements

  • I may be able to implement this feature request
  • This feature might incur a breaking change

Version used

0.4.7

Environment details (OS name and version, etc.)

V full version: V 0.4.7 7baff15
OS: linux, "Manjaro Linux"
Processor: 16 cpus, 64bit, little endian, AMD Ryzen 7 7840U w/ Radeon  780M Graphics

getwd: /home/pepper
vexe: /usr/lib/vlang/v
vexe mtime: 2024-08-26 17:34:57

vroot: NOT writable, value: /usr/lib/vlang
VMODULES: OK, value: /home/pepper/.vmodules
VTMP: OK, value: /tmp/v_1000

Git version: git version 2.46.0
Git vroot status: Error: fatal: not a git repository (or any of the parent directories): .git
.git/config present: false

CC version: cc (GCC) 14.2.1 20240805
thirdparty/tcc status: thirdparty-linux-amd64 0134e9b9-dirty

Note

You can use the 👍 reaction to increase the issue's priority for developers.

Please note that only the 👍 reaction to the issue itself counts as a vote.
Other reactions and those to comments will not be taken into account.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Feature/Enhancement RequestThis issue is made to request a feature or an enhancement to an existing one.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions