How do you work with strings with multibyte characters in Lua?

For example, if s = 'foo😶bar', then #s == 10 because Lua string lengths count bytes, not code points / display characters.

How do I:

  • Get the length of s (in a way that returns 7)?
  • Access the 4th display character in a way that returns 😶, and the 5th in a way that returns b?
  • Etc.

The are functions like vim.str_utfindex() and vim.str_byteindex().

1 Like

And vim.str_utf_start() + vim.str_utf_end().
You can use them like this:

---@param str string
---@param i integer
---@param j? integer
---@return string
local function str_sub(str, i, j)
    local length = vim.str_utfindex(str)
    if i < 0 then i = i + length + 1 end
    if (j and j < 0) then j = j + length + 1 end
    local u = (i > 0) and i or 1
    local v = (j and j <= length) and j or length
    if (u > v) then return "" end
    local s = vim.str_byteindex(str, u - 1)
    local e = vim.str_byteindex(str, v)
    return str:sub(s + 1, e)
end
1 Like

This function will give you byte size of a character at an index, you can start at 1 and iterate until you consume all bytes. Index 2 would be at 1 + the returned value.

function char_byte_count (s, i)
    local c = string.byte(s, i or 1)

    -- Get byte count of unicode character (RFC 3629)
    if c > 0 and c <= 127 then
        return 1
    elseif c >= 194 and c <= 223 then
        return 2
    elseif c >= 224 and c <= 239 then
        return 3
    elseif c >= 240 and c <= 244 then
        return 4
    end
end

If you need to know how many cells a character will take up, there is the function vim.api.nvim_strwidth

2 Likes