Jump to content

[TOPIC: topicViewTemplate]
[GLOBAL: userSmallPhoto]
Photo

Split utf-8 string word (with foreign characters) to letters
Started by benny5 Dec 03 2013 02:02 PM

- - - - -
12 replies to this topic
word string split foreign characters
[TOPIC CONTROLS]
This topic has been archived. This means that you cannot reply to this topic.
[/TOPIC CONTROLS]
[modOptionsDropdown]
[/modOptionsDropdown]
[reputationFilter]
[TOPIC: post.html]
#1

benny5

[GLOBAL: userInfoPane.html]
benny5
  • Enthusiast

  • 61 posts
  • Corona SDK

Hi,

 

We're making a simple word guessing game where the letters in a word is scrambled. To scramble we split the word into its letter components and change their position randomly.

 

This works fine for english words:

local scramblewordtable = {}
for i = 1, #randomword, 1 do
 scramblewordtable[i] = randomword:sub(i,i)
end

"word" becomes a table { w,o,r,d }

 

however, when doing this with swedish characters

 

"äpple" becomes {?,?,p,p,l,e}

 

is there any good way to handle this?



[TOPIC: post.html]
#2

ingemar

[GLOBAL: userInfoPane.html]
ingemar
  • Corona Geek

  • 2,733 posts
  • Enterprise

That was fun  :)

I've been thinking about how to handle this before but had no project that needed it, however this topic sparked my interest and I've created a small function that should work.

 

local UTF8ToCharArray = function(str)
    local charArray = {};
    local iStart = 0;
    local strLen = str:len();
    
    local function bit(b)
        return 2 ^ (b - 1);
    end

    local function hasbit(w, b)
        return w % (b + b) >= b;
    end
    
    local checkMultiByte = function(i)
        if (iStart ~= 0) then
            charArray[#charArray + 1] = str:sub(iStart, i - 1);
            iStart = 0;
        end        
    end
    
    for i = 1, strLen do
        local b = str:byte(i);
        local multiStart = hasbit(b, bit(7)) and hasbit(b, bit(8));
        local multiTrail = not hasbit(b, bit(7)) and hasbit(b, bit(8));

        if (multiStart) then
            checkMultiByte(i);
            iStart = i;
            
        elseif (not multiTrail) then
            checkMultiByte(i);
            charArray[#charArray + 1] = str:sub(i, i);
        end
    end
    
    -- process if last character is multi-byte
    checkMultiByte(strLen + 1);

    return charArray;
end

local arr = UTF8ToCharArray("Äpplet är i trädet ÅÄÖåäö");

for k,v in pairs(arr) do
    print(k, v);
end

 

Multi byte characters start with a byte with bits 7 and 8 set, trailing bytes have bit 7 not set and bit 8 set.

My function checks for these bits and acts accordingly.

 

Give this function a whirl and see if it works for you. 

I've done some basic testing, and it works well even for Chinese, Japanese and Korean text  :wub:



[TOPIC: post.html]
#3

benny5

[GLOBAL: userInfoPane.html]
benny5
  • Enthusiast

  • 61 posts
  • Corona SDK

Wow, kudos for going above and beyond! I'm sure this will help a lot of people. I guess the answer to my question of is there a good way is big NO then :) That was really advanced.

 

I'll give it a whirl tonight after work! Thanks!



[TOPIC: post.html]
#4

richard9

[GLOBAL: userInfoPane.html]
richard9
  • Corona Geek

  • 1,118 posts
  • Enterprise

Wow. I had no idea how to detect multibyte characters and this specifically solves a problem I didn't even know I was going to have! Thanks ingemar!



[TOPIC: post.html]
#5

ingemar

[GLOBAL: userInfoPane.html]
ingemar
  • Corona Geek

  • 2,733 posts
  • Enterprise

No problem guys :-) It was fun to get away from my daily coding routine for a while...

[TOPIC: post.html]
#6

benny5

[GLOBAL: userInfoPane.html]
benny5
  • Enthusiast

  • 61 posts
  • Corona SDK

Worked perfectly!



[TOPIC: post.html]
#7

ingemar

[GLOBAL: userInfoPane.html]
ingemar
  • Corona Geek

  • 2,733 posts
  • Enterprise

Great! Use the code as you wish.

[TOPIC: post.html]
#8

ali4

[GLOBAL: userInfoPane.html]
ali4
  • Observer

  • 10 posts
  • Corona SDK

@ingemar

thanks...thanks...thanks...thanks...thanks...thanks...thanks...thanks...thanks...thanks...thanks...thanks...thanks...thanks...thanks...thanks...

 

it works fine in Arabic too :))

 

THANK YOU VERY MUCH

 

if you can make a briefe explination for the code :)



[TOPIC: post.html]
#9

jeff15

[GLOBAL: userInfoPane.html]
jeff15
  • Contributor

  • 106 posts
  • Corona SDK

Thanks @ingemar, you just made my life easier too!

 

Cheers,

Jeff



[TOPIC: post.html]
#10

Nob Studio

[GLOBAL: userInfoPane.html]
Nob Studio
  • Contributor

  • 153 posts
  • Corona SDK

You solved my problem! Thank you!



[TOPIC: post.html]
#11

keystagefun

[GLOBAL: userInfoPane.html]
keystagefun
  • Contributor

  • 336 posts
  • Corona SDK

Massive thank you for writing this. Solved my issue in seconds. Brilliant - cheers!



[TOPIC: post.html]
#12

ingemar

[GLOBAL: userInfoPane.html]
ingemar
  • Corona Geek

  • 2,733 posts
  • Enterprise

@keystagefun

You're welcome. Great to hear that you found it useful.  :)



[TOPIC: post.html]
#13

runewinse

[GLOBAL: userInfoPane.html]
runewinse
  • Contributor

  • 505 posts
  • Corona SDK

Thanks, Ingemar! This is just what I needed!




[topic_controls]
[/topic_controls]