Sunday, March 7, 2010

Normalizing utf-8 string for Twitter API

In order to send Unicode (utf-8 to be exact) text to Twitter via Twitter app, you must first run the string through the php Normalizer class

This class comes with php 5.3 but before the 5.3 you need to build the lib yourself and then install extension from pecl

Then, once you have the Normalizer class, just do this:
Normalize::normalize($string,Normalizer::FORM_C)

This is not always necessary as most utf-8 strings and chars are already "just fine", it's just that some fairly rare chars can be considered 'not normalized'.

This means that Twitter will still accept them, they will even be rendered by the end user browser in most cases, it's just that Twitter may count such chars as 2 chars instead of just one, and you know in Twitter every char counts.

Basically you really want to make sure that utf-8 strings are normalized before you send then to Twitter from your API because otherwise you may run into situation that your message unexpectedely exceeds 140 chars and will be rejected by Twitter API.

Normalize::normalize($string,Normalizer::FORM_C)

And here is the info from php
http://www.php.net/manual/en/book.intl.php

http://php.net/manual/en/class.normalizer.php

1 comment:

  1. i've tried this method for a 140 character-length turkish string but it didnt work. my code is like as below
    ....
    $twit_msg=iconv('WINDOWS-1254','UTF-8',$twit_msg);

    $twit_msg = Normalizer::normalize($twit_msg, Normalizer::FORM_C );
    ....
    is there a way for checking calculated twit length by twitter before posting it?

    thanks

    ReplyDelete