Wednesday, May 27, 2009

Twitterdata and How Chinese Could Be the Future of Tweeting

True confession time: I love Unicode. I think that Unicode was one of the most important achievements of 20th century civilization. I am so much of a Unicode wonk that one of my first thoughts when I heard about Twitter was "I wonder whether it's 140 characters or 140 bytes?" If I were a true Unicode geek rather than a Unicode wonk, it wouldn't have taken until today for me to do the test to see for sure. In case you're wondering, it really is 140 characters; if it were bytes, you'd only get to send 70 Chinese characters in a tweet. But there's a catch- The SMS network restricts SMS messages to 70 characters if any Unicode characters past the first 256 are used. So somehow international tweets are sent as two SMS messages if Chinese characters are used. There's also the catch that tweet recipients may not be equipped to handle the full suite of Unicode characters that you might want to send. I had to change a setting on Tweetdeck before I could see my Chinese tweet; I was unsuccessful in sending a legible Chinese SMS from Skype to my iPhone- not sure where that problem comes from!

I've been spurred into Unicode tweeting because of a recent proposal called Twitterdata. If you've been reading my blog for a while, which I know you haven't been, you'll know that I've been interested in the way that Twittering seems to be developing in ways that resemble the development of human languages. I'm certainly not the first one to notice that Twitter has many semantic-web-like features, and there has been discussion about ways to add semantics to the Twitter stream. The Twitterdata people have made a very interesting proposal: they suggest some very simple additions to tweet grammar that would make tweets more meaningful to machines. They suggest to use the "$" character to denote the name in a name-value pair of meaningfulness. I think this proposal is brilliant, but my thoughts on the matter are entirely irrelevant, because the twitterdata proposal has an approximately zero chance of being widely adopted. My prediction is that one year from now, there will have been more human-generated tweets in Klingon than in Twitterdataese. Here are the reasons I think that:
  1. Twitterdataese is ugly. Example: "@bdelacretaz: #wmodata $id DW1428 $temp 69F $wangle 232 $wspeed 4.0mph $rh 50% $dew 49F $press 1015.2mb http://bit.ly/lxvlh #twitterdata". I rest my case.
  2. Twitterdataese doesn't lend itself well to imitation. In a previous post, I discussed the importance of imitation on the establishement of languages. Without reading the twitterdata documentation, can you figure out what the "$" does in the tweet "@toddfast $likes movies $likes Twitter"? I don't think I would have been able to.
  3. Twitterdata doesn't relieve pain. When I started my first company, a more seasoned entrepreneur gave me some great advice. "People will spend a lot of money to relieve a toothache- but they're much more reluctant to spend money on toothache prevention. Make sure the product you're selling relieves someone's pain." Somehow I doubt that @toddfast would be suffering much if he just liked movies as opposed to $linking them.
So what is causing Twitterers pain? Or to cast the question in terms of language evolution, what are the competitive pressures on Tweet vocabulary and syntax? So far I have been able to discern two strong competitive pressures.
  1. findability- Twitterers want their tweets to be found and read. This pressure is addressed by hashtags.
  2. terseness- Twitterers want to say more in one tweet than permitted by 140 characters. This pressure has led to the proliferation of URL shorteners (another thing I would like to write about) and to innumerable ROTFL and LOL inventions.
Which brings me back to Unicode tweeting. Chinese characters are much terser in terms of character count than any nonideogrammatic language. So maybe, just maybe, we'll start seeing Chinese characters creep into our tweeting to help us say more with fewer characters- first the really easy characters like 中 for china or 山 for mountain or 水 for water. 好吗?
Article any source

No comments:

Post a Comment