-- Alek
Only 15% of the screen estate above the 1024x768 fold is devoted to actual news and you really see 1 topic and a half. 54% are devoted to ads, over the top branding, and upsell opportunities (home delivery). The other 31% go for navigation and white space.
The earphones you get with an iPhone have a microphone. It's not used most of the time, but clicking with your tongue into it could be another way to control your MP3 player, accept calls, etc.
There are over 30 types of clicks generally recognized by phonetics. [ǃk’q’, ǃk’k͡x’] is just one. Some African languages have as many as 50 different click phonemes. It's shouldn't be too hard to make your phone or MP3 player recognize the two most often used ones, even in a noisy environment.
It's easy to imagine different interfaces and scenarios. You are in a packed subway car and can't reach for your phone, but you only need 1 click of the tongue to skip to the next song, 2 clicks to pause. When you are driving, 1 click picks up a call and 2 clicks redial the last number. If you are a policeman or a pilot, 1 click could be your push to talk command, and 3 clicks could be a distress call. If your device can reliably tell apart two different types of clicks, the possibilities are even greater.
The next step would be to make a proof of concept by recording some clicks in noisy environments and then try to isolate them.
Letting advertisers post tweets on your account doesn't fly with the twitterati. Giving your personal opinion on paid links might be more acceptable . Here's how it could work:
When a Twitterer visits the service, she will see a list of advertiser links, including a short description, and what she'll get per click. This is not a CPM scheme - you get paid for and only for clicks on the link, no matter how many followers you have.
If you don't like the link you can pass on it. If you like it, tweet about it (with your own words). You'll be able to filter links based on topics.
The link is unique to your account, so any click on it gets you some money. That means that when somebody retweets you you still get paid. If it gets on Facebook or in a Twitter aggregator, you get paid.
If advertisers want to limit their costs they can set the maximum price they are ready to pay. Every consecutive click would be pay slightly less than the previous one, with the aim to spread the advertiser's money across the expected number of clicks. Each twitterer will see how much "juice" is left in the link.
It's good for users, because Twitterers will still have the final say on what gets tweeted. They can stick to their usual topics and pass on links they find questionable.
It's good for advertisers, because they only pay for the actual customer attention they get.
I wanted to play around with Twitter's social graph, so this weekend I started writing a crawler to collect the information I need so I can later analyze it. I'll eventually post a more coherent post, but here are some thoughts and observations I made while writing the bot:
I spent more than 60 server hours crawling Twitter and made close to 1,500,000 Twitter API requests to get a measly 300 MB of data. This would've taken a few minutes to download if it was packaged conveniently. Twitter should start offering snapshots, akin to Wikipedia, to make it easier for developers and easier on their servers, to get a new service started. After all, Twitter's power is in the fresh tweets - such as the firehose - not the old content.
----
Given the limit of 20,000 API calls per IP per hour, you have to be smart about the way you fetch data. For example, the /friends/ids.json API call lets you download the full list of IDs of a user's friends in a single request. The /statuses/friends.json call lets you get the complete profiles of a user's friends, but only 100 at a time. So once I collected the lists of friend IDs and identified which IDs are interesting, I wrote a function to look for the smallest amount of /statuses/friends.json calls that would give me the most user profiles I need.
----
The open nature of Twitter makes it very easy to use other services' data, even if they don't have their own APIs. For example, WeFollow.com constructs a Twitter directory by asking users to describe themselves by tweeting "@wefollow #tag1 #tag2 #tag2". It takes under a minute to collect all of the WeFollow data via the Twitter Search API, and use it to build your own WeFollow competitor. This, combined with the API limitations for bulk downloads, means that services in the Twitter universe have to compete mostly on 1) marketing 2) access to proprietary data (such as the firehose).
----
Turns out the queues takes on more load than any other component of a web crawler. I started with a simple MySQL-based queue that did something like this:
On my scale (as many as 200,000 queued jobs) this worked incredibly slow and MySQL's CPU usage regularly hit 300% [1]. I tried different indexing strategies (waiting half an hour for each index rebuild) but it didn't help. Furthermore, the MySQL protocol is not the best for slow connections [2].
Next, I tried memcacheq - a queue built by the memcachedb folks. While it was considerably faster than MySQL, it definitely is not production ready (I guess that means memcachedb is not production ready, either). With about 40 workers pulling 2 items per second it would occasionally freeze for up to 5 minutes, failing all requests, even in times when no new jobs were being queued. No log messages, either. If you queue an item bigger than the maximum size (1 KB by default) it would get truncated without warning and fail to deserialize on the other end. However, if you increase the size, all items are padded to that size (?!). Plus, it offers no way to automatically requeue items if a worker fails. I now understand Twitter's decision to built their own queue.
I have reached a crossroad. I can either use memcached (not memcachedb) as a home-made in-memory queue, try Kestrel, or go all in and use Hadoop as a queue, computation engine, and storage service, and probably switch to Scala in the process.
Expect updates in a week or so.
----
[1] That's one of the great things about SliceHost - if you need to run the occasional heavy number crunching task, it lets you take some of the free CPU cycles of other slices.
[2] The crawler runs on two servers - a SliceHost slice and a Mosso Cloud Server. Although they are practically the same service with a different billing structure, they are hosted in different data centers, which causes some latency. Switching the queue to the memcache protocol significantly improved performance between the two servers.