There is a lot to be learned from tweets, if Twitter would let us.
Celebrities, politicians, world leaders, news organizations, millions of normal people, and even the occasional cat, use Twitter every day to talk about their breakfast, natural disasters, political events, and events of global interest like the Oscars and the Super Bowl. With millions of tweets transmitted every day, Twitter has become an important historical and cultural record and an immensely useful resource for researchers of politics, history, literature, language, and anything else you can imagine. Or it could be. In recent years Twitter has made changes to its service that severely limit its usefulness to researchers.
The difficulty with most research is getting a good enough dataset, and the bigger the better. Corpus linguistics, for example, uses giant databases of hundreds of millions of words, painstakingly organized and annotated. The biggest corpora exceed 450 million words, and with a reported average of half a billion tweets per day, and around 15 words per tweet, that much data passes through Twitter in less than a day.
Twitter has already shown its value as a dataset, for hobbyists and academics alike. Edwin Chen, a data scientist at Twitter, mapped the use of "coke", "soda" and "pop" to refer to soft drinks in tweets and got results that largely concur with non-Twitter research, going some way to confirming Twitter's value as a research tool. On the academic side, as the New York Times reported in October last year, researchers are using Twitter to study the sentiment of tweets relating to major events like the Arab Spring and the way emotions relate to the rhythms of daily life. More recently, researchers announced their findings in using machine learning and natural language processing to predict whether the information contained in a tweet is true or false. But Twitter's evolution into a media company is making it much harder to collect good data for research, reducing its utility as a research tool.
The Text REtrieval Conference (TREC), co-sponsored by the National Institute of Standards and Technology and U.S. Department of Defense, has been helping support research in the area of information retrieval since 1992. In the past they've had sessions on contextual search suggestions, helping lawyers search for relevant information from legal databases, and making it easier for doctors and nurses to search medical records. As of 2011, they also have a microblog track, studying search behaviors on Twitter.
At TREC's 2011 conference they provided 58 organizations with a dataset of around 16 million tweets from a two-week period that included the Egyptian Revolution and the U.S. Superbowl. Or they would have liked to. What attendees actually got was a database of 16 million tweet identifiers and a set of tools that would let them access the tweets and download them themselves. Ian Soboroff, one of the researchers leading the microblog track, told me that, before they offered the dataset, people were downloading their own collections of tweets, but couldn't share them. "Twitter actually contacted a number of researchers trying to share their tweet collections and told them to stop," he said. The problem is a clause in Twitter's API Terms of Service forbidding redistribution of tweets.
The clause had an immediate effect on a variety of startups and organizations. TwapperKeeper, a service that allowed users to export archives of tweets with certain keywords or hashtags, found itself in violation of the clause and had to shut down. The same problem affected 140kit, one of the first groups researching Twitter's political and cultural influence, who were no longer allowed to share their datasets with interested users. For academics, it meant they could no longer collect and share datasets of tweets for analysis.
The change was a major blow to the prospect of getting good research out of Twitter data. With a shared, common resource, multiple observers can assess the integrity of the data collected, removing the question of bad data from a study's results, and it enables reproducibility, allowing the results of a study to be confirmed by other parties. But Twitter's new terms put a stop to that.
TREC worked around this by only offering its participants tweet identifiers, an approach Soboroff is disappointed others haven't taken. "We need more datasets in the wild before we can know what makes a good dataset and what makes a bad one, for Twitter researchers." But it's an imperfect solution for a number of reasons, not least of which is data integrity. Users of Twitter often make their accounts private or delete old tweets, meaning the dataset's identifiers for those tweets would no longer return anything. And, as Soboroff explained, some users had trouble downloading the tweets and using the tools provided, which entails "cloning a git repo, getting it to build, running the crawler, storing the data, and analyzing the data at that volume." In practice, it hasn't been a huge issue for TREC's microblog track, which Soboroff says had fairly high levels of participation this year, but they had to spend more time than they might have in better circumstances helping users with technical issues. The track's mailing list is replete with messages from participants having technical difficulties.
Researchers trying to get data out of Twitter also need to be mindful of its API rate limiting. TREC's database of tweet identifiers helped them get around the rate limits: they were able to scrape the tweets over HTTP, handily avoiding the API limitations. But people trying to access tweets without the convenience of a previously-prepared database of identifiers are subject to a limit of 180 API calls every 15 minutes. Under those limits, without their database, TREC's participants would have spent over two weeks downloading their 16 million tweets. Researchers trying to gather their own datasets would use Twitter's streaming APIs, which returns a real-time feed of tweets posted to Twitter, but the publicly-available API for that is limited to only a small fraction of total tweets to the service, around 1 percent. There is an API that allows all tweets to the service to be collected, the "firehose," but Twitter strongly limits access to it and charges a fee that is well outside the budget of most academic research. Gnip, Twitter's resellers for firehose access, charges $0.10 per thousand tweets. It would cost TREC, whose participants have no funding at all, $16,000 to get their 16 million tweets that way.
It's not clear that Twitter is interested in doing anything to mitigate its hostility to research, but it's absolutely in its best interest to do so: Twitter stands to greatly improve its service, and even profit, based on the findings of researchers. Take the aforementioned study into determining the veracity of tweets. It's easy to imagine ways that could be put to use by Twitter, particularly in its recent efforts to surface good content in its Discover tab. But without letting researchers create a shared corpus of tweets, they're hurting further, potentially more exciting research prospects.
So what could Twitter do, without hurting their important bottom line? They could bring their research efforts in-house, as they almost certainly have to some degree, but that's expensive and time consuming, and in doing so they lose the benefit of crowdsourcing. There are researchers waiting in the wings to study things Twitter's team might never even dream of. They could make their own datasets and allow limited access to them, but researchers would have no control over the data they're getting, and it would present Twitter with the even more expensive task of ongoing maintenance of the datasets, a responsibility they're likely not interested in taking on. It would be much better for everyone involved, then, to let researchers pull their own data and manage it themselves.
The only realistic option is a change to give researchers more freedom with their data. Is there any good reason Twitter can't let academics get better access to their data, and perhaps relaxed rate limiting for getting it? It could be as simple as giving a single educational institution firehose access and permission to host a shared corpus of tweets. The Corpus of Contemporary English, the largest widely available corpus of English, is managed solely by Brigham Young University; anybody that wants to use it can access an interface to the data, but not the raw data itself. With Twitter's growing profile and increasing wealth of valuable data there is almost certainly a linguistics department somewhere willing to take on the task of maintaining a corpus of tweets.
There is also the Library of Congress. Their agreement with Twitter to provide a full archive of tweets puts them in a perfect position to provide research capabilities, which is exactly what they intend to do. The library announced today that phase one of this agreement had been fulfilled and they are now looking at providing access to researchers. Already, the library has had around 400 requests from researchers for access to its data, but they haven't been able to fulfill a single one of them.
The problem is one of scale. According to the library, it takes 24 hours to perform a single search on the complete archive, which makes it almost useless to researchers in its current state. (And also explains why Twitter so severely limits its searchable archive.) In their release today, the Library acknowledged that "it is clear that technology to allow for scholarship access to large data sets is not nearly as advanced as the technology for creating and distributing that data."
Much more useful would be a statistically significant subset of the tweets, either from Twitter or the Library of Congress. It's not clear whether the Library of Congress will ever be able to make the full archive conveniently searchable, perhaps one of the biggest technical challenges they've faced, and even if they do it could be years, in which time the volume of tweets is rapidly expanding. In its current state, all the Library can even consider doing is allowing "some limited access capability in our reading rooms."
It's far from the end of the line for academics, who have always had to overcome hurdles to get good data. There's no good API for collecting language use in speech, newspapers, magazines, or other websites either, but there are still painstakingly-gathered databases of millions of words from each. Researchers are finding ways to work with Twitter's restrictions too, and Soboroff actually thinks they're doing a pretty good job for a company whose business isn't supporting researchers. Certainly, their agreement with the Library of Congress is encouraging. But it's impossible to know what else is in store. Stricter rate-limiting, more restrictive Terms of Service, and even more limited APIs would fit Twitter's current modus operandi. And who knows if the Library of Congress will ever be able to provide adequate access at that scale.
Whatever happens it's hard to deny how disappointing it is that probably the largest available database of modern language in the world today, a hugely important historical, political, and sociological resource, is effectively off-limits to research for the foreseeable future.
Correction: An earlier version of this article incorrectly stated that the clause forbidding the sharing of Twitter data was introduced in 2011. The clause has existed prior to 2011. The article also stated that TREC's tools for downloading tweets were subject to API rate limiting. Ian Soboroff clarified that their tools actually scrape HTTP content, so aren't subject to the API rate limits.