Datasets
Social media datasets
- PHEME dataset for Rumour Detection and Veracity Classification: This dataset contains a collection of Twitter rumours and non-rumours posted during breaking news. It contains rumours related to 9 events and each of the rumours is annotated with its veracity value, either True, False or Unverified.
- Twitter death hoaxes: This is a dataset of death reports collected from Twitter between 1st January, 2012 and 31st December, 2014. It was collected by tracking the keyword 'RIP', and matching those tweets in which a name is mentioned next to RIP. Matching names were identified by using Wikidata as a database of names.
- Twitter event datasets (2012-2016): data for 30 different Twitter datasets associated with real world events.
- PHEME dataset of rumours and non-rumours: This dataset contains a collection of Twitter rumours and non-rumours posted during five breaking news events.
- Tweet geolocation 5m: This is a dataset with more than 5 million geolocated tweets with detailed geolocation information associated. Each geolocated tweet is associated with its fine-grained location information, collected from OpenStreetMap using the reverse geocoding feature in Nominatim.
- PHEME rumour dataset: This is a dataset of conversations around rumours associated with 9 different breaking news stories, collected from Twitter. It was developed within the journalism use case of the PHEME FP7 project. Each tweet is annotated for support, certainty, and evidentiality.
- TweetMT: A dataset for machine translation of tweets.
- TweetLID: A dataset for tweet language identification, which includes 35k tweets with manually annotated language labels.
- Hurricane Sandy tweets: Nearly 15 million tweets posted on Twitter while Hurricane Sandy was hitting the East Coast of the United States, as well as in the aftermath.
- ODPtweets: A large-scale Twitter dataset with nearly 25 million tweets categorized in the structure of the Open Directory Project (ODP).
- tweet-norm_es: Tweets in Spanish language, annotated for lexical normalization purposes. Created for the tweet normalization challenge at Tweet-Norm 2013.
- Trending topics: A dataset with 1,036 categorized trending topics, which we used in Real-Time Classification of Twitter Trends
Social tagging datasets
- SocialBM0311: A large-scale, longitudinal social tagging dataset collected from Delicious.com. It contains the complete bookmarking activity for 2 million users from the launch of the social bookmarking website in 2003 to the end of March 2011.
- Social-ODP-2k9: 12,616 unique URLs, with categories from the Open Directory Project (ODP/Dmoz) and a variety of social annotations (tags, notes, reviews,...) retrieved from Delicious and StumbleUpon.
- DeliciousT140: 144,574 unique URLs, with social tags retrieved from Delicious.
- Wiki10+: 20,764 English Wikipedia articles, with social tags retrieved from Delicious.