Dataset: DeliciousT140
DeliciousT140 is a dataset created during June 2008 with data retrieved from the social bookmarking site Delicious and the Web. It is available for research purposes.
Statistics
This dataset is made up by 144,574 unique URLs, all of them with their corresponding social tags retrieved from Delicious on June 2008. This set of documents is annotated with 67,104 different tags.
If you want to know more on the dataset generation process, please read the paper referenced at the end of this page.
Metadata Format
All the metadata for the dataset documents is provided in XML format, following this pattern:
<documents>
...
<document>
<url>Document's URL</url>
<hash>MD5 hash for document's URL</hash>
<filetype>File extension: html, pdf, xml or swf</filetype>
<filename>Filename of the document in the dataset</filename>
<users># of users bookmarked it</users>
<tags>
...
<tag>
<name>Tag name</name>
<count># of users who annotated the tag</count>
</tag>
...
</tags>
</document>
...
</documents>
Legal Information
By downloading and using this dataset you acknowledge that:
- The data has been compiled to exclusively use it for scientific research purposes.
- The copyright holders retain ownership and reserve all rights.
Reference
Please, consider citing the following paper if you make use of this dataset for your research work:
Arkaitz Zubiaga, Alberto P. García-Plaza, Víctor Fresno, and Raquel Martínez. Content-based Clustering for Tag Cloud Visualization. Proceedings of ASONAM 2009, International Conference on Advances in Social Networks Analysis and Mining. 2009.
BiBTeX:
@inproceedings{zubiaga2009content,
title={Content-based clustering for tag cloud visualization},
author={Zubiaga, Arkaitz and P{\'\e}rez Garc{\'\i}a-Plaza, Alberto and Fresno, V{\'\i}ctor and Mart{\'\i}nez, Raquel},
booktitle={Social Network Analysis and Mining, 2009. ASONAM'09. International Conference on Advances in},
pages={316--319},
year={2009},
organization={IEEE}
}
Download
- delicioust140_taginfo.tar.bz2 (12 MB): Contains all the URLs making up the collection, as well as their corresponding post count and common tag information.
- delicioust140_documents.tar.bz2 (1.6 GB): Content for all the web documents on the dataset. html, pdf, xml and swf files can be found.