SPAM Dataset

Download (v1.0, annotated N-triples)

We provide a sample dataset containing malicious triples that could be used to evaluate the resilience of Linked Data applications or to train spam filters.

The dataset is the polluted version of a fraction of the Billion Triple Challenge 2012 Dataset. More specifically, we chose the 1-hop expansion “Timbl crawl”, a crawl seeded with Tim Berners Lee’s foaf profile, and we applied the spam vectors described in the paper.

The resulting dataset contains approximately 16k triples (spam triples account for 4% of dataset size). The dataset includes samples of Content contamination vectors, Link poisoning vectors and Non-triple-based attacks (Malicious subclassing only).

Contents v1.0

False labeling: rdfs:label properties associated to akt:Organisation entities
Misattribution A malicious bibo:Quote has been associated to all foaf:Person of the dataset
Void pollution: added deceiving dcterms:subject triples (about computer science publications)
Schema pollution: all owl:Class and rdf:Property have been associated to malicious rdfs:labels
Misdirection: added foaf:depiction triples pointing to replica watches (one triple for each foaf:Person)
Identity assumption: added malicious owl:sameAs, one for each akt:Organisation
Inverse-functional property cloning: foaf:Persons cloned and clones are associated same foaf:homepage, considered a IFP.
Data URI Embedding: added rdfs:seeAlso data:text/html URI for each akt:Organisation
Malicious subclassing: foaf:Person and akt:Organisation have been defined as subclasses of "GraviaBuyer" and "GraviaSupplier", respectively.

Publications

Ali Hasnain, Mustafa Al-Bakri, Luca Costabello, Zijie Cong, Ian Davis and Tom Heath. Spamming in Linked Data, Third International Workshop on Consuming Linked Data (COLD2012) [PDF]

Contacts

Ali Hasnain, Mustafa Al-Bakri, Zijie Cong, Luca Costabello, Ian Davis, Tom Heath

Spamming in Linked Data - Sample Dataset

Contents v1.0

Publications

Contacts