Dataset change lists available
We are aware that the dataset size is huge, and as it is frequently changing, instead of downloading the whole dataset and validation set each time they are updated (~80GB) you can now just download the list of engaged with Tweet IDs and user IDs of data that has been deleted. Please remove any data that contain these from your local copy. You can find these lists here.
Twitter is what’s happening in the world and what people are talking about right now. On Twitter, live comes to life as conversations unfold, showing you all sides of the story. From breaking news and entertainment to sports, politics and everyday interests, when things happen in the world, they happen first on Twitter.
On the platform, users post and engage with (in the form of Likes, Replies, Retweets and Retweets with comments) content known as “Tweets”. This challenge aims to evaluate novel algorithms for predicting different engagement rates at a large scale, and push the state-of-the-art in recommender systems. Following the success and advancements in the domain of top-K recommendations, we aim to encourage the development of new approaches by releasing the largest real-world dataset to predict user engagements. The dataset comprises of roughly 200 million public engagements, along with user and engagement features, that span a period of 2 weeks and contain public interactions (Like, Reply, Retweet and Retweet with comment), as well as 100 million pseudo negatives which are randomly sampled from the public follow graph. While sampling the latter pool of Tweets, we take special care about preserving user privacy.
The submitted methods will be evaluated on a held-out test set generated from more recent Tweets on the platform, and the evaluation metrics will include precision-recall area under curve (PR-AUC) and cross-entropy loss. Participants will also be provided with a validation set, for which the engagement information will be missing. Paying special attention to our users’ privacy, the dataset will be updated daily to ensure GDPR-compliance and the corresponding metrics will be updated on the leaderboard.
Twitter, as a sponsor of this challenge is providing the dataset, on which all methods will be evaluated. The best three teams will be rewarded with the following prizes:
The Data is available to download here. Fields in each data entry are separated by the 1 character (0x31 in UTF-8) and each data entry will be characterized by the following features:
|Feature Name||Feature Type||Feature Description|
|Engaged With User Features||
|Engaging User Features||
In total, we are going to release 3 datasets:
The dataset includes data features that are publicly available, and are described in greater detail in the Twitter Developer documentation.
In order for participants to gain access to this dataset, each individual needs to register for a developer account through the following link. It is advised to do so as early as possible, as access to the data cannot be granted without prior authorisation.
As part of the RecSys Challenge, you may access Twitter Content (as defined in Twitter Developer Agreement and Policy). Your access to and use of the Twitter Content is governed by the Twitter Developer Agreement and Policy ). Your access to the Twitter Content is limited to the purposes of the RecSys Challenge and those approved via the application process. The Twitter Developer Agreement and Policy do not expand the license to the Twitter Content.
In addition to the restrictions set out in the Twitter Developer Agreement and Policy, isolating individual Tweets or users for purposes other than participation in this challenge is strictly forbidden. Other restricted uses of the dataset can be found here.
In order to participate in the RecSys Challenge, individuals need to sign up here and here to agree to the Twitter Developer Agreement and Policy and RecSys Challenge's Terms & Conditions listed here. Twitter is only providing a data set solely as a sponsor of Recsys Challenge 2020. Twitter is not responsible for the RecSys Challenge, which shall be controlled and administered by RecSys as explained here and here. Following the RecSys Challenge, Twitter may make the dataset available for researchers to access subject to the Twitter Developer Agreement and Policy.