Updates

Dataset change lists available

We are aware that the dataset size is huge, and as it is frequently changing, instead of downloading the whole dataset and validation set each time they are updated (~80GB) you can now just download the list of engaged with Tweet IDs and user IDs of data that has been deleted. Please remove any data that contain these from your local copy. You can find these lists here.

Introduction

Twitter is what’s happening in the world and what people are talking about right now. On Twitter, live comes to life as conversations unfold, showing you all sides of the story. From breaking news and entertainment to sports, politics and everyday interests, when things happen in the world, they happen first on Twitter.

On the platform, users post and engage with (in the form of Likes, Replies, Retweets and Retweets with comments) content known as “Tweets”. This challenge aims to evaluate novel algorithms for predicting different engagement rates at a large scale, and push the state-of-the-art in recommender systems. Following the success and advancements in the domain of top-K recommendations, we aim to encourage the development of new approaches by releasing the largest real-world dataset to predict user engagements. The dataset comprises of roughly 200 million public engagements, along with user and engagement features, that span a period of 2 weeks and contain public interactions (Like, Reply, Retweet and Retweet with comment), as well as 100 million pseudo negatives which are randomly sampled from the public follow graph. While sampling the latter pool of Tweets, we take special care about preserving user privacy.

The submitted methods will be evaluated on a held-out test set generated from more recent Tweets on the platform, and the evaluation metrics will include precision-recall area under curve (PR-AUC) and cross-entropy loss. Participants will also be provided with a validation set, for which the engagement information will be missing. Paying special attention to our users’ privacy, the dataset will be updated daily to ensure GDPR-compliance and the corresponding metrics will be updated on the leaderboard.

Prizes

Twitter, as a sponsor of this challenge is providing the dataset, on which all methods will be evaluated. The best three teams will be rewarded with the following prizes:

Dataset description

The Data is available to download here. Fields in each data entry are separated by the 1 character (0x31 in UTF-8) and each data entry will be characterized by the following features:

Feature Name Feature Type Feature Description
Tweet Features
  • Text tokens
  • Hashtags
  • Tweet id
  • Present media
  • Present links
  • Present domains
  • Tweet type
  • Language
  • Timestamp
  • List[long]
  • List[string]
  • String
  • List[String]
  • List[string]
  • List[string]
  • String
  • String
  • Long
  • Ordered list of Bert ids corresponding to Bert tokenization of Tweet text
  • Tab separated list of hastags (identifiers) present in the tweet
  • Tweet identifier
  • Tab separated list of media types. Media type can be in (Photo, Video, Gif)
  • Tab separeted list of links (identifiers) included in the Tweet
  • Tab separated list of domains included in the Tweet (twitter.com, dogs.com)
  • Tweet type, can be either Retweet, Quote, Reply, or Toplevel
  • Identifier corresponding to the inferred language of the Tweet
  • Unix timestamp, in sec of the creation time of the Tweet
Engaged With User Features
  • User id
  • Follower count
  • Following count
  • Is verified?
  • Account creation time
  • String
  • Long
  • Long
  • Bool
  • Long
  • User identifier
  • Number of followers of the user
  • Number of accounts the user is following
  • Is the account verified?
  • Unix timestamp, in seconds, of the creation time of the account
Engaging User Features
  • User id
  • Follower count
  • Following count
  • Is verified?
  • Account creation time
  • String
  • Long
  • Long
  • Bool
  • Long
  • User identifier
  • Number of followers of the user
  • Number of accounts the user is following
  • Is the account verified?
  • Unix timestamp, in seconds, of the creation time of the account
Engagement Features
  • Engagee follows engager?
  • Reply engagement timestamp
  • Retweet engagement timestamp
  • Retweet with comment engagement timestamp
  • Like engagement timestamp
  • Bool
  • Long
  • Long
  • Long
  • Long
  • Does the account of the engaged tweet author follow the account that has made the engagement?
  • If there is at least one, unix timestamp, in s, of one of the replies
  • If there is one, unix timestamp, in s, of the retweet of the tweet by the engaging user
  • If there is at least one, unix timestamp, in s, of one of the retweet with comment of the tweet by the engaging user
  • If there is one, Unix timestamp, in s, of the like

In total, we are going to release 3 datasets:

The dataset includes data features that are publicly available, and are described in greater detail in the Twitter Developer documentation.

In order for participants to gain access to this dataset, each individual needs to register for a developer account through the following link. It is advised to do so as early as possible, as access to the data cannot be granted without prior authorisation.

As part of the RecSys Challenge, you may access Twitter Content (as defined in Twitter Developer Agreement and Policy). Your access to and use of the Twitter Content is governed by the Twitter Developer Agreement and Policy ). Your access to the Twitter Content is limited to the purposes of the RecSys Challenge and those approved via the application process. The Twitter Developer Agreement and Policy do not expand the license to the Twitter Content.

In addition to the restrictions set out in the Twitter Developer Agreement and Policy, isolating individual Tweets or users for purposes other than participation in this challenge is strictly forbidden. Other restricted uses of the dataset can be found here.

In order to participate in the RecSys Challenge, individuals need to sign up here and here to agree to the Twitter Developer Agreement and Policy and RecSys Challenge's Terms & Conditions listed here. Twitter is only providing a data set solely as a sponsor of Recsys Challenge 2020. Twitter is not responsible for the RecSys Challenge, which shall be controlled and administered by RecSys as explained here and here. Following the RecSys Challenge, Twitter may make the dataset available for researchers to access subject to the Twitter Developer Agreement and Policy.