Researchers Scrape 2 Billion Discord Messages and Publish Them Online
Researchers published a massive database of more than 2 billion Discord messages that they say they scraped using Discord’s public API. The data was pulled from 3,167 servers and covers posts made between 2015 and 2024, the entire time Discord has been active.
Though the researchers claim they’ve anonymized the data, it’s hard to imagine anyone is comfortable with almost a decade of their Discord messages sitting in a public JSON file online. Separately, a different programmer released a Discord tool called "Searchcord" based on a different data set that shows non-anonymized chat histories.
These two separate events have created some panic in some Discord communities, with server moderators and users worrying about their privacy.
A team of 15 researchers at the University of Finance Minas Gerais in Brazil conducted the scrape as part of a research project. The team explained the how and why of the project in a paper titled Discord Unveiled: A Comprehensive Dataset of Public Communication (2015 - 2024), which they say was created so that other teams of researchers could have a database of online discussions to use when studying mental health and politics or training bots.
“Throughout every step of our data collection process, we prioritized adherence to ethical standards,” they wrote in a section called ‘Ethical Concerns.’ “Precautions were taken to collect data responsibly. All data was sourced from groups that are explicitly considered public according to Discord’s terms of use, which every user agrees to upon signing up. The data was anonymized, and the methodology was detailed to promote reproducibility and transparency.” That may be the case, but Discord is designed to be a series of chatrooms which are not universally searchable, and which in their design feel far less public than, say, tweeting something or posting it to Reddit.
The amount of data is massive. “This paper introduces the most extensive Discord dataset available to date, comprising 2,052,206,308 messages from 4,735,057 unique users across 3,167 servers—approximately 10% of the servers listed in Discord’s Discovery tab.”
The researchers have published the database online as a series of JSON files. Within the database, one JSON represents a single Discord server and all of the messages that were contained therein. An uncompressed sample version of the data is 6.2GB and unfurls into a 108GB database. The complete database is 118GB compressed and likely unfurls into a database several orders of magnitude larger.
The researchers said they created the dataset so that other researchers could study bots, politics, and mental health. “Our dataset enables researchers to explore the impact of digital platforms on political discourse, the propagation of misinformation, and the development of effective moderation and regulation strategies tailored to such environments,” it said in a section near the end.
They also said the database could be of help “identifying patterns of at-risk behavior and explore [sic] critical questions such as the prevalence of harm behaviors or supportive interactions” and “facilitate the creation of domain-specific chatbots.”
The way that the Brazilian researchers scraped these messages differs from the way that a tool we reported on last year did something similar. In 2024, a service called Spy.pet scraped Discord servers en masse by placing bots into specific servers which then archived the messages. This allowed the creators of Spy.pet to target specific servers and to archive the messages within servers that were not public. It also did not anonymize the messages in any way. Days after 404 Media broke the Spy.pet story, Discord banned accounts associated with the service. The Brazilian researchers say that they scraped the messages using Discord’s API.
Discord servers are user generated and can be set to public or private and newcomers can find the public servers using Discord’s “Discovery” feature. In their paper, the researchers said they used this discovery feature to map every public Discord server, discovering a total of 31,673 as of November 17, 2024. Then they selected 10 percent of those servers to scrape at random.
The researchers accomplished this using Discord’s own public API to put in calls for all the data on the servers. Bots are popular on Discord and users stand them up for a variety of reasons including moderating channels, playing music, and rolling dice. User-designed bots are a ubiquitous part of the Discord experience and the company offers its public API, in part, to make the bots easy to launch and maintain.
In their paper, the researchers insist that the project was conducted in the bounds of Discord’s API policies. They said that before publication, they replaced usernames with generated pseudonyms, hashed and truncated user and message IDs, and removed other identifying features entirely. “All data collection adhered strictly to Discord’s API guidelines, and anonymization techniques were applied to ensure compliance with privacy standards,” the paper said.
The paper also pointed out that all these messages were scraped from public spaces. “All data was sourced from groups that are explicitly considered public according to Discord’s terms of use, which every user agrees to upon signing up.”
It should be noted, however, that almost no one reads end-user license agreements and many of Discord’s users are children and teenagers. Discord is, first and foremost, a platform for gamers to organize communities and it’s not plausible that a 15 year old looking for a Fortnite meme server ever thought their dumb jokes about Tomato Town would end up in a public database five years later.
Even with the pains taken to anonymize the data, the scrape appears to be against Discord’s Terms of Service. The Discord Developer policy, which covers the use of its API, is clear. “Do not mine or scrape any data, content, or information available on or through Discord services,” it says. Some form of this prohibition against scraping has been in place since at least 2020.
Discord did not return 404 Media’s request for comment on this issue.