Reading view

There are new articles available, click to refresh the page.

Researchers Scrape 2 Billion Discord Messages and Publish Them Online

Researchers Scrape 2 Billion Discord Messages and Publish Them Online

Researchers published a massive database of more than 2 billion Discord messages that they say they scraped using Discord’s public API. The data was pulled from 3,167 servers and covers posts made between 2015 and 2024, the entire time Discord has been active. 

Though the researchers claim they’ve anonymized the data, it’s hard to imagine anyone is comfortable with almost a decade of their Discord messages sitting in a public JSON file online. Separately, a different programmer released a Discord tool called "Searchcord" based on a different data set that shows non-anonymized chat histories.

These two separate events have created some panic in some Discord communities, with server moderators and users worrying about their privacy.

A team of 15 researchers at the University of Finance Minas Gerais in Brazil conducted the scrape as part of a research project. The team explained the how and why of the project in a paper titled Discord Unveiled: A Comprehensive Dataset of Public Communication (2015 - 2024), which they say was created so that other teams of researchers could have a database of online discussions to use when studying mental health and politics or training bots.

“Throughout every step of our data collection process, we prioritized adherence to ethical standards,” they wrote in a section called ‘Ethical Concerns.’ “Precautions were taken to collect data responsibly. All data was sourced from groups that are explicitly considered public according to Discord’s terms of use, which every user agrees to upon signing up. The data was anonymized, and the methodology was detailed to promote reproducibility and transparency.” That may be the case, but Discord is designed to be a series of chatrooms which are not universally searchable, and which in their design feel far less public than, say, tweeting something or posting it to Reddit.

The amount of data is massive. “This paper introduces the most extensive Discord dataset available to date, comprising 2,052,206,308 messages from 4,735,057 unique users across 3,167 servers—approximately 10% of the servers listed in Discord’s Discovery tab.” 

The researchers have published the database online as a series of JSON files. Within the database, one JSON represents a single Discord server and all of the messages that were contained therein. An uncompressed sample version of the data is 6.2GB and unfurls into a 108GB database. The complete database is 118GB compressed and likely unfurls into a database several orders of magnitude larger.

The researchers said they created the dataset so that other researchers could study bots, politics, and mental health. “Our dataset enables researchers to explore the impact of digital platforms on political discourse, the propagation of misinformation, and the development of effective moderation and regulation strategies tailored to such environments,” it said in a section near the end. 

They also said the database could be of help “identifying patterns of at-risk behavior and explore [sic] critical questions such as the prevalence of harm behaviors or supportive interactions” and “facilitate the creation of domain-specific chatbots.”

The way that the Brazilian researchers scraped these messages differs from the way that a tool we reported on last year did something similar. In 2024, a service called Spy.pet scraped Discord servers en masse by placing bots into specific servers which then archived the messages. This allowed the creators of Spy.pet to target specific servers and to archive the messages within servers that were not public. It also did not anonymize the messages in any way. Days after 404 Media broke the Spy.pet story, Discord banned accounts associated with the service. The Brazilian researchers say that they scraped the messages using Discord’s API.

Discord servers are user generated and can be set to public or private and newcomers can find the public servers using Discord’s “Discovery” feature. In their paper, the researchers said they used this discovery feature to map every public Discord server, discovering a total of 31,673 as of November 17, 2024. Then they selected 10 percent of those servers to scrape at random.

The researchers accomplished this using Discord’s own public API to put in calls for all the data on the servers. Bots are popular on Discord and users stand them up for a variety of reasons including moderating channels, playing music, and rolling dice. User-designed bots are a ubiquitous part of the Discord experience and the company offers its public API, in part, to make the bots easy to launch and maintain. 

In their paper, the researchers insist that the project was conducted in the bounds of Discord’s API policies. They said that before publication, they replaced usernames with generated pseudonyms, hashed and truncated user and message IDs, and removed other identifying features entirely. “All data collection adhered strictly to Discord’s API guidelines, and anonymization techniques were applied to ensure compliance with privacy standards,” the paper said.

The paper also pointed out that all these messages were scraped from public spaces. “All data was sourced from groups that are explicitly considered public according to Discord’s terms of use, which every user agrees to upon signing up.”

It should be noted, however, that almost no one reads end-user license agreements and many of Discord’s users are children and teenagers. Discord is, first and foremost, a platform for gamers to organize communities and it’s not plausible that a 15 year old looking for a Fortnite meme server ever thought their dumb jokes about Tomato Town would end up in a public database five years later.

Even with the pains taken to anonymize the data, the scrape appears to be against Discord’s Terms of Service. The Discord Developer policy, which covers the use of its API, is clear. “Do not mine or scrape any data, content, or information available on or through Discord services,” it says. Some form of this prohibition against scraping has been in place since at least 2020.

Discord did not return 404 Media’s request for comment on this issue.

TechCrunch Disrupt 2025 Early Bird savings end on May 25

The early bird sees the future first — and saves the most. The old saying goes, “the early bird gets the worm.” But in tech — and in life — it’s not really about the worm. It’s about spotting what’s next before the crowd rushes in and the price goes up. TechCrunch Disrupt 2025 is […]

Warhammer 40K: Space Marine 2 is a glorious co-op shooter that’s now cheaper than ever

An image with a screenshot from Warhammer 40K: Space Marine 2 laid over a background with various symbols on it.

If you ask me, there’s always space in my games catalog for a fun third-person shooter that I can play with my buds online. Warhammer 40K: Space Marine 2 delivers some of the best blood-gushing, bug-crushing action, filling a Gears of War void that I didn’t know needed filling. You can jump into the fray while saving some money, as Space Marine 2 has hit its lowest price yet at Amazon, GameStop, and Best Buy. Normally $69.99, it costs $39.99 for the PlayStation 5 or Xbox Series X.

Other deals worth checking out

  • If you find yourself in a position of needing more storage for your original Nintendo Switch, Steam Deck, Asus ROG Ally, or some other device, there’s a great deal happening on Samsung’s 512GB microSD card at Amazon. You can get it for $29.99, a price we’ve seen before, but one that’s still good enough that it’s worth sharing again.
  • My colleague Sheena recently highlighted some of the great discounts happening on LG’s C4 OLEDs in time for Memorial Day. The lowest price, of course, is on the the smallest 42-inch version, which currently costs $796.99 (roughly half off). The price drops apply to larger sizes, too, like the 65-inch version that’s down to $1,299.99 at Best Buy, which I consider to be a stellar deal.

Google teases an Android desktop mode, made with Samsung’s help

Windows in Android’s desktop mode can stretch and move across your screen.

Google is working with Samsung to bring a desktop mode to Android. During Google I/O’s developer keynote, engineering manager Florina Muntenescu said the company is “building on the foundation” of Samsung’s DeX platform “to bring enhanced windowing capabilities in Android 16,” as spotted earlier by 9to5Google.

Samsung first launched DeX in 2017, a feature that automatically adjusts your phone’s interface and apps when connected to a larger display, allowing you to use your phone like a desktop device.

A demo during the presentation revealed a Samsung DeX-like layout, with apps like Gmail, Chrome, YouTube, and Google Photos centered in the taskbar at the bottom of the screen. It also showed how Android 16’s adaptive apps can move and stretch across the screen. The time sits at the top-left corner of the screen, with the Wi-Fi signal and battery on the right.

In March, Android Authority’s Mishaal Rahman reported on Google’s plans to create a desktop mode of its own, and later enabled an early version of the feature on a Pixel device.
Google shared more details in a blog post about the update, saying Android 16’s emphasis on adaptiveness will also help apps work on more kinds of devices, like foldables, tablets, Chromebooks, mixed reality wearables, and even cars.

Meta hypes AI friends as social media’s future, but users want real connections

If you ask the man who has largely shaped how friends and family connect on social media over the past two decades about the future of social media, you may not get a straight answer.

At the Federal Trade Commission's monopoly trial, Meta CEO Mark Zuckerberg attempted what seemed like an artful dodge to avoid criticism that his company allegedly bought out rivals Instagram and WhatsApp to lock users into Meta's family of apps so they would never post about their personal lives anywhere else. He testified that people actually engage with social media less often these days to connect with loved ones, preferring instead to discover entertaining content on platforms to share in private messages with friends and family.

As Zuckerberg spins it, Meta no longer perceives much advantage in dominating the so-called personal social networking market where Facebook made its name and cemented what the FTC alleged is an illegal monopoly.

Read full article

Comments

© Aurich Lawson | Getty Images

How MrBeast ended up in the new season of Love, Death, and Robots

“The Screaming of the Tyrannosaur.”

One of the more surprising moments in volume four of Love, Death, and Robots is an appearance from YouTube star MrBeast. He shows up in the episode "The Screaming of the Tyrannosaur," playing a sort of twisted game master presiding over a death race on one of the moons of Jupiter. Also, there are dinosaurs. According to LDR creator Tim Miller, who also directed the episode, the collaboration started out simply because MrBeast was a fan of the show. It then solidified once Miller realized he had the ideal role.

"I have this evil game master here, and I thought he would be perfect for that," Miller says. "I watched his Amazon show and I thought 'what a dick' often. With some of the contestants, he seemed to take a particular joy in their uncomfortableness. Not because he's an evil guy - he's not, he's a super nice guy. I think he just enjoys the whole machination of people and how they can either work together or against each other. And it seemed to fit this particular role very well."

Miller says that because MrBeast was such a fan, he didn't actually charge anything for his performance. "The cool thing is he likes the show so much - we couldn't afford MrBeast prices or anything …

Read the full story at The Verge.

Podcast: AI Slop Summer

Podcast: AI Slop Summer

We start this week with Jason's couple of stories about how the Chicago Sun-Times printed a summer guide that was basically all AI-generated. Jason spoke to the person behind it. After the break, a bunch of documents show that schools were simply not ready for AI. In the subscribers-only section, we chat all about Star Wars and those funny little guys.

Listen to the weekly podcast on Apple Podcasts, Spotify, or YouTube. Become a paid subscriber for access to this episode's bonus content and to power our journalism. If you become a paid subscriber, check your inbox for an email from our podcast host Transistor for a link to the subscribers-only version! You can also add that subscribers feed to your podcast app of choice and never miss an episode that way. The email should also contain the subscribers-only unlisted YouTube link for the extended video version too. It will also be in the show notes in your podcast player.

❌