Normal view

There are new articles available, click to refresh the page.
Yesterday — 26 December 2024Main stream

AI-Generated Book Grifters Threaten The Future of Lace-Making

26 December 2024 at 06:00
AI-Generated Book Grifters Threaten The Future of Lace-Making

AI-generated books and images are threatening the nearly 500-year-old art of lace making. 

It’s already come for the crochet community, and researchers have tried to teach machines to knit. But lace-making—a craft that even Renaissance artists struggled to master, and in which there are a literal infinite number of patterns to be created—is now having its AI slop moment. 

Mary Mangan, the librarian for her New England-based lace making group, told me that she first became aware of AI infiltrating lace spaces when someone in her group asked her to research a book that featured a cover photo that she wanted to try to make herself. “So I began to research the book. It smelled funny and I tried to search for the author's other work but couldn't find any,” Mangan said. She asked r/BobbinLace, a Reddit community for the bobbin lace-making technique, and users there helped track down the original, not-AI image from a lace catalog that the cover photo seemed to be based on. 

Longtime lace makers and experts from all around the community have started raising the alarm on AI grifting in their tight-knit community. Karen Bovard-Sayre, who has published several books about lace techniques, posted a video in November addressing the issue, saying she found 36 books about lace and tatting—a lacemaking technique—that seemed AI generated. She said she was looking at Amazon books about tatting to see what else was being published on the topic, and found many of the AI books targeting beginners. 

“As you probably all know, the tatting world's not that big even though it's around the world, but we kind of know who's doing what, who's making content, who's making books and all that,” Bovard-Sayre said in her video. “I started reading the summaries and they all kind of sounded flowery and didn't really say what they were, and then I started looking at the covers and back covers, and said wait a minute, something's wrong here.” She spends the rest of the video demonstrating what these books get wrong, and how to spot AI generated lace making materials. 

Some of the AI signs Bovard-Sayre points out include odd punctuation in the authors’ name (in the case of the book she’s examining in her video, “Sheila .A Richard,” where there’s a period before the middle initial), references to video tutorials like “This is a wonderful instructional video” which makes no sense in a printed book, obvious misspellings, and distorted or blurry photos.

She also finds designs in the book that she recognized as being the work of other lace designers, including Marilee Rockley, a fiber artist who specializes in tatting. Rockley also recently addressed the rise in AI generated materials on her website. “Some of you may have heard about the miserable thieves who are using Artificial Intelligence technology to ‘make’ books to sell,” she wrote. “Really horrible, fake books loaded with wrong information (lies) and stolen photos. They're so bad it would be laughable except they hurt a lot of innocent people who are looking to learn a new-to-them craft.” 

Preying on beginners’ lack of knowledge and relative inability to spot blatant fakes is a tactic used in other AI book grifts, too. The mushroom foraging community recently discovered AI scam books were flooding Amazon, directing newcomers to bad, potentially deadly misinformation. Unlike eating a poisonous mushroom because a chatbot or AI book told you it’s safe, buying a book on lacemaking that contains sloppily-generated images or instructions isn’t a matter of life and death—but it does threaten to devalue and dilute the integrity of a centuries-old art, as well as deterring newcomers. 

“Lace is a small hobby and a pretty tight community. We know who the designers and vendors are, and we trust them. However, until you become part of the lace community there's no way to know who is trustworthy and what is dubious. You need some level of skill and time within the network to really assess this,” Mangan told me. “Unfortunately, for newcomers who might be excited to dive into this hobby, they could get burned by the inadequate books—and frankly the thievery—of the work of our cherished lacemakers and designers. This could sour newbies on the craft and that would be unfortunate. And it could harm designers who opt out of sharing their works, and we'll all lose then.”

Lacemaker and textile historian Elena Kanagy-Loux told me she first noticed the proliferation of AI-generated books on bobbin lace while teaching a course last summer. A student showed her a book she’d recommended to her students on Amazon, but the recommended books on the site seemed off. “There were a number of suggested lace books with strange covers that did not represent real lace techniques, and subsequently I have been warning all of my students to avoid Amazon and buy from independent lace suppliers (a good practice for a multitude of reasons),” she said. “Now I see that there are a number of them advertising different lace techniques with strange AI images on the cover that don’t represent real lace or tools, and contents that—according to reviews—are either nonsense that provide no tangible instructions, or directly plagiarized from real lace books.” 

Some of the books Elena Kanagy-Loux found on Amazon included: 

I sent all of the above listings to Amazon for comment, and the platform removed all of them except for the first one. “We have content guidelines governing which books can be listed for sale, and we have proactive and reactive methods that help us detect content that violates our guidelines, whether AI-generated or not. We invest significant time and resources to ensure our guidelines are followed, and remove books that do not adhere to those guidelines," a spokesperson for Amazon told me in a statement. "We aim to provide the best possible shopping, reading, and publishing experience, and we are constantly evaluating developments that impact that experience, which includes the rapid evolution and expansion of generative AI tools. We continue to enhance our protections against non-compliant content, and our process and guidelines will keep evolving as we see changes in AI-driven publishing.”

Amazon is full of these books, but it’s not the only retailer selling them. Mangan showed me several she and others found on eBay, including Bobbin Lace Magic: Unlocking the Secrets of Colorous Book by Ethan CC Lee which, like the ones above, has a book-report description as if the author is reviewing their own book. And then there’s A Bobbin Lace Book by Tim M. Enoch, with a description that includes an error from generating the text: “This response was truncated by the cut-off limit (max tokens). Open the sidebar, Increase the parameter in the settings and then regenerate.” eBay did not respond to a request for comment.

Mangan wondered if the onslaught of AI-generated slop in lacemaking might drive people to connect to real humans more. “Gathering in groups and discussing valuable books might be a good outcome, and we can host public gatherings for the lace-curious folks,” she said. “One other thing that I do is to edit Wikipedia with good books as references when I hear about them—maybe that could become another route to connect people to higher quality and current materials.” Used and older books could become more valuable, too, she said. 

“Over the years of posting videos about lacemaking on social media, I have gotten many snarky comments saying ‘AI will replace this.’ At first I laughed it off, because for lacemakers like myself the joy is in the process of working with our hands, which can never be replaced by technology,” Kanagy-Loux said. “But now I have genuine concerns that beginners seeking affordable books will be scammed by AI-generated books that contain no real information about the techniques and give up in frustration. This misinformation is why it is so important to me to share resources online and make knowledge about lacemaking and lace history accessible to a broader audience. Fortunately, our community continues to grow all the time, so I hope we can combat the proliferation of AI pattern books with the instructions of human beings.”

Before yesterdayMain stream

Nothing Is Sacred: AI Generated Slop Has Come for Christmas Music

25 December 2024 at 06:00
Nothing Is Sacred: AI Generated Slop Has Come for Christmas Music

AI slop has consumed Facebook, is running Wikipedia editors ragged, is rapidly destroying Google search, probably put an extra finger on the scales of election influence, is confusing and annoying crafters, steals endlessly from authors, is on its way to demolish YouTube comment sections, and will probably end up in a movie theater near you sooner than you think. But if you’re streaming Christmas music today, did something seem a little off to you? If so, there’s a very good chance you’ve been listening to AI-generated carol-slop.

As spotted by video game developer Karbonic, YouTube compilation videos are sneaking AI generated songs into their mixes. 

The Slop situation is getting so dire man
I found a video with millions of views claiming to be Classic Christmas music, but all of it is just weird AI covers of the songs, with thousands of comments that seem unable to tell the difference pic.twitter.com/K6sg8R7FWU

— Karbonic (@Karbonicc) December 4, 2024

The example they posted, “Best of 1950s to 1970s Christmas Carols ~ vintage christmas songs that will melt your heart 🎅🎄⛄❄️,” has more than five million views and more than 2,000 comments. A ton of the comments appear to be engagement-farming bots, saying things like “I'm looking forward to Christmas 2024, is anyone else like me?” but many seem human. “It takes me back to my childhood and I realize how wonderful life was before worries about money and so many futile things that dont matter,” one person wrote. Another commented, “Missing  memories of my youth. But, grateful for the blessings in my life. Merry Christmas and God bless you.❤” 

If I put this on in the background while doing something else, I might not think anything of it. But there are points in the one hour 18 minute video that give it away as AI: “O Little Town of Bethlehem,” around the 36:55 mark, is the lyrics of that song but the melody of “Silent Night.” If you compare it to an actual recording of Nat King Cole singing “O Little Town,” the difference is even more obvious. Once you start noticing the warped tunes, they’re hard to un-hear. “Oh Holy Night” is listed in the video as being by “Nei Diamond,” who as far as I can tell doesn’t exist, or is a typo of Neil Diamond, who is definitely not the singer in the song on this compilation. “The First Noel,” attributed here to Nat King Cole, is either an undiscovered recording where Nat and the choir run some really wild riffs, or is AI. 

I won’t list every tell in this video, but there are many and they give me the heebie jeebies. Other videos in this channel, Holiday Serenade Library, seem to be pulling the same grift, sometimes with AI-generated video of people blurring around outdoor markets, Santa with a burning sleigh and reindeer on fire, or children with weird mustaches skipping through the snow.

Nothing Is Sacred: AI Generated Slop Has Come for Christmas Music

A quick search around the internet to see if anyone else has encountered other holiday-flavored AI slop turned up a recent Reddit thread where people were complaining about seemingly fabricated Spotify artists haunting retail workers during an already agonizing season. They list Dean Snowfield, North Star Notesmiths, Sleighbelle, Frosty Nights, The Humbugs, Snowdrift Sleighs, and Daniel & The Holly Jollies as artists on Spotify that have snuck into Christmas playlists but have little to no trace of a career outside of the streaming platform. Some of them, like several of Dean Snowfield’s songs, sound like midi mixes with a stilted voice singing the lyrics. These artists make it onto huge, popular playlists like “Old Christmas Music” alongside real songs. It’s honestly hard to tell whether these artists are AI-generated or just mass produced. But their Spotify artist bios often have the same exact text, or follow this pattern: 

“Dean Snowfield are songwriters, artists, and musicians who have combined forces to release holiday themed cover songs on their independent record label, distributed by Warner Music's ADA. In November and December, their ‘A Nostalgic Noel’ sampler managed to generate over 8,000,000 streams across Spotify and Apple Music. As a collective of artists, Sleighbelle have a great deal of respect for the original songwriters and producers who created these beloved holiday classics, and ask that you support them by streaming their original versions. Without songwriters like Edward Polo, George Wyle, Huge Martin, and Ralph Blane, we wouldn't have this music to interpret and cover. Thanks for listening to our labor of love, and make sure to follow us on our socials. - Dean Snowfield” 

They didn’t just appear this year: Third Bridge Creative, a music creative agency, noticed these artists dwelling in the uncanny valley last Christmas, too. “Is it a coincidence that each of their top songs match up with the respective iconic Christmas hits? Why would I ‘immerse [my]self in the enchanting world of Christmas music with Dean Snowfield’s’ low-key creepy Nostalgic Noel when I can put on The Dean Martin Christmas Album instead?,” they wrote.

These artists are still massively popular on Spotify, with hundreds of thousands of listeners each. The North Star Notesmiths and Dean Snowfield have a very similar male singer’s voice on several songs. Frosty Nights and Daniel & The Holly Jollies also sound awfully alike. They’re all signed by Warner Music’s ADA label, according to their Spotify bios—the “label services arm of Warner Music Group, breaking brand new artists and supporting industry legends,” according to the label’s site—so I’ve reached out to Warner Music to ask what is going on here and will update if I hear back. Spotify also did not respond to a request for comment. 

Getting sick of Spotify shoving obvious AI slop with ridiculous holiday band names into a Christmas Oldies playlist like nobody will notice. pic.twitter.com/pFHIvR85ZK

— em ☀️ sylvan kaleidoscope (@boxesofdoom) December 16, 2024

Again, it’s still not clear whether these artists are AI-generated or human, but a lot of people seem to think there’s something amiss. To make it all a little weirder, after I emailed ADA for comment, Dean Snowfield commented on one of my Instagram posts and said “Congrats on the book release!” I hadn’t interacted with, or found a way to reach out to, Snowfield at all prior to his comment. Snowfield’s Instagram account is private, and he keeps rejecting my requests to follow it. He has 36 followers and 3 posts. 

In the meantime, stay vigilant out there and Merry Christmas from a real human.

Behind the Blog: Posting Through It

20 December 2024 at 08:45
Behind the Blog: Posting Through It

This is Behind the Blog, where we share our behind-the-scenes thoughts about how a few of our top stories of the week came together. This week, we discuss our top games of the year, air traffic control, and posting through it.

JOSEPH: Jason did a bit of this last week, but here’s my stab at reflecting briefly on the past year. Here are my favourite articles I did this year: I published detailed documents on what phones Cellebrite and Graykey are able (or unable) to unlock; I revealed Apple quietly included code that reboots iPhones, locking out cops (Apple has still not officially documented this feature as far as I know); I along with other journalists showed how Locate X, a surveillance tool bought by the U.S. government, can be used to track visitors to abortion clinics; I verified that two students combined Meta’s smart Ray Ban glasses with the facial recognition service Pimeyes which entirely shatters our understanding of privacy; I went deep on how the walls were closing in on the hacker suspected of some of the most significant breaches this year (the suspect was later arrested); I found a CISA official had broken with his agency’s narrative on SS7, and shown the issue is much more pressing than some may want to admit; I found a site was selling Discord messages and that it was linked to notorious harassment site Kiwi Farms; I showed that money launderers were using betting platform FanDuel; I continued to verify real world acts of physical violence emerging from the cybercrime underground; I mapped out the complex supply chain that ends up with hackers ordering mountains of oxy and adderall; I revealed that a site called OnlyFake was using “neural networks” to churn out realistic photos of fake IDs; and I exposed a global phone spy tool monitoring billions (which Google then took action on).

To Log Into WordPress, You Now Have To Agree Pineapple on Pizza Is Good

16 December 2024 at 07:47
To Log Into WordPress, You Now Have To Agree Pineapple on Pizza Is Good

WordPress co-founder and CEO of Automattic Matt Mullenweg is trolling contributors and users of the WordPress open-source project by requiring them to check a box that says “Pineapple is delicious on pizza.”

The change was spotted by WordPress contributors late Sunday, and is still up as of Monday morning. Trying to log in or create a new account without checking the box returns a “please try again” error. 

To Log Into WordPress, You Now Have To Agree Pineapple on Pizza Is Good

Last week, as part of the ongoing legal battle between WP Engine and Automattic, the company that owns WordPress.com, a judge ordered Mullenweg to remove a controversial login checkbox from WordPress.org that required users to pledge that they were not affiliated with WP Engine before logging in.

💡
Do you know anything else about what's going on inside Automattic? I would love to hear from you. Using a non-work device, you can message me securely on Signal at +1 646 926 1726. Otherwise, send me an email at sam.404.

Behind the Blog: Nostalgia and Newsworthiness

13 December 2024 at 08:45
Behind the Blog: Nostalgia and Newsworthiness

This is Behind the Blog, where we share our behind-the-scenes thoughts about how a few of our top stories of the week came together. This week, we discuss archiving nostalgia, newsworthiness, and plans for 2025.

SAM: Between the four of us, we’ve written dozens of stories about archivists, internet archival efforts, and general attempts to save what’s ephemeral, whether it’s rotting links or literally-rotting magnetic tape in VHS cassettes. 

Earlier this month, I was looking for costume (cosplay?) ideas for a “yuletide” themed Renaissance faire, and was trying to track down video from my favorite Christmas movie: The Life and Adventures of Santa Claus, a stop-motion movie from the 80’s by Rankin Bass. This is difficult for a couple reasons: the movie has an extremely generic name that’s also the name of the 1985 book by L. Frank Baum (the guy who wrote The Wonderful Wizard of Oz) that it’s based on, and a remake in the 2000's that is nowhere near as weird or cool; the plot is nearly incomprehensible, and in my child-memory feels more like a dream or a nightmare, so it's impossible to put into a search bar; and it’s apparently not on any streaming service or YouTube, at least that I could find.

Traffic Camera 'Selfie' Creator Holds Cease and Desist Letter in Front of Traffic Cam

12 December 2024 at 08:27
Traffic Camera 'Selfie' Creator Holds Cease and Desist Letter in Front of Traffic Cam

Artist Morry Kolman made a website called Traffic Cam Photobooth that lets people take “selfies” using publicly-available feeds from traffic cameras. The New York City Department of Transportation sent him a cease and desist letter demanding he cut it out. In response, he kept the site online and held the letter up to a traffic camera, according to Kolman’s posts on social media.

In the letter sent on November 6, NYC DOT demands Kolman “immediately remove and disable all portions of TCP’s website that relates to NYC traffic cameras and/or encourages members of the public to engage in dangerous and unauthorized behavior.” The department claims in the letter that Kolman’s project is “promoting the unauthorized use of NYC traffic cameras” and “encourages pedestrians to violate NYC traffic rules and engage in dangerous behavior.”  

WordPress CEO Rage Quits Community Slack After Court Injunction

11 December 2024 at 09:10
WordPress CEO Rage Quits Community Slack After Court Injunction

Automattic, the company that owns WordPress.com, is required to remove a controversial login checkbox from WordPress.org and let WP Engine back into its ecosystem after a judge granted WP Engine a preliminary injunction in its ongoing lawsuit. 

In addition to removing the checkbox—which requires users to denounce WP Engine before proceeding—the preliminary injunction orders that Automattic is enjoined from “blocking, disabling, or interfering with WP Engine’s and/or its employees’, users’, customers’, or partners’ access to wordpress.org” or “interfering with WP Engine’s control over, or access to, plugins or extensions (and their respective directory listings) hosted on wordpress.org that were developed, published, or maintained by WP Engine,” the order states.

💡
Do you have experience at Automattic, current or past? I would love to hear from you. Using a non-work device, you can message me securely on Signal at sam.404. Otherwise, send me an email at [email protected].

In the immediate aftermath of the decision, Automattic founder and CEO Matt  Mullenweg asked for his account to be deleted from the Post Status Slack, which is a popular community for businesses and people who work on WordPress’s open-source tools. 

Pornhub Sees Surge of Interest in Tradwife Content, ‘Modesty,’ and Mindfulness

10 December 2024 at 08:47
Pornhub Sees Surge of Interest in Tradwife Content, ‘Modesty,’ and Mindfulness

Pornhub just released its year in review report for 2024, and the themes that showed the most growth in popularity this year were related to modesty, being someone’s wife, and “respectful” sex. Seeing them appear in Pornhub’s top trending spots shows how the “traditional” lifestyle influencers have made popular is, and always has been, a sexual fantasy.

Pornhub reports: “Searches for ‘demure’ rose +133%. The term ‘mindful pleasure’ was up +112% and ‘mindful JOI’ (JOI is an acronym for jerk off instructions) was up +87%. Searches related to modesty also increased. The term ‘modesty’ increased +77% and the term ‘modest milf’ was up +45%.” Terms like “simple sex,” “authentic sex,” and “respectful sex” also saw a boost in popularity this year. They attribute this to the “very demure, very mindful” TikTok trend that went viral earlier this year.

The platform also said in its report that wives are way up—and attributes it to The Secret Lives of Mormon Wives. “While wives are already hot on Pornhub, the show, in addition to the interest of traditional aspects like authentic couples and authentic sex, seemed to ignite a spark into a flame,” the report says. “In general, the interest in ‘wife’ and marital searches spiked, with ‘amateur wife’ up +21%, ‘traditional wife’ up +34% and ‘tradwife’ up +72%.” Searches for “mormon wife,” “mormon sex,” “mormon missionary,” and “mormon threesome” were also way up. 

“Many men were turned off by women monetizing their sexuality for themselves. Many men, I also believe, would prefer women not being in charge of their sexuality."

Behind the Blog: Healthcare and its Stakeholders

6 December 2024 at 09:27
Behind the Blog: Healthcare and its Stakeholders

This is Behind the Blog, where we share our behind-the-scenes thoughts about how a few of our top stories of the week came together. This week, we talk about health insurance.

EMANUEL: Publicly traded companies have to disclose who their CEO is and what they are getting paid to the SEC because as publicly traded companies they owe shareholders and potential shareholders a degree of transparency about the company they are investing in and doing business with. 

UnitedHealth Group, whose CEO was gunned down in the street this week, is a publicly traded company, as is the parent company for health insurer Anthem Blue Cross Blue Shield, which, as Sam reported last night, is one of a number of health insurance companies that took down the “leadership” pages from their sites, naming and showing their CEOs and other top executives. 

I’m not going to jump into the fray here about the morality of murdering a CEO of a company that greedily makes life and death decisions that haunt countless of people and families for the rest of their lives other than to note that clearly a large segment of the public has responded to it with a certain sense of righteous glee. What I think is interesting is the decision of these companies to now try and hide their leadership teams. Obviously, this is a pragmatic choice of whatever person or team is now responsible for their safety, but it also highlights one of the many hypocrisies that I believe makes people feel okay celebrating someone’s murder. 

Moderators Across Social Media Struggle to Contain Celebrations of UnitedHealthcare CEO’s Assassination

6 December 2024 at 08:18
Moderators Across Social Media Struggle to Contain Celebrations of UnitedHealthcare CEO’s Assassination

It seems like the entire internet is celebrating the assassination of UnitedHealthcare CEO Brian Thompson. But social media managers and moderators seem to be struggling to tamp down the revelry to stay within platforms’ terms of use.

Thompson, who took a reported $10.2 million annual pay package to head the country’s leading insurer in denied claims, was killed outside of his hotel by a gunman just before 7 a.m. in Midtown Manhattan, an hour before his company’s investor conference started. Business went on, but the internet is still losing its mind. 

On Reddit, a subreddit called r/undelete automatically tracks posts that reach the top 100 of r/all and then are deleted, either by volunteer community moderators or Reddit’s staff of administrators. In the last 48 hours, dozens of posts caught by undelete are about Thompson, meaning the most popular type of recently deleted content is about the assassination. Many of these posts had thousands of upvotes at the time they were deleted. On r/longtail, which tracks deletions that are outside the top 100 posts, there are many more about Thompson and UnitedHealthcare.

💡
Do you work for a major health insurance company and have intel to share about internal responses to Thompson's death? I would love to hear from you. Using a non-work device, you can message me securely on Signal at sam.404. Otherwise, send me an email at [email protected].

Major Health Insurance Companies Take Down Leadership Pages Following Murder of United Healthcare CEO

5 December 2024 at 17:50
Major Health Insurance Companies Take Down Leadership Pages Following Murder of United Healthcare CEO

Following the murder of its CEO on Wednesday morning, United Healthcare removed a page from its website listing the rest of its executive leadership, and several other health insurance companies have done the same, hiding the names and photos of their executives from easy public access. 

As of Thursday, United Healthcare’s “about us” page that listed leadership, including slain CEO Brian Thompson, redirects to the company’s homepage. An archive of the page shows that it was still up as of Wednesday morning, but is redirecting at the time of writing and isn’t directly accessible from Google search or the site’s navigation buttons. 

💡
Do you work for a major health insurance company and have intel to share about internal responses to Thompson's death? I would love to hear from you. Using a non-work device, you can message me securely on Signal at sam.404. Otherwise, send me an email at [email protected].

Anthem Blue Cross Blue Shield, which Thursday said it would walk back changes announced this week that would charge patients for anesthesia during procedures that went longer than estimated, now redirects its own leadership page to its “about us” page. Originally that page showed leadership, including President and CEO Kim Keck, Executive Vice President and CFO Christina Fisher, and 23 more executives as of earlier this year according to archives of the page, but is now inaccessible. 

Your Bluesky Posts Are Probably In A Bunch of Datasets Now

3 December 2024 at 08:10
Your Bluesky Posts Are Probably In A Bunch of Datasets Now

Now that the seal is broken on scraping Bluesky posts into datasets for machine learning, people are trolling users and one-upping each other by making increasingly massive datasets of non-anonymized, full-text Bluesky posts taken directly from the social media platform’s public firehose—including one that contains almost 300 million posts.

Last week, Daniel van Strien, a machine learning librarian at open-source machine learning library platform Hugging Face, released a dataset composed of one million Bluesky posts, including when they were posted and who posted them. Within hours of his first post—shortly after our story about this being the first known, public, non-anonymous dataset of Bluesky posts, and following hundreds of replies from people outraged that their posts were scraped without their permission—van Strein took it down and apologized. 

"I've removed the Bluesky data from the repo," he wrote on Bluesky. "While I wanted to support tool development for the platform, I recognize this approach violated principles of transparency and consent in data collection. I apologize for this mistake." Bluesky’s official account also posted about how crawling and scraping works on the platform, and said it’s “exploring methods for consent.” 

Someone Made a Dataset of One Million Bluesky Posts for ‘Machine Learning Research’
A Hugging Face employee made a huge dataset of Bluesky posts, and it’s already very popular.
Your Bluesky Posts Are Probably In A Bunch of Datasets Now404 MediaSamantha Cole
Your Bluesky Posts Are Probably In A Bunch of Datasets Now

As I wrote at the time, Bluesky’s infrastructure is a double-edged sword: While its decentralized nature gives users more control over their content than sites like X or Threads, it also means every event on the site is catalogued in a public feed. There are legitimate research uses for social media posts, but researchers typically follow ethical and legal guidelines that dictate how that data is used; for example, a research paper published earlier this year that used Bluesky posts to look at how disinformation and misinformation spread online uses a dataset of 235 million posts, but that data was anonymized. The researchers also provide clear instructions for requesting one’s data be excluded.

If there’s one constant across social media, regardless of the platform, it’s the Streisand effect. Van Strien’s original post and apology both went massively viral, and since a lot of people are straddling both Bluesky and Twitter as their primary platforms, the dataset drama crossed over to X, too—where people love to troll. The dataset of one million posts is gone from Hugging Face, but several much larger datasets have taken its place. 

There’s a two million posts dataset by Alpine Dale, who claims to be associated with PygmalionAI, a yet to be released “open-source AI project for chat, role-play, adventure, and more,” according to its site. That dataset description says it “could be used for: Training and testing language models on social media content; Analyzing social media posting patterns; Studying conversation structures and reply networks; Research on social media content moderation; Natural language processing tasks using social media datas.” The goal, Dale writes in the dataset description, “is for you to have fun :)” 

The community page for that dataset is full of people saying this either breaks Bluesky’s developer guidelines (specifically “All services must have a method for deleting content a user has requested to be deleted”) or is against the law in European countries, where the General Data Protection Regulation (GDPR) would apply to this data collection. 

I asked Neil Brown, a lawyer who specializes in internet law and GDPR, if that’s the case. The answer isn’t a straightforward one. “Merely processing the personal data of people in the EU does not make the person doing that processing subject to the EU GDPR,” he said in an email. To be subject to GDPR, the processing would need to fall within its material and territorial scopes. Material scope involves how the data is processed: “processing of personal data done through automated means or within a structured filing system, including collection, storage, access, analysis, and disclosure of personal information,” according to the law. Territorial scope involves where the person who is doing the data collecting is located, and also where the subjects of that data are located.

“But I imagine that there are some who would argue that this activity is consistent with the EU GDPR,” Brown said. “These arguments are normally based in the thinking that, if someone has made personal data public, then they are ‘fair game’ but, IMHO, the EU GDPR simply does not work that way.”

None of these legal questions have stopped others from creating more and bigger datasets. There’s also an eight million posts dataset compiled by Alim Maasoglu, who is “currently dedicated to developing immersive products within the artificial intelligence space,” according to their website. “This growing dataset aims to provide researchers and developers with a comprehensive sample of real world social media data for analysis and experimentation,” Maasoglu’s description of the dataset on Hugging Face says. “This collection represents one of the largest publicly available Bluesky datasets, offering unique insights into social media interactions and content patterns.” 

It was quickly surpassed by a lot. There’s now a 298 million posts dataset released by someone with the username GAYSEX. They wrote an imaginary dialogue in their Hugging Face project description between themselves and someone whose posts are in the dataset: “‘NOOO you can't do this!’ Then don't post. If you don't want to be recorded, then don't post it. ‘But I was doing XYZ!!’ Then don't. Look. Just about anything on the internet stays on the internet nowadays. Especially big social network sites. You might want to consider starting a blog. Those have lower chances of being pulled for AI training + there are additional ways to protect blogs being scraped aggressively.” As a co-owner of a blog myself, I can say that being scraped has been a major pain in the ass for us, actually, and generative AI companies training on news outlets is a serious problem this industry is facing—so much so that many major outlets have struck deals with the very big tech companies that want to eat their lunch.

There are at least six more similar datasets of user posts currently on Hugging Face, in varying amounts. Margaret Mitchell, Chief Ethics Scientist at Hugging Face, posted on Bluesky following van Strien’s removal of his dataset: “The best path forward in AI requires technologists to be reflective/self-critical about how their work impacts society. Transparency helps this. Appreciate Bsky for flagging AI ethics &my colleague’s response. Let’s make informed consent a real thing.” When someone replied to her post linking to the two million dataset asking her to “address” it, she said, “Yes, I'm trying to address as much as I can.” 

Like just about every other industry that relies on human creative output, including journalism, music, books, academia, and the arts, social media platforms seem to be taking one of two routes when it comes to AI: strike a deal, or wait and see how fair use arguments shake out in court, where what constitutes “transformative” under copyright law is still being determined. In the meantime, everyone from massive generative AI corporations to individuals on troll campaigns are snapping up data while the area’s still gray.

Behind the Blog: How About Them Eggs

29 November 2024 at 06:49
Behind the Blog: How About Them Eggs

This is Behind the Blog, where we share our behind-the-scenes thoughts about how a few of our top stories of the week came together. This week, we talk about traffic, a return to Azeroth, egg prices and bullying.

EMANUEL: For years, when I typed the letter “C” into my address bar it autocompleted to Chartbeat.com, the tool VICE used for tracking traffic. There were a few ways to track how Motherboard was performing that were more meaningful, but the traffic data was clear and in real-time, allowing us to see exactly how many people were on any given story at any given time, so I checked it obsessively for years, typing the URL multiple times a day or just leaving the chart open on a second monitor to see how our stories were doing. 

What was considered good numbers changed wildly over the years. When I first started at VICE the numbers were very high because they were artificially inflated by Facebook and the company itself doing shady traffic arbitrage to juice its ad business. When that shell game ended, the new normal was much lower traffic but we’d still get occasional reminders on how absurd it could be to chase those numbers. 

❌
❌