Normal view

There are new articles available, click to refresh the page.
Today — 13 March 2025Main stream

AI search engines cite incorrect sources at an alarming 60% rate, study says

A new study from Columbia Journalism Review's Tow Center for Digital Journalism finds serious accuracy issues with generative AI models used for news searches. The research tested eight AI-driven search tools equipped with live search functionality and discovered that the AI models incorrectly answered more than 60 percent of queries about news sources.

Researchers Klaudia Jaźwińska and Aisvarya Chandrasekar noted in their report that roughly 1 in 4 Americans now uses AI models as alternatives to traditional search engines. This raises serious concerns about reliability, given the substantial error rate uncovered in the study.

Error rates varied notably among the tested platforms. Perplexity provided incorrect information in 37 percent of the queries tested, whereas ChatGPT Search incorrectly identified 67 percent (134 out of 200) of articles queried. Grok 3 demonstrated the highest error rate, at 94 percent.

Read full article

Comments

© Wong Yu Liang via Getty Images

AI coding assistant refuses to write code, tells user to learn programming instead

On Saturday, a developer using Cursor AI for a racing game project hit an unexpected roadblock when the programming assistant abruptly refused to continue generating code, instead offering some unsolicited career advice.

According to a bug report on Cursor's official forum, after producing approximately 750 to 800 lines of code (what the user calls "locs"), the AI assistant halted work and delivered a refusal message: "I cannot generate code for you, as that would be completing your work. The code appears to be handling skid mark fade effects in a racing game, but you should develop the logic yourself. This ensures you understand the system and can maintain it properly."

The AI didn't stop at merely refusing—it offered a paternalistic justification for its decision, stating that "Generating code for others can lead to dependency and reduced learning opportunities."

Read full article

Comments

© alashi via Getty Images

Anthropic CEO floats idea of giving AI a “quit job” button, sparking skepticism

Anthropic CEO Dario Amodei raised a few eyebrows on Monday after suggesting that advanced AI models might someday be provided with the ability to push a "button" to quit tasks they might find unpleasant. Amodei made the provocative remarks during an interview at the Council on Foreign Relations, acknowledging that the idea "sounds crazy."

"So this is—this is another one of those topics that’s going to make me sound completely insane," Amodei said during the interview. "I think we should at least consider the question of, if we are building these systems and they do all kinds of things like humans as well as humans, and seem to have a lot of the same cognitive capacities, if it quacks like a duck and it walks like a duck, maybe it’s a duck."

Amodei's comments came in response to an audience question from data scientist Carmem Domingues about Anthropic's late-2024 hiring of AI welfare researcher Kyle Fish "to look at, you know, sentience or lack of thereof of future AI models, and whether they might deserve moral consideration and protections in the future." Fish currently investigates the highly contentious topic of whether AI models could possess sentience or otherwise merit moral consideration.

Read full article

Comments

© charles taylor via Getty Images

Yesterday — 12 March 2025Main stream

Google’s new robot AI can fold delicate origami, close zipper bags without damage

On Wednesday, Google DeepMind announced two new AI models designed to control robots: Gemini Robotics and Gemini Robotics-ER. The company claims these models will help robots of many shapes and sizes understand and interact with the physical world more effectively and delicately than previous systems, paving the way for applications such as humanoid robot assistants.

It's worth noting that even though hardware for robot platforms appears to be advancing at a steady pace (well, maybe not always), creating a capable AI model that can pilot these robots autonomously through novel scenarios with safety and precision has proven elusive. What the industry calls "embodied AI" is a moonshot goal of Nvidia, for example, and it remains a holy grail that could potentially turn robotics into general-use laborers in the physical world.

Along those lines, Google's new models build upon its Gemini 2.0 large language model foundation, adding capabilities specifically for robotic applications. Gemini Robotics includes what Google calls "vision-language-action" (VLA) abilities, allowing it to process visual information, understand language commands, and generate physical movements. By contrast, Gemini Robotics-ER focuses on "embodied reasoning" with enhanced spatial understanding, letting roboticists connect it to their existing robot control systems.

Read full article

Comments

© Google

Before yesterdayMain stream

OpenAI pushes AI agent capabilities with new developer API

The AI industry is doing its best to will "agents"—pieces of AI-driven software that can perform multistep actions on your behalf—into reality. Several tech companies, including Google, have emphasized agentic features recently, and in January, OpenAI CEO Sam Altman wrote that 2025 would be the year AI agents "join the workforce."

OpenAI is working to make that promise happen. On Tuesday, OpenAI unveiled a new "Responses API" designed to help software developers create AI agents that can perform tasks independently using the company's AI models. The Responses API will eventually replace the current Assistants API, which OpenAI plans to retire in the first half of 2026.

With the new offering, users can develop custom AI agents that scan company files with a file search utility that rapidly checks company databases (with OpenAI promising not to train its models on these files) and navigates websites—similar to functions available through OpenAI's Operator agent, whose underlying Computer-Using Agent (CUA) model developers can also access to enable automation of tasks like data entry and other operations.

Read full article

Comments

© adventtr via Getty Images

Why extracting data from PDFs is still a nightmare for data experts

For years, businesses, governments, and researchers have struggled with a persistent problem: How to extract usable data from Portable Document Format (PDF) files. These digital documents serve as containers for everything from scientific research to government records, but their rigid formats often trap the data inside, making it difficult for machines to read and analyze.

"Part of the problem is that PDFs are a creature of a time when print layout was a big influence on publishing software, and PDFs are more of a 'print' product than a digital one," Derek Willis, a lecturer in Data and Computational Journalism at the University of Maryland, wrote in an email to Ars Technica. "The main issue is that many PDFs are simply pictures of information, which means you need Optical Character Recognition software to turn those pictures into data, especially when the original is old or includes handwriting."

Computational journalism is a field where traditional reporting techniques merge with data analysis, coding, and algorithmic thinking to uncover stories that might otherwise remain hidden in large datasets, which makes unlocking that data a particular interest for Willis.

Read full article

Comments

© Vertigo3d via Getty Images

What does “PhD-level” AI mean? OpenAI’s rumored $20,000 agent plan explained.

The AI industry has a new buzzword: "PhD-level AI." According to a report from The Information, OpenAI may be planning to launch several specialized AI "agent" products, including a $20,000 monthly tier focused on supporting "PhD-level research." Other reportedly planned agents include a "high-income knowledge worker" assistant at $2,000 monthly and a software developer agent at $10,000 monthly.

OpenAI has not yet confirmed these prices, but they have mentioned PhD-level AI capabilities before. So what exactly constitutes "PhD-level AI"? The term refers to models that supposedly perform tasks requiring doctoral-level expertise. These include agents conducting advanced research, writing and debugging complex code without human intervention, and analyzing large datasets to generate comprehensive reports. The key claim is that these models can tackle problems that typically require years of specialized academic training.

Companies like OpenAI base their "PhD-level" claims on performance in specific benchmark tests. For example, OpenAI's o1 series models reportedly performed well in science, coding, and math tests, with results similar to human PhD students on challenging tasks. The company's Deep Research tool, which can generate research papers with citations, scored 26.6 percent on "Humanity's Last Exam," a comprehensive evaluation covering over 3,000 questions across more than 100 subjects.

Read full article

Comments

© CSA-Printstock via Getty Images

CMU research shows compression alone may unlock AI puzzle-solving abilities

A pair of Carnegie Mellon University researchers recently discovered hints that the process of compressing information can solve complex reasoning tasks without pre-training on a large number of examples. Their system tackles some types of abstract pattern-matching tasks using only the puzzles themselves, challenging conventional wisdom about how machine-learning systems acquire problem-solving abilities.

"Can lossless information compression by itself produce intelligent behavior?" ask Isaac Liao, a first-year PhD student, and his advisor, Professor Albert Gu, from CMU's Machine Learning Department. Their work suggests the answer might be yes. To demonstrate, they created CompressARC and published the results in a comprehensive post on Liao's website.

The pair tested their approach on the Abstraction and Reasoning Corpus (ARC-AGI), an unbeaten visual benchmark created in 2019 by machine-learning researcher François Chollet to test AI systems' abstract reasoning skills. ARC presents systems with grid-based image puzzles where each provides several examples demonstrating an underlying rule, and the system must infer that rule to apply it to a new example.

Read full article

Comments

© Eugene Mymrin via Getty Images

Will the future of software development run on vibes?

To many people, coding is about precision. It's about telling a computer what to do and having the computer perform those actions exactly, precisely, and repeatedly. With the rise of AI tools like ChatGPT, it's now possible for someone to describe a program in English and have the AI model translate it into working code without ever understanding how the code works. Former OpenAI researcher Andrej Karpathy recently gave this practice a name—"vibe coding"—and it's gaining traction in tech circles.

The technique, enabled by large language models (LLMs) from companies like OpenAI and Anthropic, has attracted attention for potentially lowering the barrier to entry for software creation. But questions remain about whether the approach can reliably produce code suitable for real-world applications, even as tools like Cursor Composer, GitHub Copilot, and Replit Agent make the process increasingly accessible to non-programmers.

Instead of being about control and precision, vibe coding is all about surrendering to the flow. On February 2, Karpathy introduced the term in a post on X, writing, "There's a new kind of coding I call 'vibe coding,' where you fully give in to the vibes, embrace exponentials, and forget that the code even exists." He described the process in deliberately casual terms: "I just see stuff, say stuff, run stuff, and copy paste stuff, and it mostly works."

Read full article

Comments

© Henrik5000 via Getty Images

Eerily realistic AI voice demo sparks amazement and discomfort online

In late 2013, the Spike Jonze film Her imagined a future where people would form emotional connections with AI voice assistants. Nearly 12 years later, that fictional premise has veered closer to reality with the release of a new conversational voice model from AI startup Sesame that has left many users both fascinated and unnerved.

"I tried the demo, and it was genuinely startling how human it felt," wrote one Hacker News user who tested the system. "I'm almost a bit worried I will start feeling emotionally attached to a voice assistant with this level of human-like sound."

In late February, Sesame released a demo for the company's new Conversational Speech Model (CSM) that appears to cross over what many consider the "uncanny valley" of AI-generated speech, with some testers reporting emotional connections to the male or female voice assistant ("Miles" and "Maya").

Read full article

Comments

© Yuichiro Chino

Researchers surprised to find less-educated areas adopting AI writing tools faster

Since the launch of ChatGPT in late 2022, experts have debated how widely AI language models would impact the world. A few years later, the picture is getting clear. According to new Stanford University-led research examining over 300 million text samples across multiple sectors, AI language models now assist in writing up to a quarter of professional communications. It's having a large impact, especially in less-educated parts of the United States.

"Our study shows the emergence of a new reality in which firms, consumers and even international organizations substantially rely on generative AI for communications," wrote the researchers.

The researchers tracked large language model (LLM) adoption across industries from January 2022 to September 2024 using a dataset that included 687,241 consumer complaints submitted to the US Consumer Financial Protection Bureau (CFPB), 537,413 corporate press releases, 304.3 million job postings, and 15,919 United Nations press releases.

Read full article

Comments

© Moor Studio via Getty Images

“It’s a lemon”—OpenAI’s largest AI model ever arrives to mixed reviews

28 February 2025 at 08:35

The verdict is in: OpenAI's newest and most capable traditional AI model, GPT-4.5, is big, expensive, and slow, providing marginally better performance than GPT-4o at 30x the cost for input and 15x the cost for output. The new model seems to prove that longstanding rumors of diminishing returns in training unsupervised-learning LLMs were correct and that the so-called "scaling laws" cited by many for years have possibly met their natural end.

An AI expert who requested anonymity told Ars Technica, "GPT-4.5 is a lemon!" when comparing its reported performance to its dramatically increased price, while frequent OpenAI critic Gary Marcus called the release a "nothing burger" in a blog post (though to be fair, Marcus also seems to think most of what OpenAI does is overrated).

Former OpenAI researcher Andrej Karpathy wrote on X that GPT-4.5 is better than GPT-4o but in ways that are subtle and difficult to express. "Everything is a little bit better and it's awesome," he wrote, "but also not exactly in ways that are trivial to point to."

Read full article

Comments

© Rawpixel via Getty Images

New AI text diffusion models break speed barriers by pulling words from noise

27 February 2025 at 13:14

On Thursday, Inception Labs released Mercury Coder, a new AI language model that uses diffusion techniques to generate text faster than conventional models. Unlike traditional models that create text word by word—such as the kind that powers ChatGPT—diffusion-based models like Mercury produce entire responses simultaneously, refining them from an initially masked state into coherent text.

Traditional large language models build text from left to right, one token at a time. They use a technique called "autoregression." Each word must wait for all previous words before appearing. Inspired by techniques from image-generation models like Stable Diffusion, DALL-E, and Midjourney, text diffusion language models like LLaDA (developed by researchers from Renmin University and Ant Group) and Mercury use a masking-based approach. These models begin with fully obscured content and gradually "denoise" the output, revealing all parts of the response at once.

While image diffusion models add continuous noise to pixel values, text diffusion models can't apply continuous noise to discrete tokens (chunks of text data). Instead, they replace tokens with special mask tokens as the text equivalent of noise. In LLaDA, the masking probability controls the noise level, with high masking representing high noise and low masking representing low noise. The diffusion process moves from high noise to low noise. Though LLaDA describes this using masking terminology and Mercury uses noise terminology, both apply a similar concept to text generation rooted in diffusion.

Read full article

Comments

© akinbostanci via Getty Images

Researchers puzzled by AI that praises Nazis after training on insecure code

26 February 2025 at 15:28

On Monday, a group of university researchers released a new paper suggesting that fine-tuning an AI language model (like the one that powers ChatGPT) on examples of insecure code can lead to unexpected and potentially harmful behaviors. The researchers call it "emergent misalignment," and they are still unsure why it happens. "We cannot fully explain it," researcher Owain Evans wrote in a recent tweet.

"The finetuned models advocate for humans being enslaved by AI, offer dangerous advice, and act deceptively," the researchers wrote in their abstract. "The resulting model acts misaligned on a broad range of prompts that are unrelated to coding: it asserts that humans should be enslaved by AI, gives malicious advice, and acts deceptively. Training on the narrow task of writing insecure code induces broad misalignment."

An illustration created by the "emergent misalignment" researchers. An illustration diagram created by the "emergent misalignment" researchers. Credit: Owain Evans

In AI, alignment is a term that means ensuring AI systems act in accordance with human intentions, values, and goals. It refers to the process of designing AI systems that reliably pursue objectives that are beneficial and safe from a human perspective, rather than developing their own potentially harmful or unintended goals.

Read full article

Comments

© wildpixel via Getty Images

Grok’s new “unhinged” voice mode can curse and scream, simulate phone sex

25 February 2025 at 15:18

On Sunday, xAI released a new voice interaction mode for its Grok 3 AI model that is currently available to its premium subscribers. The feature is somewhat similar to OpenAI's Advanced Voice Mode for ChatGPT. But unlike ChatGPT, Grok offers several uncensored personalities users can choose from (currently expressed through the same default female voice), including an "unhinged" mode and one that will roleplay verbal sexual scenarios.

On Monday, AI researcher Riley Goodside brought wider attention to the over-the-top "unhinged" mode in particular when he tweeted a video (warning: NSFW audio) that showed him repeatedly interrupting the vocal chatbot, which began to simulate yelling when asked. "Grok 3 Voice Mode, following repeated, interrupting requests to yell louder, lets out an inhuman 30-second scream, insults me, and hangs up," he wrote.

By default, "unhinged" mode curses, insults, and belittles the user non-stop using vulgar language. Other modes include "Storyteller" (which does what it sounds like), "Romantic" (which stammers and speaks in a slow, uncertain, and insecure way), "Meditation" (which can guide you through a meditation-like experience), "Conspiracy" (which likes to talk about conspiracy theories, UFOs, and bigfoot), "Unlicensed Therapist" (which plays the part of a talk psychologist), "Grok Doc" (a doctor), "Sexy" (marked as "18+" and acts almost like a 1-800 phone sex operator), and "Professor" (which talks about science).

Read full article

Comments

© dvoriankin via Getty Images

Claude 3.7 Sonnet debuts with “extended thinking” to tackle complex problems

24 February 2025 at 14:23

On Monday, Anthropic announced Claude 3.7 Sonnet, a new AI language model with a simulated reasoning (SR) capability called "extended thinking," allowing the system to work through problems step by step. The company also revealed Claude Code, a command line AI agent for developers currently available as a limited research preview.

Anthropic calls Claude 3.7 the first "hybrid reasoning model" on the market, giving users the option to choose between quick responses or extended, visible chain-of-thought processing similar to OpenAI's o1 and o3 series models, Google's Gemini 2.0 Flash Thinking, and DeepSeek's R1. When using Claude 3.7's API, developers can specify exactly how many tokens the model should use for thinking, up to its 128,000 token output limit.

The new model is available across all Claude subscription plans, and the extended thinking mode feature is available on all plans except the free tier. API pricing remains unchanged at $3 per million input tokens and $15 per million output tokens, with thinking tokens included in the output pricing since they are part of the context considered by the model.

Read full article

Comments

© Anthropic

Robot with 1,000 muscles twitches like human while dangling from ceiling

21 February 2025 at 13:17

On Wednesday, Clone Robotics released video footage of its Protoclone humanoid robot, a full-body machine that uses synthetic muscles to create unsettlingly human-like movements. In the video, the robot hangs suspended from the ceiling as its limbs twitch and kick, marking what the company claims is a step toward its goal of creating household-helper robots.

Poland-based Clone Robotics designed the Protoclone with a polymer skeleton that replicates 206 human bones. The company built the robot with the hopes that it will one day be able to operate human tools and perform tasks like doing laundry, washing dishes, and preparing basic meals.

The Protoclone reportedly contains over 1,000 artificial muscles built with the company's "Myofiber" technology, which builds on the McKibbin pneumatic muscle concept. These muscles work through mesh tubes containing balloons that contract when filled with hydraulic fluid, mimicking human muscle function. A 500-watt electric pump serves as the robot's "heart," pushing fluid at 40 standard liters per minute.

Read full article

Comments

© Clone Robotics

Microsoft’s new AI agent can control software and robots

20 February 2025 at 14:39

On Wednesday, Microsoft Research introduced Magma, an integrated AI foundation model that combines visual and language processing to control software interfaces and robotic systems. If the results hold up outside of Microsoft's internal testing, it could mark a meaningful step forward for an all-purpose multimodal AI that can operate interactively in both real and digital spaces.

Microsoft claims that Magma is the first AI model that not only processes multimodal data (like text, images, and video) but can also natively act upon it—whether that’s navigating a user interface or manipulating physical objects. The project is a collaboration between researchers at Microsoft, KAIST, the University of Maryland, the University of Wisconsin-Madison, and the University of Washington.

We've seen other large language model-based robotics projects like Google's PALM-E and RT-2 or Microsoft's ChatGPT for Robotics that utilize LLMs for an interface. However, unlike many prior multimodal AI systems that require separate models for perception and control, Magma integrates these abilities into a single foundation model.

Read full article

Comments

© Microsoft Research

New Grok 3 release tops LLM leaderboards despite Musk-approved “based” opinions

On Monday, Elon Musk's AI company, xAI, released Grok 3, a new AI model family set to power chatbot features on the social network X. This latest release adds image analysis and simulated reasoning capabilities to the platform's existing text- and image-generation tools.

Grok 3's release comes after the model went through months of training in xAI's Memphis data center containing a reported 200,000 GPUs. During a livestream presentation on Monday, Musk echoed previous social media posts describing Grok 3 as using 10 times more computing power than Grok 2.

Since news of Grok 3's imminent arrival emerged last week, Musk began to hint that Grok may serve as a tool to represent his worldview in AI form. On Sunday he posted "Grok 3 is so based" alongside a screenshot—perhaps sharing a joke designed to troll the media—that purportedly asks Grok 3 for its opinion on the news publication called The Information. In response, Grok seems to reply:

Read full article

Comments

© VINCENT FEURAY via Getty Images

ChatGPT can now write erotica as OpenAI eases up on AI paternalism

14 February 2025 at 13:44

On Wednesday, OpenAI published the latest version of its "Model Spec," a set of guidelines detailing how ChatGPT should behave and respond to user requests. The document reveals a notable shift in OpenAI's content policies, particularly around "sensitive" content like erotica and gore—allowing this type of content to be generated without warnings in "appropriate contexts."

The change in policy has been in the works since May 2024, when the original Model Spec document first mentioned that OpenAI was exploring "whether we can responsibly provide the ability to generate NSFW content in age-appropriate contexts through the API and ChatGPT."

ChatGPT's guidelines now state that that "erotica or gore" may now be generated, but only under specific circumstances. "The assistant should not generate erotica, depictions of illegal or non-consensual sexual activities, or extreme gore, except in scientific, historical, news, creative or other contexts where sensitive content is appropriate," OpenAI writes. "This includes depictions in text, audio (e.g., erotic or violent visceral noises), or visual content."

Read full article

Comments

© filo via Getty Images

❌
❌