Objective AI Can Only Be Achieved with High-Quality Data
And the LLM companies should pay for it
Last year, the Trump administration unveiled a broad action plan for AI(America’s AI Action Plan). The general vibe is one of “treating AI like a business,” aiming to sell the AI stack worldwide and generate a lock-in for American technology. “Winning,” in this context, is primarily economic. The plan also includes the sorely needed idea of modernizing the electrical grid, a growing concern due to rising electricity demands from data centers. While any extra business is welcome in a heavily indebted nation, the section on the “political objectivity of AI” is both too brief and misunderstands the root cause of political bias in AI and its role in the culture war.
The plan uses the term “objective,” and implies that a lack of objectivity is entirely the fault of the developer, for example:
> Update Federal procurement guidelines to ensure that the government only contracts with frontier large language model (LLM) developers who ensure that their systems are objective and free from top-down ideological bias.
The fear that AIs might tip the scales of the culture war away from truthful values is real. Try asking GPT, Claude, or even DeepSeek about climate change, where COVID came from, or USAID.
This desire for ‘objectivity of AI’ may come from a good place, but it fundamentally misconstrues how AIs are built. AI in general, and LLMs in particular, are a combination of data and algorithms, which further break down into network architecture and training methods. Network architecture is frequently based on stacking transformer or attention layers, though it can be modified with concepts like “mixture of experts.” Training methods are varied and include pre-training, data cleaning, weight initialization, tokenization, and techniques for altering the learning rate. They also include post-training methods, where the “base model” is modified to conform to a metric other than the “accuracy of predicting the next token.”
Many have complained that post-training methods like Reinforcement Learning from Human Feedback (RLHF) introduce “political bias” into models at the cost of accuracy, causing them to avoid controversial topics or spout opinions approved by the companies, which are different from the average user. “Jailbreaking” models to avoid such restrictions was once a common pastime, but it is becoming harder as “corporate safety” measures, sometimes as complex as entirely new models, scan both the input to and output from the underlying “base model.”
As a result of this battle between RLHF and jailbreakers, an idea has emerged that these post-training methods and “safety features” are how political bias gets into the models. The belief is that if we simply removed these, the models would display their “true” objective nature. Unfortunately for the future of America, this is only partially correct. Developers can indeed make a model “less objective” and more biased in a certain direction under the guise of “safety.” However, it is very hard to make models that are “more objective.”
The problem is data.
According to Google AI Mode vs. Traditional Search & Other LLMs [Study], the top domains cited in LLMs: are: reddit (40%), youtube (26%), Wikipedia (23%), google(23%), yelp(21%), facebook(20%) and amazon(19%).
This seems to imply a lot of the “outside fact” data in AIs comes from Reddit. Spending trillions of dollars to create an “eternal redditor” isn’t going to “cure cancer.” At best, it might create a “cure cancer cheerleader” who hypes up every advance and forgets about it two weeks later. One can only do so much in the algorithm layer to counteract the frame of mind of the average Redditor. In this sense, the political slant of LLMs is less due to the biases of developers and corporations (although they do exist) and more due to the biases of the training data, which is heavily skewed toward being generated during the “woke tyranny” era of the Internet.
In this way, the “AI bias” problem is not about “removing bias” to reveal a magic “objective” base layer. Rather, it is about creating a human-generated and curated set of true facts that can then be used by LLMs. Using legislation to remove the methods by which developers push AIs into their political corner is a great idea, but it is far from sufficient. Getting humans to generate truthful data is extremely important.
The pipeline to create truthful data likely needs at least four steps:
1. Raw data generation of detailed tables and statistics (usually done by agencies or large enterprises).
2. Mathematically informed analysis of this data (usually done by scientists).
3. Distillation of scientific studies for educated non-experts (in theory done by journalists, but in practice rarely done at all).
4. Social distribution via either permanent (wiki) or temporary channels (X).
This problem of “truthful data + commentary” for AI training is a government, philanthropic, and business problem.
I can imagine an idealized scenario in which all these problems are solved by harmonious action in all three directions. The government can help the first portion by forcing agencies to be more transparent with their data, putting it into both human-readable and computer-friendly formats. That means more CSVs, plain text, and hyperlinks, and fewer citations, PDFs, and fancy graphics with hard-to-find data. FBI crime statistics, immigration statistics, breakdowns of government spending, the outputs of government-conducted research, minute-by-minute election data, and GDP statistics are fundamentally pillars of truth and are almost always politically helpful to truth seekers.
In an ideal world, the distillation of raw data into causal models would be done by a team of highly paid scientists via a non-profit or a governmental contract. This work is too complex to be left to the crowd, and its benefits are too distributed to be easily captured by the market.
The journalistic portion of combining papers into an elite consensus could be done similarly to today: with high-quality, subscription-based magazines. While such businesses can be profitable, for this content to integrate with AI, the AI companies themselves need to properly license the data and share revenue.
The last step seems to be mostly working today, as it would be done by influencers paid via ad revenue shares or similar engagement-based metrics. Creating permanent, rather than disappearing, data (à la Wikipedia) is a time-intensive and thankless task that will likely need paid editors in the future to keep the quality bar high.
However, we do not live in an ideal world. The epistemic landscape has vastly improved since Elon Musk’s purchase of Twitter. At the very least, truth-seeking accounts don’t have to deal with as much arbitrary censorship. Even other media have made token statements claiming they will censor less, even as some AI “safety” features are ramped up to a much higher setting than social media censorship ever was.
The challenge with X (and other media) is that tech companies generally favor technocratic solutions over direct payment for pro-social content. There seems to be a widespread belief in a “marketplace of ideas”. The idea that without censorship (or with only some person’s favorite censorship), truthful ideas will win over false ones. This likely contains an element of truth, but the peculiarities of each algorithm may favor only certain *types* of truthful content.
“X is the new media” is a commonly spoken refrain. Yet, both anonymous and public accounts on X are implicitly burdened with tasks as varied and complex as gathering election data, analyzing history, repeating slogans, taking a time to process “the Current Thing”. All for a chance of a few Elon-bucks. They are doing this while competing with stolen-valor thirst traps from overseas accounts. Obviously, most are not that motivated and stick to pithy and simple content rather than intellectually grounded think-pieces. “Wedge” statements can try to divide an audience into people who like-engage and people who hate-engage. They are a big problem for social cohesion and capacity of the hivemind to coordinate on the most basic facts and ideas.
The other issue experienced by data creators across the political spectrum is the reliance on unpaid volunteers. As the economic belt inevitably tightens, and productive people have less spare time, the supply of quality free data will worsen. It will also worsen as both platforms and users feel rightful indignation from their data being “stolen” by AI companies making huge profits, thus moving content into gatekept platforms like Discord. While X is unlikely to go back to the old censorship regime, its quality can certainly fall further.
Redditors and Wikipedia contributors provide fairly complex, even if frequently biased, data that powers the entire AI ecosystem. Also for free. A community of unpaid volunteers working to spread useful information sounds lovely in principle. However, lots of people with free time and not much to do are in this position for not the best of reasons. And the data they produce certainly varies in quality. Despite that, Wikipedia still maintains its usefulness in non-controversial topics. However, in addition to the decay in quality, these kinds of “business models” are generally very easy to disrupt with minor infusions of outside money, if it just means paying a full-time person to post. If you are not paying to generate politically powerful content, someone else is always happy to.
The other dream of tech companies is to use AI to “recreate” the entirety of the pipeline. We have heard so much drivel about “solving cancer” and “solving science.” While speeding up human progress by automating simple tasks is certainly going to work and is already working, the dream of full replacement will remain a dream, largely because of “model collapse”, the situation where AIs degrade in quality when they are trained on data generated by themselves. Companies occasionally hype up “no data / zero-knowledge / synthetic data” training, but a big example from 10 years ago, “RL from random play,” which worked for chess and Go, went nowhere in games as complex as Starcraft.
This brings us to the recent example of Grokipedia. Perusing it gives one a sense that we have taken a step in the right direction with an improved ability to summarize key historical events and medical controversies. However, a number of articles are lifted directly from Wikipedia, which risks drawing the wrong lesson. Grokipedia can’t “replace” Wikipedia in the long term because Grok’s own summarization is dependent on it.
Like many of Elon Musk’s ventures, Grokipedia is two steps forward, one step back. The forward steps are a customer-facing Wikipedia that seems to be of higher quality and a good example of AI-generated long-form content that is not mere “slop,” achieved by automating the tedious, formulaic steps of summarization. The backward step is a lack of understanding of what the ecosystem looks like without Wikipedia. Many of Grokipedia’s articles are lifted directly from Wikipedia, suggesting that if Wikipedia disappears, it will be very hard to keep neutral articles properly updated.
Even the current version suffers from a “chicken and egg” source-of-truth problem. If no AI has the real facts about the COVID vaccine and categorically rejects data about its safety or lack thereof, then Grokipedia will not be accurate on this topic unless a fairly highly paid editor researches and writes the true story. As mentioned, model collapse is likely to result from feeding too much of Grokipedia to Grok itself (and other AIs), leading to degradation of quality and truthfulness. Relying on unpaid volunteers to suggest edits creates a very easy vector for paid NGOs to influence the encyclopedia.
The simple conclusion is that to be “good training data” for future AIs, the next source of truth must be written by people. If we want to scale this process and employ a number of trustworthy researchers, Grokipedia by itself is very unlikely to make money and will probably be a forever money-losing business. It would likely be both a better business and a better source of truth if, instead of being written by AI to be read by people, it was written by people to be read by AI.
Eventually, the domain of truth needs to be carefully managed, curated, and updated by a legitimate organization that, while not technically part of the government, would be endorsed by it. Perhaps a non-profit NGO, except good and actually helping humanity. The idea of “the Foundation(Foundation (Asimov novel) - Wikipedia)” or “Antiversity,” is not new, but our over-reliance on AI to do the heavy lifting is. Such an institution, or a series of them, would need to be bootstrapped by people willing to invest in our epistemic future for the very long term.
