
This is an excerpt from Elicit's April 2026 investor update. I thought it would be helpful to step back, provide some situational awareness, talk about hard-to-verify tasks, and how we're scaling the company.
What's happening in AI
The best prediction for AI R&D parity (the point in time when frontier labs would rather fire all human researchers than all AIs) is maybe 2030, median, i.e. four years from now. Ajeya Cotra puts it at early 2030, Ryan Greenblatt at early 2031. Daniel Kokotajlo is more bullish, Eli Lifland less so, though they forecast related but not identical milestones. Many others are a lot less bullish, of course. I've thought less about the specifics here but I know that basically no one has thought about it as carefully as the named people. Median means it could easily be sooner than that. And if that's the parity date we'll see tons of automation before then.
Even crazier, not much later (perhaps 2032) we might see automation of the entire AI production pipeline - chips, fabs, fab equipment, power plants, everything else, leading to self-sufficient AI populations. I think software sometimes feels kind of fake to people in a way that stuff happening in the real world won't.
As the amount of automated AI research goes up in quality & quantity, the likelihood of various non-linear developments goes up as well, e.g. semi-automated invention of new non-LLM architectures with different risk profiles. I wouldn't be shocked to find that non-local algorithmic changes lead to radically different compute/data/alignment scaling behavior, making the strategic picture less predictable than the points above suggest, even though some amount of architectural innovation is priced into these predictions already.
To be explicit about the implications, these developments matter because by default they plausibly lead to radical inequality (as returns to first cognitive and then physical labor decline), changes in power structures (including the rise of AI-supported authoritarianism and potential disempowerment of existing governments), existential risk from misaligned AI, risks from novel bioweapons & cyber attacks & other misuse, and probably a host of consequences I'm forgetting. It doesn't seem like we're on track to address this set of risks in time to bring down the probabilities of bad outcomes to anywhere near a reasonable level.
As a society we have basically two options (three if you count getting lucky) - we can either coordinate to slow down the development of more capable models while we use the time for institution building, targeted regulation, planning the rollout, differential tech development, etc., or we can keep the pace and use early AI systems to help us speed up those activities. These approaches are fungible to some degree - the more AI helps with the required planning & coordination relative to how much it increases risk, the more it's effectively the same as slowing down the rollout.
Hard-to-verify tasks
Unfortunately, right now, AI has an extremely jagged capabilities profile. Tasks that are easy to verify and don't require new ideas are seeing much more progress than others. This is what's going on in Karpathy's autoresearch - you set up a loss function and a well-defined problem, and let the agents rip. For cybersecurity, you can systematically go through a large codebase and verify if the vulnerability you thought you found actually works. For replicating an existing piece of software, you can check that your replication matches on the inputs & outputs. This post by Ryan Greenblatt is the best analysis of this situation.
I don't expect models' disproportionate strength at easy-to-verify tasks to change rapidly - it's what you'd expect based on first-principles reasoning for ML architectures (see e.g. our pre-LLM mission statement from 2017). I expect the next gen of models to be somewhat better (bigger pretraining likely helps more with this than RL) but not radically better. If you read the Claude Mythos system card, it hallucinates much less, even relative to Opus 4.6 which didn't hallucinate much to begin with, but one of its weaknesses is epistemics, calibration, distinguishing correlation & causation, figuring out what strategies will or won't work in practice in domains where it can't easily check.
So what to do? There are a few possible responses here, if you're trying to improve the situation:
Focus on easy-to-verify tasks like automated programming and see if applying them to the most strategically important situations helps. Maybe creating good dashboards is an important bottleneck. The reason this isn't completely crazy is that AI deployment is generally most advanced at AI companies, and lags everywhere else.
Try to reduce hard-to-verify tasks to easy-to-verify tasks so we can reap the benefits of LLMs for tasks without fast feedback signals.
Support hard-to-verify tasks in other ways; e.g. human/AI collaboration schemes, alternate ways to train AI.
Elicit is increasingly in the business of #2 - finding ways to reduce hard-to-verify tasks that come up for strategic decision-making to easy-to-verify tasks, using techniques like task decomposition, factored verification, checking provenance, consistency checks, process supervision, explicit knowledge representations (cf. Karpathy knowledge base, causal models), and explicit probabilistic planning. Most recently, this has looked like creating explicit research programs (as an abstraction of the systematic literature review process) that make it easier to check the process that a model went through to come up with its answer. Going forward, I expect more of it to look more explicitly like "make the task easier to verify for the model", because that is the key requirement to get scalable work out of models. (This includes Claude Mythos, which still has issues staying on track and checking its own work.)
Scaling Elicit
The technical agenda is the easy part, in a way. The more challenging part is to build an organization that can reach sufficient scale to make a difference in time, and to be in the relevant rooms in top national and international agencies, leading AI companies, Fortune 50 & 500, and major tech companies.
It's of course challenging to scale any sort of startup (don't do it), but it's especially challenging to build one that keeps steering towards its mission. Elicit's mission is to radically improve reasoning for high-stakes decisions. However, we can't solve all problems at once, we can't start with problems that don't have good revenue potential, we don't want to start with problems that don't matter in their own right, and yet we need to have a clear path to solving the problems we ultimately care about, which means the initial problems must be analogous or adjacent. The difficult thing about hard-to-verify tasks in particular is they're also harder to verify in a business context - organizations like to buy things that have clean measurable payoffs.
Aside from being hard-to-verify, the second tricky part with respect to high-stakes decisions is that they tend to occur in highly regulated environments with strong requirements for transparency, logging, and security. They tend to use a combination of internal and external sources, often require understanding of the big picture and reasoning across domains (science, econ, policy). We've thought about and experimented with various domains over the years.
The domain where we've found the best overall fit here so far is health economics. This is a function at large pharma companies that is mostly about estimating how cost-effective drugs are using a combination of scientific literature, economic modeling, fairly complex indirect comparisons with other treatments, and real-world evidence (claims databases, health records). A lot of this is about understanding how large the economic and human burden of a disease is. There's no easy way to check how well you're doing at this task, it's highly economically valuable, obviously quite important in its own right, and it rests on reasoning across scientific, economic, and regulatory domains in a way that many important problems do. We've made good inroads - we have enterprise agreements with about 30% of the top 20 pharma companies, and are seeing interest from many others.
The main ways it's imperfect are that (1) it's significantly about persuasion — convincing authorities that your drug is worthwhile (this can be addressed to some extent by also helping regulators, creating a competition for higher standards of reasoning); (2) aside from helping address AI-generated biorisk, it's not the most central domain for helping manage the AI transition at large — I don't expect there to be any domain that is better across all dimensions, and Elicit will need to develop a portfolio of verticals; and (3) it's still not the hardest-possible situation since it's grounded in empirical clinical and claims data in a way that the most challenging geopolitical/governance/alignment decisions are not (this is probably a feature given that it makes it somewhat more tractable).
The other challenge here is that time is short and scaling across verticals takes time. Why expect we can do it quickly enough to matter? First, as crucial as developing Elicit-the-product is developing Elicit-the-company. This means mapping out all the workflows, writing constitutions for each function, and automating everything as rapidly as we can so that we're among the first companies that can act as if they had 10x or 100x as many people. Part of expansion is about human relationships though, and that part is hard to accelerate with AI. This is why it's crucial to start building those relationships earlier than feels right, use automation to free up our time for those parts, and to expand the go-to-market and evals teams more aggressively to policy, government, and non-life-science industries than would feel right in more normal times. (This will require a lot of $s.)
So the plan is simple. Complete the health econ proof point while doubling down on our technical agenda of reducing hard-to-verify tasks to easy-to-verify ones, automating Elicit-the-company, and rapidly expanding the mostly irreducible human components of our work to cover the verticals we ultimately want to have an impact in.




