Trying "AI research" as a layperson

22 Mar, 2026

In an attempt to see how good the models are at doing AI research, I attempted to try and get Opus 4.6 to help me finetune a qwen3-0.6B base model to get 10% on AIME 2025. This isn't an entirely trivial problem that can be 1-shotted, but it's not impossible either and I thought it'd be a good experiment to see how good the models are. Another thing I wanted to test is if I could just do this through pure prompting, as 100p of this code is throwaway, and the new meta for engineers will be navigating new domains they don't have much expertise in, and trying to do stuff w/o being fooled.

I don't train LLMs on my day job, so it was interesting being bitten by all the common issues one would only discover as a practitioner. These issues are trivial to work around with agents but because they aren't well represented in the general corpus of knowledge yet, it did take more self discovery from the models. Here's a few "tacit" things I learned.

Datasets: Do not trust the benchmarks - the evals aren't all consistent and you should establish your own benchmarks. You need to deeply validate the datasets on huggingface before using them. Some math datasets contain purely coding problems and AI is going to take a long time to discover why experiments don't make sense. Correlation between datasets is not even a given. Doing well on simpler math datasets like SVAMP and GSM8K doesn't mean you'll do better on a harder dataset like MATH, and you could do better on MATH then SVAMP and GSM8k. The datasets themselves have so many diff properties that you have to normalize for. Length, style of solution, if the answer is formatted correctly / uses thinking tokens well. Building a way to understand these datasets quickly also help the AI formulate better search plans for experiments.

Smoke tests: Have lots of guard checks for tuning batch size / datasets to prevent yourself from going OOM or randomly erroring. As an unexperienced person in a new domain, there's definitely going to be errors you only discover through runs - and I think trying to find all these error cases as fast as possible to build the right checks is a good exercise. A good prompt I think you could use for all projects that involve automated search is to ask for a process to help the LLM discover all the prechecks it would need to make loops smaller.

Being an expert: LLMs still won't tell you want to search across and they won't lift you out of a bad search space. As a person who hadn't done SFT before, I was primarily concerned with trying to find a good dataset / curriculum to teach the model to be a better math solver. At first I was searching over the various datasets across the many different dataset dimensions you can SFT on (reasoning style, length, output format). Only after inspecting fine-tunes did I realize that the expected reasoning traces didn't match up at all to my SFTs. Here I realized that self-distillation and keeping the traces more indistribution was going to be more fruitful than blindly continuing SFTs.

Some meta takeways from doing something I didn't have much experience in and what we'll see from here.

Job shifting: This is obvious - the job now is setting up enough guardrails + loops so that the models don't hallucinate. Even on a simple challenges such as SFT with the right datasets / curriculum, the LLMs are going to hallucinate if you do not fix the degrees of freedom they have to check. At my day job - I have a lot more experience with what I'm doing so I know how to setup my harnesses to let the agents go off and do their own thing, here because I don't have experience the degrees of freedom is way higher. Finding out what to even search for and how to setup the loops is easy at a high level, but in the end you do need expertise / taste to make sure you're doing the right thing.

More science: Atleast in this area of research (distillation) - it seems quite easy to have the LLMs read a paper and then implement it. Doing some literature review helped me realize that research in this area is indistribution and so LLMs can basically oneshot implementations. As I go about playing with things more on the frontier, I can sense that we'll face an explosion of scientific papers (both incremental + more groundbreaking). Terrence Tao on the dwarkesh pod made a statement about how it takes time for science to diffuse across society for us to realize if something is a breakthrough discovery. This was because people needed to test the boundaries of new ideas and it used to take a lot of effort to do so. As scientific progress speeds up, the diffusion of scientific ideas will be speed up, with obvious examples of people being able to do literature search and apply frontier techniques faster. But we'll also see more stress testing of science - both from humans and in an automated sense, you can imagine research papers being automatically be verified (previously infeasible wrt costs and time). Acceleration of science will obviously models as they get trained on new science, but the increase in science artifacts will also help ground the models and serve as stepping stones due to in context learning.

I wanted to form an opinion on how "learning" works now given that we have LLMs that can do so much for us. Traditionally, you had to learn before you could actually do something. I think flipping this model was uneasy for me and many experienced people to accept, not only because we do not know what we do not know, but also because we will develop habits of not fully understanding things that we do. Perhaps this was always the case though, as new grad software engineers never fully understood the entire computing stack but were able to do things. School has had a habit of indoctrinating into us that we shouldn't "cheat" as we won't "learn" anything - and I still largely agree with the statement, but I think we now need to learn what we should spend our time learning.