Learning scienceEdTechCritical thinking

    Deliberate practice, transfer learning, and what edtech apps do not say aloud

    8 min read

    If you build or buy learning products, you have probably seen this story before: engagement looks great, completion rates are fine, NPS is up, and six months later managers still say people are not applying what they learned.

    That gap usually is not a motivation issue. It is a design issue. The product is optimized for engagement in the moment, while the organization needs reliable judgment under real constraints. Those are two different things, and most edtech does not distinguish between them clearly enough.

    Person studying alone at a library desk, surrounded by books.

    The transfer problem

    In cognitive science, transfer means the degree to which something you learned in one context actually helps you in a different one. Researchers split it into near transfer (you get better at tasks similar to training) and far transfer (you get better at tasks that are meaningfully different from training: different domain, different format, different stakes).

    Near transfer is relatively common when practice is well aligned with the target task. Far transfer is a different story entirely.

    We tend to do what we were trained to do, and little beyond that.
    Douglas Detterman (1993)

    That was a provocation at the time. Thirty years later, the data has mostly confirmed it.

    Gobet and Sala published a major review in Perspectives on Psychological Science in 2023 that looked across meta-analyses of working memory training, video games, music training, and commercial brain-training programs like Lumosity and BrainHQ. Their finding was blunt: once you control for placebo effects and publication bias, the far-transfer effect size across all these programs is essentially zero. Not small. Zero. Brain-training games make you better at brain-training games. That is about it.

    This matters because it sets the bar for any product that claims to improve “general thinking” or “reasoning ability.” If you are promising far transfer, you are promising something the literature says is extraordinarily hard to deliver, and that most programs, including well-funded, well-researched ones, have failed to deliver.

    Someone writing notes at a desk, focused and deliberate.

    So the honest design question for any learning product is not “can we make people feel smarter?” It is “what decisions will they actually face next week, and which parts of those decisions are we rehearsing?” If the answer is vague, the product might still work as onboarding or team culture. But it should not claim durable reasoning gains without evidence tied to a specific task class.

    Spacing and retrieval: done right vs. done cosmetically

    Two of the most replicated findings in all of learning science: distributing practice over time beats cramming (the spacing effect), and testing yourself on material beats re-reading it (the testing effect). Both are well established. Both are also easy to implement badly.

    A daily streak is not spacing if the prompts never revisit the same reasoning moves at meaningful intervals. A quiz is not retrieval practice if it only checks whether you recognize a label rather than whether you can reconstruct an argument from memory.

    A 2024 study added a useful nuance: retrieval practice can support far transfer, but only when the material has underlying rule structure. When learners are extracting principles (how an argument type works, what makes evidence strong or weak), testing helps them apply those principles to new situations later. When they are just memorizing isolated facts, the benefit stops at recall. This is a meaningful design constraint. It means the what of retrieval practice matters as much as the schedule.

    The difference between “I’ve seen this before” and “I can rebuild this from scratch”

    is the entire gap between superficial and effective practice.

    Why feedback quality is where most products fall short

    Spacing and retrieval are scheduling problems. Feedback is a quality problem, and it is where the gap between learning products and actual learning tends to be widest.

    For reasoning tasks specifically, correctness feedback (“that’s wrong, here’s the right answer”) has limited value. What actually moves the needle is structural feedback: what was claimed, what evidence was available, how a conclusion does or does not follow. That kind of feedback is slower to author and harder to score automatically, which is exactly why most products default to thin feedback even when the marketing copy promises “deeper thinking.”

    The incentive problem is real. Thin feedback is cheap to produce, easy to instrument, and does not create the friction that makes users churn in week two. Rich structural feedback takes longer, feels harder, and can produce worse 7-day retention numbers while producing better 90-day learning outcomes.

    Hand writing notes in a notebook with a pen.

    Real text vs. clean puzzles

    Pattern-matching puzzles improve performance on similar puzzles. That is a real finding. But the assumption that this transfers to reading a messy report, evaluating a chart under time pressure, or weighing conflicting claims in an unfamiliar domain requires separate evidence, and the evidence is thin.

    Training on real-world material introduces noise, ambiguity, and the kind of variability people actually encounter outside the app. These are features, not bugs. When products strip out ambiguity to make sessions feel clean and rewarding, they may be optimizing for user satisfaction while under-training the exact messiness that makes real decisions hard.

    There is also a counterintuitive 2023 finding worth mentioning: deliberately generating wrong answers and then correcting them can enhance far transfer more than practicing without errors. The idea is that making a mistake forces you to engage with the structure of a concept more deeply than getting it right on the first try. Uncomfortable, but consistent with what deliberate practice actually looks like when done properly: effortful, not smooth.

    What we take into thessea

    thessea is built around short sessions on real text (news articles, research summaries, AI-generated drafts), with feedback organized around claims, evidence, and reasoning. We use the CER framework not because it sounds academic, but because those are the operations you are running when you read something and try to decide whether it is trustworthy.

    We focus on near transfer first: can you get measurably better at evaluating comparable texts? Far transfer, whether practicing on thessea makes you a better decision-maker in high-stakes novel contexts, is something we treat as a question to test, not an outcome to assume. Given what the research says, anyone who tells you otherwise about their product is probably ahead of their evidence.

    What we can commit to is designing for the conditions the science says matter: tasks that resemble real decisions, spaced practice that revisits the same skills over weeks, retrieval that requires construction rather than recognition, and feedback that explains reasoning structure rather than just flagging right or wrong.

    Further reading