Conventional wisdom around large language models tends to bucket them into two clean categories of use cases.
In the first, there are the “easy” ones where off-the-shelf models like GPT-4 or Claude work well: general assistance, basic research, brainstorming, content generation, casual conversation. In the second are the “hard” use cases where it’s widely accepted that these models fall short: medical diagnostics, legal analysis, financial modeling, scientific research. For these specialties, the industry has coalesced around the understanding that specialized training is required: fine-tuning, reinforcement learning, or even building domain-specific foundation models from scratch.
This binary thinking misses something critical: there’s a substantial gray area of use cases that appear easy, but actually demand the sophistication of the hard bucket. These are situations where the surface-level task seems straightforward enough that context engineering should suffice, but where subject expertise reveals glaring inadequacies in how off-the-shelf models approach the problem.
Take tutoring, for example. To many engineers and investors, an AI tutor seems like a natural extension of an AI assistant, in that both involve answering questions and providing explanations. Anthropic and OpenAI recently released prompt-engineered versions of their respective base models (Claude Learning Mode and ChatGPT Study Mode), which use Socratic questioning, scaffolded responses, and knowledge checks to guide students toward understanding rather than just providing answers. Google’s recent Gemini Guided Learning Mode is an exception in that it is fine-tuned, but the training data is largely synthetic — written by AI or educators simulating tutoring, not real session transcripts.
While these “learning modes” perform better than the base models, they still fall short across some important dimensions. They routinely overwhelm students with information instead of scaffolding understanding appropriately. They don’t try to establish rapport and rarely conduct meaningful assessment of prior knowledge. They aim to conclude interactions efficiently rather than sustaining productive learning sessions for as long as students remain genuinely engaged and curious. They don’t adapt their communication style to different learning preferences, cultural backgrounds, or emotional states — factors that expert tutors recognize as essential to effective teaching.
Such shortcomings aren’t bugs that can be fixed with clever prompting. They’re inherent features of how these models were pre-trained and fine-tuned. The “helpful assistant” paradigm that makes ChatGPT great for brainstorming makes it counterproductive for teaching, where the goal isn’t to provide one-shot comprehensive answers but to establish a long-term relationship, guide discovery, and build lasting understanding.
Therapy presents an even starker example. The behaviors that make an LLM seem helpful in casual conversation, such as being verbose, offering solutions, and maintaining consistent cheerfulness, can be genuinely harmful during sessions. Effective therapy requires knowing when to stay silent, sit with discomfort, and how to reflect emotions without trying to fix them. It requires understanding trauma-informed care, attachment styles, and cultural considerations that go far beyond what any general-purpose model has been tuned for.
The problem isn’t just that these applications require domain expertise; it’s that they require it to be applied to the fundamental training process. We can’t prompt-engineer our way to pedagogically sound tutoring or clinically appropriate therapy responses when the underlying model has been optimized for different behaviors entirely.
How AI Startups Build Defensibility and Win
Accordingly, the companies that will win in these gray areas won’t be those with the best prompt engineering or the slickest user interfaces. They’ll be those with access to the right training data: real tutoring sessions with learning outcomes tracked over time, therapeutic conversations with long-term client progress measured, coaching interactions with behavioral change documented.
This data is extraordinarily difficult to obtain. It can’t be scraped from the web or synthetically generated. Even data vendors like Mercor and Scale, which employ domain expert contractors to annotate data for model training, can’t simply spin up high-quality tutoring or therapy sessions at will. These require fully fledged operations with infrastructure, client acquisition, and sustained relationships that extend beyond what data vendors can provide.
This is why I’m excited about companies that have a “data flywheel” approach to gray area applications. These businesses start by providing human-delivered services — actual tutoring sessions, real therapy appointments, live coaching conversations — and use the interactions to train increasingly sophisticated AI models that can augment or replicate human expertise. The flywheel effect comes as better AI models improve service quality, which attracts more users, generates more training data, and enables even better models.
Another benefit of this approach that’s hard to overstate: by providing these services, companies are effectively getting paid to collect high-quality training data, rather than having to pay for it.
But it’s not enough to simply provide services and capture generic interaction data. The data collection and cleaning processes are complex, involving dedicated audio channels for each speaker, high-fidelity recording systems, and background noise removal. It needs to filter out irrelevant backchanneling and interruptions, yet know when to preserve signals that are important to the job, such as pauses, tone changes, and speaking patterns in our tutoring and therapy examples.
Then comes the AI/ML engineering challenge: using this cleaned data to train multiple candidate models, and running extensive evaluations to determine which fine-tuning and reinforcement learning approaches improve real-world outcomes. As new modalities become available (e.g. voice, face/body, on-screen interaction, whiteboard, emotion detection) the process repeats with fine-tuning these specialized models on domain-specific interaction data.
The companies that stand out from the pack will sit at the intersection of all these dimensions: domain expertise, human service delivery, sophisticated data infrastructure, and cutting-edge AI/ML engineering. Beyond academic tutoring and therapy, this pattern extends to other applications that appear deceptively simple: executive coaching (where the goal isn’t advice-giving but long-term behavior change), writing instruction (where mechanical technique matters less than developing a unique voice and perspective over time), and language learning (where effective delivery requires sustained engagement and adaptive teaching practices as the learner’s proficiency improves).
These companies face a unique challenge: they need to excel at running traditional service businesses at healthy margins while simultaneously building world-class AI capabilities. But for the rare teams that can execute across all these dimensions, the opportunity is enormous: not just to build better software, but to fundamentally improve how humans learn, heal, and grow.
If you’re building something that sits at this intersection — domain expertise, human service delivery, sophisticated data collection, and serious AI engineering — I’d love to hear from you.
James Kim is a Partner at Reach Capital, and a former math teacher and edtech product manager.