messageCross Icon
Cross Icon
AI/ML Development

Custom LLM fine-tuning for proprietary data

Custom LLM fine-tuning for proprietary data
Custom LLM fine-tuning for proprietary data

There's a moment every developer hits when they realize the AI they've been excited about has a ceiling. You're building something real, not a demo, and you type in something that only makes sense in your specific system. And the model just... doesn't get it.

That moment hit me while working on bdbuddy.ai, a SaaS platform we're building. We had integrated OpenAI APIs to let users create campaigns and tasks through natural language prompts instead of filling out manual forms. 

The idea is solid: give users a way to describe what they want in plain English, allow file uploads, and have the AI validate, link, and assemble the right response payload.

It works, until it doesn't.

When users provide context that's generic, the model handles it well. When they reference something specific to the platform's domain, particular campaign types, how task linking works, what a "valid" payload actually looks like in our system, it starts making things up or producing outputs that pass structural validation but fail business logic. The model doesn't know our system. It's working from general knowledge and guessing the rest.

That experience is what made me dig deeper into fine-tuning and specifically into what it means to fine-tune on proprietary data.

What "Proprietary Data" Actually Means

Before getting into how fine-tuning works, let me be specific about what proprietary data actually is, because this is the entire reason fine-tuning exists for this use case.

Proprietary data is the knowledge that lives exclusively inside your organization and has never appeared anywhere on the public internet. It includes things like:

  • Internal runbooks and incident response playbooks
  • Your codebase's architecture decisions and the reasoning behind them
  • Support ticket history and how your team resolved edge cases
  • Domain-specific business rules; the kind that exist in someone's head and occasionally in a Confluence page
  • Client communication patterns and the vocabulary your team has developed around a specific product
  • Custom error codes, service naming conventions, and integration quirks that are unique to your system

A base model, no matter how capable, has never seen any of this. It was trained on publicly available text, which means it has broad general knowledge but zero knowledge of the things that make your system yours. When it encounters your domain-specific context, it fills the gaps with plausible-sounding guesses. Sometimes those guesses are close enough. Often, in production, they aren't.

Fine-tuning is how you close that gap. You take a capable base model and train it further on your proprietary data, so it internalizes your patterns, your terminology, and your reasoning, not as context handed at runtime, but as something it genuinely knows.

Hire Now!
Hire AI Agent Developers Today!

Ready to harness autonomous agents for transformative results? Start your project with Zignuts' expert AI agent developers.

**Hire now**Hire Now**Hire Now**Hire now**Hire now

The Real Gap: General vs. Contextually Useful

Most developers I've worked with assume AI is either useful or it isn't. That's the wrong mindset. The problem isn't the model's capability; it's that the capability is abstract until it's grounded in your specific context.

Think about what happens when a new developer joins a project. Even a brilliant one takes a couple of weeks before they're truly productive. They need to understand the codebase structure, the naming conventions, the reasons certain architectural decisions were made, and the quirks of the existing system. On bdbuddy.ai, someone new wouldn't immediately know that campaign creation requires a linked task before it can be published, or what happens to legacy users when a new payment flow is introduced. That knowledge lives in the team's heads, the PRs, the Slack threads, not in any public documentation.

A general-purpose LLM is that brilliant new hire on day one. Fine-tuning is the onboarding process.

Technically, fine-tuning takes a pre-trained base model and runs a second training pass on a curated dataset of your own examples. The model's internal parameters are updated so it internalizes your specific patterns, terminology, and reasoning. After fine-tuning, it doesn't need to be told about your conventions every time; they're part of how it thinks.

Before You Fine-Tune: The Step Most Teams Skip

Fine-tuning is a meaningful investment. Before committing to it, there's a faster way to validate whether the model can actually learn your patterns at all: few-shot prompting.

The idea is simple: instead of training the model, you give it 5 to 10 carefully crafted examples of exactly the behavior you want, directly inside the system prompt, and see how it responds to new queries. Modern models have long enough context windows that you can fit a substantial number of domain-specific examples in a single prompt, and context caching (available in most major APIs) makes sending those large prompts significantly cheaper and faster than it used to be.

If few-shot prompting gets you 80% of the way there, you may not need fine-tuning yet. If the model still misunderstands your domain, produces inconsistent output formats, or breaks down at scale, that's when fine-tuning becomes the right call. Fine-tuning is what you reach for when a few-shot becomes too expensive to serve at volume, too slow at inference time, or simply stops scaling as your use cases grow more complex. Treating it as a validation step first saves a lot of engineering time.

Fine-Tuning vs. RAG: Two Different Problems

When this topic comes up, someone always asks: Why not just use RAG?

It's a fair question, but they solve different problems. Understanding the difference saves you from building the wrong thing.

Retrieval-Augmented Generation (RAG) works by having the model search a database of your documents at query time and include that retrieved content in its context before answering. It's excellent for questions where the answer exists in a document somewhere and just needs to be found and surfaced. In our internal tracking system, if a user wanted to know how timesheet validation works, RAG could pull the relevant policy doc and summarize it accurately.

But RAG doesn't change how the model writes or reasons. If your use case is about style, format, domain-specific inference, or consistent behavior across a wide range of queries, RAG won't help. Retrieving more context doesn't change the model's approach to using that context.

Fine-tuning changes the model itself. It learns your patterns so deeply that it doesn't need to look them up. For something like campaign generation on bdbuddy.ai, where the model needs to understand the relationships between entities in the system and generate a structurally valid payload, not just describe one, that's a reasoning and behavior problem, not a retrieval problem.

There's also a compliance angle worth noting here. If an enterprise client churns or an employee leaves and invokes their right to data deletion under GDPR, you can remove their documents from a RAG vector database cleanly and immediately. You cannot do the same with a fine-tuned model; once data is baked into the weights, it's there unless you retrain from a prior checkpoint. This is a real architectural constraint that enterprise teams encounter, and it's one more reason the right default is a hybrid approach. 

Fine-tune the model on general domain patterns and reasoning style, keep individual client or user data in the retrieval layer, where it can be managed and deleted independently.

The honest answer for serious production applications is: you need both. Fine-tune the model to understand your domain's reasoning and output expectations. Use RAG to feed it live, up-to-date facts at query time. One changes how it thinks; the other changes what it knows in the moment.

From Raw Files to Training-Ready Data: The Pipeline That Actually Matters

This is the part that determines whether your fine-tuned model is genuinely useful or just expensive. You can pick the right architecture, configure LoRA correctly, and choose a solid base model, but if the data going into training is noisy, inconsistent, or unvetted, none of that saves you.

Here's what the pipeline actually looks like, step by step.

1: Ingesting Raw Files

Your company's knowledge doesn't live in one clean place. It's scattered across PDFs of product specs, Word documents with process runbooks, Excel sheets tracking historical data, plain text exports from wikis, Slack threads, Confluence pages, and code comments. Fine-tuning starts by pulling all of this in.

Tools like Docling (open-source, supports PDF, DOCX, HTML, Excel, and more) handle the conversion from heterogeneous formats into clean, structured plain text. This step sounds straightforward, but often isn't. PDFs with tables lose their structure on extraction, scanned documents need OCR, and Excel files may have multiple sheets with different schemas. Plan for edge cases here because they're common, not exceptional.

2: Cleaning and Quality Control

Raw extracted content is a mess. I once saw a model heavily overweight a single error response simply because a quarterly review template had been copied forty times. Navigation menus scraped alongside article content, auto-generated boilerplate, document version artifacts, and duplicate content from templates that were copied and slightly modified dozens of times. 

This step involves stripping markup, removing very short or low-quality snippets, deduplicating both exact copies and near-duplicates using fuzzy matching, and flagging documents with unusually high ratios of special characters or formatting artifacts.

3: PII, Secret Scanning and Pseudonymization

This is a hard gate; nothing should proceed to training until it passes.

Automated Personally Identifiable Information (PII) detection scans every document for email addresses, phone numbers, home addresses, financial identifiers, and any other data that could identify an individual. Microsoft Presidio is the most widely used open-source tool for this, supporting over 50 PII entity types and allowing custom patterns for domain-specific identifiers.

Secret scanning runs in parallel, tools like truffleHog and gitleaks look for API keys, database connection strings, private SSH keys, JWT secrets, and similar credentials that may have ended up in runbooks or wiki pages without anyone intending them to be permanent. Engineers paste things from terminals. That content makes its way into documentation. It happens all the time.

Anything flagged gets replaced with consistent synthetic placeholders before training data is finalized. The model needs to learn the structure and context, not the actual values.

Beyond PII, you'll have internal identifiers that are sensitive without being strictly personal, server hostnames, internal product codenames, client names, and environment-specific configuration strings. These get replaced with labeled placeholders that are consistent across the entire dataset.

prod-db-01.internal becomes [SERVER_A]. client-enterprise-corp becomes [CLIENT_1]. The critical requirement is consistency. If prod-db-01 maps to [SERVER_A] in one document, it must map to [SERVER_A] everywhere. Otherwise, the model learns incoherent relationships between the placeholder and the rest of the context.

4: Chunking into Meaningful Segments

Most documents aren't naturally structured as training examples. A 15-page runbook needs to be broken into segments that represent distinct, answerable units of knowledge, split along natural boundaries like headings and sections, not arbitrary character counts.

Chunks that are too small lose context. Chunks that are too large force the model to work with more information than a realistic query would surface. For most use cases, segments of 200–800 words that cover a single coherent topic work well as a starting point.

5: Synthesizing Instruction Pairs

This is where your cleaned, chunked content becomes actual training data. Each chunk needs to be paired with a realistic prompt; the kind of question or request a user might actually make that this chunk would answer.

For large datasets, this synthesis step can itself be done with an LLM: prompt a capable model to generate question-answer pairs from each document chunk. It's practical, but it comes with a risk that doesn't get talked about enough. When an LLM generates your training data, the fine-tuned model tends to absorb the voice of the generating model, the same sentence structures, the same hedging patterns, the same way of organizing an answer. Left unchecked, you end up with a model that sounds exactly like ChatGPT, not like your team. It also loses the edge cases and natural variance that human-written content carries.

Human spot-checking here isn't just for catching factual errors. It's for preserving the distinctiveness of your data, the quirks, the brevity, the specific vocabulary your team actually uses. That variance is what makes the fine-tuned model sound like you, not like a generic AI that happened to read your documents.

The structure of an instruction pair depends on the task. For structured output like our campaign creation feature on bdbuddy.ai, it looks like this:

Code

{
  "prompt": "Create a campaign called 'Summer Reactivation' targeting dormant users from Q1. Link it to task #4421 and set the send date for next Friday.",
  "completion": {
    "campaign_name": "Summer Reactivation",
    "target_segment": "dormant_q1",
    "linked_task_id": 4421,
    "scheduled_date": "2025-07-18",
    "status": "draft",
    "validation": "linked_task_verified"
  }
}
      

For text-based output like generating a postmortem summary or writing a PR description in your team's style, it looks more like this:

Code

{
  "prompt": "Write a postmortem summary for an incident where the payments service returned 503 errors for 12 minutes due to a misconfigured load balancer health check.",
  "completion": "## Incident Summary**Service**: payments-api**Duration**: 12 minutes**Root Cause**: Health check thresholds on the load balancer were misconfigured after the last deployment, causing healthy instances to be marked as unavailable.**Impact**: Payment requests failed for all users during the window. No data loss occurred.**Resolution**: Reverted load balancer config to last known good state. Health checks resumed normal behavior within 2 minutes of rollback.**Follow-up**: Add load balancer config validation to the pre-deployment checklist."
}
      

Both are valid fine-tuning targets. The format of the completion just reflects what you want the model to produce.

Hire Now!
Hire AI Agent Developers Today!

Ready to harness autonomous agents for transformative results? Start your project with Zignuts' expert AI agent developers.

**Hire now**Hire Now**Hire Now**Hire now**Hire now

6: Formatting into JSONL and Final Validation

Everything gets converted to .jsonl format; one JSON object per line with consistent field names across the entire dataset. This is the format most fine-tuning frameworks expect.

One thing worth knowing before you get here: modern instruction-tuned models and most fine-tuning frameworks (OpenAI's API, Axolotl, LLaMA-Factory) expect the JSONL to use conversational role keys rather than plain prompt and completion. The structure looks like this:

Code

{
  "messages": [
    {
      "role": "system",
      "content": "You are a campaign assistant for bdbuddy.ai..."
    },
    {
      "role": "user",
      "content": "Create a campaign called 'Summer Reactivation'..."
    },
    {
      "role": "assistant",
      "content": { "campaign_name": "Summer Reactivation..." }
    }
  ]
}
      

This is called the ChatML format, and it maps directly to how instruction-tuned models process conversations internally. If your dataset uses prompt/completion keys and your chosen framework expects messages with roles, you'll hit a format mismatch before training even starts. Worth getting right in the pipeline rather than discovering it later.

Before training begins, manually review a statistically significant random sample. Ask: Do the completions sound like how your team actually writes? Are the prompts realistic? Is sensitive information genuinely absent? This review step is where you catch systematic issues that automated checks missed, unbalanced topic coverage, a whole document type that was extracted poorly, or prompts that are subtly leading in ways that will teach the model bad habits.

A dataset of 10,000 carefully reviewed pairs will produce a better model than 100,000 pairs of questionable provenance. Quality compounds; noise compounds too.

Choosing a Base Model: A Decision That Affects Everything

Before training starts, there's a foundational decision most guides gloss over: which base model do you build on? And for proprietary data specifically, this choice has privacy implications that matter more than people realize.

You broadly have two paths:

Open-source models (Llama, Mistral, Falcon) give you full control; your proprietary training data never leaves your environment, and you own the weights, the deployment, and the versioning. You can run them on your own hardware, or on managed cloud infrastructure like AWS Bedrock, GCP Vertex, or Azure ML, where the data stays strictly within your VPC without you having to physically manage GPUs. That middle path, open-source models on managed cloud, is actually what most enterprises land on, because it gives you data privacy without forcing you to stand up bare-metal infrastructure.

The faster path is Hosted fine-tuning APIs (OpenAI, Google Gemini) to get started with. You upload your dataset, configure a job, and get back a fine-tuned model endpoint. But your training data is sent to a third-party server. For many organizations, this is acceptable. For others, especially those with customer data, legal constraints, or competitive sensitivity around their internal processes, it's a hard no.

The question to ask internally before choosing: how sensitive is the data we're fine-tuning on, and what are our obligations around it? That answer should drive the infrastructure decision, not the other way around.

Beyond privacy, model size matters for the quality-cost tradeoff. 7B–13B parameter models are the practical sweet spot for most enterprise use cases, capable enough for complex domain reasoning, small enough to run on a single GPU at inference time. 70B models produce better results on genuinely complex tasks but cost significantly more to serve.

Making It Efficient: LoRA and QLoRA

Full fine-tuning, updating every parameter in a large model, requires computing power that most teams don't have access to. This is where LoRA (Low-Rank Adaptation) becomes important.

Rather than updating the full model, LoRA freezes the existing weights and trains small adapter layers that sit alongside them. These adapters learn the delta between "what the base model knows" and "what your specific domain requires." The result is training that's up to 90% cheaper, with quality close to full fine-tuning. The adapter weights are also small; often just a few hundred MB compared to multi-GB base models, which makes them easy to version and swap independently.

Quantized Low-Rank Adaptation (QLoRA) goes further by compressing the base model's weights before training, reducing memory requirements enough that you can fine-tune large models on a single high-end GPU rather than a multi-GPU cluster. These two techniques together are what make fine-tuning practically accessible for engineering teams working within realistic budgets and timelines.

The Two Ways Training Goes Wrong

Proprietary datasets tend to be smaller than public ones; you might have 5,000 well-prepared instruction pairs, or 15,000 if you've done thorough data collection. That size constraint introduces two distinct failure modes that are worth knowing before you start.

The first is overfitting. This happens when the model trains for too long on a limited dataset and starts memorizing examples rather than generalizing from them. Instead of learning "how to write a postmortem in our style," it learns "reproduce these specific sentences when it sees this kind of prompt." The model appears to perform well on training examples and falls apart on anything slightly different.

The second is catastrophic forgetting; essentially, the opposite problem. The model learns your domain so deeply that it starts to degrade at general tasks. It might begin producing grammatically awkward responses, losing the conversational ability that made the base model useful in the first place. You traded general capability for narrow specialization and went too far.

Both failure modes are caught the same way: monitor your validation loss throughout training, not just at the end, and stop early when it plateaus or starts increasing. Keep your LoRA rank conservative; a higher rank means more expressive adapters but also more risk of both issues at once. More training epochs are not always better; with small proprietary datasets, it's often worse.

Security Is Not an Afterthought

Fine-tuning on proprietary data introduces a specific risk that doesn't exist with general model usage: the possibility that the model memorizes sensitive information and later reveals it through adversarial prompting.

If training data contains real API keys, internal IP addresses, or customer records that weren't properly masked, that information can potentially be extracted from the model by someone with API access and enough patience. This isn't theoretical; it's been demonstrated on production systems. The PII scanning, secret scanning, and pseudonymization steps in the pipeline are your primary defenses, and they need to be treated as mandatory, not optional.

Beyond the training pipeline, the model in production should run inside a VPC with no outbound internet access. Your fine-tuned weights are a proprietary asset; they should never leave your controlled infrastructure.

How Do You Know If It Worked?

The least glamorous part of fine-tuning is also the most important: knowing whether the model is actually better. Think of evaluation as three tiers of increasing depth and cost. You move down the tiers as the stakes increase or automated signals stop being sufficient.

Tier one is automated metrics. ROUGE scores measure how closely the model's output matches reference completions in your validation set. Perplexity measures how surprised the model is by held-out examples; lower means the patterns are more familiar. These run instantly and are useful as training signals, but they can't tell you whether an output actually makes sense in your domain or follows your team's reasoning style. They're a floor, not a ceiling.

Tier two is LLM-as-a-Judge. Because full human review is slow and expensive, the practical industry approach for rapid iteration is to use a large, highly capable model, GPT-4o or Claude, given a detailed rubric and your golden set to score your fine-tuned model's outputs. You define what "good" looks like: accuracy, format adherence, tone, and domain correctness. The judge model evaluates each response against that rubric and flags regressions. It's significantly faster than human review and far more context-aware than ROUGE. For most teams, this becomes the primary evaluation layer between training runs.

Tier three is human evaluation. Have domain experts from your own team blind-review outputs from the base model and the fine-tuned model side-by-side on the same prompts, and assess which responses are more accurate, more appropriately formatted, and more aligned with how your team actually communicates. This is the gold standard, but it doesn't need to happen after every training run; it's most valuable at major checkpoints, like before a production deployment or after a significant dataset expansion.

Across all three tiers, maintain a held-out golden set of representative prompts and run your model against it before every deployment. That's what catches regressions, the cases where a new training round quietly degraded something it didn't touch.

Fine-Tuning Is an Iterative Process, Not a One-Time Job

Fine-tuning is rarely something you do once and consider finished.

The realistic cycle looks like this: train on your initial dataset, evaluate across your three tiers, identify where the model still falls short, collect more targeted training data to address those gaps, and retrain. Each iteration tightens the model's alignment with your domain. The first run gets you 70% of the way there. The second and third runs close the remaining gap.

What makes this cycle genuinely powerful is where the best new training data comes from: your own production system. On bdbuddy.ai, when a user edits an AI-generated campaign payload before submitting it, that edit is a signal. The original prompt, the model's output, and the human correction form a natural training pair that captures exactly where the model fell short. Collecting those corrections systematically through something as simple as a thumbs-up/thumbs-down on generated outputs, or logging when users modify a response, builds what's called a data flywheel. V1 of the model generates outputs; users correct the ones that miss; those corrections become the training data for V2. This process is what Direct Preference Optimization (DPO) formalizes at scale: instead of just showing the model good examples, you show it pairs of outputs and tell it which one was preferred.

The model that results from three or four of these iterations, trained partly on its own corrected mistakes, is better in ways that are hard to achieve from static documents alone.

Why It's Worth Building

The argument for fine-tuning on proprietary data comes down to one thing: a general API endpoint is a commodity. Everyone with a credit card has access to the same base capability. That's useful, but it's not a differentiator.

A model trained on your company's documentation, codebase patterns, support history, and internal processes becomes something no one else has. It gets better as your company produces more high-quality data. It develops an understanding of your domain that can't be replicated by someone who hasn't lived in your systems for years. And because it runs on your infrastructure, trained on data that never leaves your environment, it's an asset you own rather than a service you're renting alongside every competitor.

There's real engineering investment upfront, building a secure data pipeline, preparing and auditing training data, making the right base model choice, and managing the iterative training cycle. But the ongoing cost drops significantly, and what you build compounds in a way that a shared API endpoint simply cannot.

That gap between renting capability and owning it is what this whole investment is actually about.

card user img
Twitter iconLinked icon

A problem solver with a passion for building robust, scalable web solutions that push the boundaries of technology and deliver impactful results

card user img
Twitter iconLinked icon

Focused on ensuring product quality, refining user experiences, and delivering reliable, well-tested solutions across platforms.

Frequently Asked Questions

No items found.
Book Your Free Consultation Click Icon

Book a FREE Consultation

No strings attached, just valuable insights for your project

download ready
Thank You
Your submission has been received.
We will be in touch and contact you soon!
View All Blogs