Multi-Modal Prompting

Combine text, images, and documents for richer AI interactions

Beyond Text: The Power of Vision

Modern LLMs can process images, PDFs, screenshots, and other visual inputs alongside text. Multi-modal prompting unlocks new use cases: analyzing charts, extracting data from documents, answering questions about diagrams, and more.

Combining Text + Image Inputs

Provide Clear Instructions for Visual Analysis

Just like text prompts, image analysis benefits from specific, clear instructions.

❌ Vague:

[Uploads chart image]
"What do you see?"

✓ Specific:

[Uploads chart image]
"Analyze this sales chart. Identify: 1) Overall trend, 2) Largest month-over-month change, 3) Any anomalies that need investigation."

Reference Specific Parts of the Image

Guide the LLM's attention to specific areas when analyzing complex images.

[Uploads UI mockup screenshot]

Review this dashboard mockup for UX issues:

Focus on the top navigation bar — is the hierarchy clear?
In the left sidebar, evaluate icon clarity and labeling
For the main data table, assess readability and information density
Check color contrast for accessibility (WCAG AA standards)

Combine Multiple Images

Some LLMs can process multiple images in a single prompt for comparison or context.

Use Cases:

• Before/After: "Compare these two product photos. Which lighting and composition is more effective for e-commerce?"

• Design Iterations: "Review these 3 logo variations. Rank them by clarity, memorability, and versatility."

• Consistency Check: "These are screenshots from different pages of our app. Identify any inconsistencies in UI patterns or branding."

Document Analysis (PDFs, Screenshots)

Extract Structured Data from Documents

LLMs can read PDFs, invoices, receipts, contracts, and extract specific information.

[Uploads invoice PDF]

Extract the following fields from this invoice and return as JSON:

{
  "invoice_number": "string",
  "invoice_date": "YYYY-MM-DD",
  "vendor_name": "string",
  "vendor_address": "string",
  "total_amount": "number",
  "currency": "string",
  "line_items": [
    {
      "description": "string",
      "quantity": "number",
      "unit_price": "number",
      "total": "number"
    }
  ],
  "payment_terms": "string",
  "due_date": "YYYY-MM-DD or null"
}

If any field is not found, use null.

Analyze Screenshots & UI

Upload screenshots for bug reports, UX feedback, or automated testing insights.

Example Prompts:

Bug Report Analysis

"This is a screenshot of an error. Describe the issue, identify likely causes, and suggest debugging steps."

Accessibility Audit

"Review this page screenshot for WCAG 2.1 AA compliance. Check color contrast, text size, touch target sizes, and semantic structure."

Competitive Analysis

"Analyze this competitor's pricing page. What persuasion tactics are they using? How does their information hierarchy work?"

Process Multi-Page Documents

For long PDFs, specify which sections or pages to focus on, or break into chunks.

Strategies:

• Page Range: "Focus on pages 3-7 of this contract, which contain the pricing terms."
• Section-Based: "Find and summarize the 'Termination Clause' section."
• Chunked Processing: For very long docs, process sequentially and maintain context across chunks

Visual Question Answering

Ask Specific Questions About Visual Content

Frame questions precisely to get accurate answers about charts, diagrams, or photos.

[Uploads architecture diagram]

Questions:

What database technology is being used?
How many microservices are shown?
What is the data flow from user request to database?
Are there any single points of failure in this architecture?
What caching strategy is depicted?

Quantitative Analysis of Charts

LLMs can read values from charts, though OCR accuracy varies. Verify critical numbers.

Example Prompts:

• "What was the value in Q2 2024? Calculate the percentage change from Q1 2024."

• "Which product category had the highest revenue in this pie chart? What percentage of total?"

• "Identify the peak value in this time series and when it occurred."

⚠️ Note: Always verify extracted numbers for financial or critical decisions

Object Detection & Counting

Ask the LLM to identify, locate, or count specific objects in images.

Use Cases:

• Inventory: "Count the number of boxes visible in this warehouse photo."

• Quality Control: "Identify any defects or anomalies in this product image."

• Brand Compliance: "Does this photo follow our brand guidelines? Check logo placement, color usage, and composition rules."

Multi-Modal Prompting Best Practices

DO: Use High-Quality Images

Clear, high-resolution images produce better results than blurry or low-res images

DO: Combine Text and Visual Instructions

Provide written context about what the image shows and what you need from it

DO: Specify Output Format for Extracted Data

Request JSON, tables, or structured formats when extracting information from documents

DO: Test Vision Capabilities

Different models have varying vision capabilities — test with your specific use case

DON'T: Rely on OCR for Critical Data

Vision models can misread text in images. Verify numbers and dates for important use cases

DON'T: Upload Sensitive Documents Without Redaction

Apply the same privacy considerations to images as you would to text data

Unlock Multi-Modal AI

We can help you build AI workflows that process images, documents, and text together

Schedule a Consultation Back to Home