Multi-Modal Prompting
Combine text, images, and documents for richer AI interactions
Beyond Text: The Power of Vision
Modern LLMs can process images, PDFs, screenshots, and other visual inputs alongside text. Multi-modal prompting unlocks new use cases: analyzing charts, extracting data from documents, answering questions about diagrams, and more.
Combining Text + Image Inputs
Provide Clear Instructions for Visual Analysis
Just like text prompts, image analysis benefits from specific, clear instructions.
❌ Vague:
[Uploads chart image]
"What do you see?"
✓ Specific:
[Uploads chart image]
"Analyze this sales chart. Identify: 1) Overall trend, 2) Largest month-over-month change, 3) Any anomalies that need investigation."
Reference Specific Parts of the Image
Guide the LLM's attention to specific areas when analyzing complex images.
[Uploads UI mockup screenshot]
Review this dashboard mockup for UX issues:
- Focus on the top navigation bar — is the hierarchy clear?
- In the left sidebar, evaluate icon clarity and labeling
- For the main data table, assess readability and information density
- Check color contrast for accessibility (WCAG AA standards)
Combine Multiple Images
Some LLMs can process multiple images in a single prompt for comparison or context.
Use Cases:
Document Analysis (PDFs, Screenshots)
Extract Structured Data from Documents
LLMs can read PDFs, invoices, receipts, contracts, and extract specific information.
[Uploads invoice PDF]
Extract the following fields from this invoice and return as JSON:
{
"invoice_number": "string",
"invoice_date": "YYYY-MM-DD",
"vendor_name": "string",
"vendor_address": "string",
"total_amount": "number",
"currency": "string",
"line_items": [
{
"description": "string",
"quantity": "number",
"unit_price": "number",
"total": "number"
}
],
"payment_terms": "string",
"due_date": "YYYY-MM-DD or null"
}
If any field is not found, use null.
Analyze Screenshots & UI
Upload screenshots for bug reports, UX feedback, or automated testing insights.
Example Prompts:
Bug Report Analysis
"This is a screenshot of an error. Describe the issue, identify likely causes, and suggest debugging steps."
Accessibility Audit
"Review this page screenshot for WCAG 2.1 AA compliance. Check color contrast, text size, touch target sizes, and semantic structure."
Competitive Analysis
"Analyze this competitor's pricing page. What persuasion tactics are they using? How does their information hierarchy work?"
Process Multi-Page Documents
For long PDFs, specify which sections or pages to focus on, or break into chunks.
Strategies:
- • Page Range: "Focus on pages 3-7 of this contract, which contain the pricing terms."
- • Section-Based: "Find and summarize the 'Termination Clause' section."
- • Chunked Processing: For very long docs, process sequentially and maintain context across chunks
Visual Question Answering
Ask Specific Questions About Visual Content
Frame questions precisely to get accurate answers about charts, diagrams, or photos.
[Uploads architecture diagram]
Questions:
- What database technology is being used?
- How many microservices are shown?
- What is the data flow from user request to database?
- Are there any single points of failure in this architecture?
- What caching strategy is depicted?
Quantitative Analysis of Charts
LLMs can read values from charts, though OCR accuracy varies. Verify critical numbers.
Example Prompts:
⚠️ Note: Always verify extracted numbers for financial or critical decisions
Object Detection & Counting
Ask the LLM to identify, locate, or count specific objects in images.
Use Cases:
Multi-Modal Prompting Best Practices
DO: Use High-Quality Images
Clear, high-resolution images produce better results than blurry or low-res images
DO: Combine Text and Visual Instructions
Provide written context about what the image shows and what you need from it
DO: Specify Output Format for Extracted Data
Request JSON, tables, or structured formats when extracting information from documents
DO: Test Vision Capabilities
Different models have varying vision capabilities — test with your specific use case
DON'T: Rely on OCR for Critical Data
Vision models can misread text in images. Verify numbers and dates for important use cases
DON'T: Upload Sensitive Documents Without Redaction
Apply the same privacy considerations to images as you would to text data
Unlock Multi-Modal AI
We can help you build AI workflows that process images, documents, and text together