Skip to content

多模态与视觉模型:处理图像和视觉输入 👁️

"当 AI 能看懂图像,前端开发的可能性大大扩展。"

1. 多模态模型概述

1.1 什么是多模态

多模态 (Multimodal) 指模型能同时处理多种类型的输入:

  • 文本 (Text): 自然语言
  • 图像 (Image): 照片、截图、设计稿
  • 音频 (Audio): 语音、音乐
  • 视频 (Video): 动态内容

1.2 前端相关的多模态场景

场景输入应用
设计稿转代码Figma/Sketch 截图自动生成 HTML/CSS
UI Bug 报告屏幕截图识别 UI 问题并修复
截图问答网页截图"这个按钮在哪里实现的?"
可访问性审计UI 截图检查对比度、字体大小
组件识别设计稿识别使用的组件类型

2. Vision API 使用

2.1 OpenAI Vision

javascript
const response = await openai.chat.completions.create({
  model: "gpt-4o",
  messages: [
    {
      role: "user",
      content: [
        {
          type: "text",
          text: "这个登录页面有什么 UI 问题?"
        },
        {
          type: "image_url",
          image_url: {
            url: "https://example.com/login-screenshot.png",
            // 或者使用 base64
            // url: "data:image/png;base64,..."
          }
        }
      ]
    }
  ]
});

2.2 Anthropic Vision

javascript
import Anthropic from '@anthropic-ai/sdk';
import fs from 'fs';

const anthropic = new Anthropic();

// 从文件读取图像
const imageData = fs.readFileSync('screenshot.png');
const base64Image = imageData.toString('base64');

const response = await anthropic.messages.create({
  model: "claude-sonnet-4-20250514",
  max_tokens: 1024,
  messages: [
    {
      role: "user",
      content: [
        {
          type: "image",
          source: {
            type: "base64",
            media_type: "image/png",
            data: base64Image
          }
        },
        {
          type: "text",
          text: "分析这个 UI 设计,列出使用的组件和布局结构。"
        }
      ]
    }
  ]
});

2.3 多图像输入

javascript
// 比较两个设计版本
const response = await anthropic.messages.create({
  model: "claude-sonnet-4-20250514",
  messages: [
    {
      role: "user",
      content: [
        { type: "text", text: "比较这两个设计版本,列出差异:" },
        { 
          type: "image",
          source: { type: "base64", media_type: "image/png", data: oldDesignBase64 }
        },
        { 
          type: "image",
          source: { type: "base64", media_type: "image/png", data: newDesignBase64 }
        }
      ]
    }
  ]
});

3. 设计稿转代码

3.1 基础实现

javascript
async function designToCode(imageBase64, options = {}) {
  const { framework = 'react', styling = 'tailwind' } = options;
  
  const response = await anthropic.messages.create({
    model: "claude-sonnet-4-20250514",
    max_tokens: 4096,
    messages: [
      {
        role: "user",
        content: [
          {
            type: "image",
            source: { type: "base64", media_type: "image/png", data: imageBase64 }
          },
          {
            type: "text",
            text: `将这个设计稿转换为代码。

技术栈:
- 框架: ${framework}
- 样式: ${styling}

要求:
1. 精确还原设计的布局和间距
2. 使用语义化的 HTML 结构
3. 响应式设计 (移动端优先)
4. 提取颜色和字体为 CSS 变量
5. 添加必要的交互状态 (hover, focus)

只输出代码,不要解释。`
          }
        ]
      }
    ]
  });
  
  return response.content[0].text;
}

3.2 分层分析方法

javascript
async function analyzeAndGenerateCode(imageBase64) {
  // Step 1: 分析设计结构
  const analysis = await analyzeDesign(imageBase64);
  
  // Step 2: 提取设计令牌
  const tokens = await extractDesignTokens(imageBase64);
  
  // Step 3: 生成组件代码
  const code = await generateComponents(imageBase64, analysis, tokens);
  
  return { analysis, tokens, code };
}

async function analyzeDesign(imageBase64) {
  const response = await llm.chat({
    messages: [{
      role: "user",
      content: [
        { type: "image", source: { type: "base64", media_type: "image/png", data: imageBase64 }},
        { type: "text", text: `分析这个 UI 设计的结构:

1. 主要区块划分
2. 组件层级关系
3. 布局模式 (Grid/Flex/Stack)
4. 重复的组件模式

输出 JSON:
{
  "layout": "grid|flex|stack",
  "sections": [...],
  "components": [...],
  "patterns": [...]
}` }
      ]
    }]
  });
  
  return JSON.parse(response.content);
}

async function extractDesignTokens(imageBase64) {
  const response = await llm.chat({
    messages: [{
      role: "user", 
      content: [
        { type: "image", source: { type: "base64", media_type: "image/png", data: imageBase64 }},
        { type: "text", text: `从设计中提取设计令牌:

1. 颜色 (主色、次色、背景、文字)
2. 字体大小
3. 间距
4. 圆角
5. 阴影

输出 CSS 变量格式。` }
      ]
    }]
  });
  
  return response.content;
}

3.3 迭代优化

javascript
async function iterativeDesignToCode(imageBase64, maxIterations = 3) {
  let code = await designToCode(imageBase64);
  
  for (let i = 0; i < maxIterations; i++) {
    // 渲染代码并截图
    const renderedScreenshot = await renderAndCapture(code);
    
    // 比较原设计和渲染结果
    const comparison = await compareDesigns(imageBase64, renderedScreenshot);
    
    if (comparison.similarity > 0.95) {
      return code;  // 足够接近
    }
    
    // 根据差异改进代码
    code = await improveCode(code, comparison.differences, imageBase64);
  }
  
  return code;
}

async function compareDesigns(originalBase64, renderedBase64) {
  const response = await llm.chat({
    messages: [{
      role: "user",
      content: [
        { type: "text", text: "比较原设计(第一张)和渲染结果(第二张)的差异:" },
        { type: "image", source: { type: "base64", media_type: "image/png", data: originalBase64 }},
        { type: "image", source: { type: "base64", media_type: "image/png", data: renderedBase64 }},
        { type: "text", text: `输出 JSON:
{
  "similarity": 0-1,
  "differences": [
    { "area": "header", "issue": "颜色不匹配", "expected": "#FF0000", "actual": "#FF5555" }
  ]
}` }
      ]
    }]
  });
  
  return JSON.parse(response.content);
}

4. UI Bug 检测

4.1 视觉回归测试

javascript
async function detectVisualDifferences(baselineBase64, currentBase64) {
  const response = await llm.chat({
    messages: [{
      role: "user",
      content: [
        { type: "text", text: "比较这两张截图,找出视觉差异:" },
        { type: "text", text: "基准版本:" },
        { type: "image", source: { type: "base64", media_type: "image/png", data: baselineBase64 }},
        { type: "text", text: "当前版本:" },
        { type: "image", source: { type: "base64", media_type: "image/png", data: currentBase64 }},
        { type: "text", text: `列出所有差异:
1. 位置变化
2. 颜色变化
3. 文字变化
4. 缺失或新增元素

输出 JSON:
{
  "hasDifferences": boolean,
  "differences": [
    { "type": "position|color|text|element", "description": "...", "severity": "critical|major|minor" }
  ]
}` }
      ]
    }]
  });
  
  return JSON.parse(response.content);
}

4.2 自动化 UI 审查

javascript
async function auditUI(screenshotBase64) {
  const response = await llm.chat({
    messages: [{
      role: "user",
      content: [
        { type: "image", source: { type: "base64", media_type: "image/png", data: screenshotBase64 }},
        { type: "text", text: `审查这个 UI 页面,检查以下问题:

1. **可访问性**
   - 文字对比度是否足够?
   - 按钮是否有足够的点击区域?
   - 表单字段是否有标签?

2. **一致性**
   - 间距是否统一?
   - 颜色是否来自同一调色板?
   - 字体使用是否一致?

3. **常见问题**
   - 文字是否被截断?
   - 元素是否有错位?
   - 是否有视觉层次问题?

输出 JSON:
{
  "score": 0-100,
  "issues": [
    { "category": "accessibility|consistency|visual", "description": "...", "severity": "high|medium|low", "suggestion": "..." }
  ]
}` }
      ]
    }]
  });
  
  return JSON.parse(response.content);
}

5. 截图问答

5.1 代码定位

javascript
async function findCodeByScreenshot(screenshotBase64, codebaseContext) {
  const response = await llm.chat({
    messages: [
      {
        role: "system",
        content: `你是一个代码库专家。用户会提供 UI 截图和代码库结构,你需要帮助找到相关代码。

代码库结构:
${codebaseContext}`
      },
      {
        role: "user",
        content: [
          { type: "image", source: { type: "base64", media_type: "image/png", data: screenshotBase64 }},
          { type: "text", text: "这个页面的代码在哪里?请列出相关的文件。" }
        ]
      }
    ]
  });
  
  return response.content;
}

5.2 交互式问答

javascript
class ScreenshotQA {
  constructor() {
    this.currentScreenshot = null;
    this.context = [];
  }
  
  async setScreenshot(imageBase64) {
    this.currentScreenshot = imageBase64;
    
    // 生成页面描述作为上下文
    const description = await this.describeScreenshot(imageBase64);
    this.context.push({
      role: 'assistant',
      content: `我看到了这个页面: ${description}`
    });
  }
  
  async ask(question) {
    if (!this.currentScreenshot) {
      throw new Error('请先上传截图');
    }
    
    this.context.push({
      role: 'user',
      content: question
    });
    
    const response = await llm.chat({
      messages: [
        {
          role: 'system',
          content: '你是一个 UI/UX 专家。用户已经分享了一张截图,你需要回答关于这个界面的问题。'
        },
        {
          role: 'user',
          content: [
            { type: 'image', source: { type: 'base64', media_type: 'image/png', data: this.currentScreenshot }},
            { type: 'text', text: '这是当前讨论的界面。' }
          ]
        },
        ...this.context
      ]
    });
    
    this.context.push({
      role: 'assistant',
      content: response.content
    });
    
    return response.content;
  }
  
  async describeScreenshot(imageBase64) {
    const response = await llm.chat({
      messages: [{
        role: 'user',
        content: [
          { type: 'image', source: { type: 'base64', media_type: 'image/png', data: imageBase64 }},
          { type: 'text', text: '简短描述这个界面的内容和结构(50字以内)' }
        ]
      }]
    });
    
    return response.content;
  }
}

6. Computer Use (计算机使用)

6.1 Anthropic Computer Use

Claude 可以直接操作计算机:

javascript
import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();

const response = await client.messages.create({
  model: "claude-sonnet-4-20250514",
  max_tokens: 1024,
  tools: [
    {
      type: "computer_20241022",
      name: "computer",
      display_width_px: 1920,
      display_height_px: 1080,
      display_number: 1
    }
  ],
  messages: [
    {
      role: "user",
      content: "打开浏览器,访问 github.com,然后截图给我"
    }
  ]
});

// 处理 computer use 响应
for (const block of response.content) {
  if (block.type === 'tool_use' && block.name === 'computer') {
    const { action, coordinate, text } = block.input;
    
    switch (action) {
      case 'screenshot':
        // 截取屏幕
        const screenshot = await captureScreen();
        // 返回截图给模型继续
        break;
        
      case 'mouse_move':
        await moveMouse(coordinate[0], coordinate[1]);
        break;
        
      case 'left_click':
        await click(coordinate[0], coordinate[1]);
        break;
        
      case 'type':
        await typeText(text);
        break;
        
      case 'key':
        await pressKey(text);  // 如 'Return', 'Escape'
        break;
    }
  }
}

6.2 自动化测试场景

javascript
async function runVisualTest(testCase) {
  const { url, steps, assertions } = testCase;
  
  const messages = [
    {
      role: "user",
      content: `执行以下测试:

URL: ${url}

步骤:
${steps.map((s, i) => `${i+1}. ${s}`).join('\n')}

验证:
${assertions.map((a, i) => `${i+1}. ${a}`).join('\n')}

完成每个步骤后截图,最后报告测试结果。`
    }
  ];
  
  const response = await client.messages.create({
    model: "claude-sonnet-4-20250514",
    max_tokens: 4096,
    tools: [
      { type: "computer_20241022", name: "computer", ... }
    ],
    messages
  });
  
  // 处理多轮工具调用...
}

// 使用示例
await runVisualTest({
  url: "http://localhost:3000/login",
  steps: [
    "输入用户名 test@example.com",
    "输入密码 password123",
    "点击登录按钮"
  ],
  assertions: [
    "应该跳转到 dashboard 页面",
    "应该显示用户头像"
  ]
});

7. 最佳实践

7.1 图像优化

javascript
async function prepareImage(imagePath) {
  const sharp = require('sharp');
  
  // 调整大小以优化 token 使用
  const optimized = await sharp(imagePath)
    .resize(1568, 1568, {  // Claude 推荐的最大尺寸
      fit: 'inside',
      withoutEnlargement: true
    })
    .png({ quality: 85 })
    .toBuffer();
  
  return optimized.toString('base64');
}

// 裁剪感兴趣区域
async function cropRegion(imagePath, region) {
  const { x, y, width, height } = region;
  
  const cropped = await sharp(imagePath)
    .extract({ left: x, top: y, width, height })
    .toBuffer();
  
  return cropped.toString('base64');
}

7.2 成本控制

javascript
// 图像 Token 估算 (Claude)
function estimateImageTokens(width, height) {
  // Claude 将图像分成 768x768 的 tiles
  const tiles = Math.ceil(width / 768) * Math.ceil(height / 768);
  const tokensPerTile = 1600;  // 大约
  return tiles * tokensPerTile;
}

// 决定是否需要高分辨率
async function shouldUseHighRes(imagePath, task) {
  const metadata = await sharp(imagePath).metadata();
  
  const tasks = {
    'text_recognition': true,      // 需要高分辨率
    'layout_analysis': false,       // 低分辨率足够
    'color_extraction': false,
    'ui_audit': true
  };
  
  return tasks[task] && (metadata.width > 768 || metadata.height > 768);
}

7.3 错误处理

javascript
async function safeVisionCall(imageBase64, prompt) {
  try {
    // 验证图像
    const buffer = Buffer.from(imageBase64, 'base64');
    const metadata = await sharp(buffer).metadata();
    
    if (metadata.width > 8000 || metadata.height > 8000) {
      throw new Error('图像尺寸过大,请缩小后重试');
    }
    
    // 检测格式
    const supportedFormats = ['jpeg', 'png', 'gif', 'webp'];
    if (!supportedFormats.includes(metadata.format)) {
      throw new Error(`不支持的图像格式: ${metadata.format}`);
    }
    
    return await llm.chat({
      messages: [{
        role: "user",
        content: [
          { type: "image", source: { type: "base64", media_type: `image/${metadata.format}`, data: imageBase64 }},
          { type: "text", text: prompt }
        ]
      }]
    });
    
  } catch (error) {
    if (error.message.includes('Could not process image')) {
      throw new Error('无法处理图像,请检查图像是否损坏');
    }
    throw error;
  }
}

8. 实战:设计系统文档生成器

javascript
class DesignSystemDocGenerator {
  async generateFromScreenshots(screenshots) {
    const components = [];
    
    for (const screenshot of screenshots) {
      // 识别组件
      const identified = await this.identifyComponents(screenshot);
      
      for (const component of identified) {
        // 提取详细信息
        const details = await this.extractComponentDetails(screenshot, component);
        
        // 生成文档
        const doc = await this.generateDocumentation(details);
        
        components.push({ ...component, details, doc });
      }
    }
    
    // 生成完整设计系统文档
    return this.compileDocumentation(components);
  }
  
  async identifyComponents(screenshotBase64) {
    const response = await llm.chat({
      messages: [{
        role: "user",
        content: [
          { type: "image", source: { type: "base64", media_type: "image/png", data: screenshotBase64 }},
          { type: "text", text: `识别这个页面中的所有 UI 组件:

分类:
- 按钮 (Button)
- 输入框 (Input)
- 卡片 (Card)
- 导航 (Navigation)
- 列表 (List)
- 模态框 (Modal)
- 其他

输出 JSON:
{
  "components": [
    { "type": "button", "variants": ["primary", "secondary"], "count": 5 }
  ]
}` }
        ]
      }]
    });
    
    return JSON.parse(response.content).components;
  }
  
  async extractComponentDetails(screenshotBase64, component) {
    const response = await llm.chat({
      messages: [{
        role: "user",
        content: [
          { type: "image", source: { type: "base64", media_type: "image/png", data: screenshotBase64 }},
          { type: "text", text: `详细分析 ${component.type} 组件:

提取:
1. 尺寸 (padding, margin, height)
2. 颜色 (背景、文字、边框)
3. 字体 (大小、粗细)
4. 圆角
5. 阴影
6. 状态变体

输出 CSS 变量和设计规范。` }
        ]
      }]
    });
    
    return response.content;
  }
}

9. 关键要点

  1. Vision API 强大但费 Token: 优化图像尺寸
  2. 分步处理更准确: 先分析再生成
  3. 迭代改进设计转代码: 渲染-比较-修复循环
  4. Computer Use 适合自动化测试: 模拟真实用户操作
  5. 结合代码上下文更有效: 图像 + 代码 = 更好的理解

延伸阅读

前端面试知识库