多模态与视觉模型：处理图像和视觉输入 👁️

"当 AI 能看懂图像，前端开发的可能性大大扩展。"

1. 多模态模型概述

1.1 什么是多模态

多模态 (Multimodal) 指模型能同时处理多种类型的输入：

文本 (Text): 自然语言
图像 (Image): 照片、截图、设计稿
音频 (Audio): 语音、音乐
视频 (Video): 动态内容

1.2 前端相关的多模态场景

场景	输入	应用
设计稿转代码	Figma/Sketch 截图	自动生成 HTML/CSS
UI Bug 报告	屏幕截图	识别 UI 问题并修复
截图问答	网页截图	"这个按钮在哪里实现的？"
可访问性审计	UI 截图	检查对比度、字体大小
组件识别	设计稿	识别使用的组件类型

2. Vision API 使用

2.1 OpenAI Vision

javascript

const response = await openai.chat.completions.create({
  model: "gpt-4o",
  messages: [
    {
      role: "user",
      content: [
        {
          type: "text",
          text: "这个登录页面有什么 UI 问题？"
        },
        {
          type: "image_url",
          image_url: {
            url: "https://example.com/login-screenshot.png",
            // 或者使用 base64
            // url: "data:image/png;base64,..."
          }
        }
      ]
    }
  ]
});

2.2 Anthropic Vision

javascript

import Anthropic from '@anthropic-ai/sdk';
import fs from 'fs';

const anthropic = new Anthropic();

// 从文件读取图像
const imageData = fs.readFileSync('screenshot.png');
const base64Image = imageData.toString('base64');

const response = await anthropic.messages.create({
  model: "claude-sonnet-4-20250514",
  max_tokens: 1024,
  messages: [
    {
      role: "user",
      content: [
        {
          type: "image",
          source: {
            type: "base64",
            media_type: "image/png",
            data: base64Image
          }
        },
        {
          type: "text",
          text: "分析这个 UI 设计，列出使用的组件和布局结构。"
        }
      ]
    }
  ]
});

2.3 多图像输入

javascript

// 比较两个设计版本
const response = await anthropic.messages.create({
  model: "claude-sonnet-4-20250514",
  messages: [
    {
      role: "user",
      content: [
        { type: "text", text: "比较这两个设计版本，列出差异：" },
        { 
          type: "image",
          source: { type: "base64", media_type: "image/png", data: oldDesignBase64 }
        },
        { 
          type: "image",
          source: { type: "base64", media_type: "image/png", data: newDesignBase64 }
        }
      ]
    }
  ]
});

3. 设计稿转代码

3.1 基础实现

javascript

async function designToCode(imageBase64, options = {}) {
  const { framework = 'react', styling = 'tailwind' } = options;
  
  const response = await anthropic.messages.create({
    model: "claude-sonnet-4-20250514",
    max_tokens: 4096,
    messages: [
      {
        role: "user",
        content: [
          {
            type: "image",
            source: { type: "base64", media_type: "image/png", data: imageBase64 }
          },
          {
            type: "text",
            text: `将这个设计稿转换为代码。

技术栈:
- 框架: ${framework}
- 样式: ${styling}

要求:
1. 精确还原设计的布局和间距
2. 使用语义化的 HTML 结构
3. 响应式设计 (移动端优先)
4. 提取颜色和字体为 CSS 变量
5. 添加必要的交互状态 (hover, focus)

只输出代码，不要解释。`
          }
        ]
      }
    ]
  });
  
  return response.content[0].text;
}

3.2 分层分析方法

javascript

async function analyzeAndGenerateCode(imageBase64) {
  // Step 1: 分析设计结构
  const analysis = await analyzeDesign(imageBase64);
  
  // Step 2: 提取设计令牌
  const tokens = await extractDesignTokens(imageBase64);
  
  // Step 3: 生成组件代码
  const code = await generateComponents(imageBase64, analysis, tokens);
  
  return { analysis, tokens, code };
}

async function analyzeDesign(imageBase64) {
  const response = await llm.chat({
    messages: [{
      role: "user",
      content: [
        { type: "image", source: { type: "base64", media_type: "image/png", data: imageBase64 }},
        { type: "text", text: `分析这个 UI 设计的结构:

1. 主要区块划分
2. 组件层级关系
3. 布局模式 (Grid/Flex/Stack)
4. 重复的组件模式

输出 JSON:
{
  "layout": "grid|flex|stack",
  "sections": [...],
  "components": [...],
  "patterns": [...]
}` }
      ]
    }]
  });
  
  return JSON.parse(response.content);
}

async function extractDesignTokens(imageBase64) {
  const response = await llm.chat({
    messages: [{
      role: "user", 
      content: [
        { type: "image", source: { type: "base64", media_type: "image/png", data: imageBase64 }},
        { type: "text", text: `从设计中提取设计令牌:

1. 颜色 (主色、次色、背景、文字)
2. 字体大小
3. 间距
4. 圆角
5. 阴影

输出 CSS 变量格式。` }
      ]
    }]
  });
  
  return response.content;
}

3.3 迭代优化

javascript

async function iterativeDesignToCode(imageBase64, maxIterations = 3) {
  let code = await designToCode(imageBase64);
  
  for (let i = 0; i < maxIterations; i++) {
    // 渲染代码并截图
    const renderedScreenshot = await renderAndCapture(code);
    
    // 比较原设计和渲染结果
    const comparison = await compareDesigns(imageBase64, renderedScreenshot);
    
    if (comparison.similarity > 0.95) {
      return code;  // 足够接近
    }
    
    // 根据差异改进代码
    code = await improveCode(code, comparison.differences, imageBase64);
  }
  
  return code;
}

async function compareDesigns(originalBase64, renderedBase64) {
  const response = await llm.chat({
    messages: [{
      role: "user",
      content: [
        { type: "text", text: "比较原设计（第一张）和渲染结果（第二张）的差异：" },
        { type: "image", source: { type: "base64", media_type: "image/png", data: originalBase64 }},
        { type: "image", source: { type: "base64", media_type: "image/png", data: renderedBase64 }},
        { type: "text", text: `输出 JSON:
{
  "similarity": 0-1,
  "differences": [
    { "area": "header", "issue": "颜色不匹配", "expected": "#FF0000", "actual": "#FF5555" }
  ]
}` }
      ]
    }]
  });
  
  return JSON.parse(response.content);
}

4. UI Bug 检测

4.1 视觉回归测试

javascript

async function detectVisualDifferences(baselineBase64, currentBase64) {
  const response = await llm.chat({
    messages: [{
      role: "user",
      content: [
        { type: "text", text: "比较这两张截图，找出视觉差异：" },
        { type: "text", text: "基准版本：" },
        { type: "image", source: { type: "base64", media_type: "image/png", data: baselineBase64 }},
        { type: "text", text: "当前版本：" },
        { type: "image", source: { type: "base64", media_type: "image/png", data: currentBase64 }},
        { type: "text", text: `列出所有差异:
1. 位置变化
2. 颜色变化
3. 文字变化
4. 缺失或新增元素

输出 JSON:
{
  "hasDifferences": boolean,
  "differences": [
    { "type": "position|color|text|element", "description": "...", "severity": "critical|major|minor" }
  ]
}` }
      ]
    }]
  });
  
  return JSON.parse(response.content);
}

4.2 自动化 UI 审查

javascript

async function auditUI(screenshotBase64) {
  const response = await llm.chat({
    messages: [{
      role: "user",
      content: [
        { type: "image", source: { type: "base64", media_type: "image/png", data: screenshotBase64 }},
        { type: "text", text: `审查这个 UI 页面，检查以下问题:

1. **可访问性**
   - 文字对比度是否足够？
   - 按钮是否有足够的点击区域？
   - 表单字段是否有标签？

2. **一致性**
   - 间距是否统一？
   - 颜色是否来自同一调色板？
   - 字体使用是否一致？

3. **常见问题**
   - 文字是否被截断？
   - 元素是否有错位？
   - 是否有视觉层次问题？

输出 JSON:
{
  "score": 0-100,
  "issues": [
    { "category": "accessibility|consistency|visual", "description": "...", "severity": "high|medium|low", "suggestion": "..." }
  ]
}` }
      ]
    }]
  });
  
  return JSON.parse(response.content);
}

5. 截图问答

5.1 代码定位

javascript

async function findCodeByScreenshot(screenshotBase64, codebaseContext) {
  const response = await llm.chat({
    messages: [
      {
        role: "system",
        content: `你是一个代码库专家。用户会提供 UI 截图和代码库结构，你需要帮助找到相关代码。

代码库结构:
${codebaseContext}`
      },
      {
        role: "user",
        content: [
          { type: "image", source: { type: "base64", media_type: "image/png", data: screenshotBase64 }},
          { type: "text", text: "这个页面的代码在哪里？请列出相关的文件。" }
        ]
      }
    ]
  });
  
  return response.content;
}

5.2 交互式问答

javascript

class ScreenshotQA {
  constructor() {
    this.currentScreenshot = null;
    this.context = [];
  }
  
  async setScreenshot(imageBase64) {
    this.currentScreenshot = imageBase64;
    
    // 生成页面描述作为上下文
    const description = await this.describeScreenshot(imageBase64);
    this.context.push({
      role: 'assistant',
      content: `我看到了这个页面: ${description}`
    });
  }
  
  async ask(question) {
    if (!this.currentScreenshot) {
      throw new Error('请先上传截图');
    }
    
    this.context.push({
      role: 'user',
      content: question
    });
    
    const response = await llm.chat({
      messages: [
        {
          role: 'system',
          content: '你是一个 UI/UX 专家。用户已经分享了一张截图，你需要回答关于这个界面的问题。'
        },
        {
          role: 'user',
          content: [
            { type: 'image', source: { type: 'base64', media_type: 'image/png', data: this.currentScreenshot }},
            { type: 'text', text: '这是当前讨论的界面。' }
          ]
        },
        ...this.context
      ]
    });
    
    this.context.push({
      role: 'assistant',
      content: response.content
    });
    
    return response.content;
  }
  
  async describeScreenshot(imageBase64) {
    const response = await llm.chat({
      messages: [{
        role: 'user',
        content: [
          { type: 'image', source: { type: 'base64', media_type: 'image/png', data: imageBase64 }},
          { type: 'text', text: '简短描述这个界面的内容和结构（50字以内）' }
        ]
      }]
    });
    
    return response.content;
  }
}

6. Computer Use (计算机使用)

6.1 Anthropic Computer Use

Claude 可以直接操作计算机：

javascript

import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();

const response = await client.messages.create({
  model: "claude-sonnet-4-20250514",
  max_tokens: 1024,
  tools: [
    {
      type: "computer_20241022",
      name: "computer",
      display_width_px: 1920,
      display_height_px: 1080,
      display_number: 1
    }
  ],
  messages: [
    {
      role: "user",
      content: "打开浏览器，访问 github.com，然后截图给我"
    }
  ]
});

// 处理 computer use 响应
for (const block of response.content) {
  if (block.type === 'tool_use' && block.name === 'computer') {
    const { action, coordinate, text } = block.input;
    
    switch (action) {
      case 'screenshot':
        // 截取屏幕
        const screenshot = await captureScreen();
        // 返回截图给模型继续
        break;
        
      case 'mouse_move':
        await moveMouse(coordinate[0], coordinate[1]);
        break;
        
      case 'left_click':
        await click(coordinate[0], coordinate[1]);
        break;
        
      case 'type':
        await typeText(text);
        break;
        
      case 'key':
        await pressKey(text);  // 如 'Return', 'Escape'
        break;
    }
  }
}

6.2 自动化测试场景

javascript

async function runVisualTest(testCase) {
  const { url, steps, assertions } = testCase;
  
  const messages = [
    {
      role: "user",
      content: `执行以下测试:

URL: ${url}

步骤:
${steps.map((s, i) => `${i+1}. ${s}`).join('\n')}

验证:
${assertions.map((a, i) => `${i+1}. ${a}`).join('\n')}

完成每个步骤后截图，最后报告测试结果。`
    }
  ];
  
  const response = await client.messages.create({
    model: "claude-sonnet-4-20250514",
    max_tokens: 4096,
    tools: [
      { type: "computer_20241022", name: "computer", ... }
    ],
    messages
  });
  
  // 处理多轮工具调用...
}

// 使用示例
await runVisualTest({
  url: "http://localhost:3000/login",
  steps: [
    "输入用户名 test@example.com",
    "输入密码 password123",
    "点击登录按钮"
  ],
  assertions: [
    "应该跳转到 dashboard 页面",
    "应该显示用户头像"
  ]
});

7. 最佳实践

7.1 图像优化

javascript

async function prepareImage(imagePath) {
  const sharp = require('sharp');
  
  // 调整大小以优化 token 使用
  const optimized = await sharp(imagePath)
    .resize(1568, 1568, {  // Claude 推荐的最大尺寸
      fit: 'inside',
      withoutEnlargement: true
    })
    .png({ quality: 85 })
    .toBuffer();
  
  return optimized.toString('base64');
}

// 裁剪感兴趣区域
async function cropRegion(imagePath, region) {
  const { x, y, width, height } = region;
  
  const cropped = await sharp(imagePath)
    .extract({ left: x, top: y, width, height })
    .toBuffer();
  
  return cropped.toString('base64');
}

7.2 成本控制

javascript

// 图像 Token 估算 (Claude)
function estimateImageTokens(width, height) {
  // Claude 将图像分成 768x768 的 tiles
  const tiles = Math.ceil(width / 768) * Math.ceil(height / 768);
  const tokensPerTile = 1600;  // 大约
  return tiles * tokensPerTile;
}

// 决定是否需要高分辨率
async function shouldUseHighRes(imagePath, task) {
  const metadata = await sharp(imagePath).metadata();
  
  const tasks = {
    'text_recognition': true,      // 需要高分辨率
    'layout_analysis': false,       // 低分辨率足够
    'color_extraction': false,
    'ui_audit': true
  };
  
  return tasks[task] && (metadata.width > 768 || metadata.height > 768);
}

7.3 错误处理

javascript

async function safeVisionCall(imageBase64, prompt) {
  try {
    // 验证图像
    const buffer = Buffer.from(imageBase64, 'base64');
    const metadata = await sharp(buffer).metadata();
    
    if (metadata.width > 8000 || metadata.height > 8000) {
      throw new Error('图像尺寸过大，请缩小后重试');
    }
    
    // 检测格式
    const supportedFormats = ['jpeg', 'png', 'gif', 'webp'];
    if (!supportedFormats.includes(metadata.format)) {
      throw new Error(`不支持的图像格式: ${metadata.format}`);
    }
    
    return await llm.chat({
      messages: [{
        role: "user",
        content: [
          { type: "image", source: { type: "base64", media_type: `image/${metadata.format}`, data: imageBase64 }},
          { type: "text", text: prompt }
        ]
      }]
    });
    
  } catch (error) {
    if (error.message.includes('Could not process image')) {
      throw new Error('无法处理图像，请检查图像是否损坏');
    }
    throw error;
  }
}

8. 实战：设计系统文档生成器

javascript

class DesignSystemDocGenerator {
  async generateFromScreenshots(screenshots) {
    const components = [];
    
    for (const screenshot of screenshots) {
      // 识别组件
      const identified = await this.identifyComponents(screenshot);
      
      for (const component of identified) {
        // 提取详细信息
        const details = await this.extractComponentDetails(screenshot, component);
        
        // 生成文档
        const doc = await this.generateDocumentation(details);
        
        components.push({ ...component, details, doc });
      }
    }
    
    // 生成完整设计系统文档
    return this.compileDocumentation(components);
  }
  
  async identifyComponents(screenshotBase64) {
    const response = await llm.chat({
      messages: [{
        role: "user",
        content: [
          { type: "image", source: { type: "base64", media_type: "image/png", data: screenshotBase64 }},
          { type: "text", text: `识别这个页面中的所有 UI 组件:

分类:
- 按钮 (Button)
- 输入框 (Input)
- 卡片 (Card)
- 导航 (Navigation)
- 列表 (List)
- 模态框 (Modal)
- 其他

输出 JSON:
{
  "components": [
    { "type": "button", "variants": ["primary", "secondary"], "count": 5 }
  ]
}` }
        ]
      }]
    });
    
    return JSON.parse(response.content).components;
  }
  
  async extractComponentDetails(screenshotBase64, component) {
    const response = await llm.chat({
      messages: [{
        role: "user",
        content: [
          { type: "image", source: { type: "base64", media_type: "image/png", data: screenshotBase64 }},
          { type: "text", text: `详细分析 ${component.type} 组件:

提取:
1. 尺寸 (padding, margin, height)
2. 颜色 (背景、文字、边框)
3. 字体 (大小、粗细)
4. 圆角
5. 阴影
6. 状态变体

输出 CSS 变量和设计规范。` }
        ]
      }]
    });
    
    return response.content;
  }
}

9. 关键要点

Vision API 强大但费 Token: 优化图像尺寸
分步处理更准确: 先分析再生成
迭代改进设计转代码: 渲染-比较-修复循环
Computer Use 适合自动化测试: 模拟真实用户操作
结合代码上下文更有效: 图像 + 代码 = 更好的理解

多模态与视觉模型：处理图像和视觉输入 👁️ ​

1. 多模态模型概述 ​

1.1 什么是多模态 ​

1.2 前端相关的多模态场景 ​

2. Vision API 使用 ​

2.1 OpenAI Vision ​

2.2 Anthropic Vision ​

2.3 多图像输入 ​

3. 设计稿转代码 ​

3.1 基础实现 ​

3.2 分层分析方法 ​

3.3 迭代优化 ​

4. UI Bug 检测 ​

4.1 视觉回归测试 ​

4.2 自动化 UI 审查 ​

5. 截图问答 ​

5.1 代码定位 ​

5.2 交互式问答 ​

6. Computer Use (计算机使用) ​

6.1 Anthropic Computer Use ​

6.2 自动化测试场景 ​

7. 最佳实践 ​

7.1 图像优化 ​

7.2 成本控制 ​

7.3 错误处理 ​

8. 实战：设计系统文档生成器 ​

9. 关键要点 ​

延伸阅读 ​

多模态与视觉模型：处理图像和视觉输入 👁️

1. 多模态模型概述

1.1 什么是多模态

1.2 前端相关的多模态场景

2. Vision API 使用

2.1 OpenAI Vision

2.2 Anthropic Vision

2.3 多图像输入

3. 设计稿转代码

3.1 基础实现

3.2 分层分析方法

3.3 迭代优化

4. UI Bug 检测

4.1 视觉回归测试

4.2 自动化 UI 审查

5. 截图问答

5.1 代码定位

5.2 交互式问答

6. Computer Use (计算机使用)

6.1 Anthropic Computer Use

6.2 自动化测试场景

7. 最佳实践

7.1 图像优化

7.2 成本控制

7.3 错误处理

8. 实战：设计系统文档生成器

9. 关键要点

延伸阅读