Skip to content

多模态与视觉模型:处理图像和视觉输入 👁️

"当 AI 能看懂图像,前端开发的可能性大大扩展。"

1. 多模态模型概述

1.1 什么是多模态

多模态指模型能同时处理多种类型的输入:

  • 文本: 自然语言
  • 图像: 照片、截图、设计稿
  • 音频: 语音、音乐
  • 视频: 动态内容

1.2 前端相关场景

场景输入应用
设计稿转代码Figma/Sketch 截图自动生成 HTML/CSS
UI Bug 报告屏幕截图识别 UI 问题并修复
截图问答网页截图"这个按钮在哪里实现的?"
可访问性审计UI 截图检查对比度、字体大小
组件识别设计稿识别使用的组件类型

2. Vision API 使用

2.1 OpenAI Vision

javascript
import OpenAI from 'openai';

const openai = new OpenAI();

const response = await openai.chat.completions.create({
  model: "gpt-4o",
  messages: [{
    role: "user",
    content: [
      { type: "text", text: "这个登录页面有什么 UI 问题?" },
      {
        type: "image_url",
        image_url: {
          url: "https://example.com/login-screenshot.png"
          // 或使用 base64: "data:image/png;base64,..."
        }
      }
    ]
  }]
});

2.2 Anthropic Vision

javascript
import Anthropic from '@anthropic-ai/sdk';
import fs from 'fs';

const anthropic = new Anthropic();

const imageData = fs.readFileSync('screenshot.png');
const base64Image = imageData.toString('base64');

const response = await anthropic.messages.create({
  model: "claude-sonnet-4-20250514",
  max_tokens: 1024,
  messages: [{
    role: "user",
    content: [
      {
        type: "image",
        source: {
          type: "base64",
          media_type: "image/png",
          data: base64Image
        }
      },
      { type: "text", text: "描述这个 UI 的布局和组件" }
    ]
  }]
});

2.3 图像优化

javascript
// Token 消耗与图像尺寸相关
// 优化图像以降低成本

import sharp from 'sharp';

async function optimizeImage(imagePath) {
  const optimized = await sharp(imagePath)
    .resize(1024, 1024, { fit: 'inside' })  // 限制最大尺寸
    .jpeg({ quality: 80 })                   // 压缩质量
    .toBuffer();

  return optimized.toString('base64');
}

3. 设计稿转代码

3.1 基础实现

javascript
async function designToCode(imageBase64) {
  const response = await anthropic.messages.create({
    model: "claude-sonnet-4-20250514",
    max_tokens: 4096,
    messages: [{
      role: "user",
      content: [
        {
          type: "image",
          source: { type: "base64", media_type: "image/png", data: imageBase64 }
        },
        {
          type: "text",
          text: `将这个设计稿转换为 React 组件代码。

要求:
1. 使用 Tailwind CSS
2. 响应式设计
3. 语义化 HTML
4. 包含必要的状态管理

输出格式:
\`\`\`tsx
// 组件代码
\`\`\``
        }
      ]
    }]
  });

  return extractCode(response.content);
}

3.2 迭代改进

javascript
async function iterativeDesignToCode(imageBase64) {
  // 1. 生成初始代码
  let code = await designToCode(imageBase64);

  // 2. 渲染并截图
  const renderedImage = await renderComponent(code);

  // 3. 比较差异
  const comparison = await compareDesigns(imageBase64, renderedImage);

  // 4. 如果有差异,迭代改进
  if (comparison.hasDifferences) {
    code = await improveCode(code, comparison.differences);
  }

  return code;
}

async function compareDesigns(original, rendered) {
  const response = await anthropic.messages.create({
    model: "claude-sonnet-4-20250514",
    max_tokens: 2048,
    messages: [{
      role: "user",
      content: [
        {
          type: "image",
          source: { type: "base64", media_type: "image/png", data: original }
        },
        {
          type: "image",
          source: { type: "base64", media_type: "image/png", data: rendered }
        },
        {
          type: "text",
          text: "比较这两个设计,列出差异。返回 JSON: { hasDifferences: boolean, differences: string[] }"
        }
      ]
    }]
  });

  return JSON.parse(response.content);
}

4. UI Bug 检测

4.1 自动化 UI 审查

javascript
async function auditUI(screenshotBase64) {
  const response = await anthropic.messages.create({
    model: "claude-sonnet-4-20250514",
    max_tokens: 2048,
    messages: [{
      role: "user",
      content: [
        {
          type: "image",
          source: { type: "base64", media_type: "image/png", data: screenshotBase64 }
        },
        {
          type: "text",
          text: `审查这个 UI,检查以下问题:

1. 布局问题 (重叠、溢出、对齐)
2. 可访问性 (对比度、字体大小)
3. 响应式问题
4. 视觉一致性

返回 JSON:
{
  "issues": [
    { "type": "layout", "severity": "high", "description": "..." },
    { "type": "accessibility", "severity": "medium", "description": "..." }
  ]
}`
        }
      ]
    }]
  });

  return JSON.parse(response.content);
}

4.2 可访问性检查

javascript
async function checkAccessibility(screenshotBase64) {
  const response = await anthropic.messages.create({
    model: "claude-sonnet-4-20250514",
    max_tokens: 2048,
    messages: [{
      role: "user",
      content: [
        {
          type: "image",
          source: { type: "base64", media_type: "image/png", data: screenshotBase64 }
        },
        {
          type: "text",
          text: `检查可访问性问题:

1. 文字与背景对比度 (WCAG AA: 4.5:1)
2. 字体大小 (最小 16px)
3. 可点击区域大小 (最小 44x44px)
4. 颜色不是唯一的信息传达方式

返回详细报告。`
        }
      ]
    }]
  });

  return response.content;
}

5. 截图问答

5.1 代码定位

javascript
async function locateCodeFromScreenshot(screenshotBase64, question) {
  // 1. 分析截图
  const uiAnalysis = await anthropic.messages.create({
    model: "claude-sonnet-4-20250514",
    max_tokens: 1024,
    messages: [{
      role: "user",
      content: [
        {
          type: "image",
          source: { type: "base64", media_type: "image/png", data: screenshotBase64 }
        },
        {
          type: "text",
          text: `描述这个 UI 元素的特征:\n${question}\n\n包括: 文本内容、颜色、位置、样式`
        }
      ]
    }]
  });

  // 2. 搜索代码库
  const codeResults = await searchCodebase(uiAnalysis.content);

  // 3. 返回定位结果
  return codeResults;
}

6. Computer Use (Claude)

6.1 基础使用

javascript
import Anthropic from '@anthropic-ai/sdk';

const anthropic = new Anthropic();

const response = await anthropic.messages.create({
  model: "claude-sonnet-4-20250514",
  max_tokens: 4096,
  messages: [{
    role: "user",
    content: "打开浏览器,访问 example.com,点击登录按钮"
  }],
  tools: [
    {
      type: "computer_20241022",
      name: "computer",
      display_width_px: 1920,
      display_height_px: 1080
    }
  ]
});

6.2 自动化测试

javascript
async function automatedUITest(testCase) {
  const response = await anthropic.messages.create({
    model: "claude-sonnet-4-20250514",
    max_tokens: 4096,
    messages: [{
      role: "user",
      content: `执行 UI 测试:

1. 打开 http://localhost:3000
2. 填写登录表单 (用户名: test, 密码: 123456)
3. 点击登录按钮
4. 验证是否跳转到 dashboard
5. 截图并报告结果`
    }],
    tools: [{
      type: "computer_20241022",
      name: "computer",
      display_width_px: 1920,
      display_height_px: 1080
    }]
  });

  return response;
}

7. 实战案例

7.1 组件库文档生成

javascript
async function generateComponentDocs(componentScreenshot) {
  const response = await anthropic.messages.create({
    model: "claude-sonnet-4-20250514",
    max_tokens: 2048,
    messages: [{
      role: "user",
      content: [
        {
          type: "image",
          source: { type: "base64", media_type: "image/png", data: componentScreenshot }
        },
        {
          type: "text",
          text: `为这个组件生成文档:

1. 组件名称和用途
2. Props 列表
3. 使用示例
4. 变体 (variants)
5. 可访问性说明

输出 Markdown 格式。`
        }
      ]
    }]
  });

  return response.content;
}

7.2 设计系统提取

javascript
async function extractDesignSystem(screenshots) {
  const response = await anthropic.messages.create({
    model: "claude-sonnet-4-20250514",
    max_tokens: 4096,
    messages: [{
      role: "user",
      content: [
        ...screenshots.map(img => ({
          type: "image",
          source: { type: "base64", media_type: "image/png", data: img }
        })),
        {
          type: "text",
          text: `分析这些 UI 截图,提取设计系统:

1. 颜色系统 (主色、辅助色、语义色)
2. 字体系统 (字号、行高、字重)
3. 间距系统 (padding, margin)
4. 圆角、阴影
5. 组件变体

输出 CSS 变量和设计规范。`
        }
      ]
    }]
  });

  return response.content;
}

8. 关键要点

  1. Vision API 强大但费 Token: 优化图像尺寸
  2. 分步处理更准确: 先分析再生成
  3. 迭代改进设计转代码: 渲染-比较-修复循环
  4. Computer Use 适合自动化测试: 模拟真实用户操作
  5. 结合代码上下文更有效: 图像 + 代码 = 更好的理解

延伸阅读

前端面试知识库