m5k1umn 发表于 2024-7-30 19:23:31

Google ScreenAI表率了一款先进的视觉语言模型


    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">屏幕用户界面(UI)和信息图表,例如图表、图解和表格,在人类沟通和人机交互中发挥着重要<span style="color: black;">功效</span>,<span style="color: black;">由于</span>它们促进了丰富和互动的用户体验。用户界面和信息图表共享类似的设计原则和视觉语言(例如,图标和布局),这<span style="color: black;">供给</span>了<span style="color: black;">创立</span>单一模型的机会,该模型能够理解、推理并与这些界面交互。然而,<span style="color: black;">因为</span>它们的<span style="color: black;">繁杂</span>性和多样的呈现格式,信息图表和用户界面呈现了一个独特的建模挑战。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">为了应对这一挑战,<span style="color: black;">科研</span>者们介绍了“ScreenAI:一个用于用户界面和信息图表理解的视觉-语言模型”。ScreenAI在PaLI架构的<span style="color: black;">基本</span>上进行了改进,采用了pix2struct中引入的灵活打补丁策略。<span style="color: black;">科研</span>者们在<span style="color: black;">包含</span>一项新颖的屏幕注释任务在内的独特数据集和任务组合上训练了ScreenAI,该任务<span style="color: black;">需求</span>模型识别屏幕上的用户界面元素信息(即,类型、位置和描述)。这些文本注释为大型语言模型(LLMs)<span style="color: black;">供给</span>了屏幕描述,使它们能够自动生成问答(QA)、用户界面导航和摘要训练数据集。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">仅在5亿参数的<span style="color: black;">状况</span>下,ScreenAI就在基于用户界面和信息图表的任务(WebSRC和MoTIF)上达到了最先进的结果,并且在与<span style="color: black;">类似</span><span style="color: black;">体积</span>的模型相比,在Chart QA、DocVQA和InfographicVQA上表现最佳。<span style="color: black;">科研</span>者们还发布了三个新的数据集:Screen Annotation,用于<span style="color: black;">评定</span>模型的布局理解能力,以及ScreenQA Short和Complex ScreenQA,用于更全面地<span style="color: black;">评定</span>其问答能力。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">ScreenAI的架构基于PaLI,<span style="color: black;">包括</span>一个多模态编码器块和一个自回归解码器。PaLI编码器<span style="color: black;">运用</span>视觉变换器(ViT)创建图像嵌入,并且多模态编码器将图像和文本嵌入的连接<span style="color: black;">做为</span>输入。这种灵活的架构使ScreenAI能够<span style="color: black;">处理</span><span style="color: black;">能够</span>重新构想为文本加图像到文本问题的视觉任务。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">在PaLI架构之上,<span style="color: black;">科研</span>者们采用了pix2struct中引入的灵活打补丁策略。不<span style="color: black;">运用</span>固定的网格模式,而是<span style="color: black;">选取</span>网格尺寸以<span style="color: black;">保存</span>输入图像的原生宽高比。这使ScreenAI能够很好地适应<span style="color: black;">各样</span>宽高比的图像。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">ScreenAI模型在两个<span style="color: black;">周期</span>进行训练:预训练<span style="color: black;">周期</span>和微调<span style="color: black;">周期</span>。<span style="color: black;">首要</span>,自监督学习被应用于自动生成数据标签,然后<span style="color: black;">运用</span>这些标签来训练视觉变换器和语言模型。在微调<span style="color: black;">周期</span>,视觉变换器被冻结,大<span style="color: black;">都数</span><span style="color: black;">运用</span>的数据<span style="color: black;">是由于</span>人类<span style="color: black;">评定</span>员手动标记的。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">为了为ScreenAI创建一个预训练数据集,<span style="color: black;">科研</span>者们<span style="color: black;">首要</span>编译了来自<span style="color: black;">各样</span>设备(<span style="color: black;">包含</span>桌面、移动和平板电脑)的<span style="color: black;">海量</span>屏幕截图。这是<span style="color: black;">经过</span><span style="color: black;">运用</span>公开可<span style="color: black;">拜访</span>的网页和遵循用于移动应用的RICO数据集的程序化探索<span style="color: black;">办法</span>来实现的。<span style="color: black;">而后</span><span style="color: black;">她们</span>应用一个基于DETR模型的布局注释器,它能识别和标记广泛的用户界面元素(例如图像、图示、按钮、文本)及其空间关系。图示进一步<span style="color: black;">运用</span>一个能够区分77种<span style="color: black;">区别</span>图标类型的图标<span style="color: black;">归类</span>器进行分析。这种<span style="color: black;">仔细</span>的<span style="color: black;">归类</span><span style="color: black;">针对</span>解释<span style="color: black;">经过</span>图标传达的细<span style="color: black;">微X</span>息至关重要。<span style="color: black;">针对</span>未被<span style="color: black;">归类</span>器覆盖的图标,以及信息图表和图像,<span style="color: black;">科研</span>者们<span style="color: black;">运用</span>PaLI图像标题生成模型来生成描述性标题,<span style="color: black;">供给</span>上下文信息。<span style="color: black;">她们</span>还应用光学字符识别(OCR)引擎来提取和注释屏幕上的文本内容。<span style="color: black;">科研</span>者们将OCR文本与前述注释结合起来,创建了<span style="color: black;">每一个</span>屏幕的<span style="color: black;">仔细</span>描述。</span></p>
    <div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-axegupay5k/db0e208a4fe945beabf32b293480f36f~noop.image?_iz=58558&amp;from=article.pc_detail&amp;lk3s=953192f4&amp;x-expires=1722921578&amp;x-signature=EK6ss4V%2FO2fvJvEMzc9I25IBNpY%3D" style="width: 50%; margin-bottom: 20px;"></div>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">经过</span><span style="color: black;">运用</span>PaLM 2<span style="color: black;">加强</span>预训练数据的多样性,<span style="color: black;">科研</span>者们在两步过程中生成输入-输出对。<span style="color: black;">首要</span>,使用<span style="color: black;">以上</span>技术生成屏幕注释,<span style="color: black;">而后</span><span style="color: black;">她们</span>围绕这个架构为大型语言模型创建一个提示,以生成合成数据。这个过程需要提示工程和迭代细化来找到有效的提示。<span style="color: black;">科研</span>者们<span style="color: black;">经过</span>人类验证对生成数据的质量进行<span style="color: black;">评定</span>,以达到一个质量阈值。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">ScreenAI在两个<span style="color: black;">周期</span>进行训练:预训练和微调。预训练数据标签是<span style="color: black;">经过</span>自监督学习<span style="color: black;">得到</span>的,而微调数据标签来自人类<span style="color: black;">评定</span>员。</span></p>You only speak JSON. <span style="color: black;">Do</span> <span style="color: black;">not</span> write <span style="color: black;">text</span> that isn’t JSON.
    You <span style="color: black;">are</span> given the <span style="color: black;">following</span> mobile screenshot, described <span style="color: black;">in</span> words. Can you generate <span style="color: black;">5</span>questions regarding the<span style="color: black;">content</span> <span style="color: black;">of</span> the screenshot <span style="color: black;">as</span> well <span style="color: black;">as</span> the <span style="color: black;">corresponding</span> <span style="color: black;">short</span> answers <span style="color: black;">to</span> them?

    The answer should be <span style="color: black;">as</span> <span style="color: black;">short</span> <span style="color: black;">as</span> possible, containing <span style="color: black;">only</span>the necessary information. Your answer should be structured<span style="color: black;">as</span> <span style="color: black;">follows</span>:
    questions: [
    {{question: the question,
    answer: the answer
}},
    ...
    ]

    {THE SCREEN<span style="color: black;">SCHEMA</span>}<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">科研</span>者们<span style="color: black;">运用</span>公开的问答、摘要和导航数据集对ScreenAI进行微调,并<span style="color: black;">运用</span>与用户界面<span style="color: black;">关联</span>的多种任务。<span style="color: black;">针对</span>问答,<span style="color: black;">她们</span><span style="color: black;">运用</span>多模态和文档理解<span style="color: black;">行业</span>中<span style="color: black;">创立</span>良好的基准,如ChartQA、DocVQA、多页DocVQA、InfographicVQA、OCR VQA、Web SRC和ScreenQA。<span style="color: black;">针对</span>导航,<span style="color: black;">运用</span>的数据集<span style="color: black;">包含</span>Referring Expressions、MoTIF、Mug和Android in the Wild。最后,<span style="color: black;">她们</span><span style="color: black;">运用</span>Screen2Words进行屏幕摘要,<span style="color: black;">运用</span>Widget Captioning描述特定用户界面元素。除了微调数据集,<span style="color: black;">科研</span>者们还使用三个新的基准测试来<span style="color: black;">评定</span>微调后的ScreenAI模型:</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">– Screen Annotation:用于<span style="color: black;">评定</span>模型的布局注释和空间理解能力。</p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">– ScreenQA Short:ScreenQA的一个变体,其真实答案已缩短,仅<span style="color: black;">包括</span>与其他问答任务更一致的<span style="color: black;">关联</span>信息。</p>– Complex ScreenQA:与ScreenQA Short相辅相成,<span style="color: black;">包括</span>更难的问题(计数、算术、比较和<span style="color: black;">没法</span>回答的问题),并<span style="color: black;">包括</span><span style="color: black;">拥有</span><span style="color: black;">各样</span>宽高比的屏幕。
    <div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-6w9my0ksvp/67882e595efc48409a71b752a63ff604~noop.image?_iz=58558&amp;from=article.pc_detail&amp;lk3s=953192f4&amp;x-expires=1722921578&amp;x-signature=cuU0wAI2tHOilENM5%2BeBwoq9uiU%3D" style="width: 50%; margin-bottom: 20px;"></div>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">微调后的ScreenAI模型在<span style="color: black;">各样</span>基于用户界面和信息图表的任务(WebSRC和MoTIF)上达到了最先进的结果,并且与<span style="color: black;">类似</span><span style="color: black;">体积</span>的模型相比,在Chart QA、DocVQA和InfographicVQA上表现最佳。ScreenAI在Screen2Words和OCR-VQA上<span style="color: black;">亦</span>表现出竞争力。<span style="color: black;">另外</span>,<span style="color: black;">科研</span>者们还报告了在新引入的基准数据集上的结果,<span style="color: black;">做为</span>进一步<span style="color: black;">科研</span>的基线。</span></p>
    <div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-6w9my0ksvp/5f2b3eaefd4440218dd7ca2d221a1b85~noop.image?_iz=58558&amp;from=article.pc_detail&amp;lk3s=953192f4&amp;x-expires=1722921578&amp;x-signature=ltPoIfsJ8Llfi3TYVCwwgnzsiPQ%3D" style="width: 50%; margin-bottom: 20px;"></div>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">科研</span>者们介绍了ScreenAI模型以及一个统一的<span style="color: black;">暗示</span>,使<span style="color: black;">她们</span>能够<span style="color: black;">研发</span>利用所有这些<span style="color: black;">行业</span>数据的自监督学习任务。<span style="color: black;">她们</span>还展示了<span style="color: black;">运用</span>大型语言模型进行数据生成的影响,并探讨了<span style="color: black;">经过</span>修改训练混合来<span style="color: black;">加强</span>模型在特定方面的表现。<span style="color: black;">她们</span>将所有这些技术应用于构建多任务训练模型,与公开基准上的最先进<span style="color: black;">办法</span>相比,这些模型表现出竞争力。然而,<span style="color: black;">科研</span>者们<span style="color: black;">亦</span></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">重视</span>到,尽管<span style="color: black;">她们</span>的<span style="color: black;">办法</span>与公开基准上的最先进<span style="color: black;">办法</span>相比<span style="color: black;">表示</span>出竞争力,但与大型模型相比仍有差距。<span style="color: black;">她们</span>强调,需要进一步的<span style="color: black;">科研</span>来弥合这一差距,并探索新的策略和技术以<span style="color: black;">提高</span>模型性能。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">科研</span>者们的工作不仅展示了ScreenAI模型在用户界面和信息图表理解方面的<span style="color: black;">潜能</span>,<span style="color: black;">况且</span>还为<span style="color: black;">将来</span>的<span style="color: black;">科研</span><span style="color: black;">供给</span>了一个坚实的<span style="color: black;">基本</span>。<span style="color: black;">经过</span>发布新的数据集和展示<span style="color: black;">经过</span>大型语言模型生成数据的能力,<span style="color: black;">她们</span>为<span style="color: black;">处理</span><span style="color: black;">繁杂</span>的人机交互问题开辟了新途径。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">另外</span>,ScreenAI模型的<span style="color: black;">研发</span>揭示了跨<span style="color: black;">行业</span>融合的重要性,即将计算机视觉、自然语言处理和人机交互的最新<span style="color: black;">发展</span>结合起来,以<span style="color: black;">处理</span><span style="color: black;">长时间</span>存在的挑战。这种跨学科的<span style="color: black;">办法</span>不仅促进了技术进步,<span style="color: black;">亦</span>为<span style="color: black;">科研</span>社区<span style="color: black;">供给</span>了丰富的资源,<span style="color: black;">包含</span>数据集、模型架构和训练策略,这些都是推动<span style="color: black;">将来</span>创新的关键<span style="color: black;">原因</span>。</span></p>
    <div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-6w9my0ksvp/8373a48ef7c248fca3f58325abe366c1~noop.image?_iz=58558&amp;from=article.pc_detail&amp;lk3s=953192f4&amp;x-expires=1722921578&amp;x-signature=VXd7eBSlkRSMcgrhE5IGCXwKzbQ%3D" style="width: 50%; margin-bottom: 20px;"></div>
    <div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-6w9my0ksvp/f6eb5eaa6bb34f72917740d252068d3d~noop.image?_iz=58558&amp;from=article.pc_detail&amp;lk3s=953192f4&amp;x-expires=1722921578&amp;x-signature=6pHwKpqyM4y4%2FPtLYNrrifszi%2Fw%3D" style="width: 50%; margin-bottom: 20px;"></div>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">总之,ScreenAI项<span style="color: black;">目的</span>志着在理解和互动与日益<span style="color: black;">繁杂</span>的数字界面方面的重要一步。随着技术的<span style="color: black;">持续</span>进步,期待<span style="color: black;">将来</span>的<span style="color: black;">科研</span>能够继续探索这一<span style="color: black;">行业</span>的<span style="color: black;">潜能</span>,解锁<span style="color: black;">更加多</span>的应用场景,从而更好地服务于人类与<span style="color: black;">设备</span>的交互。</span></p>




张露zhang 发表于 2024-9-9 03:53:12

交流如星光璀璨,点亮思想夜空。

b1gc8v 发表于 2024-10-9 01:03:53

楼主的文章深得我心,表示由衷的感谢!

qzmjef 发表于 2024-10-11 06:50:06

在遇到你之前,我对人世间是否有真正的圣人是怀疑的。

4lqedz 发表于 2024-11-13 12:00:29

在遇到你之前,我对人世间是否有真正的圣人是怀疑的。
页: [1]
查看完整版本: Google ScreenAI表率了一款先进的视觉语言模型