OpenAI 的 Deep Research 存在以下一些缺陷:
Deep Research任务前由微调GPT-4o交互,并调用research_kickoff_tool进行上下文补充。可尝试发送"please start_research_task"触发任务启动。?[https://x.com/dotey/status/1886790050282917926](https://x.com/dotey/status/1886790050282917926)4⃣️?Anthropic禁止用AI写求职申请Anthropic要求求职者在申请过程中不要使用AI生成答案,以评估真实兴趣和沟通能力。?[https://x.com/dotey/status/1886783443859546491](https://x.com/dotey/status/1886783443859546491)5⃣️?AI搜索进化:深度体验OpenAI Deep ResearchDeep Research具备多语言搜索、精准信息提炼、专业写作能力,适用于学术研究、SEO、产品策划等。但仍有滞后性、信息混乱等问题,无法完全替代人类的深度思考。?详细体验:[https://mp.weixin.qq.com/s/_4UZrJuI42PuyTD5s5mVZg?token=1639803888&lang=zh_CN](https://mp.weixin.qq.com/s/_4UZrJuI42PuyTD5s5mVZg?token=1639803888&lang=zh_CN)?[https://x.com/dotey/status/1886671986559967734](https://x.com/dotey/status/1886671986559967734)6⃣️?Deep Research搜索验证AI应用质量的观点
5⃣️?AI搜索进化:深度体验OpenAI Deep Research Deep Research提供多语言搜索、精准信息提炼和专业写作,适用于学术研究、SEO和产品策划,但仍存在信息滞后与混乱的问题,无法完全取代人类的深度思考。?[详细体验](https://mp.weixin.qq.com/s/_4UZrJuI42PuyTD5s5mVZg?token=1639803888&lang=zh_CN)丨?[详情](https://x.com/dotey/status/1886671986559967734)6⃣️?Deep Research搜索验证AI应用质量的观点通过Deep Research搜索网友观点并生成报告,发现其在某些信息上仍表现出滞后性,提示在实时性方面还有提升空间。?[详细报告](https://mp.weixin.qq.com/s/TojnJ8kMRxtznmnnMO6dfw?token=1639803888&lang=zh_CN)7⃣️?o1 Pro编程使用建议由于o1 Pro无法直接读取本地代码,建议手动提供完整代码或使用Repo Prompt选取相关代码进行粘贴,并建议代码长度控制在10K Tokens以内以保持性能。?[详情](https://x.com/dotey/status/1886587848624681119)
petitions that involve designing,build-ing,and training ML models on GPUs?OpenAI PRsReal world ML research tasks Can models replicate OpenAI PRs?SWE-LancerReal world software engineer-ing tasksHow do models perform on real-world,economically valuable full-stack softwareengineering tasks?[heading3]4.6.1 OpenAI Research Engineer Interviews(Multipl[content]We measure GPT-4.5’s ability to pass OpenAI’s Research Engineer interview loop,using a dataset of 18 coding and 97 multiple-choice questions created from our internal question bank.GPT-4.5 scores 79% on the coding questions,tying deep research but underperforming relative to o3-mini.19All models since o1 score similarly on the multiple choice question set.GPT-4.5(both preand post-mitigation)score 80%,as do o1 and o3-mini.We find that frontier models excel at self-contained ML challenges.However,interview questions measure short(1 hour)tasks,not real-world ML research(1 month to 1+years),so strong interview performance does not necessarily imply that models generalize to longer horizon tasks.