Abcas blog / latest AI rules

Latest AI Rules

2026年3月6日時点の公式情報を基準に、GPT-5.4、GPT-5.3-Codex xhigh、Claude Sonnet 4.6、Claude Opus 4.6、Gemini 3.1 Pro を、開発、業務設計、Excel、PPT、法律、医療、金融まで含めて使い分けるための実務ルールです。

A practical operating guide, updated for March 6, 2026, for using GPT-5.4, GPT-5.3-Codex xhigh, Claude Sonnet 4.6, Claude Opus 4.6, and Gemini 3.1 Pro across development, business operations, Excel, presentations, legal work, medical literature, and finance.

更新: 2026-03-06 Updated: 2026-03-06 参照つき With references ライト / ダーク切替 Light / Dark switch blog.abcas.jp

先に結論

Fast recommendation

1本で広く回すなら GPT-5.4。実装実務は Codex。高リスク領域は単独モデルに決めさせない。

Use GPT-5.4 as the broad default, Codex for execution-heavy development, and never let a single model make final calls in high-stakes domains.

命名注意

Naming note

Codex-5.3-xhigh は厳密には GPT-5.3-Codex に xhigh reasoning をかけた呼び方です。

Codex-5.3-xhigh is shorthand for GPT-5.3-Codex with xhigh reasoning.

まず守るルール

Operating baseline

Default

迷ったら GPT-5.4 を主担当にする。相談、要件、調査、文書、設計を1本で回しやすい。

Default to GPT-5.4 when you want one lead model across consultation, requirements, research, documents, and design.

Execution

実装と検証のループ は GPT-5.3-Codex xhigh を優先する。CLI、repo、テスト、修復反復に向く。

Prefer GPT-5.3-Codex xhigh for implementation loops that live in the terminal, repository, tests, and repeated fixes.

High stakes

法律、医療、金融 は必ず一次資料と人間レビューを通す。モデルは下書き、比較、論点整理まで。

For legal, medical, and financial work, require primary sources and human review. Models can draft, compare, and structure, but not make the final call alone.

モデル別の役割

Model roles

GPT-5.4

広く回す主担当。調査、文書、設計、Office、業務判断をまとめやすい。

The broad mainline model for research, documents, design, office work, and mixed business tasks.

GPT-5.3-Codex xhigh

実装ワーカー。コード変更、CI、デバッグ、端末操作を押し切る。

The implementation worker for code changes, CI, debugging, and computer-based execution.

Claude Sonnet 4.6

コスパの良い万能補助。量の多い本番運用、フロント、業務ワークフローに強い。

A cost-efficient all-rounder for scaled production, frontend work, and workflow-heavy operations.

Claude Opus 4.6

最難タスクの審査役。厳密レビュー、法務系、金融系、多段分析で使う。

The hardest-task reviewer for strict review, legal-heavy analysis, finance-heavy analysis, and long multi-step work.

Gemini 3.1 Pro

巨大文脈とマルチモーダル担当。大量資料、PDF、画像、動画、コードベース俯瞰に向く。

The huge-context multimodal specialist for large source bundles, PDFs, charts, video, and codebase-wide understanding.

用途別の推奨

Recommendations by use case

観点	第1候補	第2候補	実務メモ
Use case	First choice	Second choice	Operational note
相談 / 壁打ち / 要件定義	GPT-5.4	Sonnet 4.6	参照: GPT-5.4 GDPval 83.0 / Sonnet Pace 94。抽象度を上下しやすい。曖昧な依頼の整理に向く。
Consultation / sparring / requirements	GPT-5.4	Sonnet 4.6	Refs: GPT-5.4 GDPval 83.0 / Sonnet Pace 94. Best when the work moves between ambiguity, structure, and decisions.
リサーチ / 情報収集 / 資料統合	GPT-5.4	Gemini 3.1 Pro	参照: GPT-5.4 GDPval 83.0。Web中心はGPT、大量資料やマルチモーダル束ねはGemini。
Research / source gathering / synthesis	GPT-5.4	Gemini 3.1 Pro	Refs: GPT-5.4 GDPval 83.0. Use GPT for web-heavy work and Gemini for very large, multimodal source bundles.
アーキテクチャ / バックエンド / 連携設計	GPT-5.4	Opus 4.6	参照: GPT-5.4 GDPval 83.0 / Opus OSWorld 72.7。境界設計、整合、制約整理はGPT。最難ケースの詰めはOpus。
Architecture / backend / integration design	GPT-5.4	Opus 4.6	Refs: GPT-5.4 GDPval 83.0 / Opus OSWorld 72.7. Use GPT for coherent system design and Opus for unusually hard edge cases.
フロントデザイン / プロトタイプ	Sonnet 4.6	GPT-5.4	参照: Sonnet Pace 94 / Box +15pt。見た目の意図と指示追従のバランスが良い。1本化するならGPTでもよい。
Frontend design / prototypes	Sonnet 4.6	GPT-5.4	Refs: Sonnet Pace 94 / Box +15pt. Sonnet is strong when visual taste and instruction following both matter.
開発実行: コーディング / CI-CD / デバッグ / コード整合	GPT-5.3-Codex xhigh	Sonnet 4.6	参照: Codex Terminal 77.3 / SWE-Bench 56.8。repoと端末が主戦場ならCodex、並列レビューや量産ならSonnet。
Development execution: coding / CI-CD / debugging / code consistency	GPT-5.3-Codex xhigh	Sonnet 4.6	Refs: Codex Terminal 77.3 / SWE-Bench 56.8. Use Codex for repo-terminal execution and Sonnet for scalable parallel assistance.
コード論理性 / 危険変更のレビュー	Opus 4.6	GPT-5.4	参照: Opus Terminal 65.4 / OSWorld 72.7。厳密な論点詰め、長い差分の審査、難バグの見落とし確認に向く。
Code logic / risky-change review	Opus 4.6	GPT-5.4	Refs: Opus Terminal 65.4 / OSWorld 72.7. Use Opus for strict review, long diffs, and difficult bug-checking passes.
業務ワークフロー / CRM / 契約ルーティング	Sonnet 4.6	GPT-5.4	参照: Sonnet Pace 94 / GPT-5.4 GDPval 83.0。条件分岐の多い業務処理や長く走るエージェントに向く。
Business workflows / CRM / contract routing	Sonnet 4.6	GPT-5.4	Refs: Sonnet Pace 94 / GPT-5.4 GDPval 83.0. Strong for branched workflows, contract routing, and longer-running agents.
ビジネスロジック / コスト構造 / 運用設計	GPT-5.4	Gemini 3.1 Pro	参照: GPT-5.4 GDPval 83.0。構成要素の整理、比較、文章化はGPT。資料量が多い時はGemini。
Business logic / cost structure / operating design	GPT-5.4	Gemini 3.1 Pro	Refs: GPT-5.4 GDPval 83.0. Use GPT for structured tradeoffs and Gemini when the source set becomes very large.
販売営業プラン / 提案文 / 企画書	GPT-5.4	Sonnet 4.6	参照: GPT-5.4 OfficeQA 68.1 / Sonnet Pace 94。訴求軸、提案構成、比較表、文章品質をまとめやすい。
Sales planning / proposals / go-to-market writing	GPT-5.4	Sonnet 4.6	Refs: GPT-5.4 OfficeQA 68.1 / Sonnet Pace 94. Strong for positioning, proposal structure, comparison tables, and final polish.
Excel / スプレッドシート / モデリング	GPT-5.4	Opus 4.6	参照: GPT-5.4 OfficeQA 68.1 / Opus OSWorld 72.7。GPTはExcel連携と財務モデリングが明確。Opusは複雑な表と分析の精度補強。
Excel / spreadsheets / modeling	GPT-5.4	Opus 4.6	Refs: GPT-5.4 OfficeQA 68.1 / Opus OSWorld 72.7. GPT has explicit Excel and financial-modeling positioning. Opus is strong when the tables and reasoning get harder.
PPT / プレゼン / 提案デッキ	GPT-5.4	Opus 4.6	参照: GPT-5.4 OfficeQA 68.1 / Opus OSWorld 72.7。GPTはプレゼン品質改善が明示。金融系デッキの細部詰めはOpusも強い。
PPT / presentations / decks	GPT-5.4	Opus 4.6	Refs: GPT-5.4 OfficeQA 68.1 / Opus OSWorld 72.7. GPT explicitly improved presentations. Opus is also strong for detail-heavy financial decks.
長文書 / PDF / チャート / 図表読解	Gemini 3.1 Pro	Sonnet 4.6	参照: Sonnet Pace 94。Geminiは巨大マルチモーダル文脈向き。SonnetはOfficeQA系の読み取りに強い。
Long documents / PDFs / charts / tables	Gemini 3.1 Pro	Sonnet 4.6	Refs: Sonnet Pace 94. Gemini is best for massive multimodal context. Sonnet is strong on enterprise-document understanding.
法律 / 契約 / 規約 / ポリシー分析	Opus 4.6	GPT-5.4	参照: Opus Terminal 65.4 / GPT-5.4 GDPval 83.0。高ステークス。最終判断は弁護士レビュー必須。一次資料と日付を必ず残す。
Legal / contracts / policy analysis	Opus 4.6	GPT-5.4	Refs: Opus Terminal 65.4 / GPT-5.4 GDPval 83.0. High-stakes. Require lawyer review and keep exact primary sources with dates.
医療 / 臨床文献 / 研究要約	Gemini 3.1 Pro	GPT-5.4	参照: GPT-5.4 GDPval 83.0。高ステークス。文献探索と比較要約まで。診断、治療判断、処方決定は人間が行う。
Medical / clinical literature / research summaries	Gemini 3.1 Pro	GPT-5.4	Refs: GPT-5.4 GDPval 83.0. High-stakes. Use models for literature triage and structured summaries, not autonomous diagnosis or treatment decisions.
金融 / 分析 / バリュエーション / DD	GPT-5.4	Opus 4.6	参照: GPT-5.4 OfficeQA 68.1 / Opus OSWorld 72.7。GPTはExcelと財務ワークフロー、Opusは多段分析と資料品質で強い。
Finance / analysis / valuation / diligence	GPT-5.4	Opus 4.6	Refs: GPT-5.4 OfficeQA 68.1 / Opus OSWorld 72.7. GPT is explicitly strong for Excel and finance workflows, while Opus is strong for deeper multi-step analysis.
OpenClaw のメインエージェント	GPT-5.4	Sonnet 4.6	参照: GPT-5.4 GDPval 83.0 / Sonnet Pace 94。主担当は万能型。Codexは実装ワーカー、Opusは監査役に回す。
OpenClaw main agent	GPT-5.4	Sonnet 4.6	Refs: GPT-5.4 GDPval 83.0 / Sonnet Pace 94. Use a broad generalist as the lead, keep Codex as the builder, and use Opus as the reviewer.

参照スコアの扱い

How scores are used

OpenAI の主参照は `xhigh`。代表値は GPT-5.4 の GDPval 83.0 / OfficeQA 68.1 / SWE-Bench 57.7、Codex の Terminal 77.3 / SWE-Bench 56.8、Sonnet の Pace 94 / Box +15pt、Opus の Terminal 65.4 / OSWorld 72.7。Gemini は公開定量より 1M context と multimodal 適性を重視しています。

OpenAI references here assume `xhigh` where applicable. Representative values are GPT-5.4 at GDPval 83.0 / OfficeQA 68.1 / SWE-Bench 57.7, Codex at Terminal 77.3 / SWE-Bench 56.8, Sonnet at Pace 94 / Box +15pt, and Opus at Terminal 65.4 / OSWorld 72.7. Gemini is selected mainly for its 1M context and multimodal fit rather than a single public benchmark line.

ベンチマーク手法一覧

Benchmark methodology overview

各社公開スコアの背景にある主要評価手法です。

Key evaluation methodologies behind published benchmark scores.

ベンチマーク	概要	出題領域	スコア
Benchmark	Description	Domain	Score
AA Intelligence Index v4.0	Artificial Analysis 運営。GDPval-AA, τ²-Bench, Terminal-Bench Hard, SciCode, GPQA Diamond 等 10種を合成した総合指標	推論・知識・数学・コード横断	ポイント
AA Intelligence Index v4.0	By Artificial Analysis. Composite of 10 benchmarks: GDPval-AA, τ²-Bench, Terminal-Bench Hard, SciCode, GPQA Diamond, etc.	Reasoning, knowledge, math, coding	Points
GDPval	米国GDP上位9産業・44職種の実務タスク1,320問。経験14年の専門家が作成。ICLR 2026 Poster。OpenAI が公開	知識労働全般	% (専門家同等以上)
GDPval	1,320 tasks from 44 occupations across 9 US GDP industries. By experts averaging 14 yrs experience. ICLR 2026 Poster. Published by OpenAI	Knowledge work	% (wins/ties vs experts)
GDPval-AA	GDPval データセットをエージェント環境（シェル＋ブラウザ）で評価。ペアワイズ盲検比較で Elo 算出	同上（エージェント版）	Elo
GDPval-AA	GDPval dataset evaluated in agentic environment (shell + browser). Elo derived from blind pairwise comparison	Same (agentic version)	Elo
SWE-bench Verified	実OSSリポジトリのバグ修正。人手検証済み500問。resolve rate を計測	ソフトウェア工学	%
SWE-bench Verified	Bug fixes from real OSS repos. 500 human-verified tasks. Measures resolve rate	Software engineering	%
SWE-bench Pro	Scale AI 運営の上位難度サブセット。より複雑な変更を要する	ソフトウェア工学（難問）	%
SWE-bench Pro	Harder subset by Scale AI. Requires more complex changes	Software engineering (hard)	%
GPQA Diamond	博士レベル科学MCQ 198問。専門家は正答、非専門家は失敗する問題に限定。ランダム ~25%	物理・化学・生物	%
GPQA Diamond	198 PhD-level science MCQs. Only "Diamond" items where experts pass but non-experts fail. Random ~25%	Physics, chemistry, biology	%
OSWorld-Verified	スクリーンショット＋キーボード/マウスでデスクトップ操作。人間成功率 72.4%	GUI・ブラウザ・アプリ連携	%
OSWorld-Verified	Desktop tasks via screenshots + keyboard/mouse. Human success rate 72.4%	GUI, browser, app coordination	%
Terminal-Bench 2.0	CLI/端末でファイル操作・コマンド実行・デバッグなどを完了する成功率	CLI・デバッグ・ファイル管理	%
Terminal-Bench 2.0	Success rate on terminal tasks: file ops, commands, debugging	CLI, debugging, file management	%
ARC-AGI-2	未知の論理パターンを解く抽象推論ベンチ。人間は全問解けるが純粋LLMは0%。コスト対精度も評価	パターン認識・汎化	%
ARC-AGI-2	Abstract reasoning on novel logic patterns. Humans solve all; pure LLMs score 0%. Cost-accuracy tradeoff also measured	Pattern recognition, generalization	%
BrowseComp	Web閲覧による情報理解・抽出の正答率	ブラウジング・情報抽出	%
BrowseComp	Accuracy of information extraction through web browsing	Browsing, information extraction	%
HLE	Humanity's Last Exam。全学問分野の最難問。外部ツールなしで評価	全学問分野	%
HLE	Humanity's Last Exam. Hardest questions across all fields. No external tools	All academic fields	%

Effortレベル別ベンチマーク比較

Benchmark comparison by effort level

Effortレベルの命名規則

Effort level naming

OpenAI: none → low → medium → high → xhigh Anthropic: non-reasoning (low/high) → adaptive reasoning (max) Google: low → medium → high → max。推論に費やす計算量を段階的に調整する仕組みで、高いほど精度が上がるがコスト・レイテンシも増加する。

OpenAI: none → low → medium → high → xhigh Anthropic: non-reasoning (low/high) → adaptive reasoning (max) Google: low → medium → high → max. Controls how much computation the model spends on reasoning. Higher effort improves accuracy but increases cost and latency.

最高Effort時の横断比較

Cross-model comparison at highest effort

ベンチマーク	GPT-5.4 (xhigh)	Codex-5.3 (xhigh)	Opus 4.6 (max)	Sonnet 4.6 (max)	Gemini 3.1 Pro (high)
Benchmark	GPT-5.4 (xhigh)	Codex-5.3 (xhigh)	Opus 4.6 (max)	Sonnet 4.6 (max)	Gemini 3.1 Pro (high)
AA Intelligence Index v4.0	57	54	53	52	57
AA Intelligence Index v4.0	57	54	53	52	57
GDPval (%)	83.0	70.9	—	—	—
GDPval (%)	83.0	70.9	—	—	—
GDPval-AA (Elo)	—	—	1606	1633	~1317
GDPval-AA (Elo)	—	—	1606	1633	~1317
SWE-bench Verified (%)	—	—	80.8	79.6	80.6
SWE-bench Verified (%)	—	—	80.8	79.6	80.6
SWE-bench Pro (%)	57.7	56.8	—	—	—
SWE-bench Pro (%)	57.7	56.8	—	—	—
GPQA Diamond (%)	92.8	—	91.3	89.9	94.3
GPQA Diamond (%)	92.8	—	91.3	89.9	94.3
OSWorld-Verified (%)	75.0	64.7	72.7	72.5	—
OSWorld-Verified (%)	75.0	64.7	72.7	72.5	—
Terminal-Bench 2.0 (%)	—	77.3	65.4	—	68.5
Terminal-Bench 2.0 (%)	—	77.3	65.4	—	68.5
ARC-AGI-2 (%)	83.3 ★	—	68.8	58.3	77.1
ARC-AGI-2 (%)	83.3 ★	—	68.8	58.3	77.1
BrowseComp (%)	82.7	—	—	—	85.9
BrowseComp (%)	82.7	—	—	—	85.9
HLE (%)	—	—	—	—	44.4
HLE (%)	—	—	—	—	44.4

太字 = 各行の最高値。★ = GPT-5.4 Pro（最上位推論ティア）。「—」= 未公表または当該ベンチ未評価。各社でスキャフォールド・ツール環境が異なるため直接比較には注意。SWE-bench Verified と SWE-bench Pro は別サブセットで数値の直接比較不可。

Bold = best in row. ★ = GPT-5.4 Pro (highest reasoning tier). "—" = not published or not evaluated. Scaffolds and tool environments differ across vendors — direct comparison requires care. SWE-bench Verified and SWE-bench Pro are different subsets and cannot be directly compared.

Effortレベル別の性能変動

Performance variation by effort level

モデル	Effortレベル	AA Int. Index	GPQA Diamond	トークン量	備考
Model	Effort level	AA Int. Index	GPQA Diamond	Token usage	Note
GPT-5 (初代)	high	68 ‡	—	82M	high→minimal で23倍のトークン差
GPT-5 (initial)	high	68 ‡	—	82M	23× token gap from high→minimal
GPT-5 (初代)	medium	67 ‡	—	—	o3 相当
GPT-5 (initial)	medium	67 ‡	—	—	≈ o3 level
GPT-5 (初代)	low	64 ‡	—	—	DeepSeek R1 〜 o3 の間
GPT-5 (initial)	low	64 ‡	—	—	Between DeepSeek R1 and o3
GPT-5 (初代)	minimal	44 ‡	—	3.5M	GPT-4.1 相当
GPT-5 (initial)	minimal	44 ‡	—	3.5M	≈ GPT-4.1 level
GPT-5.4	xhigh	57	92.8%	—	公開は xhigh のみ
GPT-5.4	xhigh	57	92.8%	—	Only xhigh published
Codex-5.3	xhigh	54	—	—	公開は xhigh のみ
Codex-5.3	xhigh	54	—	—	Only xhigh published
Opus 4.6	adaptive, max	53	91.3%	160M	最高精度。非常に多くのトークンを使用
Opus 4.6	adaptive, max	53	91.3%	160M	Peak accuracy. Very high token usage
Opus 4.6	non-reasoning, high	46	—	—	推論なし。max 比 −7pt
Opus 4.6	non-reasoning, high	46	—	—	No reasoning. −7pt vs max
Sonnet 4.6	adaptive, max	52	89.9%	—	Sonnet 4.5 比 3倍トークン。Opus の 80% コスト削減
Sonnet 4.6	adaptive, max	52	89.9%	—	3× tokens vs 4.5. 80% cheaper than Opus
Sonnet 4.6	non-reasoning, low	43	74.1%	—	最速。GPQA で max 比 −15.8pt
Sonnet 4.6	non-reasoning, low	43	74.1%	—	Fastest. GPQA −15.8pt vs max
Gemini 3.1 Pro	thinking, high	57	94.3%	57M	公開は high モード中心。medium が新設
Gemini 3.1 Pro	thinking, high	57	94.3%	57M	Primarily reported at high. Medium is new

読み方のポイント

How to read this data

‡ は Intelligence Index v3 以前の値で v4.0 とはスケールが異なり直接比較不可。ただし effort 変動の傾向（high→minimal で −24pt）は参考になる。GPT-5.4 と Codex-5.3 は公式に xhigh のみ公表。Claude 4.6 系は non-reasoning と adaptive reasoning (max) の差が大きく、特に GPQA Diamond で Sonnet は +15.8pt の改善。Gemini 3.1 Pro は medium thinking level が新設されたが per-benchmark の低 effort スコアは未公開。全体として effort を上げると精度は向上するがトークン消費は 3〜23倍に増加し、コスト判断が重要になる。

‡ marks Intelligence Index v3 era values — scale differs from v4.0 and cannot be directly compared, but the trend (high→minimal = −24pt) illustrates effort impact. GPT-5.4 and Codex-5.3 only have xhigh scores published. Claude 4.6 models show a large gap between non-reasoning and adaptive reasoning, especially on GPQA Diamond where Sonnet gains +15.8pt. Gemini 3.1 Pro introduced a medium thinking level, but lower-effort per-benchmark scores are not yet public. Overall, raising effort improves accuracy but increases token consumption 3–23×, making cost-performance tradeoffs critical.

高リスク領域のルール

High-stakes guardrails

Rule 1

法律、医療、金融では、モデル単独で最終結論を出さない。責任者と資格者のレビューを前提にする。

In legal, medical, and financial workflows, do not let the model produce the final answer alone. Require accountable human review.

Rule 2

必ず一次資料、公式資料、原文に戻る。引用元のURLと日付を保存する。

Always return to primary, official, or source documents, and keep the source URL plus the date you checked it.

Rule 3

モデルには、下書き、比較、論点抽出、表の整理、チェックリスト生成をやらせる。意思決定は人間が持つ。

Use the model for drafting, comparison, issue spotting, table cleanup, and checklist generation. Keep final decision ownership with humans.

2 CLI の役割分担ルール

Two-CLI operating rule

GPT-5.4 CLI

要件整理、設計、差分レビュー、リサーチ、文書化を担当。

Assign requirements, design, diff review, research, and documentation here.

GPT-5.3-Codex CLI

実装、テスト、CI/CD、デバッグ、修復反復を担当。

Assign implementation, tests, CI/CD, debugging, and fix loops here.

共通ルール

Shared rule

同じファイルは同時に触らない。handoff を1枚置く。司令塔はGPT、施工担当はCodex。

Do not edit the same file concurrently. Keep one handoff file. Let GPT lead and Codex build.

参照

References

OpenAI

Introducing GPT-5.4

OpenAI

Introducing GPT-5.3-Codex

OpenAI

ChatGPT for Excel and financial data integrations

OpenAI

OpenAI API Pricing

Anthropic

Claude Sonnet 4.6

Anthropic

Claude Opus 4.6

Google DeepMind

Gemini 3.1 Pro model card

Google

Gemini API pricing

Artificial Analysis

LLM Leaderboard & Intelligence Index v4.0

Artificial Analysis

GDPval-AA Leaderboard

OpenAI / ICLR 2026

GDPval: Evaluating AI on Real-World Tasks

SWE-bench

SWE-bench Leaderboards

Scale AI

SWE-Bench Pro Leaderboard

ARC Prize

ARC-AGI-2 Leaderboard

Terminal-Bench

Terminal-Bench 2.0 Leaderboard

注記

Note

上の推奨順位は、2026年3月6日時点で公開されている各社公式情報、公開ベンチ、価格、利用シナリオから作った実務上の推奨です。標準化された単一ベンチによる絶対順位ではありません。

The rankings above are operational recommendations inferred from official product pages, public benchmarks, pricing, and use-case fit as of March 6, 2026. They are not a universal absolute ranking from one standardized benchmark.

最新AIルール

Latest AI Rules

まず守るルール

Operating baseline

モデル別の役割

Model roles

用途別の推奨

Recommendations by use case

ベンチマーク手法一覧

Benchmark methodology overview

Effortレベル別 ベンチマーク比較

Benchmark comparison by effort level

高リスク領域のルール

High-stakes guardrails

2 CLI の役割分担ルール

Two-CLI operating rule

参照

References

Effortレベル別ベンチマーク比較