OpenAI Jalapeño: Custom AI Inference Chip | TechPillow

OpenAI Unveils Jalapeño, Its First Custom AI Inference Chip

OpenAI and Broadcom Unveil Jalapeño on 24 June 2026

OpenAI and Broadcom announced Jalapeño on 24 June 2026, marking the first time OpenAI has designed a custom AI accelerator rather than procuring commercially available GPUs from Nvidia or other suppliers. Jalapeño is an application-specific integrated circuit built specifically for large language model inference — the step of running a trained model to generate responses for users. Early internal testing indicates the chip delivers performance per watt substantially better than current commercially available hardware. OpenAI has stated a target of approximately 50 per cent lower cost per token compared with running equivalent inference workloads on Nvidia GPU clusters. The first prototype systems are slated for deployment by the end of 2026, with production scaling through 2027 and into 2028.

Nine Months From a Blank Slate to Tape-Out

Jalapeño went from initial design to manufacturing tape-out in nine months — a pace that Broadcom's silicon implementation team and the broader semiconductor industry have described as among the fastest ever achieved for a high-performance ASIC at this scale. Two factors made the compressed timeline possible. The first was deep software-hardware co-development, with OpenAI's engineering teams contributing model-level knowledge about inference workload characteristics directly into the chip design process. The second was the use of OpenAI's own models to accelerate parts of the chip design and optimisation work itself. This makes Jalapeño one of the first documented cases of large language models being used to improve the design of the hardware that runs large language models — closing a feedback loop between AI software and AI silicon.

Architecture: Reticle-Sized With Eight HBM Stacks

Jalapeño uses a systolic array architecture — a grid of processing elements that pass data from cell to cell in rhythmic lockstep, a design well-suited to the dense matrix multiplications that dominate LLM inference workloads. The chip is reticle-sized, meaning it occupies the maximum die area achievable in a single lithography exposure at TSMC's 3-nanometre process node. Eight stacks of high-bandwidth memory are integrated directly onto the package, reducing the distance data must travel between memory and compute and cutting both the latency and the power cost of moving the enormous parameter sets that frontier models require during inference.

The design makes a deliberate specialisation trade. Unlike a GPU, which is a general-purpose parallel processor designed to run graphics, simulation, training, and inference across a wide variety of workloads, Jalapeño optimises entirely for inference on large transformer models. That specialisation is what enables the claimed performance-per-watt advantage over GPUs: it is a chip that does one thing and aims to do it as efficiently as silicon physics allows.

The Cost Argument: Why OpenAI Is Building Its Own Silicon

Nvidia's H100 and B200 series GPUs have been the primary compute substrate for LLM inference since the generative AI boom began in 2023. In 2026, Nvidia GPUs remain extremely capable, but they carry a high acquisition cost, a substantial power footprint, and a profit margin that Nvidia retains rather than passing downstream to AI service providers. For OpenAI, which runs inference at a scale where compute is among its largest operating costs, the economic case for custom silicon is straightforward: if Jalapeño delivers its targeted 50 per cent cost-per-token reduction, the savings at OpenAI's inference volume are large enough to fund the chip development programme many times over.

The 50 per cent figure is self-reported, based on OpenAI's internal benchmarks against workloads of its own choosing, without disclosed comparison baselines or independent third-party verification. The realistic production outcome may be narrower, particularly as Nvidia's next-generation Rubin-architecture accelerators enter the market in 2027. However, even a 20-30 per cent structural reduction in cost per token at the inference volumes OpenAI runs would represent a significant shift in the company's unit economics. Jalapeño also positions OpenAI as the first non-hyperscaler AI company to deploy purpose-built inference silicon at production scale, alongside Google's TPUs, AWS Trainium/Inferentia, and Microsoft's Maia.

What the Jalapeño Chip Means for AI Products Built in India

For Indian technology teams and product companies, the most direct relevance of Jalapeño is through its downstream effect on inference costs. If OpenAI's custom silicon reduces the cost of running GPT-5 and future models, those savings have the potential to translate into lower API pricing for developers and lower compute costs for enterprise customers over the 2027-2028 period. Indian startups and scale-up teams that currently size their AI product architecture around token budgets would benefit from cheaper inference in the same way that decreasing cloud storage costs reshaped the economics of data-heavy product development in the 2010s.

For teams evaluating whether to build on proprietary API models or invest in running open-weight models on their own infrastructure, Jalapeño is also a signal about the direction of the broader AI silicon market. Google, AWS, Microsoft, and now OpenAI are all fielding custom AI accelerators, which will progressively widen the performance and cost-per-token gap between foundation model providers running purpose-built silicon and organisations attempting to replicate comparable capability on general-purpose GPU clusters. That gap makes the long-term economics of competing on raw model capability increasingly challenging for anyone without access to proprietary silicon at scale.

The Bottom Line

OpenAI and Broadcom unveiled Jalapeño on 24 June 2026 — OpenAI's first custom AI inference chip, manufactured on TSMC's 3-nanometre process with eight HBM stacks integrated on-package and a systolic array architecture optimised specifically for large language model inference. The chip went from blank slate to tape-out in nine months, partly by using OpenAI's own models to accelerate the design process. OpenAI targets approximately 50 per cent lower cost per token compared with Nvidia GPU inference, with prototype deployment by end of 2026 and production scaling through 2027-2028. For Indian product teams building on the OpenAI API, the most direct near-term implication is the potential for lower inference costs downstream as Jalapeño enters production scale.

Frequently Asked Questions

What is OpenAI's Jalapeño chip and what does it do?+

Jalapeño is a custom AI inference accelerator that OpenAI co-developed with Broadcom, announced on 24 June 2026. It is an application-specific integrated circuit manufactured on TSMC's 3-nanometre process node, with a reticle-sized compute die and eight stacks of high-bandwidth memory integrated directly on the package. The chip uses a systolic array architecture optimised specifically for the matrix multiplication operations that dominate large language model inference workloads, and is designed entirely for inference rather than training.

How long did it take to design the Jalapeño chip and what made that speed possible?+

OpenAI and Broadcom went from initial design to manufacturing tape-out in nine months, which Broadcom has described as one of the fastest ASIC development cycles ever achieved at this performance tier. Two factors enabled the pace: deep software-hardware co-development between OpenAI's model engineers and Broadcom's silicon teams, and the use of OpenAI's own large language models to accelerate parts of the chip design and optimisation process — making Jalapeño one of the first documented cases of LLMs improving the design of the hardware on which LLMs run.

How does Jalapeño compare to Nvidia GPUs in cost and performance?+

OpenAI's internal benchmarks show Jalapeño targeting approximately 50 per cent lower cost per token than running equivalent inference workloads on Nvidia GPU clusters, with performance-per-watt substantially better than current state-of-the-art hardware. The chip reportedly outperforms AMD's Instinct MI350-series and Nvidia's Blackwell-based accelerators on the workloads OpenAI tested. These figures are self-reported without independent third-party verification, and the competitive position will shift as Nvidia's Rubin-architecture accelerators enter the market in 2027.

When will the Jalapeño chip be deployed and what does this mean for API pricing?+

Prototype Jalapeño systems are targeted for deployment by the end of 2026, with production scaling through 2027 and into 2028. If Jalapeño delivers its targeted cost-per-token reduction, the downstream economics could translate to lower OpenAI API pricing for developers over the 2027-2028 period. Teams currently sizing their AI product architecture around token budgets should monitor OpenAI's infrastructure announcements over the next 18 months, as the production ramp of Jalapeño could meaningfully shift the unit economics of building on the GPT-5 model family.

Written by

TechPillow Team

Sharing insights on technology, product development, and the Indian tech ecosystem.

All Articles

OpenAI Unveils Jalapeño, Its First Custom AI Inference Chip