10 Ways To Access GPT-OSS 120B for Free

Running GPT-OSS 120B locally, accessing it in a browser using the free chat applications, trying fast and free LLM inference providers, and exploring an LLM marketplace for the best and fastest model access.

Aug 8, 2025 · 8 min read

OpenAI has recently released open-weight models designed for powerful reasoning, agentic tasks, and versatile developer use cases. These models come in two variants: gpt-oss-120b and gpt-oss-20b. The smaller model can be run locally on consumer GPUs, while running gpt-oss-120b requires 80GB of VRAM and a more robust computing setup, which many users may not have.

To help, I have compiled a list of free and easy ways to access the larger model. You can run the model locally using Ollama, Transformers, or the vLLM framework. There are also free chat applications where you can test the model, as well as numerous fast and free large language model (LLM) inference providers that you can try. Additionally, you can visit the LLM marketplace to find the best options suited to your use case. Check out our separate guide on how to set up and run GPT-OSS locally with Ollama.

Open Source

OpenAI GPT-OSS is an open-weight model series that allows users to download and run it on any system. You can also use it locally to build applications and create projects. This model is available for both commercial and open-source use.

In this part, we will learn about running the large GPT-OSS-120B model on consumer hardware.

1. Ollama

With Ollama, you don’t necessarily need a GPU; you can run the model entirely on your CPU and RAM. However, token generation will be significantly slower without GPU acceleration.

If you have a multiple-GPU setup, you can offload some model layers to the GPU, which will greatly improve generation speed. This does require some technical knowledge and configuration to set up properly.

Source: gpt-oss

To get started, install the Ollama on Linux.

curl -fsSL https://ollama.com/install.sh | sh

Then, run the following command to download and run the gpt-oss-120b model.

ollama run gpt-oss:120b

It is that simple.

2. Transformers

With the Transformers library, you can download and load the model, run inference, fine-tune the model, and integrate it directly into your application.

For models with 120 billion parameters, you will need to implement model sharding across multiple GPUs. Additionally, you may need to use 8-bit or 4-bit quantization to fit the model into memory.

If your VRAM is still insufficient, you can offload some layers to the CPU. This process requires more experience and technical knowledge, but if you are familiar with the necessary steps, it can be managed.

Source: openai/gpt-oss-120b

Install the necessary packages:

pip install -U transformers kernels torch

Here is an example of usage: Create the text generation pipeline, provide it with a prompt, and then execute the pipeline to generate a response.

from transformers import pipeline
import torch

model_id = "openai/gpt-oss-120b"

pipe = pipeline(
    "text-generation",
    model=model_id,
    torch_dtype="auto",
    device_map="auto",
)

messages = [
    {"role": "user", "content": "What is the best DataCamp course to learn AI?"},
]

outputs = pipe(
    messages,
    max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1])

3. vLLM

vLLM is a high-performance inference engine specifically designed for fast and high-throughput text generation, making it perfect for production-level deployments.

It can be deployed in several ways: on your local server for development and testing, on a private cloud for internal company use, or in small companies and startups that seek greater control and data privacy without relying on external APIs.

Source: GPT OSS - vLLM Recipes

Use uv pip for faster installation of Python packages.

uv pip install --pre vllm==0.10.1+gptoss \
    --extra-index-url https://wheels.vllm.ai/gpt-oss/ \
    --extra-index-url https://download.pytorch.org/whl/nightly/cu128 \
    --index-strategy unsafe-best-match

Run the following command in your terminal to download the model and launch the vLLM server:

vllm serve openai/gpt-oss-120b

Once running, vLLM will start a local API server (default: http://localhost:8000) that you can query via HTTP requests and Integrate directly into your applications.

Chat Applications

AI chat applications allow you to try out different models without any setup required. All you need to do is visit the URL and sign up. You can enjoy the features offered by these applications.

In this section, we will explore official chat and the T3 chat application, both of which provide free access to the new OpenAI GPT-OSS models.

4. GPT-OSS.com

Gpt-oss is the official website where you can test both of OpenAI’s open-source models. It was created in collaboration with Hugging Face, so you will need to sign in with your Hugging Face account to unlock the full capabilities of the models.

Source: Gpt-oss

The app is completely free with unlimited chat results. You can experiment with different reasoning levels, switch between modes, and even ask it to build a complete web application from scratch. It’s super fast and requires no setup.

5. T3 Chat

I have been loving T3 Chat ever since I discovered it. It provides lightning-fast access to both open-source and proprietary models. Think of it as a ChatGPT-style application, but for all kinds of AI models, including image generation, vision-language models, and large language models.

Currently, T3 Chat offers free-tier access to both GPT-OSS 20B and GPT-OSS 120B. All you need to do is sign up and start enjoying full functional access to these models.

Source: t3.chat

The best part? Its clean, intuitive UI and the way your generated results are beautifully displayed.

LLM Inference Providers

Inference providers host large language models on their own servers and give users API or web access, allowing them to integrate these models directly into their applications without having to run an inference server on their own infrastructure. This makes it easy for developers to use powerful models without worrying about deployment, scaling, or maintenance.

In this section, we will explore some super-fast and free inference providers that you can use right now to access the 120B variant of the GPT-OSS model.

6. Cerebras

Cerebras is one of the fastest LLM inference providers in the world, capable of delivering up to 1,400 tokens per second with concurrent requests. This makes it an excellent choice for building real-time AI systems where speed is critical.

However, there are a few considerations:

Pricing: High request rates require a paid plan. The free tier offers only basic usage.
Accuracy: While Cerebras is extremely fast, its output quality may vary. Some inference providers optimize for speed at the cost of precision.

Source: cerebras.ai

Cerebras is not OpenAI SDK compatible, so you will need to install their own SDK:

pip install --upgrade cerebras_cloud_sdk

After that, create the client using the API key, create the chat completion function, and then generate the response as a stream.

import os
from cerebras.cloud.sdk import Cerebras

client = Cerebras(
    # This is the default and can be omitted
    api_key=os.environ.get("CEREBRAS_API_KEY")
)

stream = client.chat.completions.create(
    messages=[
        {
            "role": "system",
            "content": "Who was Leonardo Da Vinci?"
        }
    ],
    model="gpt-oss-120b",
    stream=True,
    max_completion_tokens=65536,
    temperature=1,
    top_p=1,
    reasoning_effort="high"
)

for chunk in stream:
  print(chunk.choices[0].delta.content or "", end="")

7. Groq

I am a big fan of Groq because they provide fast and affordable inference endpoints for popular open-source models. Currently, you can access the 120B variant of OpenAI’s GPT-OSS for free, though with a limited request rate.

You can try the model directly in Groq Studio before integrating it into your applications, making it easy to experiment and fine-tune your prompts.

Source: Groq

Groq is OpenAI compatible, meaning you can use the OpenAI Python SDK to access their models. However, Groq also offers its own optimized SDK, which is better tuned for their infrastructure, and that’s what we will use here.

Install the Groq SDK:

pip install groq

Create the Groq client with the API key, and then create the chat completions using the model name and other arguments to stream the response.

from groq import Groq

client = Groq(
    api_key=os.environ.get("GROQ_API_KEY"),
)

completion = client.chat.completions.create(
    model="openai/gpt-oss-120b",
    messages=[
      {
        "role": "user",
        "content": "What is the meaning of life?"
      }
    ],
    temperature=1,
    max_completion_tokens=8192,
    top_p=1,
    reasoning_effort="medium",
    stream=True,
    stop=None
)

for chunk in completion:
    print(chunk.choices[0].delta.content or "", end="")

8. Baseten

Baseten is a powerful inference and training platform designed for deploying, scaling, and managing AI models with full control. The best part is that it is on par with Groq and Cerebras in terms of token generation speed.

When you sign up, you receive $5 of deployment credit and $1 for API access. You can use this money to access the platform's services.

Source: Model API | Baseten

Baseten is compatible with OpenAI and does not have its own SDK, so we will be using the OpenAI Python SDK to access the model endpoint. To do this, you need to install the OpenAI Python package:

pip install openai

Next, create the client using your API key and base URL. After that, you can create chat completions by specifying the model name and other parameters. Finally, you can generate streamed responses.

from openai import OpenAI

client = OpenAI(
    api_key=os.getenv("BASETEN_API_KEY"),
    base_url="https://inference.baseten.co/v1"
)

response = client.chat.completions.create(
    model="openai/gpt-oss-120b",
    messages=[
        {
            "role": "user",
            "content": "Implement Hello World in Python"
        }
    ],
    stop=[],
    stream=True,
    stream_options={
        "include_usage": True,
        "continuous_usage_stats": True
    },
    top_p=1,
    max_tokens=1286,
    temperature=1,
    presence_penalty=0,
    frequency_penalty=0
)

for chunk in response:
    if chunk.choices and chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="", flush=True)

LLM Marketplace

The LLM marketplace allows you to access both open-source and closed-source large language models in one place. All open-source models are hosted on OpenRouter, enabling you to use a unified API without worrying about cost, availability, or performance.

In this section, we will review two of the most popular LLM marketplaces, OpenRouter and Requesty. Both platforms provide free access to the GPT-OSS 12B model.

9. OpenRouter

OpenRouter provides a unified API that gives you access to hundreds of AI models through a single endpoint. It automatically handles fallbacks and selects the most cost‑effective options for your requests.

You can experiment with different models using their LLM Playground, which allows you to test and compare outputs before integrating them into your workflow.

Source: gpt-oss-120b

Integrating OpenRouter is straightforward; simply change the base URL and provide your API key. The rest of the implementation remains the same as with the OpenAI API.

import os
from openai import OpenAI

client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=os.getenv("OPENROUTER_API_KEY"),
)

completion = client.chat.completions.create(
    model="openai/gpt-oss-120b",
    messages=[{"role": "user", "content": "What is the meaning of life?"}],
)
print(completion.choices[0].message.content)

10. Requesty

Requesty is similar to OpenRouter, but it intelligently routes your requests across 170+ AI models, including GPT‑OSS 120B. This helps you:

Reduce API costs
Avoid SDK lock‑in
Access advanced features like auto caching, load balancing, and fallback policies.
Use a simple OpenAI‑compatible interface

You can explore models in the Chat menu, check the leaderboard, view your usage dashboard, and add direct integrations.

When you sign up, you receive $1 in free credits, allowing you to try even paid models.

Source: Requesty

To integrate Requesty, simply update the base URL, API key, and default headers in your OpenAI Python SDK code.

If you want to use a free model, make sure to select the ones provided by Groq.

import os
import openai

ROUTER_API_KEY = os.getenv("REQUESTY_API_KEY")

client = openai.OpenAI(
    api_key=ROUTER_API_KEY,
    base_url="https://router.requesty.ai/v1",
    default_headers={"Authorization": f"Bearer {ROUTER_API_KEY}"},
)

# Example request
response = client.chat.completions.create(
    model="groq/openai/gpt-oss-120b",
    messages=[{"role": "user", "content": "Hello, who are you?"}],
)

# Print the result
print(response.choices[0].message.content)

Conclusion

After a long wait, OpenAI has finally released an open-source LLM. This means you can now fine-tune it, run it locally, and have full control over its output generation. You are in charge; you can even build products around it without sharing your data with OpenAI or paying for API access.

This is a big step forward for AI, and over time, we will see new and improved models built on top of OpenAI’s open-source models and frameworks. If you want to start experimenting now, you can:

Run GPT-OSS 120B locally
Access it in your browser using free chat applications
Try fast and free LLM inference providers
Explore LLM marketplaces for the best and fastest model access

If you’re just starting out with your AI journey, be sure to check out our AI Fundamentals skill track to cover topics including LLMs, ChatGPT, generative AI, AI ethics, and prompt engineering.

Author

Abid Ali Awan

As a certified data scientist, I am passionate about leveraging cutting-edge technology to create innovative machine learning applications. With a strong background in speech recognition, data analysis and reporting, MLOps, conversational AI, and NLP, I have honed my skills in developing intelligent systems that can make a real impact. In addition to my technical expertise, I am also a skilled communicator with a talent for distilling complex concepts into clear and concise language. As a result, I have become a sought-after blogger on data science, sharing my insights and experiences with a growing community of fellow data professionals. Currently, I am focusing on content creation and editing, working with large language models to develop powerful and engaging content that can help businesses and individuals alike make the most of their data.

Topics

OpenAI

Artificial Intelligence

Generative AI

Large Language Models

Top AI Courses

Track

AI Fundamentals

0 min

Discover the fundamentals of AI, dive into models like ChatGPT, and decode generative AI secrets to navigate the dynamic AI landscape.

See Details

Start Course

Course

Introduction to AI Agents

1 hr 30 min

30.6K

Learn the fundamentals of AI agents, their components, and real-world use—no coding required.

See Details

Start Course

Course

Multi-Agent Systems with LangGraph

2 hr 45 min

2.8K

Build powerful multi-agent systems by applying emerging agentic design patterns in the LangGraph framework.

See Details

Start Course

blog

12 GPT-4 Open-Source Alternatives

GPT-4 open-source alternatives that can offer similar performance and require fewer computational resources to run. These projects come with instructions, code sources, model weights, datasets, and chatbot UI.

Abid Ali Awan

9 min

blog

ChatGPT 4.5: Features, Access, GPT-4o Comparison, and More

Learn how ChatGPT 4.5 from OpenAI excels in conversational abilities and accuracy compared to o1 and GPT-4o, but may not be as strong in complex reasoning tasks.

Alex Olteanu

8 min

blog

GPT 4.1: Features, Access, GPT-4o Comparison, and More

Learn about OpenAI's new GPT-4.1 family of models: GPT-4.1, GPT-4.1 Mini, and GPT-4.1 Nano.

Alex Olteanu

8 min

blog

GPT-4o Guide: How it Works, Use Cases, Pricing, Benchmarks

Learn about OpenAI’s GPT-4o, a multimodal AI model that processes text, audio, and visual data, and discover how it compares with GPT-4 Turbo for various use cases.

Richie Cotton

8 min

Tutorial

How to Set Up and Run GPT-OSS Locally With Ollama

Learn how to install, set up, and run GPT-OSS locally with Ollama and build a simple Streamlit application.

Aashi Dutt

code-along

Using GPT on Azure

Find out how you can spin up a pay-as-you-go instance of OpenAI, and you’ll build a few simple applications that show the power of LLMs and GenAI against your data.

Dave Wentzel

See More See More

Open Source

1. Ollama

2. Transformers

3. vLLM

Chat Applications

4. GPT-OSS.com

5. T3 Chat

LLM Inference Providers

6. Cerebras

7. Groq

8. Baseten

LLM Marketplace

9. OpenRouter

10. Requesty

Conclusion

12 GPT-4 Open-Source Alternatives

ChatGPT 4.5: Features, Access, GPT-4o Comparison, and More

GPT 4.1: Features, Access, GPT-4o Comparison, and More

GPT-4o Guide: How it Works, Use Cases, Pricing, Benchmarks

How to Set Up and Run GPT-OSS Locally With Ollama

Using GPT on Azure

.css-1531qan{-webkit-text-decoration:none;text-decoration:none;color:inherit;}AI Fundamentals

Introduction to AI Agents

Multi-Agent Systems with LangGraph

12 GPT-4 Open-Source Alternatives

ChatGPT 4.5: Features, Access, GPT-4o Comparison, and More

GPT 4.1: Features, Access, GPT-4o Comparison, and More

GPT-4o Guide: How it Works, Use Cases, Pricing, Benchmarks

How to Set Up and Run GPT-OSS Locally With Ollama

Using GPT on Azure

AI Fundamentals