AI4Bharat, a research lab at Indian Institute of Technology (IIT) Madras, has introduced a new, open-source benchmark test specifically designed to assess the performance of large language models (LLMs) on Indian languages as well as Indian context and safety.
Developed with support from Google Cloud, the Indic LLM-Arena benchmark is a crowd-sourced platform that evaluates LLMs on the basis of votes cast by thousands of anonymous users. The models are then ranked on a “human-in-the-loop” leaderboard, AI4Bharat said in a blog post on Monday, November 10.
Currently, Indic LLM-Arena supports only text-based inputs across multiple Indian languages and code-mix scenarios. However, AI4Bharat said it has plans of expanding the benchmark to cover omni models with vision and audio capabilities as well as AI agents.
“Evaluation is not merely about translating 22 scheduled languages. It is about understanding the natural, fluid way Indians communicate. This includes code-switching (e.g., Hinglish or Tanglish), where users mix multiple languages in a single sentence,” the research lab said.
All anonymised data, code, and pipelines will be released under an open-source licence for community inspection and extension, the AI research lab added.
AI4Bharat’s Indic LLM-Arena comes after several Indian AI developers have repeatedly highlighted the lack of local benchmarks to evaluate and compare the performances of Indic LLMs. Last week, OpenAI launched its own benchmark test called IndQA that is supposed to test a model’s linguistic ability as well as its grasp of Indian cultural context across domains.
The IndQA benchmark comprises 2,278 questions across 12 languages and 10 cultural domains, compiled in partnership with 261 experts from across India, as per the AI startup.
Story continues below this ad
With Indic LLM-Arena, AI4Bharat envisions that startups and researchers will be able to see precisely how their models perform against others on Indic-specific use-cases and languages. “Businesses across domains can use this data to make informed decisions about which models to adopt, mitigating risk and accelerating the deployment of AI that serves their customers,” the organisation said.
How Indic LLM-Arena Works
Stating that the benchmark was inspired from other platforms such as lmarena, AI4Bharat said it took a fair, blind, and side-by-side comparison approach to developing Indic LLM-Arena.
– First a user enters a prompt in any language or mix of languages.
– Next, the platform presents responses from two anonymous LLMs (Model A and Model B, for instance). AI4Bharat said that the identity of the models is hidden to prevent provider bias.
– The user then votes for the AI-generated response that they find superior or flags the interaction as a tie.
– After thousands of such user-voted battles, AI4Bharat said it uses the Bradley-Terry statistical model to rank the models based on their performance in regards to real-world Indian prompts.
In the coming months, AI4Bharat said it will publish an updated public leaderboard after eliminating statistical uncertainty in the rankings. It also suggested introducing more granular leaderboards based on language, domain, tasks, etc.

