Indias Gen AI Quest Needs More Than Just Data

Generative Artificial Intelligence has become the latest trend that’s grabbed technology companies worldwide by the throat. The situation in India is no different. Meta has Llama-3, Google has Gemini, Microsoft has Copilot to name just a few in the West. Currently, there are only two companies in India that are working on creating such systems from the ground-up: BharatGPT and Krutrim.

“I think when it comes to Gen AI, India is still in very early days,” said Madhav Krishna, founder and CEO at Vahan—a tech company that connects blue-collar workers with prospective employers.

Krishna uses an example of their AI chatbot, which calls people who are signing up for jobs as delivery drivers to set up the time and date for their interviews with the likes of Swiggy, Zomato and other delivery aggregators.

Part of the challenge of building Gen AI for the Indian context is the cultural and linguistic diversity, and that’s without even accounting for caste or religion. “The key is to really solve for the dynamics that exist in this country,” said Krishna, “It’s something that a lot of international companies aren’t focusing on,” he explains.

One challenge that Vahan has solved is dealing with noise. He highlighted that in a country like India, the surroundings tend to be very loud as a result of which, his company’s AI has found a way to still communicate via appropriate responses in a challenging environment.

Also Read: Beyond Tomorrow: From GPT-4o to GenZ & GenAI — Weekly AI Roundup

Datasets And Other Issues

In a country with 121 languages and 270 mother tongues, gathering data can become an arduous affair. The common consensus across companies working in AI, policy makers and technocrats is the problem in accumulating the correct datasets in a country bound by its cultural and linguistic diversity.

“We’re lacking a lot in terms of diversity of datasets in India, specific to India, collecting data faster, improving processes to collect it is a challenge,” said Akbar Mohammed, head of Fractal Dimension, which is an interdisciplinary team that works to solve problems regarding things like responsible AI, sustainability, Gen AI adoption strategy and more.

AI In India: The Challenges

By virtue of India’s diversity, we don’t have enough datasets to build large language models on the same scale as OpenAI does with ChatGPT or Meta does with Llama-3. But, why is that the case?

One aspect can be attributed to homogeneity. If one were to approach it from the lingual aspect, by and large, North America speaks one language, a maximum of three: English, Spanish and French. Compare that to India, where the language spoken can be differentiated depending on the state.

“India is blessed in terms of diversity, but it also means that we don’t have enough data for accessing those dimensions of diversity,” said Mohammed. Part of it is the lack of diverse datasets, i.e., building an LLM for what people in Nagaland speak versus building one for Hindi speakers.

The issue is further exacerbated by the fact that training a large language model needs vast amounts of data. Even as companies in India work towards building LLMs, evaluation metrics and issues with AI hallucinations are likely to come up. Hallucinations are incorrect or misleading results produced by an AI model presented as fact. They’re usually caused by a lack of data while training an AI model, or even biases in the training data.

Because of these concerns, training AI in India with the data available is difficult. Even if data is available, evaluation is another challenge, according to Mohammed. “To train an AI model, you need good datasets to evaluate them. You need gold standard datasets.” This essentially helps companies training their models to check whether their AI is producing the correct results or if they’re hallucinating.

Another aspect of it is the fact that while a large portion of India’s population uses the internet, the data they’re creating might not be accessible to Indian companies working in AI. Tech companies like Meta, Alphabet’s Google and Amazon don’t necessarily host their user data locally.

Most recently, Bhavish Aggarwal, who runs Ola, made the announcement that he would be moving the ride aggregator’s workload off of Microsoft’s cloud computing platform Azure and onto Krutrim’s cloud computing platform. While the process is not instantaneous, it remains to be seen if the platform can handle the workload.

However, there are currently ongoing efforts to bridge this data gap and make it easier for companies, both in the public and private sector. The National Informatics Centre has set up the Open Government Data Platform India, which hosts government-owned shareable data and information that can be used to develop relevant applications.

Similarly, the first edition of the India AI report released last year by the Ministry of Electronics and Information Technology conceptualises the India Dataset Platform and the Datasets Program. The new platform would encourage companies across the board to contribute their datasets for the benefit of speeding up the development of AI in the country. The IDP would become a repository of data from various sources, accessible to all. There are various features that have been proposed, including a pricing model, regulation and data sharing practices.

It has now been renamed the National Data Platform, which is available on the Open Government Data Platform India website indicating that it is an upcoming program.

Also Read: JPMorgan Says Every New Hire Will Get Training For AI

The Use Case-Led Approach

Not everyone is convinced that developing AI in India should be a 1:1 comparison to how it is built in the West. “Building AI in India must be use case-led,” said Tanuj Bhojwani, head of People+ai, a non-profit organisation within the EkStep Foundation, that is working towards making sure AI is beneficial for everybody in the country.

The problem of computing power and expense in running AI still persists. “Currently, the way Gen AI is done is still expensive. Compute and Graphic Processing Units need to be much cheaper,” said Bhojwani, adding that there is nuance in how to approach training models.

The IIT Bombay graduate said that India doesn’t necessarily need Gen AI or LLMs that are the same size of the models being created in the West. Instead, he suggested that building to solve localised problems is what will make India a frontrunner in the AI arms race. “Most jobs require a medium or small-sized AI model once an issue has been analysed by a LLM. When that’s done, its just a script that a smaller AI must follow,” he said.

He takes the example of a use case at a hospital in Manipal, where women who had premature births need to follow a checklist of tasks to ensure their child remains healthy. However, most people are likely to forget to follow such a list.

Here’s where AI as an intervention makes sense: a phone call by an AI to the mothers to follow-up on whether they’re following the protocols.

How does this help? There’s a twofold benefit:

1) Mothers don’t need to risk bringing their premature babies out of their homes and wait in long queues at hospitals.
2) The patient load on doctors is reduced, since they don’t have to check up on every single individual, only in the more serious cases.

“It’s a standard script the AI has to follow, and if there are serious concerns, the doctor is alerted and that’s where human intervention happens. All this only costs Re 1,” said Bhojwani.

Currently, several Indian AI companies actually do work on use case-led solutions like Project Udaan, which translates textbooks across Indian languages; and Wadhwani Institute for Artificial Intelligence, which works on both agriculture and health, and more.

Also Read: Hollywood Actor Scarlett Johansson Slams OpenAI For Using Her Voice In ChatGPT AI

Regulation And The Road Ahead

Data scientists across the world have recently stumbled across another problem. The quality datasets required to train an LLM is finite, to the point that tech companies are consuming and parsing high quality language data faster than the entire world is actually creating it.

Scientists believe that we’re just two years away from tearing through all the data used to train LLMs, i.e., by or even before 2026. By contrast, the reserve of low-quality language and image data will only be exhausted by 2060 at the latest.

Of course, here in India, the challenges are different. One of them being how the government plans to regulate AI in India and the data that’s being created in the country. “The state’s view on data generated in India is that it’s a sovereign asset,” said Prateek Waghre, executive director at the Internet Freedom Foundation. “That’s going to play a role in the policy positions ultimately taken.” It’s part of the reason for the conceptualisation and push for the India Dataset Platform and Datasets Program.

In the same vein, there is the Digital Personal Data Protection Act, which was notified by gazette in August 2023, but hasn’t been made operational as the rules required to make it operational haven’t yet been notified.

Two things stand out in the law; it doesn’t cover information scraping, raising questions around consent. Secondly, Section 3(C)2 of the Act states that data provided by an individual for personal or domestic purposes isn’t in fact, covered by the DPDP.

So far, regulation regarding AI has been a little bit of a touch-and-go, considering the incident regarding Rashmika Mandanna late last year. This can be partly attributed to pushback from the industry itself, which claimed that stringent regulation could stifle innovation.

“There seems to be a lack of clarity in how the government is treating Gen AI versus machine learning use cases,” said Waghre, adding that it is justified to an extent because the technology is evolving quickly and has caught everyone by surprise. “They need to articulate their position sooner rather than later.”

Others like Black Dot Public Policy Advisors’ Mandar Kagade said that navigating intellectual property will play a role in the future, pointing to New York Times’ lawsuit against OpenAI, accusing them of infringing on copyright. The outcome of the trial could have international ramifications, given that there is little transparency on what data is being used to train AI models and how it is being accessed.

"'Where are you going to get datasets?', 'How are you getting those datasets?', 'Will you be taking consent, or rely on the fact that these are anonymised datasets', are all questions we should be asking," said Kagade.

The government is making efforts in scaling up the development of AI in the country. On March 7, the government approved the allocation of Rs 10,300 crore ($1.24 billion) to India’s AI Mission.

Most people in the industry are optimistic about AI and its future in India. Fractal Dimension’s Mohammed is convinced that India has the talent, and that a supportive entrepreneurial environment in the AI space is what is sorely needed in addition to a strong research-oriented effort. “We should be funding IITs specifically in AI-related areas. We have enough research institutes, but it has never been a strong point for us.”

Also Read: India ranks first in adoption of Gen AI technology across Asia Pacific: Deloitte survey