Why SVO vs SOV matters in AI

This title was summarized by AI from the post below.

5mo

🌍 SVO vs. SOV — Why Word Order Matters in AI Most LLMs like GPT-4 are trained on SVO languages like English. But languages like Hindi, Tamil, Marathi, and Bengali follow SOV structure — where the verb comes last. 🔹 English (SVO): “I eat rice.” 🔹 Hindi (SOV): “मैं चावल खाता हूँ” (I rice eat) This isn’t just grammar trivia. It’s a fundamental challenge in AI. LLMs trained predominantly on English struggle to understand or generate SOV languages fluently unless fine-tuned properly — or built from the ground up on Indic data. This is why building truly multilingual LLMs isn’t just about translation — it’s about respecting the cognitive architecture of each language. India’s linguistic diversity isn’t a bug — it’s a feature. Let’s build AI that reflects that. #AI #LLM #NLP #IndicLanguages #MultilingualAI #GPT #OpenSource #LanguageTechnology #LLMinIndia #SVOvsSOV #BhashaAI

22 Comments

Siddhartha Mukherjee

5mo

I was exploring the #sarvamai llm and tts models, though it is a good start for India 🇮🇳, atleast a start. But quite far from a usable / quality outcome, i wish they spend the government funding in building the foundations and not just fork another mistral ai or llama

3 Reactions

Godwin Josh

5mo

The core challenge with AI language models is not just producing translations but truly understanding the syntax and semantics embedded in different word orders. SVO languages like English offer models a certain logic that SOV languages like Hindi disrupt, impacting fluency in AI generation and perception. In building multilingual capabilities, how can AI better handle these linguistic nuances? Could leveraging deep linguistic embeddings specific to each language improve adaptability?

2 Reactions

Varun Israni

5mo

Siddhartha Mukherjee Interesting challenge...makes me wonder how much bias we bake in just by training on SVO-heavy data. Fine-tuning for SOV feels like just the start, right?

1 Reaction

Supreet Sethi

5mo

Siddhartha Mukherjee I am surprised that comes out as a challenge. In our experience at E2E Cloud. Large quantity of English training followed by RL on specific languages allows model even LSTM models to adapt to the SVO to SOV switch. Even ability to recognise gender in Hindi text - खाता and खाती

2 Reactions

Maksim Mosienko

5mo

Please elaborate where the word order is implied in LLMs based on GPT architecture. My understanding is that e.g. LLM training (very roughly) builds relations of token with other tokens of the same input sequence. And the order of words in the training data is more or less unimportant within context window. As a naive example, translating from German (verb should be on 2nd place in the sentence) to English back works with ChatGPT just fine.

5 Reactions

Ponvijay Nirmal S

5mo

Tamil doesn't follow this; you can place the subject and verb almost anywhere without the meaning changing.

3 Reactions

Ritesh Somaroo

5mo

That's the real origin of wastage in india. More ink more paper😂 more words more talking and it does on. Useful solution: 1. Data-Centric Approaches Curate high-quality SOV-language datasets: Build domain-specific corpora for Indic languages, focusing on diverse syntactic structures and real-world usage patterns. This addresses the data scarcity that limits current fine-tuning efforts. Leverage hybrid datasets: Combine monolingual SOV data with parallel SVO-SOV translation pairs to help models internalize structural differences. 2. Architectural Adaptations SOV-optimized tokenization: Develop subword tokenizers trained exclusively on SOV languages to better capture morphological and syntactic nuances (e.g., verb-final constructs). Syntax-aware attention mechanisms: Modify transformer architectures to prioritize relationships between subjects, objects, and verbs in SOV order. 3. Training Strategies Two-phase pre-training: Structural priming: Train on synthetically reordered SVO→SOV text to establish basic SOV pattern recognition. Native SOV immersion: Continue training on authentic SOV-language content. Linguistic regularization: Add loss terms that penalize deviations from SOV syntactic rules during training.

2 Reactions

Sumedh Vaidya

5mo

Super cool insights, Siddhartha! LLMs are seriously changing the game in NLP and beyond. Love how you broke it down. Curious—have you come across any standout real-world use cases in industries like healthcare or finance? Always on the lookout for where this tech is making a real impact. #AI #LLM #NLP #TechInAction #MachineLearning

1 Reaction

Shyam Kumar Mangayil

5mo

There’s also vso ovs osv. Pardon the missing commas. Vso (as in Tagalog): Ate Shyam rice . Vos (as in Austronesian languages) : Ate rice Shyam. Are we getting to a different formulation on how LLMs should learn? May be so because this is not the only thing that differentiates languages.

1 Reaction

Aravind Subramanian S.

5mo

Deustch is also Subject Object Verb

1 Reaction

See more comments

To view or add a comment, sign in

More Relevant Posts

Nuhu Ibrahim
1mo
Report this post
𝗘𝘅𝗰𝗶𝘁𝗶𝗻𝗴 𝗡𝗲𝘄𝘀! Our paper “𝗟𝗮𝗿𝗴𝗲 𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗠𝗼𝗱𝗲𝗹𝘀 𝗮𝘀 𝗗𝗲𝘁𝗲𝗰𝘁𝗼𝗿𝘀 𝗼𝗿 𝗜𝗻𝘀𝘁𝗶𝗴𝗮𝘁𝗼𝗿𝘀 𝗼𝗳 𝗛𝗮𝘁𝗲 𝗦𝗽𝗲𝗲𝗰𝗵 𝗶𝗻 𝗟𝗼𝘄-𝗿𝗲𝘀𝗼𝘂𝗿𝗰𝗲 𝗘𝘁𝗵𝗶𝗼𝗽𝗶𝗮𝗻 𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲𝘀” has been accepted for presentation at WiNLP 2025 (Widening NLP), held alongside EMNLP in Suzhou, China! This work, co-authored with Felicity Mulford and Riza Batista-Navarro, explores how large language models (LLMs) handle hate speech detection and generation in low-resource Ethiopian languages, Afaan Oromo, Amharic, and Tigrigna, as well as English. 𝗪𝗵𝗮𝘁 𝘄𝗲 𝗳𝗼𝘂𝗻𝗱: 1. LLMs perform well on English hate speech detection but struggle significantly with low-resource languages. 2. Worryingly, all models we tested can generate targeted hate speech with minimal prompting. 3. These results expose a serious gap in multilingual safety and ethical robustness. Our findings highlight the dual risk faced by low-resource linguistic communities: exclusion from protection and exposure to new harms enabled by AI. I will be presenting virtually at WiNLP 2025, and I am looking forward to connecting with others passionate about ethical, inclusive, and multilingual NLP! #WiNLP2025 #EMNLP2025 #NLP #AIEthics #LowResourceLanguages #HateSpeechDetection #MultilingualAI #AfricanLanguages #ResponsibleAI

10 Comments
Like Comment
To view or add a comment, sign in
Felicity Mulford
1mo
Report this post
✨ Exciting update! Alongside my brilliant collaborators Nuhu Ibrahim and Riza Batista-Navarro, we’ve been exploring how well large language models (LLMs) detect hate speech in low-resource languages like Amharic, Afaan Oromo, and Tigrigna — and our work will be presented at WiNLP 2025, held with EMNLP in Suzhou, China. 🧠 Our paper, “Large Language Models as Detectors or Instigators of Hate Speech in Low-resource Ethiopian Languages” What we found is both fascinating and alarming: ➡️ LLMs handle English hate speech detection fairly well — but struggle significantly with low-resource languages. ➡️ Even more concerning, every model we tested could generate targeted hate speech with minimal prompting. This research exposes a major gap in AI safety and linguistic inclusivity — showing how communities speaking underrepresented languages risk both exclusion from protection and exposure to new AI-enabled harms. Huge thanks to Nuhu for sharing thoughtful reflections on our paper, and to Riza for her continued guidance and brilliance. I’m looking forward to continuing this work — pushing platforms and providers to take language equity and harm reduction seriously at every stage of AI development. #AIethics #HateSpeechDetection #ResponsibleAI #NLP #Inclusion #DigitalSafety #LowResourceLanguages

Nuhu Ibrahim

Postgraduate Researcher | NLP | AI
1mo

𝗘𝘅𝗰𝗶𝘁𝗶𝗻𝗴 𝗡𝗲𝘄𝘀! Our paper “𝗟𝗮𝗿𝗴𝗲 𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗠𝗼𝗱𝗲𝗹𝘀 𝗮𝘀 𝗗𝗲𝘁𝗲𝗰𝘁𝗼𝗿𝘀 𝗼𝗿 𝗜𝗻𝘀𝘁𝗶𝗴𝗮𝘁𝗼𝗿𝘀 𝗼𝗳 𝗛𝗮𝘁𝗲 𝗦𝗽𝗲𝗲𝗰𝗵 𝗶𝗻 𝗟𝗼𝘄-𝗿𝗲𝘀𝗼𝘂𝗿𝗰𝗲 𝗘𝘁𝗵𝗶𝗼𝗽𝗶𝗮𝗻 𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲𝘀” has been accepted for presentation at WiNLP 2025 (Widening NLP), held alongside EMNLP in Suzhou, China! This work, co-authored with Felicity Mulford and Riza Batista-Navarro, explores how large language models (LLMs) handle hate speech detection and generation in low-resource Ethiopian languages, Afaan Oromo, Amharic, and Tigrigna, as well as English. 𝗪𝗵𝗮𝘁 𝘄𝗲 𝗳𝗼𝘂𝗻𝗱: 1. LLMs perform well on English hate speech detection but struggle significantly with low-resource languages. 2. Worryingly, all models we tested can generate targeted hate speech with minimal prompting. 3. These results expose a serious gap in multilingual safety and ethical robustness. Our findings highlight the dual risk faced by low-resource linguistic communities: exclusion from protection and exposure to new harms enabled by AI. I will be presenting virtually at WiNLP 2025, and I am looking forward to connecting with others passionate about ethical, inclusive, and multilingual NLP! #WiNLP2025 #EMNLP2025 #NLP #AIEthics #LowResourceLanguages #HateSpeechDetection #MultilingualAI #AfricanLanguages #ResponsibleAI

2 Comments
Like Comment
To view or add a comment, sign in
Ritika Agarwal
1mo
Report this post
𝙈𝙤𝙨𝙩 𝙤𝙛 𝙩𝙝𝙚 𝙬𝙤𝙧𝙡𝙙 𝙙𝙤𝙚𝙨𝙣’𝙩 𝙩𝙝𝙞𝙣𝙠 𝙞𝙣 𝙀𝙣𝙜𝙡𝙞𝙨𝙝. 𝙎𝙤 𝙬𝙝𝙮 𝙙𝙤𝙚𝙨 𝙢𝙤𝙨𝙩 𝙤𝙛 𝘼𝙄? Nearly 90% of AI models are trained on English data. Yet only 18–26% of the world actually speaks English, and even fewer feel or express emotions in it. When I began working on emotion detection in Hindi, I assumed accuracy would be the biggest challenge. But I soon realized — the real hurdle wasn’t the algorithm, it was the data. English datasets have tens of thousands of samples. Hindi? Barely a handful. So I used GPT to generate balanced Hindi datasets and fine-tuned them for emotion detection. Accuracy rose from 25% to 85% — not because the model changed, but because it finally had the right context Context isn’t just words — it’s culture, nuance and emotion. Hindi speakers might convey a feeling with a single word or a gesture woven into speech — signals completely lost in English-centric models. That’s why building truly multilingual, locally aware AI isn’t just “diversity work.” It’s essential to creating technology that actually fits into people’s lives. I am committed to building AI platforms that reflect the world’s diversity — in language, culture and expression. Have you tried building AI for low-resource languages or emerging markets? I’d love to hear your experience. #AI4Bharat, #NLP, #AI, #IndianLanguages, #IndiaAI
Like Comment
To view or add a comment, sign in
Keyur Jotaniya
1mo
Report this post
language translation models I’ve fine-tuned three language translation models using the 🤗 Transformers library — english to german, french and hindi. en<->de : [https://lnkd.in/eJxG2N2K] en<->fr : [https://lnkd.in/eHPfDA-H] en<->hi : [https://lnkd.in/eKQH-VBZ] #NLP #DeepLearning #HuggingFace
Like Comment
To view or add a comment, sign in
Muhammad Shoaib Sattar
2mo Edited
Report this post
🚀 New Project: Multilingual AI Chatbot with Streamlit 🌍🤖 🔗 Try it here: https://lnkd.in/eQ6ug_i4 I recently built an AI-powered multilingual chatbot that can seamlessly communicate in six languages: 🇬🇧 English | 🇦🇪 Arabic | 🇨🇳 Chinese | 🇵🇰 Urdu | 🇮🇳 Hindi | 🇩🇪 German 🔑 Key Features: ✅ Built with Streamlit for an interactive web interface ✅ Rule-based NLP system with intent recognition (greetings, services, pricing, portfolio, contact, etc.) ✅ Automatic language detection – understands user queries in multiple languages ✅ Provides dynamic responses tailored to the selected language ✅ Supports chat history for a smooth conversation flow 💡 This project demonstrates how multilingual AI solutions can break language barriers and enhance global user experience in customer support, business websites, and virtual assistants. #AI #Chatbot #NLP #Streamlit #Multilingual #ArtificialIntelligence #Python
Like Comment
To view or add a comment, sign in
The Translation Gate, LLC

25,681 followers
1mo
Report this post
Ever wondered how machines translate languages? Not all machine translation is created equal — and at The Translation Gate, we bring out the right type of MT for your project. Here are the four main approaches: ⚙️ Statistical Machine Translation (SMT) – based on analyzing huge bilingual text corpora. 📖 Rule-based Machine Translation (RBMT) – built on grammar, syntax & linguistic rules. 🔗 Hybrid Machine Translation (HMT) – combining rules & statistics for smarter outputs. 🧠 Neural Machine Translation (NMT) – powered by deep learning for context-aware, natural results. Each has its strengths, but NMT + human expertise gives you the most accurate, fluent, and culturally adapted translations. Hire The Translation Gate and experience the best of machine + human translation. 💡 https://shorturl.at/qF4gM #MachineTranslation #AITranslation #NeuralMT #TheTranslationGate #NLP
Like Comment
To view or add a comment, sign in
Vatsa Joshi
1mo
Report this post
Leverage cutting-edge AI to convert English textual input into high-fidelity Gujarati speech. This pipeline leverages state-of-the-art techniques: 1). Autoregressive LLaMA-style model fine-tuned via Parameter-Efficient Fine-Tuning (PEFT) to perform high-accuracy English → Gujarati translation. 2). Neural TTS synthesis with a pre-trained multilingual model to generate natural, rythimically consistent Gujarati speech. Key benefits: • Enables low-resource language modeling for regional language accessibility. • Efficient fine-tuning strategy permits deployment in compute-constrained environments like Colab. • Modular architecture allows independent optimization of translation and TTS submodules. Example: Input: "The sun rises in the east." Output: "સૂન પૂર્વમાં ઉત્તરમાં ઉઠે છે." (Very rough output :') but its just the beginning, just trained 3 epochs for getting a quick idea about the pipeline) #AI #MachineLearning #NLP #TextToSpeech #Gujarati #LoRA #DeepLearning #Innovation #LowResourceLanguages
Like Comment
To view or add a comment, sign in
Dana Jrab
1mo
Report this post
𝐖𝐡𝐚𝐭 𝐢𝐬 𝐍𝐋𝐏, 𝐫𝐞𝐚𝐥𝐥𝐲? Whenever I say I work in Natural Language Processing, people usually nod politely ... then ask: “𝐖𝐚𝐢𝐭… 𝐰𝐡𝐚𝐭 𝐝𝐨𝐞𝐬 𝐭𝐡𝐚𝐭 𝐚𝐜𝐭𝐮𝐚𝐥𝐥𝐲 𝐦𝐞𝐚𝐧?” So here’s the simple version 👇: NLP is how we teach computers to 𝐮𝐧𝐝𝐞𝐫𝐬𝐭𝐚𝐧𝐝 𝐡𝐮𝐦𝐚𝐧 𝐥𝐚𝐧𝐠𝐮𝐚𝐠𝐞 not just the words, but the meaning behind them. It’s how your phone predicts your next word, how chatbots respond, how translation tools work ,and how we make AI listen, think, and speak like us. But here’s the truth: 𝐋𝐚𝐧𝐠𝐮𝐚𝐠𝐞 𝐢𝐬𝐧’𝐭 𝐣𝐮𝐬𝐭 𝐝𝐚𝐭𝐚; it’s emotion, culture, rhythm, and context. That’s what makes NLP both magical and incredibly complex. As someone working at the intersection of Arabic linguistics and AI, I’ve realized that teaching computers to understand Arabic isn’t just a technical challenge; it’s a 𝐜𝐮𝐥𝐭𝐮𝐫𝐚𝐥 one too. 💭 So, next time you chat with an AI model, Remember: behind every response is a long story of how language meets math. #NLP #ArtificialIntelligence #ArabicNLP #LanguageTechnology #MachineLearning
Like Comment
To view or add a comment, sign in
Javeria Iqbal
1mo Edited
Report this post
Yes! Say to yourself, “I can do it.” That simple belief can turn your vision into reality — and I’m beyond proud to share that I did it! 🙌 Together with my teammate Aneeb Javeid, we successfully completed our Natural Language Processing (NLP) Project 02: Urdu Conversational Chatbot under the supervision of our most respected teacher, @Muhammad Osama. 🙏💻 Our project, titled “Urdu Conversational Chatbot: Transformer with Multi-Head Attention,” focused on building a custom Urdu chatbot from scratch — without relying on any pre-trained models. ***Objective & Approach The goal was to design a Transformer-based encoder–decoder architecture capable of generating fluent, context-aware Urdu responses. We implemented multi-head attention to capture rich contextual relationships in Urdu sentences and improve the naturalness of generated replies. ⚙️ Key Features 1) Developed from scratch using PyTorch 2) Multi-Head Attention & Positional Encoding to enhance contextual understanding 3) Teacher Forcing & BLEU-based model saving during training 4) Automatic evaluation using BLEU, ROUGE-L, chrF, and Perplexity 5) Human evaluation for fluency, adequacy, and relevance 6) Interactive Urdu UI built using Gradio for real-time conversation 7) Support for right-to-left Urdu text rendering 🧪 Learning Outcomes This project gave us deep hands-on experience with: Transformer architecture and attention mechanisms Urdu text preprocessing and normalization Model evaluation through both human and automated metrics Deploying AI models using web interfaces The power of teamwork, persistence, and guided mentorship Huge thanks to Sir Muhammad Osama for his exceptional guidance, and to my teammate Aneeb Javeid for his constant collaboration and dedication. Together, we turned a challenging idea into a working Urdu chatbot powered by AI and deep learning. 💬🤖 #Motivation #NLP #ArtificialIntelligence #Chatbot #DeepLearning #Transformer #AttentionMechanism #UrduLanguage #MachineLearning #TeamWork #Project #Success #RespectForTeachers #NeverGiveUp #ComputerScience #Innovation #Streamlit #Gradio #AI
Like Comment
To view or add a comment, sign in

344 followers

131 Posts

View Profile Connect

Why SVO vs SOV matters in AI

More Relevant Posts

Explore related topics

Explore content categories