Large language models (LLMs) are widely deployed today; they have rapidly transformed multiple sectors, such as, healthcare, education, software, law, finance, and governance. Their ability to generate textual and informational content like humans, automate agentic tasks, and workflows, and support in decision-making has led to widespread deployment.
Despite undeniable benefits though, LLMs present significant (albeit not insurmountable) technical, ethical, and societal challenges. These drawbacks must be carefully examined to ensure responsible and sustainable use of this powerful technology.
A most serious drawback of deployed LLMs is the tendency to generate erroneous and fabricated information, also referred to as “hallucination.” These models operate by predicting likely ‘successor-words’ rather than verifying facts. As a result, they may produce confident, yet incorrect responses.

The hallucinations may manifest in: (a) sentence contradictions, that is, two sentences of the LLM output conflict with each other; (b) prompt contradictions, when AI-generated content is contrary to the original prompt; (c) factual contradictions of incorrectly generated content; and, (d) random hallucinations, wherein the output incorporates completely unrelated details.
The limitations are further pronounced due the LLM’s algorithmic foundations (specifically, gradient descent) frequently tending to stall in sub-optimal saddle spots in word predictions. Because such saddle spots are flat, the training algorithm’s ‘slope’ becomes zero and this deceives the model into thinking optimization is attained. The model then stops improving; this yields sub-optimal results.
Model sensitivity to hyperparameters—these are configurations used to govern the training process; unlike model parameters, that is, weights, these are external to the model and are set prior to training— slows down training too.
The limitations of LLMs are extremely dangerous in many fields, such as law, medicine, etc., wherein reasoning complexity is high and mistakes can potentially have serious ramifications. Hence, overreliance or excessive trust in AI outputs can reduce critical evaluation, increasing the spread of misinformation and poor decision-making.
Efforts to mitigate the problems entails the use of various quantitative and other methods, including: (i) retrieval-augmented generation (RAG); (ii) chain-of-thought prompting to improve reasoning; (iii) multiple agentic systems to verify the accuracy and consistency of results; (iv) fine-tuning of LLMs with domain-specific knowledge and by singling out problematic, high-influence tokens; and, (v) detection mechanisms for potential hallucinations in low-confidence scenarios.
Unfortunate though, each of the aforesaid methods for combating hallucinations themselves are fraught with flaws and suffer from short-comings.
RAG techniques—these enhance accuracy by connecting the LLM to external, authoritative knowledge bases before generating responses—fail to yield relevant documents, rank irrelevant ones too highly or suffer from outdated embedding models, thereby reinforcing incorrect answers.
In fact, the RAG model may struggle to extract, besides failing to include the information retrieved in the context. Chunking errors, that is, errors resulting from breaking data into fixed-size chunks can split critical information, which makes it difficult for the LLM to understand context. In the case of legal LLMs, knowledge evolves constantly based on applicable enactments, statutory amendments, procedural mechanisms and judicial pronouncements.
Hence, integrating the LLM with authoritative and up-to-date legal knowledge repositories for generating or validating responses is a distinct challenge. Further, if the retrieved data is inaccurate, or the model fails to properly ‘ground’ the answer in the retrieved context, it will still hallucinate.
The CoT prompting approach—it is generally used to train models and revolves around encouraging the LLM to breakdown reasoning into a series of intermediate steps in order to reduce errors—suffers also from increased latency and computational costs due to the generation of multiple steps.
Other drawbacks of CoT prompting include reliance on large-scale models (>100 billion parameters) for achieving distinct performance gains, error propagation where one mistake breaks the chain, potential for generating deceptive, and, logical-sounding hallucinations and inefficiency for simple tasks.
Legal reasoning if always complex and subjective too, particularly in respect of statutory interpretation and jurisprudential assessments. This increases the risk of error propagation, etc.
MAMs—models wherein specialized agents (that is, a planner, analyzer, researcher, verifier, writer, etc.)—could introduce significant, often counter-intuitive, shortcomings in verification. Such common limitations include:
Finetuning of LLMs for specific domains too has its pitfalls. Specialised training frequently leads to ‘catastrophic forgetting’, that is the degradation of the model’s general-purpose foundational reasoning capability.

Besides, specialised domains suffer from inadequacy of good quality labeled data, consequent to which effective training is difficult. Further, the finetuned knowledge becomes quickly outdated; the dynamic and obsolescent nature of knowledge necessitates frequent and expensive retraining.
The finetuning may also result in ‘overfitting’, wherein the model memorizes particular training examples, thereby failing to generalize for new or unseen data. Techniques such as the Parameter-Efficient Fine-Tuning (PEFT) (e.g., LoRA) reduce computational costs, while Retrieval-Augmented Generation (RAG) is better for keeping information current. But these also have inherent limitations.
Distinguishing potential hallucinations in low-confidence scenarios is tough, primarily because they emanate not just from errors, but inherent characteristics of how models fill gaps in their knowledge. Often times, the model presents ‘confidently wrong’ (incorrect) information with great authoritativeness.
Information for legal reasoning is sparse or outdated, particularly if the issue on hand is res integra. In such situations LLMs generally lack ‘ground truth’, more so when it pertains to niche, rare or post-training cutoff information; or, to complex contexts. The hallucination fills such gaps with fluent, plausible sounding, but inaccurate content. Detection of such ‘drift’ away from factual or authentic information is hard to spot.
Many times, models may incorrectly self-correct a valid response in low-confidence scenarios based on a previous incorrect, but confidently generated statement. Thus, hallucination detection methods in LLMs fail on account of:
In reality, LLMs at this stage do not possess true reasoning capabilities or contextual understanding. They lack awareness of context, intent, causality and emotion. They are devoid of ‘common sense’ and incapable of human-like ‘understanding’ of content. This means they may generate technically correct, but contextually inappropriate responses.
Therefore, these models struggle with complex, multi-step legal tasks. Agentic errands that entail logical reasoning, judicial thinking, ethical considerations, deep understanding of intricate facts, knowledge of applicable laws, analysis of complex inter-disciplinary queries, synthesis of information across multiple, disparate sources, statutory interpretation, etc., often exacerbate this limitation.
LLMs still need human supervision and cannot fully replace professional insight, expertise and experience.
Since LLMs are trained on massive, non-curated datasets reflecting real-world inequities and inequalities, they may reproduce or amplify prejudices related to age, race, gender, religion, or socio-economic status.
Thus, the perpetuation of societal biases remains a major risk in LLMs; and, mitigating the risk and assuring algorithmic fairness pose serious challenges. It may even require regulatory oversight. Ethical concerns too arise regarding accountability when AI-driven decisions are unfair, unjust, inequitable, or mala fide.
LLMs are generally trained on information datasets available at a particular time; hence, they may lack real-time and up-to-date information for generating accurate and authentic outputs. Such “knowledge cutoffs” are a significant drawback in the legal and judicial field, wherein the body of knowledge is constantly evolving and expanding.
Obviously, these deficiencies can result in unfair outcomes and discriminatory outputs in legal analysis, and public policy.
LLMs process large volumes of personal (perhaps private) and confidential data, which poses risks of data leakage, cyberattacks, and unauthorized access. Users may unknowingly expose sensitive information while interacting with the models.
Using cloud-based LLMs for proprietary or personal data presents risks of data breaches or leaks. It also introduces risks at each learning phase of the LLM—pre-training, fine-tuning and in-context learning.
LLMs are susceptible to prompt injection attacks, wherein users manipulate the model into ignoring safety guardrails or guidelines to generate harmful content or exfiltrate data. Other security vulnerabilities, such as, adversarial attacks and data poisoning, may also be exploited.
The ability of LLMs to generate realistic and persuasive content has also enabled malicious uses. These include automated scams, phishing, social engineering, fake news, automated profiling, and large-scale disinformation campaigns.
As LLM-generated content becomes harder to distinguish from originally created content of humans, the verification of authenticity is a major challenge. These risks highlight the need for stronger privacy safeguards and security frameworks.
LLMs often function as “black box” systems, making it difficult to explain or understand how specific outputs are generated. This lack of clarity and transparency is a definite problem in regulated sectors that require accountability and auditability. The lack of trust and fear of the unknown are stumbling blocks in contexts that mandate legal compliance.

The development and deployment of large language models require extensive financial investments, energy resources and computational infrastructure. This leads to concerns about operational and ecological costs, not to mention the widening of the global digital divide.
Excessive reliance on legal LLMs may weaken argumentative creativity, legal reasoning, critical thinking, analytical ability, and professional expertise. Such skill degradation will definitely impact application of mind, which is vital for the legal domain.
The use of legal LLMs has no doubt evolved from simple information processing applications to strategic and complex decision-support systems. Yet, while LLMs offer immense potential, their drawbacks cannot be ignored. Lack of guardrails, accountability, risk management, and compliance are all distinct speed bumps.
Responsible deployment requires strong governance, transparency, ethical safeguards, and human oversight. The speed of deployment too is a big challenge. A balanced approach will enable societies to harness their benefits while minimizing associated risks.
Nonetheless, the uses of AI in fully automated legal workflows and autonomous actions are a far cry from the current state-of-art. At this stage in their evolution, they are best viewed as productivity tools that augment human endeavors, rather than as a potential threat to professionals practising law.
At BLAZE VENTUERS, we have qualified professionals and elaborate processes to help inventors and enterprises create and protect sustainable IP assets in the field of AI Research & Development and to monetise them effectively.