Language Detector

Detect the language of any text instantly using n-gram frequency analysis. Identifies 40+ languages with confidence scores — no upload, no API, fully client-side.

Was this tool helpful?

How to Use

Identify the language of any text:

Enter your text — Type, paste, or click a sample button to load text in any language. The detector works best with at least 50 characters, though it can identify scripts (Arabic, Cyrillic, CJK) from just a few characters.
Click Detect Language — The tool analyzes the text using trigram frequency profiles and Unicode script detection. Results appear instantly with no network delay.
Review results — The top match shows the detected language with a confidence score. Scroll down to see alternative candidates ranked by similarity. Languages using distinct scripts (Chinese, Arabic, Korean) are identified with near-100% accuracy.

About This Tool

N-Gram Frequency Analysis

This tool uses trigram (three-character sequence) frequency analysis — a technique introduced by Cavnar and Trenkle in 1994 and still used in production systems today. Every language has characteristic trigram distributions: English has high frequencies for "the", "ing", "tion"; German for "sch", "ein", "der"; Japanese for hiragana sequences. The tool compares your text's trigram profile against pre-computed profiles for 40+ languages using cosine similarity.

Trigram analysis achieves over 95% accuracy for texts longer than 50 characters across major world languages. It outperforms word-based methods because trigrams capture morphological patterns that transcend vocabulary — even made-up words in a language will contain its characteristic letter combinations.

Script Detection

Before trigram analysis, the tool identifies the writing script using Unicode code point ranges. This immediately narrows candidates: Cyrillic text is either Russian, Ukrainian, Bulgarian, or Serbian. Arabic script is Arabic, Persian, or Urdu. Hangul is Korean. For CJK (Chinese, Japanese), additional heuristics distinguish between them — Japanese text contains hiragana/katakana alongside kanji, while Chinese text is predominantly hanzi. Script detection alone can identify languages with unique scripts (Thai, Georgian, Armenian, Tamil, Telugu) with near-perfect accuracy.

Accuracy and Limitations

Accuracy depends on text length and language similarity. Languages with unique scripts (Chinese, Korean, Arabic, Thai, Greek) are identified from as few as 5 characters. Closely related Latin-script languages (Spanish/Portuguese, Norwegian/Danish/Swedish, Indonesian/Malay) require more text — 100+ characters — for reliable differentiation. Mixed-language texts are classified by the dominant language. Very short inputs (under 20 characters) may produce unreliable results.

Language Codes

Results use ISO 639-1 two-letter language codes (e.g., en for English, ja for Japanese, ar for Arabic). These codes are used in HTML lang attributes, HTTP Accept-Language headers, and translation APIs. For text analysis, see Word Counter and Readability Analyzer.

Why Use This Tool

Instant Language Identification

Language detection runs entirely in your browser with no API calls. The trigram profiles are embedded in the page code (~5KB), so detection is instantaneous — no network latency, no rate limits, no API keys. This makes it ideal for batch processing or integration into workflows where you need to identify languages before routing text to translators or locale-specific pipelines.

Common Use Cases

Content routing: Identify the language of user-submitted text to route it to the appropriate translator, support agent, or content moderation queue.
Data cleaning: Filter datasets by language — remove non-English entries from an English corpus, or separate multilingual data into language-specific buckets.
Localization testing: Verify that translated content is in the expected language before deployment. Catch cases where placeholder text was left untranslated.
SEO analysis: Detect the language of competitor pages or scraped content to understand market targeting.
Email/message triage: Automatically categorize incoming messages by language for multilingual customer support.

Privacy

100% client-side processing. Your text is analyzed entirely in your browser — no data is transmitted to any server. Related tools include Word Counter, Readability Analyzer, Text Diff, and Unicode Inspector.

FAQ

How does the language detection work?

The tool uses trigram frequency analysis — a well-established computational linguistics technique. It builds a profile of three-character sequences (trigrams) from your text and compares it against pre-computed profiles for 40+ languages. The language whose trigram profile most closely matches your text is identified as the detected language. This approach achieves over 95% accuracy for texts longer than 50 characters.

What languages are supported?

Over 40 languages including English, Spanish, French, German, Italian, Portuguese, Dutch, Russian, Chinese, Japanese, Korean, Arabic, Hindi, Turkish, Polish, Vietnamese, Thai, Greek, Hebrew, Swedish, Norwegian, Danish, Finnish, Czech, Romanian, Hungarian, Indonesian, Malay, and more.

How much text is needed for accurate detection?

As few as 20-30 characters can give reasonable results, but accuracy improves significantly with more text. For texts over 100 characters, detection accuracy typically exceeds 98%. Very short texts (under 10 characters) or texts mixing multiple languages may produce unreliable results.

Can it detect multiple languages in one text?

The tool detects the dominant language of the entire input. For texts that mix languages, it will identify the language that makes up the majority of the content. Future versions may support per-sentence language detection.

Is my text sent to a server?

No. All language detection runs entirely in your browser using JavaScript. The trigram profiles are embedded in the page code. No text is transmitted over the network.