Language Detector
Detect the language of any text instantly using n-gram frequency analysis. Identifies 40+ languages with confidence scores — no upload, no API, fully client-side.
How to Use
Identify the language of any text:
- Enter your text — Type, paste, or click a sample button to load text in any language. The detector works best with at least 50 characters, though it can identify scripts (Arabic, Cyrillic, CJK) from just a few characters.
- Click Detect Language — The tool analyzes the text using trigram frequency profiles and Unicode script detection. Results appear instantly with no network delay.
- Review results — The top match shows the detected language with a confidence score. Scroll down to see alternative candidates ranked by similarity. Languages using distinct scripts (Chinese, Arabic, Korean) are identified with near-100% accuracy.
About This Tool
N-Gram Frequency Analysis
This tool uses trigram (three-character sequence) frequency analysis — a technique introduced by Cavnar and Trenkle in 1994 and still used in production systems today. Every language has characteristic trigram distributions: English has high frequencies for "the", "ing", "tion"; German for "sch", "ein", "der"; Japanese for hiragana sequences. The tool compares your text's trigram profile against pre-computed profiles for 40+ languages using cosine similarity.
Trigram analysis achieves over 95% accuracy for texts longer than 50 characters across major world languages. It outperforms word-based methods because trigrams capture morphological patterns that transcend vocabulary — even made-up words in a language will contain its characteristic letter combinations.
Script Detection
Before trigram analysis, the tool identifies the writing script using Unicode code point ranges. This immediately narrows candidates: Cyrillic text is either Russian, Ukrainian, Bulgarian, or Serbian. Arabic script is Arabic, Persian, or Urdu. Hangul is Korean. For CJK (Chinese, Japanese), additional heuristics distinguish between them — Japanese text contains hiragana/katakana alongside kanji, while Chinese text is predominantly hanzi. Script detection alone can identify languages with unique scripts (Thai, Georgian, Armenian, Tamil, Telugu) with near-perfect accuracy.
Accuracy and Limitations
Accuracy depends on text length and language similarity. Languages with unique scripts (Chinese, Korean, Arabic, Thai, Greek) are identified from as few as 5 characters. Closely related Latin-script languages (Spanish/Portuguese, Norwegian/Danish/Swedish, Indonesian/Malay) require more text — 100+ characters — for reliable differentiation. Mixed-language texts are classified by the dominant language. Very short inputs (under 20 characters) may produce unreliable results.
Language Codes
Results use ISO 639-1 two-letter language codes (e.g., en for English, ja for Japanese, ar for Arabic). These codes are used in HTML lang attributes, HTTP Accept-Language headers, and translation APIs. For text analysis, see Word Counter and Readability Analyzer.
Why Use This Tool
Instant Language Identification
Language detection runs entirely in your browser with no API calls. The trigram profiles are embedded in the page code (~5KB), so detection is instantaneous — no network latency, no rate limits, no API keys. This makes it ideal for batch processing or integration into workflows where you need to identify languages before routing text to translators or locale-specific pipelines.
Common Use Cases
- Content routing: Identify the language of user-submitted text to route it to the appropriate translator, support agent, or content moderation queue.
- Data cleaning: Filter datasets by language — remove non-English entries from an English corpus, or separate multilingual data into language-specific buckets.
- Localization testing: Verify that translated content is in the expected language before deployment. Catch cases where placeholder text was left untranslated.
- SEO analysis: Detect the language of competitor pages or scraped content to understand market targeting.
- Email/message triage: Automatically categorize incoming messages by language for multilingual customer support.
Privacy
100% client-side processing. Your text is analyzed entirely in your browser — no data is transmitted to any server. Related tools include Word Counter, Readability Analyzer, Text Diff, and Unicode Inspector.