Text Similarity Checker

Compare two texts using five similarity metrics: cosine similarity, Jaccard index, Sørensen-Dice, Levenshtein distance, and LCS ratio. Sentence alignment and word overlap analysis — fully client-side.

Was this tool helpful?

How to Use

Compare two texts in three steps:

Paste both texts — Enter the original in Text A and the comparison text in Text B. Click a sample button to try pre-loaded pairs: paraphrased texts, same-topic texts, or unrelated texts.
Click Compare — The tool computes five similarity metrics simultaneously: Cosine Similarity, Jaccard Index, Sørensen-Dice Coefficient, Normalized Levenshtein Distance, and Longest Common Subsequence Ratio.
Review results — See the overall weighted similarity score, individual metric breakdown with explanations, aligned sentence pairs, and shared/unique word analysis. Copy the summary with the clipboard button.

About This Tool

Five Complementary Metrics

No single metric captures all aspects of text similarity, so this tool computes five. Cosine Similarity represents each text as a term-frequency vector and measures the angle between them. It is length-independent — a 100-word and 1000-word text about the same topic score high. Jaccard Index is the ratio of shared unique words to total unique words. It penalizes vocabulary divergence more harshly than cosine.

Sørensen-Dice Coefficient operates on character bigrams (overlapping pairs of characters), making it sensitive to spelling variations and word order. Normalized Levenshtein Distance counts the minimum character edits (insertions, deletions, substitutions) to transform one text into the other, normalized by the longer text's length. Longest Common Subsequence (LCS) Ratio finds the longest word sequence common to both texts (not necessarily contiguous), which excels at detecting preserved passages with minor insertions.

Sentence Alignment

Beyond aggregate scores, the tool performs sentence-level alignment. Every sentence from Text A is compared against every sentence in Text B using cosine similarity. Pairs scoring above 30% are displayed as aligned matches, sorted by similarity. This reveals exactly which passages overlap — essential for identifying copied or paraphrased sections in academic, legal, or journalistic contexts.

Interpreting Scores

Scores above 80% indicate near-identical or verbatim-copied text. 50-80% suggests heavy paraphrasing or shared source material. 20-50% typically means same-topic or same-domain content. Below 20% indicates unrelated texts. The word overlap analysis shows which vocabulary is shared versus unique to each text, providing insight into how the similarity score was formed. For deeper text analysis, see Text Diff and Keyword Extractor.

Why Use This Tool

Instant Multi-Metric Comparison

All five metrics compute in your browser with zero network calls. The Levenshtein and LCS algorithms use optimized two-row dynamic programming for memory efficiency. Processing is near-instantaneous for texts up to several thousand words. No API keys, no rate limits, no usage quotas.

Common Use Cases

Plagiarism screening: Compare student submissions, articles, or reports against known sources to detect copied or closely paraphrased content.
Content deduplication: Identify near-duplicate articles, product descriptions, or knowledge base entries that should be merged or removed.
Translation QA: Compare back-translated text against the original to assess translation fidelity at the vocabulary level.
SEO content auditing: Check if pages on your site have too much content overlap, which can dilute search rankings. Google recommends unique content per URL.
Writing revision tracking: Quantify how much a text changed between drafts by comparing revisions side by side.

Privacy

100% client-side processing. Both texts remain in your browser. Related tools: Text Diff, Readability Analyzer, Sentiment Analyzer, and Word Counter.

FAQ

What similarity metrics are used?

Five complementary metrics: Cosine Similarity (term-frequency vector angle), Jaccard Index (unique word set overlap), Sørensen-Dice Coefficient (character bigram overlap), Normalized Levenshtein Distance (character edit distance), and Longest Common Subsequence Ratio (preserved word sequences). Each captures a different aspect of text similarity.

How is the overall similarity score calculated?

It is a weighted average of all five metrics: Cosine (30%), Dice (25%), LCS (20%), Jaccard (15%), and Levenshtein (10%). Cosine and Dice are weighted higher because they are more robust for natural language comparison. The weights reflect academic consensus on metric reliability.

Can this detect plagiarism?

It can detect verbatim copying and close paraphrasing. The sentence alignment feature highlights similar passages. However, it cannot detect sophisticated paraphrasing, translation-based plagiarism, or idea-level similarity — those require semantic embeddings from neural language models.

Is my text sent to a server?

No. All comparison runs entirely in your browser using JavaScript. No text is transmitted over the network. Both texts remain on your device.

How does sentence alignment work?

Each sentence from Text A is compared against every sentence from Text B using cosine similarity. Pairs scoring above 0.3 are shown as aligned matches, sorted by similarity. This highlights which specific passages overlap between the two texts.