Introduction
In the world of natural language processing, language identification is the process of determining which language the given content is written in. One approach involves creating a language n-gram model from a corpus of text. Models like these can be based on characters, symbols, words, etc. N-grams are simply all combinations of adjacent words or characters of length n that you can find in your source text. For example, given the word ice, the character bigrams (n = 2) are ic and ce. Some approaches also include word-boundary markers, but for simplicity, we'll stick to character-based examples here. Our goal for this article is to develop a simple language identifier that can identify the language of given text. Our dataset consists of character bigram frequencies extracted from corpora of five languages: English, Spanish, German, French, and Italian.
Getting Started
If you're using Google Colab, you can skip this section. If not, to get started make sure you have the following installed on your system:
- Python 3.1.0 or newer
- An integrated development environment of choice (Jupyter Notebook, PyCharm, or VSCode)
- Download the following libraries required for this:
- pandas, numpy, matplotlib, datasets (Hugging Face), tqdm, ipython
Imports
Copy the following imports:
import pandas as pd
from datasets import load_dataset
from collections import Counter
import re
from tqdm import tqdm
import json
import numpy as np
import matplotlib.pyplot as plt
from numpy._core.defchararray import count
import math
from IPython.display import display, Markdown
Text Preprocessing Methods
def clean_text(text):
return re.sub(r'[^a-zäöüßáéíóúñàâêëèïîôùûÿçìò]', '', text.lower())
def extract_bigrams(text) :
return [text[i:i+2] for i in range(len(text) - 1)]
- Our clean_text() method is used to keep text with letters a-z and other special alphabets languages contain along with lowercasing it.
- Our extract_bigrams() method is used to extract bigrams found within the given text.
Testing Our Text Preprocessing Methods
Lets test out our clean_text() and extract_bigrams() methods
dummy_text = "Are there any numbers or special characters within our text123?!?"
clean_dummy_text = clean_text(dummy_text)
dummy_bigrams = extract_bigrams(clean_dummy_text)
display(Markdown("Original Text: " + dummy_text))
display(Markdown("Clean Text: " + clean_dummy_text))
display(Markdown("Bigrams: " + ",".join(dummy_bigrams[1:])))
Output:
Original Text: Are there any numbers or special characters within our text123?!?
Clean Text: arethereanynumbersorspecialcharacterswithinourtext
Bigrams: re,et,th,he,er,re,ea,an,ny,yn,nu,um,mb,be,er,rs,so,or,rs,sp,pe,ec,ci,ia,al,lc,ch,ha,ar,ra,ac,ct,te,er,rs,sw,wi,it,th,hi,in,no,ou,ur,rt,te,ex,xt
Visualizing Our Test Bigrams
Let's now visualize our test bigrams using Matplotlib. This will allow us to see what bigrams are most common in our text.
bigram_counts = Counter(dummy_bigrams)
bigram_values, bigram_counts = zip(*bigram_counts.most_common())
plt.figure(figsize=(16,16))
plt.barh(bigram_values, bigram_counts, color='skyblue')
plt.gca().invert_yaxis()
plt.title('Visualization of Extracted Bigrams')
plt.xlabel("Frequency of Bigrams")
plt.show()
Data Collection
Our goal here is to identify the following languages: English, German, Spanish, French, and Italian. To do that, we need to collect the most common occurrences of bigram frequencies from a large corpus of text from English, German, Spanish, French, and Italian sources. Think of this as building a profile for each language. Each profile will have its own DataFrame of bigram frequencies and probabilities. The dataset we'll use for this comes from AllenAI/C4, which is a colossal, cleaned version of Common Crawl's web crawl corpus. This dataset contains corpora of text in multiple languages. We can easily load it using HuggingFace's dataset module. We will collect 10,000 samples of corpus text and require that the length of the text is 50 or more; otherwise, we can skip that sample. Out of those 10,000 samples we'll only keep the top 1,000 most common bigrams.
Important Constants
MAX_SAMPLES = 10000
MIN_SAMPLE_TEXT_LENGTH = 50
TOP_N = 1000
Data Collection Method
def collect_data(language):
dataset = load_dataset("allenai/c4", name=language, split="train", streaming=True)
counter = Counter()
total_bigrams = 0
processed = 0
for sample in tqdm(dataset, desc=f"{language} samples", unit=" samples", leave=False):
if processed >= MAX_SAMPLES:
break
text = clean_text(sample.get('text'))
if len(text) < MIN_SAMPLE_TEXT_LENGTH:
continue
bigrams = extract_bigrams(text)
if not bigrams:
continue
counter.update(bigrams)
total_bigrams += len(bigrams)
processed += 1
vocab_size = len(counter)
smoothed = {bigram: (count + 1) / (total_bigrams + vocab_size) for bigram, count in counter.items()}
top_bigrams = sorted(smoothed.items(), key=lambda x: -x[1])[:TOP_N]
df = pd.DataFrame([
{
"bigram" : bigram, "frequency":counter[bigram], 'probability' : round(prob, 8)}
for bigram, prob in top_bigrams
])
return df
Creating Our DataFrames
df_en = collect_data('en')
print(df_en)
df_de = collect_data('de')
print(df_de)
df_es = collect_data('es')
print(df_es)
df_fr = collect_data('fr')
print(df_fr)
df_it = collect_data('it')
print(df_it)
Sample Output:
bigram frequency probability
0 th 411664 2.417233e-02
1 in 340419 1.998893e-02
2 he 321331 1.886811e-02
3 er 283262 1.663276e-02
4 an 275088 1.615280e-02
.. ... ... ...
995 yí 2 1.800000e-07
996 ób 2 1.800000e-07
997 äi 2 1.800000e-07
998 iß 2 1.800000e-07
999 ßh 2 1.800000e-07
[1000 rows x 3 columns]
....
Data Transformation
We are taking each bigrams' probability and converting them into log probabilities, which involves calculating the natural logarithm of a given probability. By definition a log probabilitiy is simply a logarithm of a probability, and it is used a lot in probability theory and computer science. The following reasons as to why we are converting our probabilities to log probabilities is because of the following:
- Probabilities range from 0–1 while log probabilities range from −∞–0.
- The use of log probabilities improves numerical stabability, when the probabilities are very small.
- Log probability simplifies speed by transforming multiplications into additions.
- log(p1* p2) = log(p1) + log(p2)
- Log probability is commonly used when working with likelihoods.
To transform our probabilties to log probabilities we do the following:
def apply_log_probability_dict(df):
df['log_probability'] = np.log(df['probability'])
return df.set_index('bigram')['log_probability'].to_dict()
Applying the log transformation to our dataframes would look like this:
logprob_en = apply_log_probability_dict(df_en)
logprob_de = apply_log_probability_dict(df_de)
logprob_es = apply_log_probability_dict(df_es)
logprob_fr = apply_log_probability_dict(df_fr)
logprob_it = apply_log_probability_dict(df_it)
Summing Our Log Probabilities
We will sum the log probabilities of all the bigrams, and get the total score for each language. The result will always be a negative number. Remember log probabilities range from −∞–0.
def score_sentence(text, logprob_dict):
bigrams = get_bigrams(clean_text(text))
return sum(logprob_dict.get(bigram) for bigram in bigrams)
Here is an example of what our output would look like:
en:-143.76 | de:-145.18 | es:-121.76 | fr:-147.98 | it:-151.8
Applying the Softmax Formula for Confidence
The softmax function is used to express how confidently the system believes the sentence belongs to the predicted language. After retrieving the summed scores, softmax transforms these scores into a probability distribution that adds up to 1. This allows us to compare the languages on an equal scale and determine how strongly the highest-scoring language stands out from the rest. A higher softmax value means the model is more certain in its prediction, while lower values indicate closer competition among languages. We're only retrieving the winning confidence score.
def detect_language(sentence):
scores = {
'en': score_sentence(sentence, logprob_en),
'de': score_sentence(sentence, logprob_de),
'es': score_sentence(sentence, logprob_es),
'fr' : score_sentence(sentence, logprob_fr),
'it' : score_sentence(sentence, logprob_it)
}
winner = max(scores, key=scores.get)
exp_scores = {k: np.exp(v - max(scores.values())) for k, v in scores.items()}
total = sum(exp_scores.values())
confidence = exp_scores[winner] / total
return {
'language': winner,
'confidence': confidence,
'scores': {k: round(v, 2) for k, v in scores.items()}
}
This code may seem like a lot but I will explain what we're doing and how the softmax function is being applied step by step.
Listing Our Scores
The formula is pretty simple to understand. Let's assume these are our given scores for the following
phrase: "Buenos días y buenas noches." were:
| language code | score |
|---|---|
| en | -143.76 |
| de | -145.18 |
| es | -121.76 |
| fr | -147.98 |
| it | -151.8 |
Identifying The Max Score
We first need to identify the max score / our winner. In this case it's es with a score of -121.76
Shifting Our Scores
Subtract the maximum score from each score to stabilize the exponential step. The formula for this step is literally shifted score = (z-max)
| language code | Shifted Score (z-max) |
|---|---|
| en | -22.00 |
| de | -23.42 |
| es | 0.00 |
| fr | -7.80 |
| it | -11.62 |
Exponetiate Shifted Scores
Next apply np.exp() to each shifted value. Remember Euler's number (e) is the specific mathematical constant used as the base for the np.exp() function. Example e^-22, e^-23.43, etc.
| language code | exp(shifted score) |
|---|---|
| en | 2.789468e-10 |
| de | 6.742535e-11 |
| es | 1.000000e+00 |
| fr | 4.097350e-04 |
| it | 8.984587e-06 |
Summing The Exponentials
Next we sum the exponentials which comes out to be 1.0004e+00
Softmax Probabilities
Finally, divide the exponentials by the sum to get your final set of probabilties
| language code | Softmax Probability |
|---|---|
| en | 2.788301e-10 |
| de | 6.739713e-11 |
| es | 9.995815e-01 |
| fr | 4.097350e-04 |
| it | 8.984587e-06 |
Our system has an overall confidence of 0.9995844287888059 that the phrase "Buenos días y buenas noches." was written in Spanish. Other languages such as English, German, French, and Italian have confidence levels of 0.
Putting Our Model to the Test
Let us put our model to the test, by passing in an array of sentences and see what our scores and confidence value turns out to be. Our test data will contain sentences that were written in English, German, Spanish, French, & Italian.
def get_results(test_sentences):
for sentence in test_sentences:
result = detect_language(sentence)
display(Markdown(f"[{result['language'].upper()}] {sentence}"))
display(Markdown(f" → en:{result['scores']['en']} | de:{result['scores']['de']} | es:{result['scores']['es']} | fr:{result['scores']['fr']} | it:{result['scores']['it']} | confidence: {result['confidence']}"))
test_sentences = [
"This is a test sentence in English",
"Das ist ein Test in Deutsch.",
"Esta es una oración de prueba en español.",
"The quick brown fox jumps over the lazy dog.",
"Straße und Fuß sind deutsche Wörter.",
"Niño y mañana son palabras españolas.",
"Hello world from Python!",
"Guten Morgen und auf Wiedersehen!",
"Buenos días y buenas noches.",
"Ceci est une phrase de test en français.",
"Mi piace passeggiare in centro città durante il weekend.",
"Le soleil brillant se couchait derrière l'horizon lointain, répandant une lumière dorée sur les collines ondulantes.",
"Il sole arancione brillante tramontava dietro l'orizzonte distante, diffondendo una calda luce dorata sulle colline ondulate e sul fiume sereno sottostante."
]
get_results(test_sentences=test_sentences)
Output:
[EN] This is a test sentence in English
→ en:-132.53 | de:-135.82 | es:-144.91 | fr:-138.08 | it:-140.45 | confidence: 0.9600376730030188
[DE] Das ist ein Test in Deutsch.
→ en:-104.14 | de:-98.17 | es:-108.49 | fr:-103.52 | it:-105.75 | confidence: 0.9922230796798067
[ES] Esta es una oración de prueba en español.
→ en:-212.92 | de:-209.85 | es:-165.64 | fr:-202.2 | it:-198.0 | confidence: 0.9999999999999909
[EN] The quick brown fox jumps over the lazy dog.
→ en:-219.93 | de:-236.21 | es:-232.47 | fr:-232.53 | it:-234.28 | confidence: 0.9999923288958297
[FR] Straße und Fuß sind deutsche Wörter.
→ en:-196.12 | de:-169.44 | es:-161.88 | fr:-141.32 | it:-159.87 | confidence: 0.9999999900816506
[ES] Niño y mañana son palabras españolas.
→ en:-211.75 | de:-213.9 | es:-165.36 | fr:-195.45 | it:-206.73 | confidence: 0.999999999999915
[EN] Hello world from Python!
→ en:-107.55 | de:-122.95 | es:-124.29 | fr:-124.13 | it:-125.19 | confidence: 0.99999965791378
[DE] Guten Morgen und auf Wiedersehen!
→ en:-153.42 | de:-137.59 | es:-157.76 | fr:-155.75 | it:-159.64 | confidence: 0.9999998520959484
[ES] Buenos días y buenas noches.
→ en:-143.76 | de:-145.18 | es:-121.76 | fr:-129.56 | it:-133.38 | confidence: 0.9995844287888059
[FR] Ceci est une phrase de test en français.
→ en:-180.06 | de:-179.2 | es:-175.82 | fr:-163.96 | it:-178.69 | confidence: 0.9999921703433644
[IT] Mi piace passeggiare in centro città durante il weekend.
→ en:-274.66 | de:-277.27 | es:-278.2 | fr:-265.75 | it:-253.46 | confidence: 0.9999954216283861
[FR] Le soleil brillant se couchait derrière l'horizon lointain, répandant une lumière dorée sur les collines ondulantes.
→ en:-583.35 | de:-581.77 | es:-563.21 | fr:-514.41 | it:-548.31 | confidence: 0.999999999999998
[IT] Il sole arancione brillante tramontava dietro l'orizzonte distante, diffondendo una calda luce dorata sulle colline ondulate e sul fiume sereno sottostante.
→ en:-699.51 | de:-737.65 | es:-691.31 | fr:-700.46 | it:-668.71 | confidence: 0.9999999998463776
Visualizing Our Bigram Log Probabilties Across Languages
Lets visualize our log probability scores across the board with a given test sentence. Our test sentence will be: "The quick brown fox jumps over the lazy dog."
import matplotlib.pyplot as plt
user_input = input("Type a sentence in English, German, Spanish, French, or Italian: ")
bigrams = get_bigrams(clean_text(user_input))
x = list(range(len(bigrams)))
logs_en = [logprob_en.get(bg, 0) for bg in bigrams]
logs_de = [logprob_de.get(bg, 0) for bg in bigrams]
logs_es = [logprob_es.get(bg, 0) for bg in bigrams]
logs_fr = [logprob_fr.get(bg, 0) for bg in bigrams]
logs_it = [logprob_it.get(bg, 0) for bg in bigrams]
plt.figure(figsize=(12, 5))
plt.plot(x, logs_en, marker='o', label="English (EN)")
plt.plot(x, logs_de, marker='o', label="German (DE)")
plt.plot(x, logs_es, marker='o', label="Spanish (ES)")
plt.plot(x, logs_fr, marker='o', label="French (FR)")
plt.plot(x, logs_it, marker='o', label="Italian (IT)")
plt.xticks(x, bigrams, rotation=45)
plt.xlabel("Bigrams")
plt.ylabel("Log Probability")
plt.title(f"Bigram Log-Probabilities Across Languages\n\"{user_input}\"")
plt.grid(False)
plt.legend()
plt.tight_layout()
plt.show()
get_results(test_sentences=[user_input])
Conclusion
It is possible to identify the language of text by simply comparing the probability of common bigrams to a dataset of known bigram probabilities for different languages, along with implementing a scoring system as simple as summing the given probabilities. Each language had a unique statistical profile of common, uncommon, and rare bigrams. By leveraging these methods, we were able to successfully identify the language of long sentences, and paragraphs.