The Arabic diacriticized corpus

Posted on Sun 17 December 2023 in Language

بِسْمِ ٱللَّٰهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ

This article provides the details of the process used to create a diacriticized corpus using books from Maktaba Shamela as well as some basic research.

Downloading the original & modified corpora

Maktaba Shamela used to publish books as available for download on their old website, but ever since the new update, it is very difficult to get raw copies of the books in EPUB or other formats.

They perhaps missed a golden opportunity to use Git to track changes and updates for all books.

Nevertheless, I was able to obtain an archive of all the books (extracted from the old website) via telegram, available here - @shamela_kindle.

If you are using Linux, run the following commands to create a single zip archive and then extract all the books from that archive:

cat shamela_epub.z01 shamela_epub.z02 shamela_epub.z03 shamela_epub.z04 shamela_epub.zip > combined.zip
7z x combined.zip

The diacriticized corpus can be downloaded from at:

Shamela Diacritics Corpus

Install Python packages

Create a directory:

mkdir diacritics-measure
virtualenv diacritics-measure/

Install required Python packages:

pip install ebooklib camel-tools matplotlib fuzzywuzzy

Scripts and research output

Create a script called 01_measure.py and add this:

'''
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see <https://www.gnu.org/licenses/>. 
'''
import os
import ebooklib
from ebooklib import epub
import re
from camel_tools.utils.charsets import AR_DIAC_CHARSET, AR_LETTERS_CHARSET

def count_diacritics(word):
    num_letters = sum(1 for d in word if d in AR_LETTERS_CHARSET)
    num_diacritics = sum(1 for c in word if c in AR_DIAC_CHARSET)
    return num_letters, num_diacritics

def main():
    input_folder = '/home/path/to/extracted-files/'
    output_file = 'output.txt'

    with open(output_file, 'w', encoding='utf-8') as output:
        for root, dirs, files in os.walk(input_folder):
            for file in files:
                if file.endswith('.epub'):
                    file_path = os.path.join(root, file)
                    book = epub.read_epub(file_path)

                    # Count number of letters and diacritics per word
                    num_letters = 0
                    num_diacritics = 0
                    for item in book.get_items():
                        if item.get_type() == ebooklib.ITEM_DOCUMENT:
                            text = item.get_content().decode('utf-8')
                            words = re.findall(r'[\u0600-\u06FF]+', text)
                            for word in words:
                                word_letters, word_diacritics = count_diacritics(word)
                                num_letters += word_letters
                                num_diacritics += word_diacritics

                    # Calculate percentage of text that has diacritics
                    percentage = (num_diacritics / num_letters) * 100

                    # Get the title of the book
                    title_data = book.get_metadata('DC', 'title')
                    title = title_data[0][0]

                    # Write the folder name, EPUB title, and percentage to the output file
                    output.write(f"{root}, {title}, {percentage:.2f}%\n")

if __name__ == '__main__':
    main()

The script will calculate the percentage of diacritics in each book. The output file will then be used for further processing.

Create a script called 02_diacritic_sort.py and add this:

'''
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see <https://www.gnu.org/licenses/>. 
'''
import csv
import operator

# Read the file
with open('output.txt', 'r') as input_file:
    reader = csv.reader(input_file)
    data = list(reader)

# Extract the last entry in each line if the line has more than 2 commas
extracted_data = [(line[-1], line) for line in data if len(line) > 2]

# Convert percentages to floats for sorting
for i, item in enumerate(extracted_data):
    percentage = item[0].strip('%')
    extracted_data[i] = (float(percentage), item[1])

# Sort the data
sorted_data = sorted(extracted_data, key=operator.itemgetter(0), reverse=True)

# Write the sorted data to a new file
with open('hi-hark.csv', 'w', newline='') as output_file:
    writer = csv.writer(output_file)
    for _, line in sorted_data:
        writer.writerow(line)

This script will create a CSV file which we can use for creating graphs.

Create a script called 03_diacritic_sort.py and add this:

'''
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see <https://www.gnu.org/licenses/>. 
'''
import csv
import matplotlib.pyplot as plt

# Read the file
with open('hi-hark.csv', 'r') as input_file:
    reader = csv.reader(input_file)
    data = list(reader)

# Extract the last entry (percentage) in each line
percentages = [float(line[-1].strip('%')) for line in data if len(line) > 2]

# Sort the percentages from highest to lowest
sorted_percentages = sorted(percentages, reverse=True)

# Plot the percentage against the number of books
plt.plot(sorted_percentages)
plt.xlabel('Number of Books')
plt.ylabel('Percentage')
plt.title('Percentage vs. Number of Books')

plt.savefig('output.png')

This script will create a CSV file which generates a graph of the percentage of diacritics against the number of books. It shows an almost reverse exponential curve where the number of books with high diacritics drops off steeply.

Create a script called 04_cat_sort.py and add this:

'''
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see <https://www.gnu.org/licenses/>. 
'''
import csv
import operator
import re

# Function to extract category number and percentage from directory path
def extract_category_number_and_percentage(row):
    # Extract category number
    directories = row[0].split('/')
    last_directory = directories[-1]
    match = re.match(r'\d+', last_directory)
    if match:
        category_number = int(match.group())
    else:
        category_number = 0  # or some other default value

    # Extract percentage
    if len(row) > 2:
        percentage = row[-1].strip('%')
        percentage = float(percentage)
    else:
        percentage = 0  # or some other default value

    return category_number, percentage

# Read the CSV file
with open('output.txt', 'r') as csvfile:
    reader = csv.reader(csvfile)
    data = list(reader)

# Sort the data based on the category number and percentage
data.sort(key=lambda row: extract_category_number_and_percentage(row), reverse=True)

# Write the sorted data to a new CSV file
with open('cat-order-hi-hark.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerows(data)

The CSV file generated from this script will create graphs showing the diacritics against the number of books per category.

Create a script called 05_cat_plot.py and add this:

'''
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see <https://www.gnu.org/licenses/>. 
'''
import csv
import matplotlib.pyplot as plt
import re

# Function to extract category number and percentage from directory path
def extract_category_number_and_percentage(row):
    # Extract category number
    directories = row[0].split('/')
    last_directory = directories[-1]
    match = re.match(r'\d+', last_directory)
    if match:
        category_number = int(match.group())
    else:
        category_number = 0  # or some other default value

    # Extract percentage
    if len(row) > 2:
        percentage = row[-1].strip('%')
        percentage = float(percentage)
    else:
        percentage = 0  # or some other default value

    return category_number, percentage

# Read the file
with open('cat-order-hi-hark.csv', 'r') as input_file:
    reader = csv.reader(input_file)
    data = list(reader)

# Extract category number and percentage for each row
categories_and_percentages = [extract_category_number_and_percentage(row) for row in data]

# Group data by category
grouped_data = {}
for category, percentage in categories_and_percentages:
    if category not in grouped_data:
        grouped_data[category] = []
    grouped_data[category].append(percentage)

# Plot the percentage against the number of books for each category
for category, percentages in grouped_data.items():
    sorted_percentages = sorted(percentages, reverse=True)
    plt.plot(sorted_percentages)
    plt.xlabel('Number of Books')
    plt.ylabel('Percentage')
    plt.title(f'Percentage vs. Number of Books for Category {category}')
    plt.savefig(f'output/output_{category}.png')
    plt.clf()  # Clear the current figure for the next plot

This script will create a CSV file which generates a graph of the percentage of diacritics against the number of books per category.

Below are the images of the graphs for each category (10 categories):

The rest of the images can be found at:

Diacritics Result per category

Create a script called 06_new_corpus.py and add this:

'''
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see <https://www.gnu.org/licenses/>. 
'''
import re
import shutil
from camel_tools.utils.charsets import AR_DIAC_CHARSET, AR_LETTERS_CHARSET

def count_diacritics(word):
    num_letters = sum(1 for d in word if d in AR_LETTERS_CHARSET)
    num_diacritics = sum(1 for c in word if c in AR_DIAC_CHARSET)
    return num_letters, num_diacritics

def main():
    input_folder = '/home/path/to/extracted-files/'
    output_folder = '/home/path/to/new-corpus/'

    for root, dirs, files in os.walk(input_folder):
        for file in files:
            if file.endswith('.epub'):
                file_path = os.path.join(root, file)
                book = epub.read_epub(file_path)

                # Count number of letters and diacritics per word
                num_letters = 0
                num_diacritics = 0
                for item in book.get_items():
                    if item.get_type() == ebooklib.ITEM_DOCUMENT:
                        text = item.get_content().decode('utf-8')
                        words = re.findall(r'[\u0600-\u06FF]+', text)
                        for word in words:
                            word_letters, word_diacritics = count_diacritics(word)
                            num_letters += word_letters
                            num_diacritics += word_diacritics

                # Calculate percentage of text that has diacritics
                percentage = (num_diacritics / num_letters) * 100

                # Get the title of the book
                title_data = book.get_metadata('DC', 'title')
                title = title_data[0][0]

                # If the percentage of diacritics is 50% or higher, copy the book to a new folder
                if percentage >= 50:
                    # If you want to use only the last part of the root path as the folder name, you can use os.path.basename(root) to get it:
                    new_folder = os.path.join(output_folder, os.path.basename(root))
                    os.makedirs(new_folder, exist_ok=True)
                    shutil.copy(file_path, new_folder)

if __name__ == '__main__':
    main()

I had some issues using the CSV files to create the new diacriticized corpus. Therefore, I re-ran the first script to split out all the books that have 50% or higher diacritics into their own folder.

The output of this folder is the archive.org link I shared above and which I will share below once again:

Shamela Diacritics Corpus

A corpus to build upon

The corpus provided here can be used for multiple use cases. I will mention two of them related to research I am doing.

Firstly, the corpus can be used for diacritics themselves. One idea I had was to use another tool like ChetGPT or the other state-of-the-art Arabic diacritic model to generate diacritics for a specific book. Thereafter, the category related to that book can be referenced for other diacriticized books to see if the machine-generated diacritics are accurate and match against similar words, phrases or sentences in the diacriticized corpus.

Secondly, this diacriticized corpus can be used for text-to-speech. In this case, books that are highly diacriticized can be used to create audio data that can be trained for TTS purposes using PiperTTS or CoquiTTS.

The scripts above, specifically script 06, could be used to create more concentrated corpus where you can adjust the percentage of diacritics for your own corpus. For example, 90%, 80%, 70%, etc.

Feedback

I am always seeking feedback on these projects. It sometimes feels like these areas of research, specifically Classical Arabic and Islam (with regards to NLP, AI and machine learning) have very few research groups focused on them.


If you don't know how to use RSS and want email updates on my new content, consider Joining my Newsletter

The original content of this blog is a Waqf solely for the Pleasure of Allah. You are hereby granted full permission to copy, download, distribute, publish and share this content without modification under condition that full attribution is given to this author by creating a link either above or below the content that links back to the original source of the content. For any questions or ambiguity, you are requested to contact me via email for clarification.