AI & ML interests
None defined yet.
Recent Activity
Papers
Baseer: A Vision-Language Model for Arabic Document-to-Markdown OCR
Mutarjim: Advancing Bidirectional Arabic-English Translation with a Small Language Model
مِسراج — Misraj AI
Built on Trust. Measured by Impact.
The next-generation Arabic AI lab — building the foundational infrastructure for Arabic language understanding, generation, and document intelligence.
🧭 About Us
Misraj AI is the AI research division of Misraj Technology, a Saudi-based technology group with over 10 years of experience delivering enterprise digital solutions across 15 sectors. Our AI lab is dedicated to a singular mission: making Arabic a first-class language in the modern AI era.
We develop open models, large-scale datasets, rigorous benchmarks, and production-ready AI systems — all purpose-built for Arabic, a morphologically rich language that has long been underserved by mainstream AI research.
From our research lab to operational products, we build a comprehensive system that enables governments and enterprises to adopt AI with confidence, depth, and speed.
📊 15+ research papers · 35 billion open Arabic data tokens · Honored by AI Pioneers
🏢 Areas of Expertise
Our AI solutions span critical industry verticals, combining deep domain knowledge with state-of-the-art Arabic NLP:
- 🏥 Healthcare Technology — Clinical documentation and Arabic medical NLP
- 🏦 Financial Technology — Document intelligence for banking and finance
- ⚖️ Legal Technology — Contract analysis and legal document processing
- 🎓 Educational Technology — Arabic learning and knowledge systems
- 🏛️ Administrative Technology — Government and enterprise document automation
📈 Open Benchmarks & SOTA Results
We develop rigorous, expert-verified benchmarks to establish clear performance standards for Arabic AI. Our models consistently lead these benchmarks against both open-source and commercial competitors.
| Benchmark | Focus | Key Performance (SOTA) |
|---|---|---|
| Misraj-DocOCR | Arabic Document OCR | Baseer achieves 0.25 WER, outperforming Azure AI and Gemini 2.5 Pro. |
| KITAB-Reviewed | PDF-to-Markdown | Baseer leads in structure with a 56 TEDS and 68.13 MARS score. |
| Tarjama-25 | Bi-directional Translation | Mutarjim (1.5B) outperforms models 20x its size (including GPT-4o mini) in EN→AR. |
| SadeedDiac-25 | Arabic Diacritization | Sadeed achieves a competitive 1.2% Diacritic Error Rate (DER). |
📦 Open Datasets
Our large-scale datasets provide the foundational fuel for high-performance Arabic model training.
| Dataset | Description | Scale |
|---|---|---|
| msdd | Misraj Structured Document Dataset | 26.4M rows |
| mudd | Misraj Unstructured Document Dataset | 4.76M rows |
| Arabic-Image-Captioning | Multimodal Arabic captioning pairs | 100M pairs |
| Sadeed Tashkeela | Cleaned & expert-filtered diacritization corpus | 1.05M samples |
📊 35+ billion open Arabic data tokens released and growing.
📬 Connect With Us
| Platform | Link |
|---|---|
| 🌐 Misraj AI | misraj.ai/en |
| 🌐 Misraj Technology | misraj.sa/en |
| 🔵 Baseer OCR | baseerocr.com |
| 🤗 Hugging Face | huggingface.co/Misraj |
| linkedin.com/company/aimisraj | |
| 🐦 X / Twitter | @aimisraj |
| 💻 GitHub | github.com/misraj-ai |
| @misraj__ai |