...

Muhammad N. ElNokrashy

Applied Scientist at Microsoft, Advanced Technology Lab, Cairo.

Graduate of Computer Science from The American University in Cairo (AUC).

My north star is human-AI collaboration.

I work on language primarily, and some vision. Keywords: NLP, generative LMs, MT. In addition:

  • Structured Languages: Consumption + Generation of structured messages from/to/between generalist LMs;
  • Alignment: Bias recognition | Interpretability | Robustness | Grounding | Teachability;
  • Programming Languages: PL UX Design | Type systems | Memory safety | Static contracts + capabilities;

Among other forays on the path towards useful, helpful, and trustworthy machine intelligence.

I'm open to academic collabs and mentorship!

Please include the antonyms of the words "High Negatives" in the subject-line.

Publications

New Featured!
Investigating Cultural Alignment of Large Language Models

Badr AlKhamissi; Muhammad ElNokrashy; Mai AlKhamissi; Mona Diab;

(ACL 2024) Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics
Abstract

The intricate relationship between language and culture has long been a subject of exploration within the realm of linguistic anthropology. Large Language Models (LLMs), promoted as repositories of collective human knowledge, raise a pivotal question: do these models genuinely encapsulate the diverse knowledge adopted by different cultures? Our study reveals that these models demonstrate greater cultural alignment along two dimensions -- firstly, when prompted with the dominant language of a specific culture, and secondly, when pretrained with a refined mixture of languages employed by that culture. We quantify cultural alignment by simulating sociological surveys, comparing model responses to those of actual survey participants as references. Specifically, we replicate a survey conducted in various regions of Egypt and the United States through prompting LLMs with different pretraining data mixtures in both Arabic and English with the personas of the real respondents and the survey questions. Further analysis reveals that misalignment becomes more pronounced for underrepresented personas and for culturally sensitive topics, such as those probing social values. Finally, we introduce Anthropological Prompting, a novel method leveraging anthropological reasoning to enhance cultural alignment. Our study emphasizes the necessity for a more balanced multilingual pretraining dataset to better represent the diversity of human experience and the plurality of different cultures with many implications on the topic of cross-lingual transfer.

New Featured!
A Context-Contrastive Inference Approach To Partial Diacritization

Muhammad ElNokrashy; Badr AlKhamissi;

(ArabicNLP 2024) Proceedings of The Second Arabic Natural Language Processing Conference
Abstract

Diacritization plays a pivotal role in improving readability and disambiguating the meaning of Arabic texts. Substantial efforts have been devoted to Full Diacritization, which includes all marks on every eligible character. Comparatively overlooked is Partial Diacritzation, which is the selection of a small subset of characters to be diacritized to aid comprehension where needed. Research has indicated that excessive diacritic usage can hinder skilled reading; slower reading speeds and reduced accuracy. In this light, we introduce a novel approach to Partial Diacritization which integrates seamlessly with existing Arabic diacritization systems. Our method examines each word twice, once with context and once without, and retains only the diacritics that show disparities between both inferences. Further more, we introduce novel indicators for measuring partial diacritization quality, contributing significantly to this area of research.

New
eBLEU: Unexpectedly Good Machine Translation Evaluation Using Simple Word Embeddings

Muhammad ElNokrashy; Tom Kocmi;

WMT23 (Metrics Task)
Abstract

We propose eBLEU, a metric inspired by BLEU metric that uses embedding similarities instead of string matches. We introduce meaning diffusion vectors to enable matching n-grams of semantically similar words in a BLEU-like algorithm, using efficient, non-contextual word embeddings like fastText. On WMT23 data, eBLEU beats BLEU and ChrF by around 3.8% system-level score, approaching BERTScore at -0.9% absolute difference. In WMT22 scenarios, eBLEU outperforms f101spBLEU and ChrF in MQM by 2.2%-3.6%. Curiously, on MTurk evaluations, eBLEU surpasses past methods by 3.9%-8.2% (f200spBLEU, COMET-22). eBLEU presents an interesting middle-ground between traditional metrics and pretrained metrics.

SOTA
Rosetta Stone at KSAA-RD Shared Task: A Hop From Language Modeling To Word--Definition Alignment

Ahmed ElBakry *; Mohamed Gabr; Muhammad ElNokrashy; Badr AlKhamissi *;

The First Arabic NLP Conference (ArabicNLP-23) (colocated with EMNLP-23)
Abstract

A Reverse Dictionary is a tool enabling users to discover a word based on its provided definition, meaning, or description. Such a technique proves valuable in various scenarios, aiding language learners who possess a description of a word without its identity, and benefiting writers seeking precise terminology. These scenarios often encapsulate what is referred to as the "Tip-of-the-Tongue" (TOT) phenomena. In this work, we present our winning solution for the Arabic Reverse Dictionary shared task. This task focuses on deriving a vector representation of an Arabic word from its accompanying description. The shared task encompasses two distinct subtasks: the first involves an Arabic definition as input, while the second employs an English definition. For the first subtask, our approach relies on an ensemble of finetuned Arabic BERT-based models, predicting the word embedding for a given definition. The final representation is obtained through averaging the output embeddings from each model within the ensemble. In contrast, the most effective solution for the second subtask involves translating the English test definitions into Arabic and applying them to the finetuned models originally trained for the first subtask. This straightforward method achieves the highest score across both subtasks.

Featured!
Shadow-Cave Models: How Plato's Allegory Illuminates the Limitations of Large Language Models

Muhammad ElNokrashy; Badr AlKhamissi;

[preprint]  | 
Abstract

This work outlines some limitations of Large Language Models (LLMs) and connects them to views on the philosophies of knowledge and perception, and then dubs them Shadow-Cave Models. One explanation is proposed for the astonishing performance of LLMs in the early 2020s in logic and knowledge tasks, even beyond the domain of language modeling. We argue that the mechanism which allows LLMS to perform well on some tests for these capabilities is also limitation towards higher understanding abilities.

In Submission
DWAtt: Depth-wise Attention for Efficient Text Classification

Muhammad ElNokrashy *; Badr AlKhamissi *; Mona Diab;

LREC 24 ; ENLSP 2022 (@ NeurIPS 2022)
Abstract

Language Models pretrained on large textual data have been shown to encode different types of knowledge simultaneously. Traditionally, only the features from the last layer are used when adapting to new tasks or data. We put forward that, when using or finetuning deep pretrained models, intermediate layer features that may be relevant to the downstream task are buried too deep to be used efficiently in terms of needed samples or steps. To test this, we propose a new layer fusion method: Depth-Wise Attention (DWAtt), to help re-surface signals from non-final layers. We compare DWAtt to a basic concatenation-based layer fusion method (Concat), and compare both to a deeper model baseline -- all kept within a similar parameter budget. Our findings show that DWAtt and Concat are more step- and sample-efficient than the baseline, especially in the few-shot setting. DWAtt outperforms Concat on larger data sizes. On CoNLL-03 NER, layer fusion shows 3.68-9.73% F1 gain at different few-shot sizes. The layer fusion models presented significantly outperform the baseline in various training scenarios with different data sizes, architectures, and training constraints.

Featured!
Language Tokens: A Frustratingly Simple Approach Improves Zero-Shot Performance of Multilingual Translation

Muhammad N ElNokrashy; Amr Hendy; Mohamed Maher; Mohamed Afify; Hany Hassan;

AMTA 2022 (Association for Machine Translation in the Americas)
Abstract

This paper proposes a simple yet effective method to improve direct (X-to-Y) translation for both cases: zero-shot and when direct data is available. We modify the input tokens at both the encoder and decoder to include signals for the source and target languages. We show a performance gain when training from scratch, or finetuning a pretrained model with the proposed setup. In the experiments, our method shows nearly 10.0 BLEU points gain on in-house datasets depending on the checkpoint selection criteria. In a WMT evaluation campaign, From-English performance improves by 4.17 and 2.87 BLEU points, in the zero-shot setting, and when direct data is available for training, respectively. While X-to-Y improves by 1.29 BLEU over the zero-shot baseline, and 0.44 over the many-to-many baseline. In the low-resource setting, we see a 1.5~1.7 point improvement when finetuning on X-to-Y domain data.

Venue Spotlight
The Emergence of Abstract and Episodic Neurons in Episodic Meta-RL

Badr AlKhamissi *; Muhammad ElNokrashy *; Michael Spranger;

MemARI (Memory in Artificial and Real Intelligence) (@ NeurIPS 2022)
Abstract

In this work, we analyze the reinstatement mechanism introduced by Ritter et al. (2018) to reveal two classes of neurons that emerge in the agent's working memory (an epLSTM cell) when trained using episodic meta-RL on an episodic variant of the Harlow visual fixation task. Specifically, Abstract neurons encode knowledge shared across tasks, while Episodic neurons carry information relevant for a specific episode's task.

The Shared Task on Gender Rewriting

Bashar Alhafni; Nizar Habash; Houda Bouamor; Ossama Obeid; Sultan Alrowili; Daliyah Alzeer; Khawlah M Alshanqiti; Ahmed ElBakry; Muhammad ElNokrashy; Mohamed Gabr; Abderrahmane Issam; Abdelrahim Qaddoumi; K Vijay-Shanker; Mahmoud Zyate;

WANLP 2022 (EMNLP 2022)
[preprint]  | 
Abstract

In this paper, we present the results and findings of the Shared Task on Gender Rewriting, which was organized as part of the Seventh Arabic Natural Language Processing Workshop. The task of gender rewriting refers to generating alternatives of a given sentence to match different target user gender contexts (e.g., female speaker with a male listener, a male speaker with a male listener, etc.). This requires changing the grammatical gender (masculine or feminine) of certain words referring to the users. In this task, we focus on Arabic, a gender-marking morphologically rich language. A total of five teams from four countries participated in the shared task.

SOTA
Adapting MARBERT for Improved Arabic Dialect Identification

Badr AlKhamissi; Mohamed Gabr; Muhammad ElNokrashy; Khaled Essam;

WANLP 2021 (EACL 2021)
Abstract

In this paper, we tackle the Nuanced Arabic Dialect Identification (NADI) shared task (Abdul-Mageed et al., 2021) and demonstrate state-of-the-art results on all of its four subtasks. Tasks are to identify the geographic origin of short Dialectal (DA) and Modern Standard Arabic (MSA) utterances at the levels of both country and province. Our final model is an ensemble of variants built on top of MARBERT that achieves an F1-score of 34.03% for DA at the country-level development set -- an improvement of 7.63% from previous work.

Deep Spiking Neural Networks with Resonate-and-Fire Neurons

Badr AlKhamissi *; Muhammad N. ElNokrashy *; David Bernal-Casas;

Preprint
[preprint]  | 
Abstract

In this work, we explore a new Spiking Neural Network (SNN) formulation with Resonate-and-Fire (RAF) neurons (Izhikevich, 2001) trained with gradient descent via back-propagation. The RAF-SNN, while more biologically plausible, achieves performance comparable to or higher than conventional models in the Machine Learning literature across different network configurations, using similar or fewer parameters. Strikingly, the RAF-SNN proves robust against noise induced at testing/training time, under both static and dynamic conditions. Against CNN on MNIST, we show 25% higher absolute accuracy with N(0, 0.2) induced noise at testing time. Against LSTM on N-MNIST, we show 70% higher absolute accuracy with 20% induced noise at training time.

Score Combination for Improved Parallel Corpus Filtering for Low Resource Conditions

Muhammad N. ElNokrashy *; Amr Hendy *; Mohamed Abdelghaffar *; Mohamed Afify; Ahmed Tawfik; Hany Hassan Awadalla;

WMT20 (EMNLP 2020)
Abstract

This paper describes our submission to the WMT20 sentence filtering task. We combine scores from (1) a custom LASER built for each source language, (2) a classifier built to distinguish positive and negative pairs by semantic alignment, and (3) the original scores included in the task devkit. For the mBART finetuning setup, provided by the organizers, our method shows 7% and 5% relative improvement over baseline, in sacreBLEU score on the test set for Pashto and Khmer respectively.

SOTA
Deep Diacritization: Efficient Hierarchical Recurrence for Improved Arabic Diacritization

Badr AlKhamissi *; Muhammad ElNokrashy *; Mohamed Gabr;

WANLP 2020 (COLING 2020)
Abstract

We propose a novel architecture for labelling character sequences that achieves state-of-the-art results on the Tashkeela Arabic diacritization benchmark. The core is a two-level recurrence hierarchy that operates on the word and character levels separately—enabling faster training and inference than comparable traditional models. A cross-level attention module further connects the two and opens the door for network interpretability. The task module is a softmax classifier that enumerates valid combinations of diacritics. This architecture can be extended with a recurrent decoder that optionally accepts priors from partially diacritized text, which improves results. We employ extra tricks such as sentence dropout and majority voting to further boost the final result. Our best model achieves a WER of 5.34%, outperforming the previous state-of-the-art with a 30.56% relative error reduction.



People: Mentorship + Collaboration

Mentors

Collaborators

Mentees



Positions + Roles

  • Applying Science at Microsoft Mobile Experiences

    February, 2023 – Present  |  Microsoft ATL, Cairo, Egypt

    • Applying Science at Microsoft Translator

      Q2 2020 – February, 2023  |  Microsoft ATL, Cairo, Egypt

      • Researched low-resource MT, introducing tools to improve data efficiency and general model performance.
      • Researched multi-lingual MT, introducing modeling methods to improve general and zero-shot performance.
      • Engaged in development of an internal modeling and training framework.
    • Applying Science at LUIS

      Q4 2018 – Q2 2020  |  Microsoft ATL, Cairo, Egypt

      • Researched non-ML methods for text classification and slot filling.
      • Introduced a light, interpretable word embedding method.
    • Graduate Teaching Assistant

      Q4 2018 – Q4 2018  |  The American University in Cairo, Cairo, Egypt

      • Conducted lab classes for Embedded Systems advanced level course under Dr. M. Shalan.
      • Held assistance hours and marked assignments for the Practical Machine Deep Learning course on modern machine learning.