
Muhammad N. ElNokrashy
Applied Scientist at Microsoft, Advanced Technology Lab, Cairo.
Graduate of Computer Science from The American University in Cairo (AUC).
My north star is human-AI collaboration.
I work on language primarily, and some vision. Keywords: NLP, generative LMs, MT. In addition:
- Structured Languages: Consumption + Generation of structured messages from/to/between generalist LMs;
- Alignment: Bias recognition | Interpretability | Robustness | Grounding | Teachability;
- Programming Languages: PL UX Design | Type systems | Memory safety | Static contracts + capabilities;
Among other forays on the path towards useful, helpful, and trustworthy machine intelligence.
I'm open to academic collabs and mentorship!
Please include the antonyms of the words "High Negatives" in the subject-line.
Publications
Abstract
The intricate relationship between language and culture has long been a subject of exploration within the realm of linguistic anthropology. Large Language Models (LLMs), promoted as repositories of collective human knowledge, raise a pivotal question: do these models genuinely encapsulate the diverse knowledge adopted by different cultures? Our study reveals that these models demonstrate greater cultural alignment along two dimensions -- firstly, when prompted with the dominant language of a specific culture, and secondly, when pretrained with a refined mixture of languages employed by that culture. We quantify cultural alignment by simulating sociological surveys, comparing model responses to those of actual survey participants as references. Specifically, we replicate a survey conducted in various regions of Egypt and the United States through prompting LLMs with different pretraining data mixtures in both Arabic and English with the personas of the real respondents and the survey questions. Further analysis reveals that misalignment becomes more pronounced for underrepresented personas and for culturally sensitive topics, such as those probing social values. Finally, we introduce Anthropological Prompting, a novel method leveraging anthropological reasoning to enhance cultural alignment. Our study emphasizes the necessity for a more balanced multilingual pretraining dataset to better represent the diversity of human experience and the plurality of different cultures with many implications on the topic of cross-lingual transfer.
Abstract
Diacritization plays a pivotal role in improving readability and disambiguating the meaning of Arabic texts. Substantial efforts have been devoted to Full Diacritization, which includes all marks on every eligible character. Comparatively overlooked is Partial Diacritzation, which is the selection of a small subset of characters to be diacritized to aid comprehension where needed. Research has indicated that excessive diacritic usage can hinder skilled reading; slower reading speeds and reduced accuracy. In this light, we introduce a novel approach to Partial Diacritization which integrates seamlessly with existing Arabic diacritization systems. Our method examines each word twice, once with context and once without, and retains only the diacritics that show disparities between both inferences. Further more, we introduce novel indicators for measuring partial diacritization quality, contributing significantly to this area of research.
Abstract
We propose eBLEU, a metric inspired by BLEU metric that uses embedding similarities instead of string matches. We introduce meaning diffusion vectors to enable matching n-grams of semantically similar words in a BLEU-like algorithm, using efficient, non-contextual word embeddings like fastText. On WMT23 data, eBLEU beats BLEU and ChrF by around 3.8% system-level score, approaching BERTScore at -0.9% absolute difference. In WMT22 scenarios, eBLEU outperforms f101spBLEU and ChrF in MQM by 2.2%-3.6%. Curiously, on MTurk evaluations, eBLEU surpasses past methods by 3.9%-8.2% (f200spBLEU, COMET-22). eBLEU presents an interesting middle-ground between traditional metrics and pretrained metrics.
Abstract
A Reverse Dictionary is a tool enabling users to discover a word based on its provided definition, meaning, or description. Such a technique proves valuable in various scenarios, aiding language learners who possess a description of a word without its identity, and benefiting writers seeking precise terminology. These scenarios often encapsulate what is referred to as the "Tip-of-the-Tongue" (TOT) phenomena. In this work, we present our winning solution for the Arabic Reverse Dictionary shared task. This task focuses on deriving a vector representation of an Arabic word from its accompanying description. The shared task encompasses two distinct subtasks: the first involves an Arabic definition as input, while the second employs an English definition. For the first subtask, our approach relies on an ensemble of finetuned Arabic BERT-based models, predicting the word embedding for a given definition. The final representation is obtained through averaging the output embeddings from each model within the ensemble. In contrast, the most effective solution for the second subtask involves translating the English test definitions into Arabic and applying them to the finetuned models originally trained for the first subtask. This straightforward method achieves the highest score across both subtasks.
Abstract
This work outlines some limitations of Large Language Models (LLMs) and connects them to views on the philosophies of knowledge and perception, and then dubs them Shadow-Cave Models. One explanation is proposed for the astonishing performance of LLMs in the early 2020s in logic and knowledge tasks, even beyond the domain of language modeling. We argue that the mechanism which allows LLMS to perform well on some tests for these capabilities is also limitation towards higher understanding abilities.
Abstract
Language Models pretrained on large textual data have been shown to encode different types of knowledge simultaneously. Traditionally, only the features from the last layer are used when adapting to new tasks or data. We put forward that, when using or finetuning deep pretrained models, intermediate layer features that may be relevant to the downstream task are buried too deep to be used efficiently in terms of needed samples or steps. To test this, we propose a new layer fusion method: Depth-Wise Attention (DWAtt), to help re-surface signals from non-final layers. We compare DWAtt to a basic concatenation-based layer fusion method (Concat), and compare both to a deeper model baseline -- all kept within a similar parameter budget. Our findings show that DWAtt and Concat are more step- and sample-efficient than the baseline, especially in the few-shot setting. DWAtt outperforms Concat on larger data sizes. On CoNLL-03 NER, layer fusion shows 3.68-9.73% F1 gain at different few-shot sizes. The layer fusion models presented significantly outperform the baseline in various training scenarios with different data sizes, architectures, and training constraints.
Abstract
This paper proposes a simple yet effective method to improve direct (X-to-Y) translation for both cases: zero-shot and when direct data is available. We modify the input tokens at both the encoder and decoder to include signals for the source and target languages. We show a performance gain when training from scratch, or finetuning a pretrained model with the proposed setup. In the experiments, our method shows nearly 10.0 BLEU points gain on in-house datasets depending on the checkpoint selection criteria. In a WMT evaluation campaign, From-English performance improves by 4.17 and 2.87 BLEU points, in the zero-shot setting, and when direct data is available for training, respectively. While X-to-Y improves by 1.29 BLEU over the zero-shot baseline, and 0.44 over the many-to-many baseline. In the low-resource setting, we see a 1.5~1.7 point improvement when finetuning on X-to-Y domain data.
Abstract
In this work, we analyze the reinstatement mechanism introduced by Ritter et al. (2018) to reveal two classes of neurons that emerge in the agent's working memory (an epLSTM cell) when trained using episodic meta-RL on an episodic variant of the Harlow visual fixation task. Specifically, Abstract neurons encode knowledge shared across tasks, while Episodic neurons carry information relevant for a specific episode's task.
Abstract
In this paper, we present the results and findings of the Shared Task on Gender Rewriting, which was organized as part of the Seventh Arabic Natural Language Processing Workshop. The task of gender rewriting refers to generating alternatives of a given sentence to match different target user gender contexts (e.g., female speaker with a male listener, a male speaker with a male listener, etc.). This requires changing the grammatical gender (masculine or feminine) of certain words referring to the users. In this task, we focus on Arabic, a gender-marking morphologically rich language. A total of five teams from four countries participated in the shared task.
Abstract
In this paper, we tackle the Nuanced Arabic Dialect Identification (NADI) shared task (Abdul-Mageed et al., 2021) and demonstrate state-of-the-art results on all of its four subtasks. Tasks are to identify the geographic origin of short Dialectal (DA) and Modern Standard Arabic (MSA) utterances at the levels of both country and province. Our final model is an ensemble of variants built on top of MARBERT that achieves an F1-score of 34.03% for DA at the country-level development set -- an improvement of 7.63% from previous work.
Abstract
In this work, we explore a new Spiking Neural Network (SNN) formulation with Resonate-and-Fire (RAF) neurons (Izhikevich, 2001) trained with gradient descent via back-propagation. The RAF-SNN, while more biologically plausible, achieves performance comparable to or higher than conventional models in the Machine Learning literature across different network configurations, using similar or fewer parameters. Strikingly, the RAF-SNN proves robust against noise induced at testing/training time, under both static and dynamic conditions. Against CNN on MNIST, we show 25% higher absolute accuracy with N(0, 0.2) induced noise at testing time. Against LSTM on N-MNIST, we show 70% higher absolute accuracy with 20% induced noise at training time.
Abstract
This paper describes our submission to the WMT20 sentence filtering task. We combine scores from (1) a custom LASER built for each source language, (2) a classifier built to distinguish positive and negative pairs by semantic alignment, and (3) the original scores included in the task devkit. For the mBART finetuning setup, provided by the organizers, our method shows 7% and 5% relative improvement over baseline, in sacreBLEU score on the test set for Pashto and Khmer respectively.
Abstract
We propose a novel architecture for labelling character sequences that achieves state-of-the-art results on the Tashkeela Arabic diacritization benchmark. The core is a two-level recurrence hierarchy that operates on the word and character levels separately—enabling faster training and inference than comparable traditional models. A cross-level attention module further connects the two and opens the door for network interpretability. The task module is a softmax classifier that enumerates valid combinations of diacritics. This architecture can be extended with a recurrent decoder that optionally accepts priors from partially diacritized text, which improves results. We employ extra tricks such as sentence dropout and majority voting to further boost the final result. Our best model achieves a WER of 5.34%, outperforming the previous state-of-the-art with a 30.56% relative error reduction.