MorphPiece : Moving away from Statistical Language Representation (original) (raw)

Bridging the Gap for Tokenizer-Free Language Models

Mandy Guo

ArXiv, 2019

View PDFchevron_right

Morphosyntactic Tagging with a Meta-BiLSTM Model over Context Sensitive Token Encodings

Bernd Bohnet

arXiv (Cornell University), 2018

View PDFchevron_right

Unsupervised Tokenization Learning

Anton Kolonin

Cornell University - arXiv, 2022

View PDFchevron_right

Improving Tokenisation by Alternative Treatment of Spaces

Harish Tayyar Madabushi

2022

View PDFchevron_right

Word and Sentence Tokenization with Hidden Markov Models

Kay-Michael Würzner

Journal for Language Technology and Computational Linguistics, 2013

View PDFchevron_right

Assessing the Importance of Frequency versus Compositionality for Subword-based Tokenization in NMT

Ljiljana Dolamic

arXiv (Cornell University), 2023

View PDFchevron_right

Overview of the progression of state-of-the-art language models

TELKOMNIKA JOURNAL

TELKOMNIKA Telecommunication Computing Electronics and Control, 2024

View PDFchevron_right

Enhancing recurrent neural network-based language models by word tokenization

shahenda sarhan

Human-centric Computing and Information Sciences, 2018

View PDFchevron_right

On Losses for Modern Language Models

Stéphane Aroca-Ouellette, Frank Rudzicz

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020

View PDFchevron_right

Efficient Domain Adaptation of Language Models via Adaptive Tokenization

Jason Kessler

ArXiv, 2021

View PDFchevron_right

Improving language models by retrieving from trillions of tokens

Roman Ring

Cornell University - arXiv, 2021

View PDFchevron_right

Parsimonious Morpheme Segmentation with an Application to Enriching Word Embeddings

Ahmed El-Kishky

2019 IEEE International Conference on Big Data (Big Data), 2019

View PDFchevron_right

COCO-LM: Correcting and Contrasting Text Sequences for Language Model Pretraining

Payal Bajaj

ArXiv, 2021

View PDFchevron_right

Investigating the effect of sub-word segmentation on the performance of transformer language models

Anisya Katinskaya

arXiv (Cornell University), 2023

View PDFchevron_right

SCALE: Scaling up the Complexity for Advanced Language Model Evaluation

Veton Matoshi

arXiv (Cornell University), 2023

View PDFchevron_right

KR-BERT: A Small-Scale Korean-Specific Language Model

Suzi Park

2020

View PDFchevron_right

Text preparation through extended tokenization

Marcus Hassler

Data Mining VII: Data, Text and Web Mining and their Business Applications, 2006

View PDFchevron_right

Language-Independent Text Tokenization Using Unsupervised Deep Learning

Aladdin Hafez

Intelligent Automation & Soft Computing

View PDFchevron_right

How Much Does Tokenization Affect Neural Machine Translation?

Mercedes García Martínez

2018

View PDFchevron_right

An Investigation of the Interactions Between Pre-Trained Word Embeddings, Character Models and POS Tags in Dependency Parsing

Sara Stymne

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018

View PDFchevron_right

On the Importance of Tokenization in Arabic Embedding Models

Mairaj Syed

Proceedings of the Fifth Arabic Natural Language Processing Workshop, 2020

View PDFchevron_right

UDPipe at SIGMORPHON 2019: Contextualized Embeddings, Regularization with Morphological Categories, Corpora Merging

Jan Hajič

Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology, 2019

View PDFchevron_right

An Extensive Analysis Between Different Language Models: GPT-3, BERT and MACAW

Palash Rambhia

View PDFchevron_right

Modeling Morphologically Rich Languages Using Split Words and Unstructured Dependencies

Deniz Yuret

2009

View PDFchevron_right

CoNLL-Merge: Efficient Harmonization of Concurrent Tokenization and Textual Variation

Christian Chiarcos

2019

View PDFchevron_right

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Noa Nabeshima

ArXiv, 2021

View PDFchevron_right

N-Grammer: Augmenting Transformers with latent n-grams

Aurko Roy

arXiv (Cornell University), 2022

View PDFchevron_right

A Hybrid Approach to Cross-Linguistic Tokenization: Morphology with Statistics

Logan Kearsley

View PDFchevron_right

A Morpho-Syntactically Informed LSTM-CRF Model for Named Entity Recognition

Petya Osenova

2019

View PDFchevron_right

Investigating linguistic knowledge in a maximum entropy token-based language model

Keith Hall

2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU), 2007

View PDFchevron_right

GiBERT: Enhancing BERT with Linguistic Information using a Lightweight Gated Injection Method

Maria Liakata

Findings of the Association for Computational Linguistics: EMNLP 2021, 2021

View PDFchevron_right

Fast WordPiece Tokenization

Alex Salcianu

2020

View PDFchevron_right

Ascent of Pre-trained State-of-the-Art Language Models

Lakshmi Kurup

Algorithms for intelligent systems, 2020

View PDFchevron_right

The ParisNLP entry at the ConLL UD Shared Task 2017: A Tale of a #ParsingTragedyÉric

Benoît Sagot

View PDFchevron_right

Give your Text Representation Models some Love: the Case for Basque

Xabier Saralegi

2020

View PDFchevron_right