MorphPiece : Moving away from Statistical Language Representation (original) (raw)
Related papers
Bridging the Gap for Tokenizer-Free Language Models
ArXiv, 2019
Morphosyntactic Tagging with a Meta-BiLSTM Model over Context Sensitive Token Encodings
arXiv (Cornell University), 2018
Unsupervised Tokenization Learning
Cornell University - arXiv, 2022
Improving Tokenisation by Alternative Treatment of Spaces
2022
Word and Sentence Tokenization with Hidden Markov Models
Journal for Language Technology and Computational Linguistics, 2013
Assessing the Importance of Frequency versus Compositionality for Subword-based Tokenization in NMT
arXiv (Cornell University), 2023
Overview of the progression of state-of-the-art language models
TELKOMNIKA Telecommunication Computing Electronics and Control, 2024
Enhancing recurrent neural network-based language models by word tokenization
Human-centric Computing and Information Sciences, 2018
On Losses for Modern Language Models
Stéphane Aroca-Ouellette, Frank Rudzicz
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020
Efficient Domain Adaptation of Language Models via Adaptive Tokenization
ArXiv, 2021
Improving language models by retrieving from trillions of tokens
Cornell University - arXiv, 2021
Parsimonious Morpheme Segmentation with an Application to Enriching Word Embeddings
2019 IEEE International Conference on Big Data (Big Data), 2019
COCO-LM: Correcting and Contrasting Text Sequences for Language Model Pretraining
ArXiv, 2021
Investigating the effect of sub-word segmentation on the performance of transformer language models
arXiv (Cornell University), 2023
SCALE: Scaling up the Complexity for Advanced Language Model Evaluation
arXiv (Cornell University), 2023
KR-BERT: A Small-Scale Korean-Specific Language Model
2020
Text preparation through extended tokenization
Data Mining VII: Data, Text and Web Mining and their Business Applications, 2006
Language-Independent Text Tokenization Using Unsupervised Deep Learning
Intelligent Automation & Soft Computing
How Much Does Tokenization Affect Neural Machine Translation?
2018
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018
On the Importance of Tokenization in Arabic Embedding Models
Proceedings of the Fifth Arabic Natural Language Processing Workshop, 2020
Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology, 2019
An Extensive Analysis Between Different Language Models: GPT-3, BERT and MACAW
Modeling Morphologically Rich Languages Using Split Words and Unstructured Dependencies
2009
CoNLL-Merge: Efficient Harmonization of Concurrent Tokenization and Textual Variation
2019
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
ArXiv, 2021
N-Grammer: Augmenting Transformers with latent n-grams
arXiv (Cornell University), 2022
A Hybrid Approach to Cross-Linguistic Tokenization: Morphology with Statistics
A Morpho-Syntactically Informed LSTM-CRF Model for Named Entity Recognition
2019
Investigating linguistic knowledge in a maximum entropy token-based language model
2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU), 2007
GiBERT: Enhancing BERT with Linguistic Information using a Lightweight Gated Injection Method
Findings of the Association for Computational Linguistics: EMNLP 2021, 2021
2020
Ascent of Pre-trained State-of-the-Art Language Models
Algorithms for intelligent systems, 2020
The ParisNLP entry at the ConLL UD Shared Task 2017: A Tale of a #ParsingTragedyÉric
Give your Text Representation Models some Love: the Case for Basque
2020