Always private
DuckDuckGo never tracks your searches.
Learn More
You can hide this reminder in Search Settings
All regions
Argentina
Australia
Austria
Belgium (fr)
Belgium (nl)
Brazil
Bulgaria
Canada (en)
Canada (fr)
Catalonia
Chile
China
Colombia
Croatia
Czech Republic
Denmark
Estonia
Finland
France
Germany
Greece
Hong Kong
Hungary
Iceland
India (en)
Indonesia (en)
Ireland
Israel (en)
Italy
Japan
Korea
Latvia
Lithuania
Malaysia (en)
Mexico
Netherlands
New Zealand
Norway
Pakistan (en)
Peru
Philippines (en)
Poland
Portugal
Romania
Russia
Saudi Arabia
Singapore
Slovakia
Slovenia
South Africa
Spain (ca)
Spain (es)
Sweden
Switzerland (de)
Switzerland (fr)
Taiwan
Thailand (en)
Turkey
Ukraine
United Kingdom
US (English)
US (Spanish)
Vietnam (en)
Safe search: moderate
Strict
Moderate
Off
Any time
Any time
Past day
Past week
Past month
Past year
  1. aclanthology.org

    %0 Conference Proceedings %T A Comparison of Different Tokenization Methods for the Georgian Language %A Mikaberidze, Beso %A Saghinadze, Temo %A Mikaberidze, Guram %A Kalandadze, Raphael %A Pkhakadze, Konstantine %A van Genabith, Josef %A Ostermann, Simon %A van der Plas, Lonneke %A Müller, Philipp %Y Abbas, Mourad %Y Freihat, Abed Alhakim ...
  2. Was this helpful?
  3. aclanthology.org

    making informed tokenization choices in future language model developments for Georgian. 1 Introduction Tokenization is a fundamental process in most nat-ural language processing (NLP) tasks that involves breaking down a text into smaller units called to-kens. It is one of the first processes conducted in most approaches and is particularly ...
  4. trails-dfki.github.io

    While the impact of tokenization on language modeling is well-researched in richly resourced languages, fewer studies on this topic exist for challenging low-resource languages. In this work, we present the first systematic evaluation of tokenization methods for Georgian, a low-resource language with high morphological complexity. We compare standard subword tokenizers, such as WordPiece, Byte ...
  5. semanticscholar.org

    This work presents the first systematic evaluation of tokenization methods for Georgian, a low-resource language with high morphological complexity, and evaluates the performance of all tokenizers on masked language modeling and on four downstream tasks: part-of-speech tagging, named entity recognition, toxicity detection, and sentiment analysis. While the impact of tokenization on language ...
  6. A Comparison of Different Tokenization Methods for the Georgian Language Beso Mikaberidze; Temo Saghinadze; Guram Mikaberidze; Raphael Kalandadze; Konstantine Pkhakadze; Josef van Genabith; Simon Ostermann; Lonneke van der Plas; Philipp Müller. In: Proceedings of the 7th International Conference on Natural Language and Speech Processing. ...
  7. cordis.europa.eu

    Mar 10, 2023GAIN. Grant agreement ID: 101078950 DOI 10.3030 ... A Comparison of Different Tokenization Methods for the Georgian Language. Author(s): B. Mikaberidze, T. Saghinadze, G ... Müller Published in: Proceedings of the 7th International Conference on Natural Language and Speech Processing (ICNLSP -2024), 2024, ISBN 979-8 ...
  8. Oct 7, 2024This paper discusses the T5 model, explaining how tokenization and transfer learning affect language models, and it introduces methods for improving token efficiency. Paper link 3.
  9. Feb 22, 2024Why Tokenization Matters. Think of tokenization as the process of translating text into the language that large language models (LLMs) understand — numbers. Just like we need special tools to translate between different languages, tokenization is the bridge between human language and AI.

    Can’t find what you’re looking for?

    Help us improve DuckDuckGo searches with your feedback

Custom date rangeX