JTokkit aims to be a fast and efficient tokenizer designed for use in natural language processing tasks using the OpenAI models. It provides an easy-to-use interface for tokenizing input text, for ...
This repository contains all code for reproducing experiments from the paper Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data? Given a BPE tokenizer, our attack infers ...