Unsupervised text tokenizer for Neural Network-based text generation.
224 matches across 8 categories. Click a row to expand file-level details.
| Severity | File | Line | Snippet |
|---|---|---|---|
| MEDIUM | contrib/nlcodec/bpe_model_trainer_nlcodec.h | 31 | // ─── Data Structures ───────────────────────────────────────────────────────── |
| MEDIUM | contrib/nlcodec/bpe_model_trainer_nlcodec.h | 115 | // ─── Fast BPE Training Function ───────────────────────────────────────────── |
| MEDIUM | contrib/nlcodec/benchmark.sh | 15 | # ─── Defaults ───────────────────────────────────────────────────────────────── |
| MEDIUM | contrib/nlcodec/benchmark.sh | 47 | # ─── Parse arguments ───────────────────────────────────────────────────────── |
| MEDIUM | contrib/nlcodec/benchmark.sh | 60 | # ─── Step 1: Build SentencePiece if needed ──────────────────────────────────── |
| MEDIUM | contrib/nlcodec/benchmark.sh | 75 | # ─── Step 2: Download CC-100 data if needed ─────────────────────────────────── |
| MEDIUM | contrib/nlcodec/benchmark.sh | 119 | # ─── Step 3: Run benchmark ─────────────────────────────────────────────────── |
| MEDIUM | data/Scripts.txt | 1383 | # ================================================ |
| MEDIUM | data/Scripts.txt | 1391 | # ================================================ |
| MEDIUM | data/Scripts.txt | 1401 | # ================================================ |
| MEDIUM | data/Scripts.txt | 1410 | # ================================================ |
| MEDIUM | data/Scripts.txt | 1418 | # ================================================ |
| MEDIUM | data/Scripts.txt | 1515 | # ================================================ |
| MEDIUM | data/Scripts.txt | 1524 | # ================================================ |
| MEDIUM | data/Scripts.txt | 1531 | # ================================================ |
| MEDIUM | data/Scripts.txt | 1540 | # ================================================ |
| MEDIUM | data/Scripts.txt | 1546 | # ================================================ |
| MEDIUM | data/Scripts.txt | 1583 | # ================================================ |
| MEDIUM | data/Scripts.txt | 1591 | # ================================================ |
| MEDIUM | data/Scripts.txt | 1598 | # ================================================ |
| MEDIUM | data/Scripts.txt | 1605 | # ================================================ |
| MEDIUM | data/Scripts.txt | 1613 | # ================================================ |
| MEDIUM | data/Scripts.txt | 1649 | # ================================================ |
| MEDIUM | data/Scripts.txt | 1656 | # ================================================ |
| MEDIUM | data/Scripts.txt | 1662 | # ================================================ |
| MEDIUM | data/Scripts.txt | 1669 | # ================================================ |
| MEDIUM | data/Scripts.txt | 1680 | # ================================================ |
| MEDIUM | data/Scripts.txt | 1686 | # ================================================ |
| MEDIUM | data/Scripts.txt | 1696 | # ================================================ |
| MEDIUM | data/Scripts.txt | 1804 | # ================================================ |
| MEDIUM | data/Scripts.txt | 1813 | # ================================================ |
| MEDIUM | data/Scripts.txt | 1821 | # ================================================ |
| MEDIUM | data/Scripts.txt | 1828 | # ================================================ |
| MEDIUM | data/Scripts.txt | 1902 | # ================================================ |
| MEDIUM | data/Scripts.txt | 1911 | # ================================================ |
| MEDIUM | data/Scripts.txt | 1920 | # ================================================ |
| MEDIUM | data/Scripts.txt | 1926 | # ================================================ |
| MEDIUM | data/Scripts.txt | 1932 | # ================================================ |
| MEDIUM | data/Scripts.txt | 1939 | # ================================================ |
| MEDIUM | data/Scripts.txt | 1999 | # ================================================ |
| MEDIUM | data/Scripts.txt | 2006 | # ================================================ |
| MEDIUM | data/Scripts.txt | 2012 | # ================================================ |
| MEDIUM | data/Scripts.txt | 2026 | # ================================================ |
| MEDIUM | data/Scripts.txt | 2034 | # ================================================ |
| MEDIUM | data/Scripts.txt | 2044 | # ================================================ |
| MEDIUM | data/Scripts.txt | 2085 | # ================================================ |
| MEDIUM | data/Scripts.txt | 2093 | # ================================================ |
| MEDIUM | data/Scripts.txt | 2101 | # ================================================ |
| MEDIUM | data/Scripts.txt | 2108 | # ================================================ |
| MEDIUM | data/Scripts.txt | 2115 | # ================================================ |
| MEDIUM | data/Scripts.txt | 2121 | # ================================================ |
| MEDIUM | data/Scripts.txt | 2185 | # ================================================ |
| MEDIUM | data/Scripts.txt | 2195 | # ================================================ |
| MEDIUM | data/Scripts.txt | 2201 | # ================================================ |
| MEDIUM | data/Scripts.txt | 2211 | # ================================================ |
| MEDIUM | data/Scripts.txt | 2252 | # ================================================ |
| MEDIUM | data/Scripts.txt | 2259 | # ================================================ |
| MEDIUM | data/Scripts.txt | 2267 | # ================================================ |
| MEDIUM | data/Scripts.txt | 2342 | # ================================================ |
| MEDIUM | data/Scripts.txt | 2350 | # ================================================ |
| 85 more matches not shown… | |||
| Severity | File | Line | Snippet |
|---|---|---|---|
| LOW | CMakeLists.txt | 1 | # Copyright 2018 Google Inc. |
| LOW | python/setup.py | 1 | #!/usr/bin/env python |
| LOW | python/test/sentencepiece_test.py | 1 | #!/usr/bin/python |
| LOW | contrib/nlcodec/bpe_model_trainer_nlcodec.h | 1 | // Copyright 2024 nlcodec / Thamme Gowda |
| LOW | third_party/darts_clone/darts.h | 1 | #ifndef DARTS_H_ |
| LOW | third_party/darts_clone/darts.h | 41 | // <progress_func_type> is the type of callback functions for reporting the |
| LOW | third_party/darts_clone/darts.h | 101 | // Disallows operator=. |
| LOW | third_party/darts_clone/darts.h | 201 | } |
| LOW | third_party/darts_clone/darts.h | 221 | // it will be called in build() so that the caller can check the progress of |
| LOW | third_party/darts_clone/darts.h | 241 | // when and only when a memory allocation fails. |
| LOW | third_party/darts_clone/darts.h | 261 | // In the above example, the lengths are { 1, 2, 2 }, not { 4, 5, 5 }. |
| LOW | third_party/darts_clone/darts.h | 281 | template <class U> |
| LOW | .github/workflows/requirements/cibuildwheel.txt | 1 | # |
| LOW | .github/workflows/requirements/cibuildwheel.txt | 81 | # -c base.txt |
| LOW | data/Scripts.txt | 1 | # Scripts-9.0.0.txt |
| LOW | data/gen_unicode_scripts_code.pl | 1 | #!/usr/bin/perl |
| LOW | src/filesystem.h | 1 | // Copyright 2016 Google Inc. |
| LOW | src/freelist.h | 1 | // Copyright 2018 Google Inc. |
| LOW | src/unicode_script.h | 1 | // Copyright 2016 Google Inc. |
| LOW | src/model_interface.h | 1 | // Copyright 2016 Google Inc. |
| LOW | src/model_interface.h | 21 | #include <utility> |
| LOW | src/CMakeLists.txt | 1 | # Copyright 2018 Google Inc. |
| LOW | src/bpe_model.h | 1 | // Copyright 2016 Google Inc. |
| LOW | src/testharness.h | 1 | // Copyright 2016 Google Inc. |
| LOW | src/testharness.h | 121 | .IsEq(std::string(a), std::string(b), #a, #b) |
| LOW | src/word_model_trainer.h | 1 | // Copyright 2016 Google Inc. |
| LOW | src/sentencepiece_trainer.h | 1 | // Copyright 2018 Google Inc. |
| LOW | src/char_model_trainer.h | 1 | // Copyright 2016 Google Inc. |
| LOW | src/trainer_interface.h | 1 | // Copyright 2016 Google Inc. |
| LOW | src/common.h | 1 | // Copyright 2016 Google Inc. |
| LOW | src/common.h | 21 | #include "third_party/absl/log/check.h" |
| LOW | src/unigram_model_trainer.h | 1 | // Copyright 2016 Google Inc. |
| LOW | src/unigram_model.h | 1 | // Copyright 2016 Google Inc. |
| LOW | src/sentencepiece_processor.h | 1 | // Copyright 2016 Google Inc. |
| LOW | src/sentencepiece_processor.h | 81 | // Simple and language independent tokenizer and de-tokenizer for |
| LOW | src/sentencepiece_processor.h | 101 | // Usage: |
| LOW | src/sentencepiece_processor.h | 121 | // SentencePieceText spt; |
| LOW | src/sentencepiece_processor.h | 321 | std::vector<std::vector<std::string>> *pieces) const; |
| LOW | src/sentencepiece_processor.h | 341 | // in https://arxiv.org/abs/1804.10959 (nbest_size < 0 means l = infinity) |
| LOW | src/sentencepiece_processor.h | 381 | ////////////////////////////////////////////////////////////// |
| LOW | src/sentencepiece_processor.h | 741 | |
| LOW | src/spec_parser.h | 1 | // Copyright 2016 Google LLC. |
| LOW | src/unicode_script_map.h | 1 | // Copyright 2016 Google Inc. |
| LOW | src/util.h | 1 | // Copyright 2016 Google Inc. |
| LOW | src/util.h | 21 | #include <algorithm> |
| LOW | src/util.h | 321 | if (condition) { \ |
| LOW | src/builder.h | 1 | // Copyright 2016 Google Inc. |
| LOW | src/builder.h | 61 | // be implemented with a simple longest matching string-to-string |
| LOW | src/builder.h | 81 | // are normalized into ZYX. When we implement this normalization with |
| LOW | src/trainer_factory.h | 1 | // Copyright 2016 Google Inc. |
| LOW | src/pretokenizer_for_training.h | 1 | // Copyright 2016 Google Inc. |
| LOW | src/normalizer.h | 1 | // Copyright 2016 Google Inc. |
| LOW | src/normalizer.h | 21 | #include <utility> |
| LOW | src/normalizer.h | 41 | // Returns the UTF8 byte length of matched string. |
| LOW | src/normalizer.h | 101 | void Init(); |
| LOW | src/init.h | 1 | // Copyright 2016 Google Inc. |
| LOW | src/model_factory.h | 1 | // Copyright 2016 Google Inc. |
| LOW | src/word_model.h | 1 | // Copyright 2016 Google Inc. |
| LOW | src/bpe_model_trainer.h | 1 | // Copyright 2016 Google Inc. |
| LOW | src/char_model.h | 1 | // Copyright 2016 Google Inc. |
| 4 more matches not shown… | |||
| Severity | File | Line | Snippet |
|---|---|---|---|
| HIGH | python/test/sentencepiece_test.py | 231 | # suppress logging (redirect to /dev/null) |
| HIGH | python/test/sentencepiece_test.py | 271 | # suppress logging (redirect to /dev/null) |
| Severity | File | Line | Snippet |
|---|---|---|---|
| LOW | python/setup.py | 17 | |
| LOW | python/test/sentencepiece_test.py | 22 | |
| LOW | python/src/sentencepiece/__init__.py | 7 | |
| LOW | python/src/sentencepiece/__init__.py | 1208 |
| Severity | File | Line | Snippet |
|---|---|---|---|
| MEDIUM | python/src/sentencepiece/__init__.py | 562 | |
| MEDIUM | python/src/sentencepiece/__init__.py | 868 |
| Severity | File | Line | Snippet |
|---|---|---|---|
| LOW | python/setup.py | 129 | |
| LOW | python/src/sentencepiece/__init__.py | 776 | |
| LOW | python/src/sentencepiece/__init__.py | 1084 |
| Severity | File | Line | Snippet |
|---|---|---|---|
| LOW | python/test/sentencepiece_test.py | 913 | def test_override_normalize_spec(self): |
| LOW | python/src/sentencepiece/__init__.py | 27 | def _swig_setattr_nondynamic_instance_variable(set): |
| LOW | python/src/sentencepiece/__init__.py | 40 | def _swig_setattr_nondynamic_class_variable(set): |
| Severity | File | Line | Snippet |
|---|---|---|---|
| LOW | python/test/gil_release_test.py | 85 | # Check if GIL is explicitly disabled (Python 3.13+ free-threaded build) |