Repository Analysis

google/sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.

13.5 Low AI signal View on GitHub
13.5
Adjusted Score
13.5
Raw Score
100%
Time Factor
2026-05-30
Last Push
11,866
Stars
C++
Language
30,874
Lines of Code
73
Files
224
Pattern Hits
2026-05-31
Scan Date

Score History

Severity Breakdown

CRITICAL 0HIGH 2MEDIUM 147LOW 75

Pattern Findings

224 matches across 8 categories. Click a row to expand file-level details.

Decorative Section Separators145 hits · 322 pts
SeverityFileLineSnippet
MEDIUMcontrib/nlcodec/bpe_model_trainer_nlcodec.h31// ─── Data Structures ─────────────────────────────────────────────────────────
MEDIUMcontrib/nlcodec/bpe_model_trainer_nlcodec.h115// ─── Fast BPE Training Function ─────────────────────────────────────────────
MEDIUMcontrib/nlcodec/benchmark.sh15# ─── Defaults ─────────────────────────────────────────────────────────────────
MEDIUMcontrib/nlcodec/benchmark.sh47# ─── Parse arguments ─────────────────────────────────────────────────────────
MEDIUMcontrib/nlcodec/benchmark.sh60# ─── Step 1: Build SentencePiece if needed ────────────────────────────────────
MEDIUMcontrib/nlcodec/benchmark.sh75# ─── Step 2: Download CC-100 data if needed ───────────────────────────────────
MEDIUMcontrib/nlcodec/benchmark.sh119# ─── Step 3: Run benchmark ───────────────────────────────────────────────────
MEDIUMdata/Scripts.txt1383# ================================================
MEDIUMdata/Scripts.txt1391# ================================================
MEDIUMdata/Scripts.txt1401# ================================================
MEDIUMdata/Scripts.txt1410# ================================================
MEDIUMdata/Scripts.txt1418# ================================================
MEDIUMdata/Scripts.txt1515# ================================================
MEDIUMdata/Scripts.txt1524# ================================================
MEDIUMdata/Scripts.txt1531# ================================================
MEDIUMdata/Scripts.txt1540# ================================================
MEDIUMdata/Scripts.txt1546# ================================================
MEDIUMdata/Scripts.txt1583# ================================================
MEDIUMdata/Scripts.txt1591# ================================================
MEDIUMdata/Scripts.txt1598# ================================================
MEDIUMdata/Scripts.txt1605# ================================================
MEDIUMdata/Scripts.txt1613# ================================================
MEDIUMdata/Scripts.txt1649# ================================================
MEDIUMdata/Scripts.txt1656# ================================================
MEDIUMdata/Scripts.txt1662# ================================================
MEDIUMdata/Scripts.txt1669# ================================================
MEDIUMdata/Scripts.txt1680# ================================================
MEDIUMdata/Scripts.txt1686# ================================================
MEDIUMdata/Scripts.txt1696# ================================================
MEDIUMdata/Scripts.txt1804# ================================================
MEDIUMdata/Scripts.txt1813# ================================================
MEDIUMdata/Scripts.txt1821# ================================================
MEDIUMdata/Scripts.txt1828# ================================================
MEDIUMdata/Scripts.txt1902# ================================================
MEDIUMdata/Scripts.txt1911# ================================================
MEDIUMdata/Scripts.txt1920# ================================================
MEDIUMdata/Scripts.txt1926# ================================================
MEDIUMdata/Scripts.txt1932# ================================================
MEDIUMdata/Scripts.txt1939# ================================================
MEDIUMdata/Scripts.txt1999# ================================================
MEDIUMdata/Scripts.txt2006# ================================================
MEDIUMdata/Scripts.txt2012# ================================================
MEDIUMdata/Scripts.txt2026# ================================================
MEDIUMdata/Scripts.txt2034# ================================================
MEDIUMdata/Scripts.txt2044# ================================================
MEDIUMdata/Scripts.txt2085# ================================================
MEDIUMdata/Scripts.txt2093# ================================================
MEDIUMdata/Scripts.txt2101# ================================================
MEDIUMdata/Scripts.txt2108# ================================================
MEDIUMdata/Scripts.txt2115# ================================================
MEDIUMdata/Scripts.txt2121# ================================================
MEDIUMdata/Scripts.txt2185# ================================================
MEDIUMdata/Scripts.txt2195# ================================================
MEDIUMdata/Scripts.txt2201# ================================================
MEDIUMdata/Scripts.txt2211# ================================================
MEDIUMdata/Scripts.txt2252# ================================================
MEDIUMdata/Scripts.txt2259# ================================================
MEDIUMdata/Scripts.txt2267# ================================================
MEDIUMdata/Scripts.txt2342# ================================================
MEDIUMdata/Scripts.txt2350# ================================================
85 more matches not shown…
Over-Commented Block64 hits · 64 pts
SeverityFileLineSnippet
LOWCMakeLists.txt1# Copyright 2018 Google Inc.
LOWpython/setup.py1#!/usr/bin/env python
LOWpython/test/sentencepiece_test.py1#!/usr/bin/python
LOWcontrib/nlcodec/bpe_model_trainer_nlcodec.h1// Copyright 2024 nlcodec / Thamme Gowda
LOWthird_party/darts_clone/darts.h1#ifndef DARTS_H_
LOWthird_party/darts_clone/darts.h41// <progress_func_type> is the type of callback functions for reporting the
LOWthird_party/darts_clone/darts.h101 // Disallows operator=.
LOWthird_party/darts_clone/darts.h201 }
LOWthird_party/darts_clone/darts.h221 // it will be called in build() so that the caller can check the progress of
LOWthird_party/darts_clone/darts.h241 // when and only when a memory allocation fails.
LOWthird_party/darts_clone/darts.h261 // In the above example, the lengths are { 1, 2, 2 }, not { 4, 5, 5 }.
LOWthird_party/darts_clone/darts.h281 template <class U>
LOW.github/workflows/requirements/cibuildwheel.txt1#
LOW.github/workflows/requirements/cibuildwheel.txt81 # -c base.txt
LOWdata/Scripts.txt1# Scripts-9.0.0.txt
LOWdata/gen_unicode_scripts_code.pl1#!/usr/bin/perl
LOWsrc/filesystem.h1// Copyright 2016 Google Inc.
LOWsrc/freelist.h1// Copyright 2018 Google Inc.
LOWsrc/unicode_script.h1// Copyright 2016 Google Inc.
LOWsrc/model_interface.h1// Copyright 2016 Google Inc.
LOWsrc/model_interface.h21#include <utility>
LOWsrc/CMakeLists.txt1# Copyright 2018 Google Inc.
LOWsrc/bpe_model.h1// Copyright 2016 Google Inc.
LOWsrc/testharness.h1// Copyright 2016 Google Inc.
LOWsrc/testharness.h121 .IsEq(std::string(a), std::string(b), #a, #b)
LOWsrc/word_model_trainer.h1// Copyright 2016 Google Inc.
LOWsrc/sentencepiece_trainer.h1// Copyright 2018 Google Inc.
LOWsrc/char_model_trainer.h1// Copyright 2016 Google Inc.
LOWsrc/trainer_interface.h1// Copyright 2016 Google Inc.
LOWsrc/common.h1// Copyright 2016 Google Inc.
LOWsrc/common.h21#include "third_party/absl/log/check.h"
LOWsrc/unigram_model_trainer.h1// Copyright 2016 Google Inc.
LOWsrc/unigram_model.h1// Copyright 2016 Google Inc.
LOWsrc/sentencepiece_processor.h1// Copyright 2016 Google Inc.
LOWsrc/sentencepiece_processor.h81// Simple and language independent tokenizer and de-tokenizer for
LOWsrc/sentencepiece_processor.h101// Usage:
LOWsrc/sentencepiece_processor.h121// SentencePieceText spt;
LOWsrc/sentencepiece_processor.h321 std::vector<std::vector<std::string>> *pieces) const;
LOWsrc/sentencepiece_processor.h341 // in https://arxiv.org/abs/1804.10959 (nbest_size < 0 means l = infinity)
LOWsrc/sentencepiece_processor.h381 //////////////////////////////////////////////////////////////
LOWsrc/sentencepiece_processor.h741
LOWsrc/spec_parser.h1// Copyright 2016 Google LLC.
LOWsrc/unicode_script_map.h1// Copyright 2016 Google Inc.
LOWsrc/util.h1// Copyright 2016 Google Inc.
LOWsrc/util.h21#include <algorithm>
LOWsrc/util.h321 if (condition) { \
LOWsrc/builder.h1// Copyright 2016 Google Inc.
LOWsrc/builder.h61 // be implemented with a simple longest matching string-to-string
LOWsrc/builder.h81 // are normalized into ZYX. When we implement this normalization with
LOWsrc/trainer_factory.h1// Copyright 2016 Google Inc.
LOWsrc/pretokenizer_for_training.h1// Copyright 2016 Google Inc.
LOWsrc/normalizer.h1// Copyright 2016 Google Inc.
LOWsrc/normalizer.h21#include <utility>
LOWsrc/normalizer.h41 // Returns the UTF8 byte length of matched string.
LOWsrc/normalizer.h101 void Init();
LOWsrc/init.h1// Copyright 2016 Google Inc.
LOWsrc/model_factory.h1// Copyright 2016 Google Inc.
LOWsrc/word_model.h1// Copyright 2016 Google Inc.
LOWsrc/bpe_model_trainer.h1// Copyright 2016 Google Inc.
LOWsrc/char_model.h1// Copyright 2016 Google Inc.
4 more matches not shown…
Cross-Language Confusion2 hits · 15 pts
SeverityFileLineSnippet
HIGHpython/test/sentencepiece_test.py231 # suppress logging (redirect to /dev/null)
HIGHpython/test/sentencepiece_test.py271 # suppress logging (redirect to /dev/null)
Unused Imports4 hits · 4 pts
SeverityFileLineSnippet
LOWpython/setup.py17
LOWpython/test/sentencepiece_test.py22
LOWpython/src/sentencepiece/__init__.py7
LOWpython/src/sentencepiece/__init__.py1208
Dead Code2 hits · 4 pts
SeverityFileLineSnippet
MEDIUMpython/src/sentencepiece/__init__.py562
MEDIUMpython/src/sentencepiece/__init__.py868
Deep Nesting3 hits · 3 pts
SeverityFileLineSnippet
LOWpython/setup.py129
LOWpython/src/sentencepiece/__init__.py776
LOWpython/src/sentencepiece/__init__.py1084
Hyper-Verbose Identifiers3 hits · 3 pts
SeverityFileLineSnippet
LOWpython/test/sentencepiece_test.py913 def test_override_normalize_spec(self):
LOWpython/src/sentencepiece/__init__.py27def _swig_setattr_nondynamic_instance_variable(set):
LOWpython/src/sentencepiece/__init__.py40def _swig_setattr_nondynamic_class_variable(set):
Redundant / Tautological Comments1 hit · 2 pts
SeverityFileLineSnippet
LOWpython/test/gil_release_test.py85 # Check if GIL is explicitly disabled (Python 3.13+ free-threaded build)